Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
0

Read pdf programatically

New Here ,
Apr 16, 2017 Apr 16, 2017

Copy link to clipboard

Copied

Is there a way to read pdf pro-grammatically?

reading of text would be helpful to start with. in java preferably.

Views

4.1K
Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

LEGEND , Apr 17, 2017 Apr 17, 2017

Ok, the server issue is extremely important. Adobe's main programmatic interfaces are offered through the Acrobat SDK. This isn't really a product, just a name for documentation of the interfaces of Acrobat (not Reader). (You've posted in the Acrobat Reader forum, but I assume you know the difference between Acrobat and Acrobat Reader).

Your key issue would be that Acrobat is not for server use. Not licensed and (though it is irrelevant) not technically suitable. So there's no point looking at i

...

Votes

Translate
LEGEND ,
Apr 16, 2017 Apr 16, 2017

Copy link to clipboard

Copied

On a server?

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 16, 2017 Apr 16, 2017

Copy link to clipboard

Copied

Sure, it is. Look into libraries like iText, PDFBox, PDF Clown, etc.

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Apr 16, 2017 Apr 16, 2017

Copy link to clipboard

Copied

Yes, to be read on server.

a) iText, PDFBox, etc have limited capabilities. They fail if slight formatting in form of tables, layout is there in pdf. Isnt there an API available from Adobe itself?

b) Is it possible to generate xml off all PDFs, new or old?

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 17, 2017 Apr 17, 2017

Copy link to clipboard

Copied

These libraries can extract everything (pretty much) that's available in the file. It's up to you to further analyse it and extract from it the data you're after. As mentioned, there's no such thing as a "table" in a PDF file.

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Apr 17, 2017 Apr 17, 2017

Copy link to clipboard

Copied

Not an expert on Acrobat or ISO and am struggling to convey my question correctly, simply put, "Is there a way to read a pdf file through a java program on server side? No through some pdfbox or itext, but actual API from Adobe."

As image below depicts, there is no uniform output given by the APIs, some read it column wise, others do it row wise.

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 17, 2017 Apr 17, 2017

Copy link to clipboard

Copied

This is the official Adobe PDF Library for processing PDF files using Java: http://www.datalogics.com/products/pdf/pdflibrary/

You'll need to discuss with them the issues regarding using it on a server, though.

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Apr 17, 2017 Apr 17, 2017

Copy link to clipboard

Copied

Ok, the server issue is extremely important. Adobe's main programmatic interfaces are offered through the Acrobat SDK. This isn't really a product, just a name for documentation of the interfaces of Acrobat (not Reader). (You've posted in the Acrobat Reader forum, but I assume you know the difference between Acrobat and Acrobat Reader).

Your key issue would be that Acrobat is not for server use. Not licensed and (though it is irrelevant) not technically suitable. So there's no point looking at it if the final deployment is a server. Instead Adobe offer the Adobe PDF Library, which can be licensed (on a royalty basis, price by negotiation) for server use. It has a C/C++ interface which is in fact similar to part of the Acrobat SDK.

But there is nothing to do with text extraction from tables, because there is no such thing in a PDF. Are you familiar with the PDF specification, ISO 32000-1? There's nothing there but text and lines, with no connection between them.

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Apr 17, 2017 Apr 17, 2017

Copy link to clipboard

Copied

What you describe is normal. You want there to be a "correct" extraction. There is none; there is extraction of text and position, end of story. Clever software with fuzzy logic might be able to receive further properties as the human brain does with such data. Good luck, it's leading edge research.

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Apr 18, 2017 Apr 18, 2017

Copy link to clipboard

Copied

APIs Aspose and Datalogic are slightly helpful. Just curious, what options are available on client side to read a pdf file?

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Apr 18, 2017 Apr 18, 2017

Copy link to clipboard

Copied

LATEST

If a client has a licensed Acrobat (not Reader) installed, a plug-in to Acrobat can read PDF contents. The plug-in has to be run from Acrobat, you cannot use this to make an EXE.

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines