Participant

Answered

Read pdf programatically

Forum|Forum|9 years ago
April 16, 2017
5 replies
4688 views

Is there a way to read pdf pro-grammatically?

reading of text would be helpful to start with. in java preferably.

read auto

This topic has been closed for replies.

Correct answer Test Screen Name

Ok, the server issue is extremely important. Adobe's main programmatic interfaces are offered through the Acrobat SDK. This isn't really a product, just a name for documentation of the interfaces of Acrobat (not Reader). (You've posted in the Acrobat Reader forum, but I assume you know the difference between Acrobat and Acrobat Reader).

Your key issue would be that Acrobat is not for server use. Not licensed and (though it is irrelevant) not technically suitable. So there's no point looking at it if the final deployment is a server. Instead Adobe offer the Adobe PDF Library, which can be licensed (on a royalty basis, price by negotiation) for server use. It has a C/C++ interface which is in fact similar to part of the Acrobat SDK.

But there is nothing to do with text extraction from tables, because there is no such thing in a PDF. Are you familiar with the PDF specification, ISO 32000-1? There's nothing there but text and lines, with no connection between them.

T

Test Screen Name

Legend

If a client has a licensed Acrobat (not Reader) installed, a plug-in to Acrobat can read PDF contents. The plug-in has to be run from Acrobat, you cannot use this to make an EXE.

T

Test Screen Name

Legend

What you describe is normal. You want there to be a "correct" extraction. There is none; there is extraction of text and position, end of story. Clever software with fuzzy logic might be able to receive further properties as the human brain does with such data. Good luck, it's leading edge research.

M

mohds66928557Author

Participant

APIs Aspose and Datalogic are slightly helpful. Just curious, what options are available on client side to read a pdf file?

T

Test Screen NameCorrect answer

Legend

Ok, the server issue is extremely important. Adobe's main programmatic interfaces are offered through the Acrobat SDK. This isn't really a product, just a name for documentation of the interfaces of Acrobat (not Reader). (You've posted in the Acrobat Reader forum, but I assume you know the difference between Acrobat and Acrobat Reader).

Your key issue would be that Acrobat is not for server use. Not licensed and (though it is irrelevant) not technically suitable. So there's no point looking at it if the final deployment is a server. Instead Adobe offer the Adobe PDF Library, which can be licensed (on a royalty basis, price by negotiation) for server use. It has a C/C++ interface which is in fact similar to part of the Acrobat SDK.

But there is nothing to do with text extraction from tables, because there is no such thing in a PDF. Are you familiar with the PDF specification, ISO 32000-1? There's nothing there but text and lines, with no connection between them.

try67

Community Expert

Sure, it is. Look into libraries like iText, PDFBox, PDF Clown, etc.

M

mohds66928557Author

Participant

Yes, to be read on server.

a) iText, PDFBox, etc have limited capabilities. They fail if slight formatting in form of tables, layout is there in pdf. Isnt there an API available from Adobe itself?

b) Is it possible to generate xml off all PDFs, new or old?

try67

Community Expert

These libraries can extract everything (pretty much) that's available in the file. It's up to you to further analyse it and extract from it the data you're after. As mentioned, there's no such thing as a "table" in a PDF file.

T

Test Screen Name

Legend

On a server?

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded