Skip to main content
Participant
April 16, 2017
Answered

Read pdf programatically

  • April 16, 2017
  • 5 replies
  • 4687 views

Is there a way to read pdf pro-grammatically?

reading of text would be helpful to start with. in java preferably.

This topic has been closed for replies.
Correct answer Test Screen Name

Ok, the server issue is extremely important. Adobe's main programmatic interfaces are offered through the Acrobat SDK. This isn't really a product, just a name for documentation of the interfaces of Acrobat (not Reader). (You've posted in the Acrobat Reader forum, but I assume you know the difference between Acrobat and Acrobat Reader).

Your key issue would be that Acrobat is not for server use. Not licensed and (though it is irrelevant) not technically suitable. So there's no point looking at it if the final deployment is a server. Instead Adobe offer the Adobe PDF Library, which can be licensed (on a royalty basis, price by negotiation) for server use. It has a C/C++ interface which is in fact similar to part of the Acrobat SDK.

But there is nothing to do with text extraction from tables, because there is no such thing in a PDF. Are you familiar with the PDF specification, ISO 32000-1? There's nothing there but text and lines, with no connection between them.

5 replies

Legend
April 18, 2017

If a client has a licensed Acrobat (not Reader) installed, a plug-in to Acrobat can read PDF contents. The plug-in has to be run from Acrobat, you cannot use this to make an EXE.

Legend
April 17, 2017

What you describe is normal. You want there to be a "correct" extraction. There is none; there is extraction of text and position, end of story. Clever software with fuzzy logic might be able to receive further properties as the human brain does with such data. Good luck, it's leading edge research.

Participant
April 18, 2017

APIs Aspose and Datalogic are slightly helpful. Just curious, what options are available on client side to read a pdf file?

Test Screen NameCorrect answer
Legend
April 17, 2017

Ok, the server issue is extremely important. Adobe's main programmatic interfaces are offered through the Acrobat SDK. This isn't really a product, just a name for documentation of the interfaces of Acrobat (not Reader). (You've posted in the Acrobat Reader forum, but I assume you know the difference between Acrobat and Acrobat Reader).

Your key issue would be that Acrobat is not for server use. Not licensed and (though it is irrelevant) not technically suitable. So there's no point looking at it if the final deployment is a server. Instead Adobe offer the Adobe PDF Library, which can be licensed (on a royalty basis, price by negotiation) for server use. It has a C/C++ interface which is in fact similar to part of the Acrobat SDK.

But there is nothing to do with text extraction from tables, because there is no such thing in a PDF. Are you familiar with the PDF specification, ISO 32000-1? There's nothing there but text and lines, with no connection between them.

try67
Community Expert
Community Expert
April 16, 2017

Sure, it is. Look into libraries like iText, PDFBox, PDF Clown, etc.

Participant
April 17, 2017

Yes, to be read on server.

a) iText, PDFBox, etc have limited capabilities. They fail if slight formatting in form of tables, layout is there in pdf. Isnt there an API available from Adobe itself?

b) Is it possible to generate xml off all PDFs, new or old?

try67
Community Expert
Community Expert
April 17, 2017

These libraries can extract everything (pretty much) that's available in the file. It's up to you to further analyse it and extract from it the data you're after. As mentioned, there's no such thing as a "table" in a PDF file.

Legend
April 16, 2017

On a server?