Copy link to clipboard
Copied
Is there a way to read pdf pro-grammatically?
reading of text would be helpful to start with. in java preferably.
Ok, the server issue is extremely important. Adobe's main programmatic interfaces are offered through the Acrobat SDK. This isn't really a product, just a name for documentation of the interfaces of Acrobat (not Reader). (You've posted in the Acrobat Reader forum, but I assume you know the difference between Acrobat and Acrobat Reader).
Your key issue would be that Acrobat is not for server use. Not licensed and (though it is irrelevant) not technically suitable. So there's no point looking at i
...Copy link to clipboard
Copied
On a server?
Copy link to clipboard
Copied
Sure, it is. Look into libraries like iText, PDFBox, PDF Clown, etc.
Copy link to clipboard
Copied
Yes, to be read on server.
a) iText, PDFBox, etc have limited capabilities. They fail if slight formatting in form of tables, layout is there in pdf. Isnt there an API available from Adobe itself?
b) Is it possible to generate xml off all PDFs, new or old?
Copy link to clipboard
Copied
These libraries can extract everything (pretty much) that's available in the file. It's up to you to further analyse it and extract from it the data you're after. As mentioned, there's no such thing as a "table" in a PDF file.
Copy link to clipboard
Copied
Not an expert on Acrobat or ISO and am struggling to convey my question correctly, simply put, "Is there a way to read a pdf file through a java program on server side? No through some pdfbox or itext, but actual API from Adobe."
As image below depicts, there is no uniform output given by the APIs, some read it column wise, others do it row wise.
Copy link to clipboard
Copied
This is the official Adobe PDF Library for processing PDF files using Java: http://www.datalogics.com/products/pdf/pdflibrary/
You'll need to discuss with them the issues regarding using it on a server, though.
Copy link to clipboard
Copied
Ok, the server issue is extremely important. Adobe's main programmatic interfaces are offered through the Acrobat SDK. This isn't really a product, just a name for documentation of the interfaces of Acrobat (not Reader). (You've posted in the Acrobat Reader forum, but I assume you know the difference between Acrobat and Acrobat Reader).
Your key issue would be that Acrobat is not for server use. Not licensed and (though it is irrelevant) not technically suitable. So there's no point looking at it if the final deployment is a server. Instead Adobe offer the Adobe PDF Library, which can be licensed (on a royalty basis, price by negotiation) for server use. It has a C/C++ interface which is in fact similar to part of the Acrobat SDK.
But there is nothing to do with text extraction from tables, because there is no such thing in a PDF. Are you familiar with the PDF specification, ISO 32000-1? There's nothing there but text and lines, with no connection between them.
Copy link to clipboard
Copied
What you describe is normal. You want there to be a "correct" extraction. There is none; there is extraction of text and position, end of story. Clever software with fuzzy logic might be able to receive further properties as the human brain does with such data. Good luck, it's leading edge research.
Copy link to clipboard
Copied
APIs Aspose and Datalogic are slightly helpful. Just curious, what options are available on client side to read a pdf file?
Copy link to clipboard
Copied
If a client has a licensed Acrobat (not Reader) installed, a plug-in to Acrobat can read PDF contents. The plug-in has to be run from Acrobat, you cannot use this to make an EXE.