Programically convert/extract text from PDF

Question

Hey there -- I have been struggling with this all week. I am trying to take a PDF that we are sent daily and have the data (text) extracted for placement in our database. I have tried multiple PHP classes & functions as well as running a PERL script through PHP.

The methods I used above worked for a sample PDF I downloaded from here: http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/ucm072322.pdf

So the problem I am having is getting the PDF that our Vendor is sending us to convert as well. This specific PDF document is generated with Amuyuni PDF Converter version 4.0.0.7. and the only difference I see in these two PDFs is when I use Notepad to view the raw data.

Sample PDF:

%PDF-1.3%âãÏÓ

376 0 obj<< /Linearized 1 /O 379 /H [ 1063 556 ] /L 220094 /E 92903 /N 12 /T 212455 >> endobj

xref376 20 0000000016 00000 n

Vendor PDF:

%PDF-1.3%ÿÿÿÿ1 0 obj<</Title (þÿ I n t u i t _ Q B O B _ I n t e r n a l . p d f)/Producer (Amyuni PDF Converter version 4.0.0.7)/ CreationDate (D:20100830160629-07'00')>>endobj7 0 obj<< /Length 8 0 R /Filter /FlateDecode >>streamxœ ›M®ã6 €O;ä õˆú³ ’¼dÑ]Ñwƒ)º

Is there anything out there that I might be able to use to convert this particular PDF document?

Thanks!

Kevin

Anonymous · Answer

This doesn't do anything for a programmatic solution, but if you open .PDF files you have several options both for Saving As and Exporting, including text, xml. html. Perhaps that is an option? Some of the vendors that offer programs to create .PDF files might offer something like a library of pdf routines that you could use.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded