Programically convert/extract text from PDF
Hey there -- I have been struggling with this all week. I am trying to take a PDF that we are sent daily and have the data (text) extracted for placement in our database. I have tried multiple PHP classes & functions as well as running a PERL script through PHP.
The methods I used above worked for a sample PDF I downloaded from here: http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/ucm072322.pdf
So the problem I am having is getting the PDF that our Vendor is sending us to convert as well. This specific PDF document is generated with Amuyuni PDF Converter version 4.0.0.7. and the only difference I see in these two PDFs is when I use Notepad to view the raw data.
Sample PDF:
%PDF-1.3%âãÏÓ
376 0 obj<< /Linearized 1 /O 379 /H [ 1063 556 ] /L 220094 /E 92903 /N 12 /T 212455 >> endobj
xref376 20 0000000016 00000 n
Vendor PDF:
%PDF-1.3%ÿÿÿÿ1 0 obj<</Title (þÿ I n t u i t _ Q B O B _ I n t e r n a l . p d f)/Producer (Amyuni PDF Converter version 4.0.0.7)/ CreationDate (D:20100830160629-07'00')>>endobj7 0 obj<< /Length 8 0 R /Filter /FlateDecode >>streamxœ ›M®ã6 €O;ä õˆú³ ’¼dÑ]Ñwƒ)º
Is there anything out there that I might be able to use to convert this particular PDF document?
Thanks!
Kevin