Skip to main content
Participant
September 2, 2010
Question

Programically convert/extract text from PDF

  • September 2, 2010
  • 2 replies
  • 1318 views

Hey there -- I have been struggling with this all week. I am trying to take a PDF that we are sent daily and have the data (text) extracted for placement in our database. I have tried multiple PHP classes & functions as well as running a PERL script through PHP.

The methods I used above worked for a sample PDF I downloaded from here: http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/ucm072322.pdf

So the problem I am having is getting the PDF that our Vendor is sending us to convert as well. This specific PDF document is generated with Amuyuni PDF Converter version 4.0.0.7. and the only difference I see in these two PDFs is when I use Notepad to view the raw data.

Sample PDF:

%PDF-1.3%âãÏÓ

376 0 obj<< /Linearized 1 /O 379 /H [ 1063 556 ] /L 220094 /E 92903 /N 12 /T 212455 >> endobj

                                                     xref376 20 0000000016 00000 n

Vendor PDF:

%PDF-1.3%ÿÿÿÿ1 0 obj<</Title (þÿ I n t u i t _ Q B O B _ I n t e r n a l . p d f)/Producer (Amyuni PDF Converter version 4.0.0.7)/ CreationDate (D:20100830160629-07'00')>>endobj7 0 obj<< /Length 8 0 R /Filter /FlateDecode >>streamxœ ›M®ã6 €O&#144;;ä õˆú³  ’¼dÑ]Ñwƒ)º

Is there anything out there that I might be able to use to convert this particular PDF document?

Thanks!

Kevin

This topic has been closed for replies.

2 replies

September 4, 2010

This doesn't do anything for a programmatic solution, but if you open .PDF files you have several options both for Saving As and Exporting, including text, xml. html. Perhaps that is an option? Some of the vendors that offer programs to create .PDF files might offer something like a library of pdf routines that you could use.

Lon_Winters
Inspiring
September 3, 2010

And I didn't know such a thing could even be done!

PDF's are definitely not my area of expertise, but I'll take a stab at it, and it's probably over simplified and you're way beyond this point, but here goes anyway.

Given that the vendor PDF includes more metadata, I would guess that it was created for a higher compatibility version than that of the sample PDF, which is compatible with Acrobat 5 and later. Could you re-save the vendor PDF at the 5.0 version and see if that has any effect? I find that option when you go to Reduce File Size.