Extracting Data from a PDF File

Report · Mar 11, 2020

My organization uses third party software (OpenText - Exstream) to create PDF files. Within one PDF file are multiple, multi-page documents for different recipients. The first page of each document contains a white line of data (~30 characters) in the exact same location across all documents. I need to be able to extract just this data from the first page of each document and output it to an external file (preferably a text file). Any thoughts? Thanks in advance.

Report · Mar 11, 2020

Were you planning on doing this extraction with Acrobat, or do you have some other PDF tool?

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Mar 12, 2020

That is what I'm trying to determine. What are my options? Does Acrobat do what I'm describing? My research suggests the Redact functionality might be relevant. However, I don't know about the extracting to an external file requirement using Redact.

Report · Mar 12, 2020

In Acrobat, a a script or plug-in can scan page content. So yes, Acrobat can do this.

I written scripts for doing exactly this type of thing many times.

In JavaScript the relevant fucntions are "doc.getPageNthWord()" and "doc.getPageNthWordQuad()".

Here's the reference entry:

https://help.adobe.com/en_US/acrobat/acrobat_dc_sdk/2015/HTMLHelp/#t=Acro12_MasterBook%2FJS_API_Acro...

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Report · Mar 11, 2020

How many pages do your files can have?

Report · Mar 12, 2020

Any one file can have hundreds of thousands of pages. But the data only exists on the first page of each document.

Report · Mar 12, 2020

And how would the tool know where each "document" starts within the file?

Report · Mar 12, 2020

I don't know as it depends on the robustness of the tool. I can tell you the data resides consistently in the same location on the page and if it doesn't exist on a page, then nothing will be in that location.

Report · Mar 12, 2020

In that case I would not recommend doing it in Acrobat. It's just too much for a script to be able to handle.

I would do it using a stand-alone tool, which is much more robust and can process much larger files, much faster. If you're interested in hiring someone to develop such a tool for you feel free to contact me privately via [try6767 at gmail.com] to discuss it further.

Report · Mar 12, 2020

Thank you. I will add it to my list of possibilities which, as of the moment, is a list of one. LOL