Extracting Data from a PDF File

Forum|Forum|6 years ago
March 11, 2020
3 replies
1042 views

My organization uses third party software (OpenText - Exstream) to create PDF files. Within one PDF file are multiple, multi-page documents for different recipients. The first page of each document contains a white line of data (~30 characters) in the exact same location across all documents. I need to be able to extract just this data from the first page of each document and output it to an external file (preferably a text file). Any thoughts? Thanks in advance.

This topic has been closed for replies.

C

CorrespondenceManAuthor

Participant

Thank you. I will add it to my list of possibilities which, as of the moment, is a list of one. LOL

try67

Community Expert

How many pages do your files can have?

C

CorrespondenceManAuthor

Participant

Any one file can have hundreds of thousands of pages. But the data only exists on the first page of each document.

try67

Community Expert

I don't know as it depends on the robustness of the tool. I can tell you the data resides consistently in the same location on the page and if it doesn't exist on a page, then nothing will be in that location.

In that case I would not recommend doing it in Acrobat. It's just too much for a script to be able to handle.

I would do it using a stand-alone tool, which is much more robust and can process much larger files, much faster. If you're interested in hiring someone to develop such a tool for you feel free to contact me privately via [try6767 at gmail.com] to discuss it further.

Thom Parker

Community Expert

Were you planning on doing this extraction with Acrobat, or do you have some other PDF tool?

Thom Parker - Software Developer at PDFScriptingUse the Acrobat JavaScript Reference early and often

C

CorrespondenceManAuthor

Participant

That is what I'm trying to determine. What are my options? Does Acrobat do what I'm describing? My research suggests the Redact functionality might be relevant. However, I don't know about the extracting to an external file requirement using Redact.

Thom Parker

Community Expert

In Acrobat, a a script or plug-in can scan page content. So yes, Acrobat can do this.

I written scripts for doing exactly this type of thing many times.

In JavaScript the relevant fucntions are "doc.getPageNthWord()" and "doc.getPageNthWordQuad()".

Here's the reference entry:

https://help.adobe.com/en_US/acrobat/acrobat_dc_sdk/2015/HTMLHelp/#t=Acro12_MasterBook%2FJS_API_AcroJS%2FDoc_methods.htm%23TOC_getPageNthWordbc-54&rhtocid=_6_1_8_23_1_53

Thom Parker - Software Developer at PDFScriptingUse the Acrobat JavaScript Reference early and often

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded