Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Extracting Data from a PDF File

Community Beginner ,
Mar 11, 2020 Mar 11, 2020

My organization uses third party software (OpenText - Exstream) to create PDF files.  Within one PDF file are multiple, multi-page documents for different recipients.  The first page of each document contains a white line of data (~30 characters) in the exact same location across all documents.  I need to be able to extract just this data from the first page of each document and output it to an external file (preferably a text file).  Any thoughts?  Thanks in advance.

1.1K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Mar 11, 2020 Mar 11, 2020

Were you planning on doing this extraction with Acrobat, or do you have some other PDF tool?

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Mar 12, 2020 Mar 12, 2020

That is what I'm trying to determine.  What are my options?  Does Acrobat do what I'm describing?  My research suggests the Redact functionality might be relevant.  However, I don't know about the extracting to an external file requirement using Redact.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Mar 12, 2020 Mar 12, 2020

In Acrobat, a a script or plug-in can scan page content. So yes, Acrobat can do this.

I written scripts for doing exactly this type of thing many times. 

 

In JavaScript the relevant fucntions are "doc.getPageNthWord()" and "doc.getPageNthWordQuad()".

Here's the reference entry:

https://help.adobe.com/en_US/acrobat/acrobat_dc_sdk/2015/HTMLHelp/#t=Acro12_MasterBook%2FJS_API_Acro...

  

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Mar 11, 2020 Mar 11, 2020

How many pages do your files can have?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Mar 12, 2020 Mar 12, 2020

Any one file can have hundreds of thousands of pages.  But the data only exists on the first page of each document.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Mar 12, 2020 Mar 12, 2020

And how would the tool know where each "document" starts within the file?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Mar 12, 2020 Mar 12, 2020

I don't know as it depends on the robustness of the tool.  I can tell you the data resides consistently in the same location on the page and if it doesn't exist on a page, then nothing will be in that location.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Mar 12, 2020 Mar 12, 2020
LATEST

In that case I would not recommend doing it in Acrobat. It's just too much for a script to be able to handle.

I would do it using a stand-alone tool, which is much more robust and can process much larger files, much faster. If you're interested in hiring someone to develop such a tool for you feel free to contact me privately via [try6767 at gmail.com] to discuss it further.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Mar 12, 2020 Mar 12, 2020

Thank you.  I will add it to my list of possibilities which, as of the moment, is a list of one. LOL

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines