Skip to main content
Participant
March 11, 2020
Question

Extracting Data from a PDF File

  • March 11, 2020
  • 3 replies
  • 1042 views

My organization uses third party software (OpenText - Exstream) to create PDF files.  Within one PDF file are multiple, multi-page documents for different recipients.  The first page of each document contains a white line of data (~30 characters) in the exact same location across all documents.  I need to be able to extract just this data from the first page of each document and output it to an external file (preferably a text file).  Any thoughts?  Thanks in advance.

    This topic has been closed for replies.

    3 replies

    Participant
    March 12, 2020

    Thank you.  I will add it to my list of possibilities which, as of the moment, is a list of one. LOL

    try67
    Community Expert
    Community Expert
    March 11, 2020

    How many pages do your files can have?

    Participant
    March 12, 2020

    Any one file can have hundreds of thousands of pages.  But the data only exists on the first page of each document.

    try67
    Community Expert
    Community Expert
    March 12, 2020

    I don't know as it depends on the robustness of the tool.  I can tell you the data resides consistently in the same location on the page and if it doesn't exist on a page, then nothing will be in that location.


    In that case I would not recommend doing it in Acrobat. It's just too much for a script to be able to handle.

    I would do it using a stand-alone tool, which is much more robust and can process much larger files, much faster. If you're interested in hiring someone to develop such a tool for you feel free to contact me privately via [try6767 at gmail.com] to discuss it further.

    Thom Parker
    Community Expert
    Community Expert
    March 11, 2020

    Were you planning on doing this extraction with Acrobat, or do you have some other PDF tool?

    Thom Parker - Software Developer at PDFScriptingUse the Acrobat JavaScript Reference early and often
    Participant
    March 12, 2020

    That is what I'm trying to determine.  What are my options?  Does Acrobat do what I'm describing?  My research suggests the Redact functionality might be relevant.  However, I don't know about the extracting to an external file requirement using Redact.

    Thom Parker
    Community Expert
    Community Expert
    March 12, 2020

    In Acrobat, a a script or plug-in can scan page content. So yes, Acrobat can do this.

    I written scripts for doing exactly this type of thing many times. 

     

    In JavaScript the relevant fucntions are "doc.getPageNthWord()" and "doc.getPageNthWordQuad()".

    Here's the reference entry:

    https://help.adobe.com/en_US/acrobat/acrobat_dc_sdk/2015/HTMLHelp/#t=Acro12_MasterBook%2FJS_API_AcroJS%2FDoc_methods.htm%23TOC_getPageNthWordbc-54&rhtocid=_6_1_8_23_1_53

      

    Thom Parker - Software Developer at PDFScriptingUse the Acrobat JavaScript Reference early and often