Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Extracting Pages Based on Matching Strings in Adobe Acrobat 2020

New Here ,
Jul 19, 2023 Jul 19, 2023

Rookie here! I would love some help. Every week I sort through hundreds of PDF pages and combine them based on mutual routing numbers...there's gotta be a better way to do this. I've thought of either creating code to reorganize or extract the pages from the PDF document with matching strings where it finds routing number strings that I have in an xlsx file on Windows. My version is Adobe Acrobat Standard 2020. Adobe only takes Javascript which I'm unfamiliar with but have attempted to compile code from other pages to try and create something that might work. Help?

 

// Iterates over all pages and find a given string and extracts all

// pages on which that string is found to a new file.

var pageArray = [];

var stringsToSearchFor = ["routingnumber"];

for (var p = 0; p < this.numPages; p++) {

// iterate over all words

for (var n = 0; n < this.getPageNumWords(p); n++) {

if (this.getPageNthWord(p, n)!=-1) {

pageArray.push(p);

break;

}

}

}

if (pageArray.length > 0) {

// extract all pages that contain the string into a new document

var d = app.newDoc(); // this will add a blank page - we need to remove that once we are done

for (var n = 0; n < pageArray.length; n++) {

d.insertPages( {

nPage: d.numPages-1,

cPath: this.path,

nStart: pageArray[n],

nEnd: pageArray[n],

} );

}

// remove the first page

d.deletePages(0);

}

TOPICS
Comment review and collaborate Experiment , JavaScript , PDF
471
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 19, 2023 Jul 19, 2023
LATEST

The structure of the code is good, but it doesn't actually  collect the page words or search for anything. 

Here's a change to the portion of the code that iterates over the pages and words.

var pageArray = [], cPageText;
var stringsToSearchFor = ["routingnumber"];

for (var p = 0; p < this.numPages; p++) {
    // Collect all words on page
    cPageText = "";
    for (var n = 0; n < this.getPageNumWords(p); n++) {
       cPageText += this.getPageNthWord(p, n);
    }
    if(stringsToSearchFor.some(function(cTest){return (cPageText.indexOf(cTest) != -1);}))
        pageArray.push(p);
}

 

   This code collects all the words on the page, and then searchs that string for any matches in the array of routing numbers.

The reason for doing it this way is because Acrobat breaks all words on non-word boundaries. So if the routing number contains punctuation it will be divided into several words. If this is not the case, and the routing number is a single continous string of alpha-numeric characters, then the code can be made more efficient by searching individual words. 

 

So another method would be to search for a pattern using a regular expression, if that is suitable.

And a variation on the technique is to collect pages numbers for each different routing number, instead of mixing them in single array.  This can be done with an object.

 

There is yet another technique that might be many times faster using an Action script. Actions work on many documents at the same time. You can see an example of this technique in many of the Actions you can download here (such as the "Find and Highlight words"):

https://acrobatusers.com/actions-exchange/

 

 

  

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines