Extract two pages from a multi-page PDF and rename it based on matching string using Regex.

Report · Feb 02, 2023

Hi all, I am using Adobe Acrobat Pro 2017 and I am trying to extract every two pages of a multi page PDF. The two pages that are extracted have an ID that can be searchable if all words in the document are put into a string. I have created similar code that works, but it is supposed to extract every page and rename it the first 8 digit code it can find using regular expressions. Take a look at the below code and let me know what you thing. Thanks!

/* Extract 2-page funding notice */

// Regular expression used to acquire the base name of file
var re = /\.pdf$/i;

// filename is the base name of the file Acrobat is working on
var filename = this.documentFileName.replace(re,"");

for (var i = 0; (i * 2) < this.numPages; i++) {  // Loop through the entire document
    numWords = this.getPageNumWords(i); // Find out how many words are on the page
    
    var WordString = ""; // Prepare a string
    
    for (var j = 0; (j < numWords; j++) // Put all the words on the page into a string
    {WordString = WordString + " " + this.getPageNthWord(i,  j);}
    
    ID = WordString.match(/\b\d{8}\b/); // Search for the 8 digit ID control # in the string

    this.extractPages({

        nStart: i * 2,
         
        nEnd: (i * 2) + 1,
         
        cPath: "/J/myfilepath/" + "SBIC_" + ID +"-Fnew.pdf"
        });
   
}

This code does run, however, not how I want it to. It pulls the first 8 digit ID in the string and the last two pages of the document.

Report · Feb 02, 2023

Hello, upon posting this question, I found the answer not too long after and it was pretty simple. Just update this line of code to the below and you will be golden!

{WordString = WordString + " " + this.getPageNthWord((i*2), j);}

Report · Feb 02, 2023

This doesn't look right. Why are you multiplying the value of i by 2? If you want to skip a page change the step part of the if-condition to i+=2.

Also, since what you're looking for is a single word I don't see the need to add up all the text in the page. You can just test each word on its own.

You're also missing an if-condition checking that ID is not null, in case no matches are found, and a break command to stop the (inner) loop once the code has been identified and the pages extracted.

Report · Feb 02, 2023

Hi try67, thanks for reaching out. I am multiplying the value of i by 2 because I need to extract every two pages from the document.

How would I go about testing each word on its own? when I tried researching ways to search for text, this was the only way that worked for me.

I have include this code snippet the line that contains the variable line and groups the rest of the code. I'm not sure how to enter a break command to stop the inner loop though.

if (WordString.match(/\b\d{8}\b/)) { // Search for the word 8 digit SBA Control ID in the string
    search.matchWholeWord = true; // If we got here, we'll search for the 8 digit SBA Control ID in the document

Report · Feb 02, 2023

This is what I meant:

pagesLoop:
for (var i = 0; i<this.numPages; i+=2) {  // Loop through the entire document
    var numWords = this.getPageNumWords(i); // Find out how many words are on the page
	for (var j = 0; j < numWords; j++) { // Put all the words on the page into a string
		var WordString = this.getPageNthWord(i,  j);    
		if (/$\d{8}^/.test(WordString)) { // Search for the 8 digit ID control # in the string
			this.extractPages({
				nStart: i,
				nEnd: i+1,
				cPath: "/J/myfilepath/" + "SBIC_" + WordString +"-Fnew.pdf"
			});
			continue pagesLoop;
		}
	}
	console.println("ERROR! Could not find the ID on page " + (i+1));   
}

Report · Feb 02, 2023

Another way to do this that is much more efficient is to use the Redaction pattern search. This search places a redact annot over all matching text. And it does it very quickly compared to JS word searches. Then the script only needs to get the locations of the redact annots. These annots can be deleted after collecting the naming data. To do this though, you'll need to create a custom search pattern, which are defined in this file:

C:\Users\<user name>\AppData\Roaming\Adobe\Acrobat\DC\Redaction\ENU\SearchRedactPatterns.xml

The redaction search can be used in a batch process as the first step, then the extraction script as the second step.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Extract two pages from a multi-page PDF and rename it based on matching string using Regex.

1 Correct answer

Photos