Javascript Search for Text within a PDF

Report · Jan 20, 2018

Hello. I found a script that extracts pages based on content. I am trying to extract pages based on "Page 1 of 1" and "Page 1 of 2 & Page 2 of 2". I cannot figure out what to put in the search line. (“page”, “1”, “of”, and “1”) doesn't work. Any help would be appreciated. I really don't have much programming experience. I'm researching Javascript documentation, but it's really not much help. I'm so close...

// Iterates over all pages and find a given string and extracts all

// pages on which that string is found to a new file.

var pageArray = [];

var stringToSearchFor = "page\s1\sof\s1";

for (var p = 0; p < this.numPages; p++) {

// iterate over all words

for (var n = 0; n < this.getPageNumWords(p); n++) {

if (this.getPageNthWord(p, n) == stringToSearchFor) {

pageArray.push(p);

break;

}

if (pageArray.length > 0) {

// extract all pages that contain the string into a new document

var d = app.newDoc(); // this will add a blank page - we need to remove that once we are done

for (var n = 0; n < pageArray.length; n++) {

d.insertPages( {

nPage: d.numPages-1,

cPath: this.path,

nStart: pageArray,

nEnd: pageArray,

} );

}

// remove the first page

d.deletePages(0);

}

Report · Jan 20, 2018

Have you tired to see what your search script is finding and testing by adding a "console.println" to display the word found as the script searches the page?

As I understand "this.getPageNthWord(p, n)" returns the "n"th word on the "p" page. It appears you are looking for the four words "Page", "1", "of", "1". In my experience you need to search for all for words including the 3 word separating spaces between the words. Please review the Acrobat JavaScript documentation for "getPageNthWord" method.

Report · Jan 20, 2018

// Iterates over all pages and find a given string and extracts all

// pages on which that string is found to a new file.

var pageArray = [];

var stringToSearchFor = "page\s1\sof\s1";

for (var p = 0; p < this.numPages; p++) {

// iterate over all words

for (var n = 0; n < this.getPageNumWords(p); n++) {

if (this.getPageNthWord(p, n) == stringToSearchFor) {

pageArray.push(p);

break;

}

console.println

if (pageArray.length > 0) {

// extract all pages that contain the string into a new document

var d = app.newDoc(); // this will add a blank page - we need to remove that once we are done

for (var n = 0; n < pageArray.length; n++) {

d.insertPages( {

nPage: d.numPages-1,

cPath: this.path,

nStart: pageArray,

nEnd: pageArray,

} );

}

// remove the first page

d.deletePages(0);

}

Was the that okay for inserting console.println?

Report · Jan 20, 2018

I would have added the statement during the first loop to display the each word that was found in the document for the comparison to the string of words "page 1 of 1"..

console.clear();

var pageArray = [];

var stringToSearchFor = "page\s1\sof\s1";

for (var p = 0; p < this.numPages; p++) {

// iterate over all words

for (var n = 0; n < this.getPageNumWords(p); n++) {

console.println("Page: " + p + " word " + n + " is " + this.getPageNthWord(p, n));

if (this.getPageNthWord(p, n) == stringToSearchFor) {

console.println("Match found");

pageArray.push(p);

break;

}

Now my results list one word at time, so no one word will match your string of 4 words.

You need to make a string of 4 words in a row including the word separator between the first 3 words for the comparison to work.

Sample PDF with single and group word search.

Report · Jan 21, 2018

GKaiseril is correct, The function that acquires page text only returns one word at a time. If you want to detect phrases you'll need to collect all the words on a page into a single string and search it for the phrase.

Or, a much simpler and more efficient solution is to use the Redact find tool to mark the phrases with a redact annotation. Then extract pages that contain the annots, and then delete the annots.

In fact, I created exactly this type of solution for the free search and highlight Action here:

https://acrobatusers.com/actions-exchange

Also on this page you'll find the "Extract Commented Pages" Action. If you run these two Actions back to back, you get exactly what you want. And if you can program, then you can extract and combine the scripts into a single tool.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often