Skip to main content
DanielLopez0
Participant
January 20, 2018
Question

Javascript Search for Text within a PDF

  • January 20, 2018
  • 2 replies
  • 18200 views

Hello. I found a script that extracts pages based on content. I am trying to extract pages based on "Page 1 of 1" and "Page 1 of 2 & Page 2 of 2". I cannot figure out what to put in the search line. (“page”, “1”, “of”, and “1”) doesn't work. Any help would be appreciated. I really don't have much programming experience. I'm researching Javascript documentation, but it's really not much help. I'm so close...

// Iterates over all pages and find a given string and extracts all

// pages on which that string is found to a new file.

var pageArray = [];

var stringToSearchFor = "page\s1\sof\s1";

for (var p = 0; p < this.numPages; p++) {

    // iterate over all words

    for (var n = 0; n < this.getPageNumWords(p); n++) {

        if (this.getPageNthWord(p, n) == stringToSearchFor) {

            pageArray.push(p);

            break;

        }

    }

}

if (pageArray.length > 0) {

    // extract all pages that contain the string into a new document

    var d = app.newDoc();    // this will add a blank page - we need to remove that once we are done

    for (var n = 0; n < pageArray.length; n++) {

        d.insertPages( {

            nPage: d.numPages-1,

            cPath: this.path,

            nStart: pageArray,

            nEnd: pageArray,

        } );

    }

    // remove the first page

    d.deletePages(0);

   

}

This topic has been closed for replies.

2 replies

DanielLopez0
Participant
January 20, 2018

// Iterates over all pages and find a given string and extracts all

// pages on which that string is found to a new file.

var pageArray = [];

var stringToSearchFor = "page\s1\sof\s1";

for (var p = 0; p < this.numPages; p++) {

    // iterate over all words

    for (var n = 0; n < this.getPageNumWords(p); n++) {

        if (this.getPageNthWord(p, n) == stringToSearchFor) {

            pageArray.push(p);

            break;

        }

    }

}

console.println

if (pageArray.length > 0) {

    // extract all pages that contain the string into a new document

    var d = app.newDoc();    // this will add a blank page - we need to remove that once we are done

    for (var n = 0; n < pageArray.length; n++) {

        d.insertPages( {

            nPage: d.numPages-1,

            cPath: this.path,

            nStart: pageArray,

            nEnd: pageArray,

        } );

    }

    // remove the first page

    d.deletePages(0);

   

}

Was the that okay for inserting console.println?

Inspiring
January 20, 2018

I would have added the statement during the first loop to display the each word that was found in the document for the comparison to the string of words "page 1 of 1"..

console.clear();

var pageArray = [];

var stringToSearchFor = "page\s1\sof\s1";

for (var p = 0; p < this.numPages; p++) {

    // iterate over all words

    for (var n = 0; n < this.getPageNumWords(p); n++) {

    console.println("Page: " + p + " word " + n + " is " + this.getPageNthWord(p, n));

        if (this.getPageNthWord(p, n) == stringToSearchFor) {

            console.println("Match found");

            pageArray.push(p);

            break;

        }

    }

}

Now my results list one word at time, so no one word will match your string of 4 words.

You need to make a string of 4 words in a row including the word separator between the first 3 words for the comparison to work.

Sample PDF with single and group word search.

Inspiring
January 20, 2018

Have you tired to see what your search script is finding and testing by adding a "console.println" to display the word found as the script searches the page?

As I understand "this.getPageNthWord(p, n)" returns the "n"th word on the "p" page. It appears you are looking for the four words "Page", "1", "of", "1". In my experience you need to search for all for words including the 3 word separating spaces between the words. Please review the Acrobat JavaScript documentation for "getPageNthWord" method.