Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Javascript Search for Text within a PDF

New Here ,
Jan 20, 2018 Jan 20, 2018

Hello. I found a script that extracts pages based on content. I am trying to extract pages based on "Page 1 of 1" and "Page 1 of 2 & Page 2 of 2". I cannot figure out what to put in the search line. (“page”, “1”, “of”, and “1”) doesn't work. Any help would be appreciated. I really don't have much programming experience. I'm researching Javascript documentation, but it's really not much help. I'm so close...

// Iterates over all pages and find a given string and extracts all

// pages on which that string is found to a new file.

var pageArray = [];

var stringToSearchFor = "page\s1\sof\s1";

for (var p = 0; p < this.numPages; p++) {

    // iterate over all words

    for (var n = 0; n < this.getPageNumWords(p); n++) {

        if (this.getPageNthWord(p, n) == stringToSearchFor) {

            pageArray.push(p);

            break;

        }

    }

}

if (pageArray.length > 0) {

    // extract all pages that contain the string into a new document

    var d = app.newDoc();    // this will add a blank page - we need to remove that once we are done

    for (var n = 0; n < pageArray.length; n++) {

        d.insertPages( {

            nPage: d.numPages-1,

            cPath: this.path,

            nStart: pageArray,

            nEnd: pageArray,

        } );

    }

    // remove the first page

    d.deletePages(0);

   

}

TOPICS
Create PDFs
17.4K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 20, 2018 Jan 20, 2018

Have you tired to see what your search script is finding and testing by adding a "console.println" to display the word found as the script searches the page?

As I understand "this.getPageNthWord(p, n)" returns the "n"th word on the "p" page. It appears you are looking for the four words "Page", "1", "of", "1". In my experience you need to search for all for words including the 3 word separating spaces between the words. Please review the Acrobat JavaScript documentation for "getPageNthWord" method.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 20, 2018 Jan 20, 2018

// Iterates over all pages and find a given string and extracts all

// pages on which that string is found to a new file.

var pageArray = [];

var stringToSearchFor = "page\s1\sof\s1";

for (var p = 0; p < this.numPages; p++) {

    // iterate over all words

    for (var n = 0; n < this.getPageNumWords(p); n++) {

        if (this.getPageNthWord(p, n) == stringToSearchFor) {

            pageArray.push(p);

            break;

        }

    }

}

console.println

if (pageArray.length > 0) {

    // extract all pages that contain the string into a new document

    var d = app.newDoc();    // this will add a blank page - we need to remove that once we are done

    for (var n = 0; n < pageArray.length; n++) {

        d.insertPages( {

            nPage: d.numPages-1,

            cPath: this.path,

            nStart: pageArray,

            nEnd: pageArray,

        } );

    }

    // remove the first page

    d.deletePages(0);

   

}

Was the that okay for inserting console.println?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 20, 2018 Jan 20, 2018

I would have added the statement during the first loop to display the each word that was found in the document for the comparison to the string of words "page 1 of 1"..

console.clear();

var pageArray = [];

var stringToSearchFor = "page\s1\sof\s1";

for (var p = 0; p < this.numPages; p++) {

    // iterate over all words

    for (var n = 0; n < this.getPageNumWords(p); n++) {

    console.println("Page: " + p + " word " + n + " is " + this.getPageNthWord(p, n));

        if (this.getPageNthWord(p, n) == stringToSearchFor) {

            console.println("Match found");

            pageArray.push(p);

            break;

        }

    }

}

Now my results list one word at time, so no one word will match your string of 4 words.

You need to make a string of 4 words in a row including the word separator between the first 3 words for the comparison to work.

Sample PDF with single and group word search.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jan 21, 2018 Jan 21, 2018
LATEST

GKaiseril is correct, The function that acquires page text only returns one word at a time. If you want to detect phrases you'll need to collect all the words on a page into a single string and search it for the phrase.

Or, a much simpler and more efficient solution is to use the Redact find tool to mark the phrases with a redact annotation. Then extract pages that contain the annots, and then delete the annots.

In fact, I created exactly this type of solution for the free search and highlight Action here:

https://acrobatusers.com/actions-exchange

Also on this page you'll find the "Extract Commented Pages" Action. If you run these two Actions back to back, you get exactly what you want. And if you can program, then you can extract and combine the scripts into a single tool.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines