• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Javascript Search for Text within a PDF

New Here ,
Jan 20, 2018 Jan 20, 2018

Copy link to clipboard

Copied

Hello. I found a script that extracts pages based on content. I am trying to extract pages based on "Page 1 of 1" and "Page 1 of 2 & Page 2 of 2". I cannot figure out what to put in the search line. (“page”, “1”, “of”, and “1”) doesn't work. Any help would be appreciated. I really don't have much programming experience. I'm researching Javascript documentation, but it's really not much help. I'm so close...

// Iterates over all pages and find a given string and extracts all

// pages on which that string is found to a new file.

var pageArray = [];

var stringToSearchFor = "page\s1\sof\s1";

for (var p = 0; p < this.numPages; p++) {

    // iterate over all words

    for (var n = 0; n < this.getPageNumWords(p); n++) {

        if (this.getPageNthWord(p, n) == stringToSearchFor) {

            pageArray.push(p);

            break;

        }

    }

}

if (pageArray.length > 0) {

    // extract all pages that contain the string into a new document

    var d = app.newDoc();    // this will add a blank page - we need to remove that once we are done

    for (var n = 0; n < pageArray.length; n++) {

        d.insertPages( {

            nPage: d.numPages-1,

            cPath: this.path,

            nStart: pageArray,

            nEnd: pageArray,

        } );

    }

    // remove the first page

    d.deletePages(0);

   

}

TOPICS
Create PDFs

Views

16.7K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 20, 2018 Jan 20, 2018

Copy link to clipboard

Copied

Have you tired to see what your search script is finding and testing by adding a "console.println" to display the word found as the script searches the page?

As I understand "this.getPageNthWord(p, n)" returns the "n"th word on the "p" page. It appears you are looking for the four words "Page", "1", "of", "1". In my experience you need to search for all for words including the 3 word separating spaces between the words. Please review the Acrobat JavaScript documentation for "getPageNthWord" method.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 20, 2018 Jan 20, 2018

Copy link to clipboard

Copied

// Iterates over all pages and find a given string and extracts all

// pages on which that string is found to a new file.

var pageArray = [];

var stringToSearchFor = "page\s1\sof\s1";

for (var p = 0; p < this.numPages; p++) {

    // iterate over all words

    for (var n = 0; n < this.getPageNumWords(p); n++) {

        if (this.getPageNthWord(p, n) == stringToSearchFor) {

            pageArray.push(p);

            break;

        }

    }

}

console.println

if (pageArray.length > 0) {

    // extract all pages that contain the string into a new document

    var d = app.newDoc();    // this will add a blank page - we need to remove that once we are done

    for (var n = 0; n < pageArray.length; n++) {

        d.insertPages( {

            nPage: d.numPages-1,

            cPath: this.path,

            nStart: pageArray,

            nEnd: pageArray,

        } );

    }

    // remove the first page

    d.deletePages(0);

   

}

Was the that okay for inserting console.println?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 20, 2018 Jan 20, 2018

Copy link to clipboard

Copied

I would have added the statement during the first loop to display the each word that was found in the document for the comparison to the string of words "page 1 of 1"..

console.clear();

var pageArray = [];

var stringToSearchFor = "page\s1\sof\s1";

for (var p = 0; p < this.numPages; p++) {

    // iterate over all words

    for (var n = 0; n < this.getPageNumWords(p); n++) {

    console.println("Page: " + p + " word " + n + " is " + this.getPageNthWord(p, n));

        if (this.getPageNthWord(p, n) == stringToSearchFor) {

            console.println("Match found");

            pageArray.push(p);

            break;

        }

    }

}

Now my results list one word at time, so no one word will match your string of 4 words.

You need to make a string of 4 words in a row including the word separator between the first 3 words for the comparison to work.

Sample PDF with single and group word search.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jan 21, 2018 Jan 21, 2018

Copy link to clipboard

Copied

LATEST

GKaiseril is correct, The function that acquires page text only returns one word at a time. If you want to detect phrases you'll need to collect all the words on a page into a single string and search it for the phrase.

Or, a much simpler and more efficient solution is to use the Redact find tool to mark the phrases with a redact annotation. Then extract pages that contain the annots, and then delete the annots.

In fact, I created exactly this type of solution for the free search and highlight Action here:

https://acrobatusers.com/actions-exchange

Also on this page you'll find the "Extract Commented Pages" Action. If you run these two Actions back to back, you get exactly what you want. And if you can program, then you can extract and combine the scripts into a single tool.

Thom Parker - Software Developer at PDFScripting
Use the Acrobat JavaScript Reference early and often

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines