Skip to main content
September 16, 2016
Question

Extract certain pages from a document based on key words

  • September 16, 2016
  • 1 reply
  • 966 views

Hi everyone,

I am trying to extract pages from a large document based on certain keywords. So if a keyword is found on one specific page, then that page number is pushed to an array, and used to create a new document. However, the issue I am having is with my script, it seems to be very inconsistent and cannot seem to create multiple new documents. Please note - almost all of this script I found online that someone else had made, and I am trying to adapt it to my purposes.

// Iterates over all pages and find a given string and extracts all

// pages on which that string is found to a new file.

var pageArray = [];

var pageA = [];

var stringToSearchFor = "keyword1";

var stringToSearch = "keyword2";

for (var p = 0; p < this.numPages; p++) {

  // iterate over all words

  for (var n = 0; n < this.getPageNumWords(p); n++) {

  if (this.getPageNthWord(p, n) == stringToSearchFor) {

  pageArray.push(p);

  break;

  }

        else if (this.getPageNthWord(p,n) == stringToSearch) {

            pageA.push(p);

            break;

     }

    }

}

console.println("Test 2 of pageArray " + pageArray);

if (pageArray.length > 0) {

  // extract all pages that contain the string into a new document

  var d = app.newDoc();    // this will add a blank page - we need to remove that once we are done

  for (var n = 0; n < pageArray.length; n++) {

  d.insertPages( {

  nPage: d.numPages-1,

  cPath: this.path,

  nStart: pageArray,

  nEnd: pageArray,

  } );

       console.println(n + " pageArray " + pageArray) }

    // remove the first page

    d.deletePages(0);

   

}

if (pageA.length > 0) {

  // extract all pages that contain the string into a new document

  var q = app.newDoc();    // this will add a blank page - we need to remove that once we are done

  for (var n = 0; n < pageA.length; n++) {

  q.insertPages( {

  nPage: q.numPages-1,

  cPath: this.path,

  nStart: pageA,

  nEnd: pageA,

  } );

        console.println(n + " pageA " + pageA)

}

console.println(pageA)

    // remove the first page

   

}

Thanks!

-Forrest

This topic has been closed for replies.

1 reply

try67
Adobe Expert
September 16, 2016

Is the issue that some pages that contain both words only appear in one of the final files?

By the way, you're missing the command to delete the first page of the second file, after generating it.

September 16, 2016

Thanks for the quick response - Unfortunately no. I am using this script as part of a way to sort invoices, so the keyword I am searching for is the vendor's name - so two vendor's names will not appear on the same page.

And thanks for pointing that out - I had done that as a trouble shooting mechanism. Oddly enough the script seems to work for certain words but not others, even though I can find both words by searching (cmd + f) the document. Very confusing.

try67
Adobe Expert
September 16, 2016

You seem to be describing different kinds of issues. One is with the detection of the keywords, another with the extraction of the pages to the new file (if I understood correctly). These are unrelated issues. You should focus on each one of them separately and try to solve it.

Start by disabling the extraction process. Print to the console the list of pages for each search term. If they are not correct, investigate further. If a page that is supposed to appear in the list doesn't, go back to that page and print out all the words in it, and try to find out what the issue is.

This is how you debug code: You focus on a specific issue and eliminate causes until you find the cause of the problem, and then look for a solution for it. Then you move on to the next issue.


I'm seeing a potential bug in your code that might cause all kinds of strange behaviors and that will be very difficult to spot if you don't know to look for it.

You should not use the "this" keyword after you create a new document, as it will probably point to that document instead of to the original one. Instead you should keep a separate reference to the original file, something like this as the first line of your code:

var originalDoc = this;

Then replace all instances of "this" in your code with "originalDoc".