Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Extract certain pages from a document based on key words

Guest
Sep 16, 2016 Sep 16, 2016

Hi everyone,

I am trying to extract pages from a large document based on certain keywords. So if a keyword is found on one specific page, then that page number is pushed to an array, and used to create a new document. However, the issue I am having is with my script, it seems to be very inconsistent and cannot seem to create multiple new documents. Please note - almost all of this script I found online that someone else had made, and I am trying to adapt it to my purposes.

// Iterates over all pages and find a given string and extracts all

// pages on which that string is found to a new file.

var pageArray = [];

var pageA = [];

var stringToSearchFor = "keyword1";

var stringToSearch = "keyword2";

for (var p = 0; p < this.numPages; p++) {

  // iterate over all words

  for (var n = 0; n < this.getPageNumWords(p); n++) {

  if (this.getPageNthWord(p, n) == stringToSearchFor) {

  pageArray.push(p);

  break;

  }

        else if (this.getPageNthWord(p,n) == stringToSearch) {

            pageA.push(p);

            break;

     }

    }

}

console.println("Test 2 of pageArray " + pageArray);

if (pageArray.length > 0) {

  // extract all pages that contain the string into a new document

  var d = app.newDoc();    // this will add a blank page - we need to remove that once we are done

  for (var n = 0; n < pageArray.length; n++) {

  d.insertPages( {

  nPage: d.numPages-1,

  cPath: this.path,

  nStart: pageArray,

  nEnd: pageArray,

  } );

       console.println(n + " pageArray " + pageArray) }

    // remove the first page

    d.deletePages(0);

   

}

if (pageA.length > 0) {

  // extract all pages that contain the string into a new document

  var q = app.newDoc();    // this will add a blank page - we need to remove that once we are done

  for (var n = 0; n < pageA.length; n++) {

  q.insertPages( {

  nPage: q.numPages-1,

  cPath: this.path,

  nStart: pageA,

  nEnd: pageA,

  } );

        console.println(n + " pageA " + pageA)

}

console.println(pageA)

    // remove the first page

   

}

Thanks!

-Forrest

TOPICS
Acrobat SDK and JavaScript
852
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 16, 2016 Sep 16, 2016

Is the issue that some pages that contain both words only appear in one of the final files?

By the way, you're missing the command to delete the first page of the second file, after generating it.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Sep 16, 2016 Sep 16, 2016

Thanks for the quick response - Unfortunately no. I am using this script as part of a way to sort invoices, so the keyword I am searching for is the vendor's name - so two vendor's names will not appear on the same page.

And thanks for pointing that out - I had done that as a trouble shooting mechanism. Oddly enough the script seems to work for certain words but not others, even though I can find both words by searching (cmd + f) the document. Very confusing.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Sep 16, 2016 Sep 16, 2016

I should also point out that I put the console.println() to check that the arrays have values, which both of them do. So I think the issue may have something to do with the newDoc creation?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 16, 2016 Sep 16, 2016

You seem to be describing different kinds of issues. One is with the detection of the keywords, another with the extraction of the pages to the new file (if I understood correctly). These are unrelated issues. You should focus on each one of them separately and try to solve it.

Start by disabling the extraction process. Print to the console the list of pages for each search term. If they are not correct, investigate further. If a page that is supposed to appear in the list doesn't, go back to that page and print out all the words in it, and try to find out what the issue is.

This is how you debug code: You focus on a specific issue and eliminate causes until you find the cause of the problem, and then look for a solution for it. Then you move on to the next issue.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 16, 2016 Sep 16, 2016

I'm seeing a potential bug in your code that might cause all kinds of strange behaviors and that will be very difficult to spot if you don't know to look for it.

You should not use the "this" keyword after you create a new document, as it will probably point to that document instead of to the original one. Instead you should keep a separate reference to the original file, something like this as the first line of your code:

var originalDoc = this;

Then replace all instances of "this" in your code with "originalDoc".

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guest
Sep 19, 2016 Sep 19, 2016

Thanks again for the suggestion try67! Unfortunately I am still not getting the script to work - sometimes it will create newDoc for one of the words, but never for both and it does not seem to create either consistently.

// Iterates over all pages and find a given string and extracts all

// pages on which that string is found to a new file.

var pageArray = [];

var pageA = [];

var originalDoc = this;

var stringToSearchFor = "keyword1";

var stringToSearch = "keyword2";

for (var p = 0; p < originalDoc.numPages; p++) {

  // iterate over all words

  for (var n = 0; n < originalDoc.getPageNumWords(p); n++) {

  if (originalDoc.getPageNthWord(p, n) == stringToSearchFor) {

  pageArray.push(p);

  break;

  }

        else if (originalDoc.getPageNthWord(p,n) == stringToSearch) {

            pageA.push(p);

            break;

     }

    }

}

console.println("Test 2 of pageArray " + pageArray);

console.println("Test 1 of pageA " + pageA);

if (pageArray.length > 0) {

  // extract all pages that contain the string into a new document

  var d = app.newDoc();    // this will add a blank page - we need to remove that once we are done

  for (var n = 0; n < pageArray.length; n++) {

  d.insertPages( {

  nPage: d.numPages-1,

  nStart: pageArray,

  cPath: originalDoc.path,

  nEnd: pageArray,

  } );

       console.println(n + " pageArray " + pageArray) }

    // remove the first page

    d.deletePages(0);

   

}

if (pageA.length > 0) {

  // extract all pages that contain the string into a new document

  var q = app.newDoc();    // this will add a blank page - we need to remove that once we are done

  for (var n = 0; n < pageA.length; n++) {

  q.insertPages( {

  nPage: q.numPages-1,

  nStart: pageA,

  cPath: originalDoc.path,

  nEnd: pageA,

  } );

       

}

console.println(pageA)

  

   

}

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 20, 2016 Sep 20, 2016
LATEST

To help you further I'll need to see the actual file.

On Sep 20, 2016 1:11 AM, "forresth46081687" <forums_noreply@adobe.com>

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines