• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Searching a document with search.query

Community Beginner ,
Sep 14, 2018 Sep 14, 2018

Copy link to clipboard

Copied

I wrote a working script to compare every word in a document to a list of words I want to flag. It works but is incredibly slow. I know that Acrobat has an indexing feature. Maybe I can tap into that?

I've been playing with search.query, but it doesn't look like it actually returns anything. Please don't tell me to look in the manual :-)... it isn't described there much.

My loop is:

for (var i = 0; i < currentDoc.numPages; i++ )

{

...

        for (var j = 0; j < currentDoc.getPageNumWords(i); j++)

        {

         ....

              for (var n = 0; n < wordItem.length; n++)   // my list of terms I want to look for

              {

                 If found, comment-highlight the word.

               }

          }

}

Thanks for any help you can give me!

Rick

TOPICS
Acrobat SDK and JavaScript , Windows

Views

700

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

Community Expert , Sep 14, 2018 Sep 14, 2018

The method search.query is a way to programmatically initiate a search. The results are available in the search panel, not to JavaScript. Unfortunately, the only way to locate words and highlight them is the way you are doing it.

That said, you can make your code FAR more efficient by not trying to add a highlight annotation while you are looping through the words. That really slows down Acrobat.

Instead, create an array of arrays for the word indices that match the terms you want to highlight. Th

...

Votes

Translate

Translate
Community Expert ,
Sep 14, 2018 Sep 14, 2018

Copy link to clipboard

Copied

The method search.query is a way to programmatically initiate a search. The results are available in the search panel, not to JavaScript. Unfortunately, the only way to locate words and highlight them is the way you are doing it.

That said, you can make your code FAR more efficient by not trying to add a highlight annotation while you are looping through the words. That really slows down Acrobat.

Instead, create an array of arrays for the word indices that match the terms you want to highlight. Then after all words have been identified, loop through that new array and add the highlights. Your array might look like this...

var wordsToHighlight = [

     [4,15,21],

     [],

     [19,22,25,85]

]

...so the first page (index 0 of the array) would have words 4,15, and 21 highlighted. The second page gets no highlights and the third page... well, you get the idea.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Sep 14, 2018 Sep 14, 2018

Copy link to clipboard

Copied

Great, advice... thanks! I'll give that a shot.

Rick

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 14, 2018 Sep 14, 2018

Copy link to clipboard

Copied

You'll need to use the getPageNthWord method to actually find the matches to your search string, though, and if it's more than one word it becomes even more complicated...

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Sep 17, 2018 Sep 17, 2018

Copy link to clipboard

Copied

I'm not sure what you mean by this. I am using the ".getPageNthWord" property to look at each word on a page. Maybe you could wrap a little pseudo code or something around your suggestion if you had something else in mind?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 17, 2018 Sep 17, 2018

Copy link to clipboard

Copied

That is what I mean...

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Sep 17, 2018 Sep 17, 2018

Copy link to clipboard

Copied

Here comes the long-winded response.

I'm going to re-write my function with an array, but honestly I predict I'll have less than five mark ups per document, so I don't know that it will save much processing time.

My current script takes about 30 seconds to process a 3 page document. My test is searching on 7 terms. This script will have to work on more like 50 terms, and each document will be 100s of pages.

I'm also going to try taking my list of terms and loading them into a single string to remove one my loops:

     var terms = "my, redact, list, is, bogus";

     var n = terms.search("redact");

I don't know anything about the indexing function in Acrobat yet, but since there is a good possibility that there will be no terms discovered in a document, it might make sense to check against an index first to see if I even need to search a document word-by-word, and if search terms are found, I could likely limit the scope of terms to search for. The manual search function, that I'm guessing creates/uses an index, is so much faster.

I also thought about creating a custom dictionary to discover the terms. I'm not sure that even makes sense. Any thoughts on that?

The script actually checks all open documents, and marks up the document and creates a report at the end, so even if this takes a while to run, it should still make folks happier than doing it manually.

Rick

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 17, 2018 Sep 17, 2018

Copy link to clipboard

Copied

Forget about the search object. If you want to automate this task you can't use it. Use getPageNthWord, instead.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 17, 2018 Sep 17, 2018

Copy link to clipboard

Copied

LATEST

If you're looking for a script that will do it in Acrobat check out this (paid-for) one I've developed: Custom-made Adobe Scripts: Acrobat -- Highlight All Instances of a Word or Phrase in a PDF

I also developed a standalone version of that tool, that runs independently of Acrobat and is therefore much more robust. If you're interested in that contact me privately (try6767 at gmail.com).

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Sep 17, 2018 Sep 17, 2018

Copy link to clipboard

Copied

You don't want to do this in Acrobat. The fact that it can doesn't mean you should. Get yourself a decent PDF library and perform the operation on a server. The Datalogics PDF Java Toolkit by Adobe is probably the best option for this kind of thing.

The reason your code is going to take so long is that the Acrobat Word finder assembles drawing instructions of what humans interpret as a word... into a machine readable word... and that takes a lot of time. PDF isn't like HTML. The word finder is programed for the worst case scenario, a hyphenated word in a multicolumn document with irregular column widths. 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines