I wrote a working script to compare every word in a document to a list of words I want to flag. It works but is incredibly slow. I know that Acrobat has an indexing feature. Maybe I can tap into that?
I've been playing with search.query, but it doesn't look like it actually returns anything. Please don't tell me to look in the manual :-)... it isn't described there much.
My loop is:
for (var i = 0; i < currentDoc.numPages; i++ )
for (var j = 0; j < currentDoc.getPageNumWords(i); j++)
for (var n = 0; n < wordItem.length; n++) // my list of terms I want to look for
If found, comment-highlight the word.
Thanks for any help you can give me!
That said, you can make your code FAR more efficient by not trying to add a highlight annotation while you are looping through the words. That really slows down Acrobat.
Instead, create an array of arrays for the word indices that match the terms you want to highlight. Then after all words have been identified, loop through that new array and add the highlights. Your array might look like this...
var wordsToHighlight = [
...so the first page (index 0 of the array) would have words 4,15, and 21 highlighted. The second page gets no highlights and the third page... well, you get the idea.
Great, advice... thanks! I'll give that a shot.
You'll need to use the getPageNthWord method to actually find the matches to your search string, though, and if it's more than one word it becomes even more complicated...
I'm not sure what you mean by this. I am using the ".getPageNthWord" property to look at each word on a page. Maybe you could wrap a little pseudo code or something around your suggestion if you had something else in mind?
That is what I mean...
Here comes the long-winded response.
I'm going to re-write my function with an array, but honestly I predict I'll have less than five mark ups per document, so I don't know that it will save much processing time.
My current script takes about 30 seconds to process a 3 page document. My test is searching on 7 terms. This script will have to work on more like 50 terms, and each document will be 100s of pages.
I'm also going to try taking my list of terms and loading them into a single string to remove one my loops:
var terms = "my, redact, list, is, bogus";
var n = terms.search("redact");
I don't know anything about the indexing function in Acrobat yet, but since there is a good possibility that there will be no terms discovered in a document, it might make sense to check against an index first to see if I even need to search a document word-by-word, and if search terms are found, I could likely limit the scope of terms to search for. The manual search function, that I'm guessing creates/uses an index, is so much faster.
I also thought about creating a custom dictionary to discover the terms. I'm not sure that even makes sense. Any thoughts on that?
The script actually checks all open documents, and marks up the document and creates a report at the end, so even if this takes a while to run, it should still make folks happier than doing it manually.
Forget about the search object. If you want to automate this task you can't use it. Use getPageNthWord, instead.
If you're looking for a script that will do it in Acrobat check out this (paid-for) one I've developed: Custom-made Adobe Scripts: Acrobat -- Highlight All Instances of a Word or Phrase in a PDF
I also developed a standalone version of that tool, that runs independently of Acrobat and is therefore much more robust. If you're interested in that contact me privately (try6767 at gmail.com).
You don't want to do this in Acrobat. The fact that it can doesn't mean you should. Get yourself a decent PDF library and perform the operation on a server. The Datalogics PDF Java Toolkit by Adobe is probably the best option for this kind of thing.
The reason your code is going to take so long is that the Acrobat Word finder assembles drawing instructions of what humans interpret as a word... into a machine readable word... and that takes a lot of time. PDF isn't like HTML. The word finder is programed for the worst case scenario, a hyphenated word in a multicolumn document with irregular column widths.