Skip to main content
Known Participant
April 16, 2018
Answered

Searching text phrases without the Search plugin

  • April 16, 2018
  • 1 reply
  • 1273 views

I am looking for a smooth way of searching text, not just single words in a pdf document using the api. I am not interested in using the Search plugin. I have started to put together a function using the word finder and a vector of words, but I am a little surprised that it does not exist a better way of doing this. Or does it?

This topic has been closed for replies.
Correct answer Thom Parker

There are no hacks, just better and worse algorithms for searching. The way I do this, and I've done this multiple times, is to create a list (std C template list or MFC CStringList) of words that make up the phrase. Then search for words that match the first item. On a match, divert into a sub loop that checks each item in the list with the next word on the page. Its easy to add simple variations like no-case and partial word checking. I'll typically use the same incrementer in the subloop so the phrase matches don't overlap. But you could do it differently depending on the search requirements.  I have also done this by pre-processing the text on the page into distinct blocks of text to ensure the search only happens within a text block. Of course this technique will not catch phases broken across pages, which requires identifying paragraphs, headers and footers.

1 reply

Thom Parker
Community Expert
Community Expert
April 16, 2018

How could it be better than having the WordFinder? It's super fast and provides tons of info.  If it wasn't for the WordFinder you'd be parsing content streams. You are on the best route. 

Thom Parker - Software Developer at PDFScriptingUse the Acrobat JavaScript Reference early and often
Known Participant
April 16, 2018

OK, thanks. Well, it is easy and fast to search word by word as long as you do not want to search for a sentence or a small amount of text. I have not found a better way than to split the sentence by the spaces between the words and then do a search on each individual word in the correct order to match up for what I am searching for. It kind of feels lika a hack so I was thinking it was a better way of doing it that I was not aware of.

Thom Parker
Community Expert
Thom ParkerCommunity ExpertCorrect answer
Community Expert
April 16, 2018

There are no hacks, just better and worse algorithms for searching. The way I do this, and I've done this multiple times, is to create a list (std C template list or MFC CStringList) of words that make up the phrase. Then search for words that match the first item. On a match, divert into a sub loop that checks each item in the list with the next word on the page. Its easy to add simple variations like no-case and partial word checking. I'll typically use the same incrementer in the subloop so the phrase matches don't overlap. But you could do it differently depending on the search requirements.  I have also done this by pre-processing the text on the page into distinct blocks of text to ensure the search only happens within a text block. Of course this technique will not catch phases broken across pages, which requires identifying paragraphs, headers and footers.

Thom Parker - Software Developer at PDFScriptingUse the Acrobat JavaScript Reference early and often