Participating Frequently

Question

getPageNumWords in Acrobat SDK is not detecting words with '-' and '_'

Forum|Forum|9 years ago
May 4, 2016
3 replies
2989 views

we are using Acrobat SDK for identifying and linking the words on our pdf files. using the getPageNumWords method we get the number of words and using the quad and rect for the word we link it.

    for (int page = 0; page < numPages; page++) { //for each page // number of words object objNumWords = COMUtils.invokeMethod(jso, "getPageNumWords", page); if (objNumWords == null) throw new PDFProcessingException("Acrobat API Error. Cannot access doc.getPageNumWords()"); int numWords = ConvertUtils.getInt(objNumWords); //Other logic goes here }

when there is a word like ABCD-EFGH or ABCD_EFGH in the PDF file. the above method returns them as ABCD and EFGH instead of one word.

Is it a bug in Acrobat SDK or are we not using it as it is intended?

BTW we are using Acrobat SDK 1.1

what am I missing.

Thanks,

Tippu

This topic has been closed for replies.

lrosenth

Adobe Employee

Whether or not a – or a _ is a word break character depends quite significantly on the context of the content of the document. That is why the core “WordFinder” APIs allow you to specify your own rules for how to break words.

Unfortunately, it appears that you are using our JavaScript APIs (across a COM bridge) which doesn’t give you that control. You would need to switch to using our C/C++ APIs directly.

G

gkaiseril

Inspiring

The characters you are having the issue with are considered as part of the "white space" set of character. There is an optional parameter for the getPageNthWord method that will include those characters. The parameter is the bStrip logical parameter. The default value is 'true' so the immediate white space following the word is removed. Omission of the tis parameter cause the default value to be in effect. Set this value to "false" and you should get the missing characters. Unfortunately there is no example provided in the documentation.

T

tippus29220139Author

Participating Frequently

generally getPageNthWord is used after getPageNumwords which auto excludes the words with '-' and '_' and takes them as multiple words. How does getPageNthWord identify such words and group them together?

try67

Community Expert

Not possible.

On Thu, May 5, 2016 at 9:03 PM, tippus29220139 <forums_noreply@adobe.com>

try67

Community Expert

This is the expected result when using this method (maybe not what you expected, but it's how it works).

T

tippus29220139Author

Participating Frequently

Is there any method that does exactly the same but not split up the words? Because, _ and - and very common identifiers people use.

Also, is this behavior fixed in subsequent versions of SDK?

try67

Community Expert

You can do the counting yourself. Read all of the words (using the getPageNthWord method) to an array, and then combine the ones that end with underscore or a hyphen. Make sure you specify the bNoStrip parameter to false, so the method actually retrieves these characters.

I don't think this behaivour changed or will change, although one can never know what the future might hold.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded