Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

getPageNumWords in Acrobat SDK is not detecting words with '-' and '_'

New Here ,
May 04, 2016 May 04, 2016

we are using Acrobat SDK for identifying and linking the words on our pdf files. using the getPageNumWords method we get the number of words and using the quad and rect for the word we link it.

    for (int page = 0; page < numPages; page++) { //for each page // number of words object objNumWords = COMUtils.invokeMethod(jso, "getPageNumWords", page); if (objNumWords == null) throw new PDFProcessingException("Acrobat API Error. Cannot access doc.getPageNumWords()"); int numWords = ConvertUtils.getInt(objNumWords); //Other logic goes here } 

when there is a word like ABCD-EFGH or ABCD_EFGH in the PDF file. the above method returns them as ABCD and EFGH instead of one word.

Is it a bug in Acrobat SDK or are we not using it as it is intended?

BTW we are using Acrobat SDK 1.1

what am I missing.

Thanks,

Tippu

TOPICS
Acrobat SDK and JavaScript
2.8K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
May 04, 2016 May 04, 2016

This is the expected result when using this method (maybe not what you expected, but it's how it works).

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
May 04, 2016 May 04, 2016

Is there any method that does exactly the same but not split up the words? Because, _ and - and very common identifiers people use.

Also, is this behavior fixed in subsequent versions of SDK?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
May 04, 2016 May 04, 2016

You can do the counting yourself. Read all of the words (using the getPageNthWord method) to an array, and then combine the ones that end with underscore or a hyphen. Make sure you specify the bNoStrip parameter to false, so the method actually retrieves these characters.

I don't think this behaivour changed or will change, although one can never know what the future might hold.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
May 04, 2016 May 04, 2016

sounds like we are not using this method for what it was intended to. why does it splits the '-' and '_'. it wasn't mentioned in the documentation that we downloaded from the adobe too.

Can you explain this in more detail for me.

Thanks

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
May 04, 2016 May 04, 2016

What do you mean? What are you using it for, then?

I suggest you read the documentation of the method I just mentioned.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
May 04, 2016 May 04, 2016

I meant that we didn't know that it will split for '_' and '-' before we used it nor it is mentioned in the documentation here. One thing we can do is like you said, we can do the check and count by our self in the code.

The other thing that I would be interested is if there any method that counts the words without splitting them on special chars. from your reply, it seems like there's none.

That made me to think more on getPageNumWords method that why does adobe wrote that method to split on these special chars though they are very common in literal names? and hence requested more info.

However the final solution to my prob seems like we should do all the counting and book keeping by our selves.

Hope I am clear now.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
May 05, 2016 May 05, 2016

Extracting text from a PDF requires Acrobat to first assemble and order all the characters on the page, with their positions; to decide what makes lines, to decide by fuzzy logic where the spaces are, to apply punctuation. There is a huge amount of guesswork.

In the full API for plug-ins you have control over what characters are considered punctuation, but not other aspects like the threshold for deciding what is a space versus loosely spaced text. The very simple JavaScript API has no such control.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
May 05, 2016 May 05, 2016

The characters you are having the issue with are considered as part of the "white space" set of character. There is an optional parameter for the getPageNthWord method that will include those characters. The parameter is the bStrip logical parameter. The default value is 'true' so the immediate white space following the word is removed. Omission of the tis parameter cause the default value to be in effect. Set this value to "false" and you should get the missing characters. Unfortunately there is no example provided in the documentation.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
May 05, 2016 May 05, 2016

generally getPageNthWord is used after getPageNumwords which auto excludes the words with '-' and '_'  and takes them as multiple words. How does getPageNthWord identify such words and group them together?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
May 05, 2016 May 05, 2016

Not possible.

On Thu, May 5, 2016 at 9:03 PM, tippus29220139 <forums_noreply@adobe.com>

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
May 05, 2016 May 05, 2016

It is up to you as the programmer to detect the trailing white space and account for this situation in your coding.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
May 05, 2016 May 05, 2016
LATEST

Whether or not a – or a _ is a word break character depends quite significantly on the context of the content of the document. That is why the core “WordFinder” APIs allow you to specify your own rules for how to break words.

Unfortunately, it appears that you are using our JavaScript APIs (across a COM bridge) which doesn’t give you that control. You would need to switch to using our C/C++ APIs directly.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines