• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

getPageNthWord combines pairs of words instead of reporting each word individually

Community Beginner ,
Aug 26, 2023 Aug 26, 2023

Copy link to clipboard

Copied

When using getPageNthWord to obtain words in a PDF, every sentence that is wrapped causes the wrapped pair of words to be concatenated. 

 

for example assume wrap at the 7

 

6 CFR 210.5(c), 7
CFR

 

There should be 6 words obtained

6 CFR 210.5 c 7 CFR

 

instead only 5 worrds are obtained

6 CFR 210.5 c 7CFR

The 7 is concatanated to the CFR which is incorrect and makes searching fail if I am searching for "7 CFR".

 

I assume this is a bug in the SDK for the javscript example I am using

object[] getPageNthWordParam = { p, i };
word = (String)T.InvokeMember(
    "getPageNthWord",
    BindingFlags.InvokeMethod |
    BindingFlags.Public |
    BindingFlags.Instance,
    null, jsObj, getPageNthWordParam);

 

If indeed this is a bug, where can I report it?

Is there a work around such as disabling sentence wraping?

Another way to read individual words other than getPageNthWord?

 

I can not use my program and it is a lot easier to convert PDF to DOCX and not even use Acrobat.

TOPICS
Acrobat SDK and JavaScript , Windows

Views

179

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Aug 27, 2023 Aug 27, 2023

Copy link to clipboard

Copied

I spent some time looking at the problem and discovered the word wrap problem seems to occur only when the text is bulleted.  I do not know how to explain it but if the PDF form has the following

 

JStateson_0-1693144992434.png

 

the wrapped words and to show up as

 

under 7 cfr 210.8 a andto ensure resolution

 

If I create a new PDF and put the contents of the bulleted items into the new document the problem of the word wrap does not  exist.

 

Unfortunately, the bulleted items are basically the bibliography and contains references to documents and a word wrap of the reference causes the search for that reference to fail.  My application looks for document references in a PDF or DOCX file.   The problem occures in both the 2015 and the 2022 Acrobat Pro

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Aug 27, 2023 Aug 27, 2023

Copy link to clipboard

Copied

This is not a bug, but a feature. A word can be split into multiple areas, which is why the quads property is a 2D array. Each top-level item in this array represents one rectangle (ie. one part of the word).

This becomes useful when a word is split at the end of a line, for example, but you still want it to be treated as a single word when exporting or reading the text. It seems the application that created the file used this feature incorrectly, but it's there for a good reason.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Aug 27, 2023 Aug 27, 2023

Copy link to clipboard

Copied

quote

This is not a bug, but a feature.

 

 

The Titanic featured 16 lifeboats that met the requirements of the Merchant Shipping Act at the time it sank.  Some would consider that a bug and not a feature.

 

quote

This becomes useful when a word is split at the end of a line, for example, but you still want it to be treated as a single word when exporting or reading the text.

 

If I enable editing on the page containing the bullets, there is clearly a space after the "and " and that space is followed by the line wrap to the next word "the".  If the line wrap is dropped the two words should still contain the space and be extracted as two words, not "andthe"

 

Opening the PDF in Word 365 and saveing as DOCX allows the search to complete on the DOCX.  There is no need to use Acrobat Pro any my clients may want to just use Word.   

 

 

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Aug 27, 2023 Aug 27, 2023

Copy link to clipboard

Copied

LATEST

I don't see how your anecdote about the Titanic has any relevance to this situation, but fine.

 

Editing the static content of the PDF in Acrobat teaches you nothing about its internal structure.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines