Skip to main content
Known Participant
August 27, 2023
Question

getPageNthWord combines pairs of words instead of reporting each word individually

  • August 27, 2023
  • 1 reply
  • 652 views

When using getPageNthWord to obtain words in a PDF, every sentence that is wrapped causes the wrapped pair of words to be concatenated. 

 

for example assume wrap at the 7

 

6 CFR 210.5(c), 7
CFR

 

There should be 6 words obtained

6 CFR 210.5 c 7 CFR

 

instead only 5 worrds are obtained

6 CFR 210.5 c 7CFR

The 7 is concatanated to the CFR which is incorrect and makes searching fail if I am searching for "7 CFR".

 

I assume this is a bug in the SDK for the javscript example I am using

object[] getPageNthWordParam = { p, i };
word = (String)T.InvokeMember(
    "getPageNthWord",
    BindingFlags.InvokeMethod |
    BindingFlags.Public |
    BindingFlags.Instance,
    null, jsObj, getPageNthWordParam);

 

If indeed this is a bug, where can I report it?

Is there a work around such as disabling sentence wraping?

Another way to read individual words other than getPageNthWord?

 

I can not use my program and it is a lot easier to convert PDF to DOCX and not even use Acrobat.

This topic has been closed for replies.

1 reply

JStatesonAuthor
Known Participant
August 27, 2023

I spent some time looking at the problem and discovered the word wrap problem seems to occur only when the text is bulleted.  I do not know how to explain it but if the PDF form has the following

 

 

the wrapped words and to show up as

 

under 7 cfr 210.8 a andto ensure resolution

 

If I create a new PDF and put the contents of the bulleted items into the new document the problem of the word wrap does not  exist.

 

Unfortunately, the bulleted items are basically the bibliography and contains references to documents and a word wrap of the reference causes the search for that reference to fail.  My application looks for document references in a PDF or DOCX file.   The problem occures in both the 2015 and the 2022 Acrobat Pro

try67
Community Expert
Community Expert
August 27, 2023

This is not a bug, but a feature. A word can be split into multiple areas, which is why the quads property is a 2D array. Each top-level item in this array represents one rectangle (ie. one part of the word).

This becomes useful when a word is split at the end of a line, for example, but you still want it to be treated as a single word when exporting or reading the text. It seems the application that created the file used this feature incorrectly, but it's there for a good reason.

JStatesonAuthor
Known Participant
August 27, 2023
quote

This is not a bug, but a feature.

 

 

The Titanic featured 16 lifeboats that met the requirements of the Merchant Shipping Act at the time it sank.  Some would consider that a bug and not a feature.

 

quote

This becomes useful when a word is split at the end of a line, for example, but you still want it to be treated as a single word when exporting or reading the text.

 

If I enable editing on the page containing the bullets, there is clearly a space after the "and " and that space is followed by the line wrap to the next word "the".  If the line wrap is dropped the two words should still contain the space and be extracted as two words, not "andthe"

 

Opening the PDF in Word 365 and saveing as DOCX allows the search to complete on the DOCX.  There is no need to use Acrobat Pro any my clients may want to just use Word.