Copy link to clipboard
Copied
When using getPageNthWord to obtain words in a PDF, every sentence that is wrapped causes the wrapped pair of words to be concatenated.
for example assume wrap at the 7
6 CFR 210.5(c), 7
CFR
There should be 6 words obtained
6 CFR 210.5 c 7 CFR
instead only 5 worrds are obtained
6 CFR 210.5 c 7CFR
The 7 is concatanated to the CFR which is incorrect and makes searching fail if I am searching for "7 CFR".
I assume this is a bug in the SDK for the javscript example I am using
object[] getPageNthWordParam = { p, i };
word = (String)T.InvokeMember(
"getPageNthWord",
BindingFlags.InvokeMethod |
BindingFlags.Public |
BindingFlags.Instance,
null, jsObj, getPageNthWordParam);
If indeed this is a bug, where can I report it?
Is there a work around such as disabling sentence wraping?
Another way to read individual words other than getPageNthWord?
I can not use my program and it is a lot easier to convert PDF to DOCX and not even use Acrobat.
Copy link to clipboard
Copied
I spent some time looking at the problem and discovered the word wrap problem seems to occur only when the text is bulleted. I do not know how to explain it but if the PDF form has the following
the wrapped words and to show up as
under 7 cfr 210.8 a andto ensure resolution
If I create a new PDF and put the contents of the bulleted items into the new document the problem of the word wrap does not exist.
Unfortunately, the bulleted items are basically the bibliography and contains references to documents and a word wrap of the reference causes the search for that reference to fail. My application looks for document references in a PDF or DOCX file. The problem occures in both the 2015 and the 2022 Acrobat Pro
Copy link to clipboard
Copied
This is not a bug, but a feature. A word can be split into multiple areas, which is why the quads property is a 2D array. Each top-level item in this array represents one rectangle (ie. one part of the word).
This becomes useful when a word is split at the end of a line, for example, but you still want it to be treated as a single word when exporting or reading the text. It seems the application that created the file used this feature incorrectly, but it's there for a good reason.
Copy link to clipboard
Copied
This is not a bug, but a feature.
The Titanic featured 16 lifeboats that met the requirements of the Merchant Shipping Act at the time it sank. Some would consider that a bug and not a feature.
This becomes useful when a word is split at the end of a line, for example, but you still want it to be treated as a single word when exporting or reading the text.
If I enable editing on the page containing the bullets, there is clearly a space after the "and " and that space is followed by the line wrap to the next word "the". If the line wrap is dropped the two words should still contain the space and be extracted as two words, not "andthe"
Opening the PDF in Word 365 and saveing as DOCX allows the search to complete on the DOCX. There is no need to use Acrobat Pro any my clients may want to just use Word.
Copy link to clipboard
Copied
I don't see how your anecdote about the Titanic has any relevance to this situation, but fine.
Editing the static content of the PDF in Acrobat teaches you nothing about its internal structure.