Acrobat Javascript extracting footnotes technique

New Here ,
Nov 21, 2019

Copy link to clipboard

Copied

Hi,

 

For PDFs that have been converted from Word files, I'm investigating how I can extract footnotes. One approach I'd like to validate the possibility of is writing a search that looks for footnotes in the text (understanding that "text" is not a straightfoward concept in a PDF). I've been looking at:

  • ADOBE PDF LIBRARY SDK
  • Acrobat DC SDK

for scripting options.

 

I'm wondering if I could first do a search for a number - e.g. 1 and then either determine the rectangle shape and relative offset to determine if it's a footnote reference; or if text properties are available, the superscript property (if there is one). If it finds a footnote reference, follow on to find the actual footnote content at the bottom of the page and extract that.

 

Thanks

TOPICS
Acrobat SDK and JavaScript

Views

127

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more

Acrobat Javascript extracting footnotes technique

New Here ,
Nov 21, 2019

Copy link to clipboard

Copied

Hi,

 

For PDFs that have been converted from Word files, I'm investigating how I can extract footnotes. One approach I'd like to validate the possibility of is writing a search that looks for footnotes in the text (understanding that "text" is not a straightfoward concept in a PDF). I've been looking at:

  • ADOBE PDF LIBRARY SDK
  • Acrobat DC SDK

for scripting options.

 

I'm wondering if I could first do a search for a number - e.g. 1 and then either determine the rectangle shape and relative offset to determine if it's a footnote reference; or if text properties are available, the superscript property (if there is one). If it finds a footnote reference, follow on to find the actual footnote content at the bottom of the page and extract that.

 

Thanks

TOPICS
Acrobat SDK and JavaScript

Views

128

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Nov 21, 2019 0
Most Valuable Participant ,
Nov 21, 2019

Copy link to clipboard

Copied

You can search for words and their location.

To do that you would need to use the getPageNthWord and getPageNthWordQuads methods.

However, you can't find any additional information about them, like the font used, color, size, whether or not they are superscript, underline, italic, bold, etc.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Nov 21, 2019 0
New Here ,
Nov 21, 2019

Copy link to clipboard

Copied

Thanks for your reply.

Would I do a search and then call getPageNthWord to get more information? Without doing the search I won't know what index the word is (using this reference, which may be out-of-date: https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/AcrobatDC_js_api_reference.pdf)

Or would I iterate through every word?

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Nov 21, 2019 0
Most Valuable Participant ,
Nov 21, 2019

Copy link to clipboard

Copied

No, this has nothing to do with the search command. You use this method in a loop to iterate over all the words in the file, looking for a match.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Nov 21, 2019 0
New Here ,
Nov 21, 2019

Copy link to clipboard

Copied

OK, and you're saying that if even if there is a text match JavaScript for Acrobat API doesn't provide any information that would indicate that it is a footnote reference (size, position, etc). Is there a way I could look at the geometry of a word (or character).

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Nov 21, 2019 0
Most Valuable Participant ,
Nov 21, 2019

Copy link to clipboard

Copied

Yes, you can look at the position of the word, using the getPageNthWordQuads method.

It works at the word-level only, though.

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Nov 21, 2019 0
Most Valuable Participant ,
Nov 21, 2019

Copy link to clipboard

Copied

And of course it uses the same indexes as the getPageNthWord method...

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Nov 21, 2019 0
New Here ,
Nov 21, 2019

Copy link to clipboard

Copied

That's great - thank you - I'll give that a go. Do you have any suggestions for other approaches to the extraction?

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Nov 21, 2019 0
try67 LATEST
Most Valuable Participant ,
Nov 21, 2019

Copy link to clipboard

Copied

If you need to know the location of the word that's the only way (using a script).

Likes

Translate

Translate

Report

Report
Community Guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
Reply
Loading...
Nov 21, 2019 0