pdf Extract - Incorrect order of paragraph after a paragraph spans across to the next page

Forum|Forum|4 years ago
July 21, 2021
1 reply
1636 views

Not sure if this is the right forum to report an issue with the pdf Extract.

When a paragraph spans across to the next page, Extract was able to capture the remaining paragraph from the next page. However, following right after it, Extract captures a paragraph that belongs to a different section (aka diff header in json output) below it and brings it up right after the spanned paragraph. The section where the paragraph should belong is empty. This creates inaccurate structure output of the pdf.

I can provide a sample pdf if your developer needs it to troubleshoot.

PDF Extract API

This topic has been closed for replies.

Joel Geraci

Community Expert

Can you share the PDF in question?

B

BudSVAuthor

Participant

Thanks Joel for your response. The pdf file attached below. I also attach the json output from pdf extract.
Note in the pdf: The last paragraph in page 1 spans to page 2. Adobe Sensei AI figures it our correctly in the element 26 & 27 (ParagraphSpan & ParagraphSpan[2]). Now, what next in the element 28 is a paragraph that belongs to section "Endoform RWD ” DFU Study (Heading in element 33). In summary paragraph in element 28 should belongs in btw element 34 and 35.
Similar problem btw page 2 & page 3 where we also have another paragraph spans across the page.

Also note: characters in the Heading text are not cluster properly. Our NLP engine can't tokenise the heading.

extractPdfInput.pdf

extractTextInfoFromPDF.zip