• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

pdf Extract - Incorrect order of paragraph after a paragraph spans across to the next page

New Here ,
Jul 20, 2021 Jul 20, 2021

Copy link to clipboard

Copied

Not sure if this is the right forum to report an issue with the pdf Extract.

When a paragraph spans across to the next page, Extract was able to capture the remaining paragraph from the next page.  However, following right after it, Extract captures a paragraph that belongs to a different section (aka diff header in json output) below it and brings it up right after the spanned paragraph.  The section where the paragraph should belong is empty.  This creates inaccurate structure output of the pdf. 

I can provide a sample pdf if your developer needs it to troubleshoot.

TOPICS
PDF Extract API

Views

747

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 21, 2021 Jul 21, 2021

Copy link to clipboard

Copied

Can you share the PDF in question?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jul 21, 2021 Jul 21, 2021

Copy link to clipboard

Copied

Thanks Joel for your response.  The pdf file attached below.  I also attach the json output from pdf extract.
Note in the pdf: The last paragraph in page 1 spans to page 2.  Adobe Sensei AI figures it our correctly in the element 26 & 27 (ParagraphSpan & ParagraphSpan[2]).   Now, what next in the element 28 is a paragraph that belongs to section "Endoform RWD ” DFU Study (Heading in element 33).   In summary paragraph in element 28 should belongs in btw element 34 and 35.
Similar problem btw page 2 & page 3 where we also have another paragraph spans across the page.
 
Also note: characters in the Heading text are not cluster properly.  Our NLP engine can't tokenise  the heading. 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jul 26, 2021 Jul 26, 2021

Copy link to clipboard

Copied

Hi @Joel_Geraci ,

Any update on this? Thanks.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jul 28, 2021 Jul 28, 2021

Copy link to clipboard

Copied

Just wondering if there is anything I need to do. e.g.  file a bug with the product development team??

Having the correct order of paragraphs in a document is critical in our project.  Thanks.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 29, 2021 Jul 29, 2021

Copy link to clipboard

Copied

The best I can do is share your files with the engineers to help train the AI. Can I have your permission to do that?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jul 30, 2021 Jul 30, 2021

Copy link to clipboard

Copied

Thanks Joel.  Please do.  My file is a public doc.  I don't think my pdf is unique.  I hope adobe engineers can easily reproduce this problem and re-train the model with other pdfs that have paragraphs span across pages.  Your engineers can contact me if needed.

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 30, 2021 Jul 30, 2021

Copy link to clipboard

Copied

LATEST

We actually arte quite good at reading paragraphs across pages. Just not with this particular file.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources