• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
2

Inaccuracies in json produced by PDF Extract API

New Here ,
Jul 05, 2023 Jul 05, 2023

Copy link to clipboard

Copied

Hi,

I'm processing the json that I get back from the PDF Extract API. My goal is to translate it into HTML so that the resulting HTML would resemble the original PDF as close as possible, with respect to structure. My implementations works pretty well in general, I get the text, images and tables ok - but see below.

 

The biggest problem is that footnotes and tables have often mistakes in the json. Sometimes ordinary paragraphs come under /Table path in the json, sometimes under /Footnote path. Another problem is that the structure of the tables is sometimes incorrect, the text and the table structure in the json does not really match the structure of the table in the PDF. I see the same problem if I ask the API for the .xlsx for the tables, so I'm confident that this problem is in the PDF Extraction service.

 

As a good example, this (public domain) PDF https://unece.org/sites/default/files/2023-04/ECE-TRANS-WP.29-2023-57e%20.pdf has a few problems with the PDF Extract API:

- the Table A on page 10 is ok, but then paragraphs 5.3 - 5.3.3.1 have path //Document/Footnote, but these paragraphs are clearly not footnotes.

- the table on page 20 has problems in that the table starts from paragraph 3.4.3.1.2. with borderless rows until the first actual row of the table, with borders. The first cell on the second row "Maximum for Continuously variable..." is split into two distinct cells in the json, etc.

- the tables after page 24 are randomly missing cells.

 

I have other similar examples where the tables are more messed up.

 

I'm aware that PDF is a rendering format and I understand that the task of extracting structured information off of a PDF is very difficult. But is there anything that can be done to improve the accuracy of the PDF Extraction API?

 

Thanks,

Sami

Views

318

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
no replies

Have something to add?

Join the conversation
Resources