Copy link to clipboard
Copied
Hi,
I'm processing the json that I get back from the PDF Extract API. My goal is to translate it into HTML so that the resulting HTML would resemble the original PDF as close as possible, with respect to structure. My implementations works pretty well in general, I get the text, images and tables ok - but see below.
The biggest problem is that footnotes and tables have often mistakes in the json. Sometimes ordinary paragraphs come under /Table path in the json, sometimes under /Footnote path. Another problem is that the structure of the tables is sometimes incorrect, the text and the table structure in the json does not really match the structure of the table in the PDF. I see the same problem if I ask the API for the .xlsx for the tables, so I'm confident that this problem is in the PDF Extraction service.
As a good example, this (public domain) PDF https://unece.org/sites/default/files/2023-04/ECE-TRANS-WP.29-2023-57e%20.pdf has a few problems with the PDF Extract API:
- the Table A on page 10 is ok, but then paragraphs 5.3 - 5.3.3.1 have path //Document/Footnote, but these paragraphs are clearly not footnotes.
- the table on page 20 has problems in that the table starts from paragraph 3.4.3.1.2. with borderless rows until the first actual row of the table, with borders. The first cell on the second row "Maximum for Continuously variable..." is split into two distinct cells in the json, etc.
- the tables after page 24 are randomly missing cells.
I have other similar examples where the tables are more messed up.
I'm aware that PDF is a rendering format and I understand that the task of extracting structured information off of a PDF is very difficult. But is there anything that can be done to improve the accuracy of the PDF Extraction API?
Thanks,
Sami
Have something to add?