Skip to main content
Participant
June 7, 2023
Question

Is extraction of table structure restricted by Adobe?

  • June 7, 2023
  • 1 reply
  • 506 views

We are trying to see if we can extract a 'tagged table' from PDF with TD TR structure along with TD properties (like col/row span, background color, border etc.). I have tried many PDF extraction tools or libraries and all of them just extract only positions of text objects and not the structure for the tagged tables. The only tool that extracts a table structure is Adobe's proprietary HTML converter but the conversion is not 100% accurate (sometimes table is rendered as plain text). Is Adobe restricting the extraction of TD TR tags along with their properties? Clarification would be really helpful.

This topic has been closed for replies.

1 reply

Abambo
Community Expert
Community Expert
June 8, 2023

No, Adobe is not restricting the extraction of any data structure from a PDF document. Indeed, PDF documents follow a standard that is not any more in the hands of Adobe (as it is now an ISO standard).

 

If a structure does not extract correctly, that may be because it has not been created as such a structure. PDF documents may be quite complex, but at the end of the day, they were never thought to be converted back. It was thought to be an electronic copy of your print. That means that you may have data in your PDF file that looks like a table, but is none. And till, it's a correct PDF file.

ABAMBO | Hard- and Software Engineer | Photographer
Participant
June 12, 2023

Hi Adambo, thanks for your response.

The attached PDF document is a tagged one and has proper table structure. Yet all we get the x and y position of text objects without any table td tr structure.