Skip to main content
Participant
December 2, 2023
Question

Deserializing PDF Extract Json File

  • December 2, 2023
  • 1 reply
  • 797 views

Hi,

Is there any inbuilt tool or documentation which can be used to deserialize the Structured JSON to get the Proper DOM hierarchy of the document.

This topic has been closed for replies.

1 reply

Joel Geraci
Community Expert
Community Expert
December 4, 2023

There is no concept of a DOM hierarchy in PDF. There can be "Marked Content" generally known as tags, the output from Extract is more often more properly representative of the document structure than the tags,  especially if the PDF was created by a low-quality tool

 

If you are asking if the flat JSON can be transformed into something hierarchical like XML/HTML then yes. The "Path" property of each element can be used to construct such a hierarchy I'm actually working on a sample of this that should be published soon.

test5C0EAuthor
Participant
December 5, 2023

Hi @Joel Geraci 
Thank you for your reply,

Yes I was talking about the "Path" property, I was wondering is there any documentation on what could be present under the Path property as I've seen there's a lot of values in that property and in order to constrcut a proper hierarchy all of those values should be mapped properly.

Please keep the commiunity posted on your solution as it would help a lot of us as well.

Thanks