Copy link to clipboard
Copied
Hello everyone,
I am currently working on a project where I need to extract structured data from PDF documents using the Adobe PDF Extract API. While most of the data is extracted correctly, I am facing an issue with some headings not being recognized as headings by the API.
In the JSON output, certain headings are not recognized correctly. For instance:
{
"Path": "//Document/P[113]",
"Text": "1.4 Context of the Plan",
"Font": {
"weight": 700
},
"TextSize": 13.5
},
{
"Path": "//Document/H2",
"Text": "1.2 Transmission-Development-Plan 2023",
"Font": {
"weight": 700
},
"TextSize": 13.5
}
In the above example, the first entry with Path: "//Document/P[113]" is a heading ("1.4 Context of the Plan") but is not recognized as such (please see attached image as well). However, the second entry with Path: "//Document/H2" is recognized correctly as a heading.
I have a couple of questions regarding this issue:
This is a fairly common occurrence. The AI does an excellent job on most layouts but it does miss some things. The good news is that it's easy to compensate for.
I typically "normalize" the output from Extract prior to trying to use it in another application. One of the steps is to build a map of all of the recognized headings then look for paragraphs that have the same exact properties as any of the headings and then reassign them by updating the Path property.
Copy link to clipboard
Copied
This is a fairly common occurrence. The AI does an excellent job on most layouts but it does miss some things. The good news is that it's easy to compensate for.
I typically "normalize" the output from Extract prior to trying to use it in another application. One of the steps is to build a map of all of the recognized headings then look for paragraphs that have the same exact properties as any of the headings and then reassign them by updating the Path property.
Copy link to clipboard
Copied
Thanks a lot Joel for the reply! Just to understand it a bit more, so you look for all the headings and compare their properties with paragraph's properties and change the path of this pargraphs if they match the properties of the heading? Do you only look for parapgraphs or do you also look for other elements such as lists?
Copy link to clipboard
Copied
Right now, I'm only normalizing the headings but lists would be just as easy.