Help Needed: Adobe PDF Extract API Not Recognizing Some Headings

Report · Jun 20, 2024

Hello everyone,

I am currently working on a project where I need to extract structured data from PDF documents using the Adobe PDF Extract API. While most of the data is extracted correctly, I am facing an issue with some headings not being recognized as headings by the API.

Example of the Issue:

In the JSON output, certain headings are not recognized correctly. For instance:

{
"Path": "//Document/P[113]",
"Text": "1.4 Context of the Plan",
"Font": {
"weight": 700
},
"TextSize": 13.5
},
{
"Path": "//Document/H2",
"Text": "1.2 Transmission-Development-Plan 2023",
"Font": {
"weight": 700
},
"TextSize": 13.5
}

In the above example, the first entry with Path: "//Document/P[113]" is a heading ("1.4 Context of the Plan") but is not recognized as such (please see attached image as well). However, the second entry with Path: "//Document/H2" is recognized correctly as a heading.

I have a couple of questions regarding this issue:

Has anyone else encountered similar issues with the Adobe PDF Extract API?
Are there any best practices or alternative approaches for reliably detecting headings in PDF documents?

Report · Jun 20, 2024

This is a fairly common occurrence. The AI does an excellent job on most layouts but it does miss some things. The good news is that it's easy to compensate for.

I typically "normalize" the output from Extract prior to trying to use it in another application. One of the steps is to build a map of all of the recognized headings then look for paragraphs that have the same exact properties as any of the headings and then reassign them by updating the Path property.

Report · Jul 01, 2024

Thanks a lot Joel for the reply! Just to understand it a bit more, so you look for all the headings and compare their properties with paragraph's properties and change the path of this pargraphs if they match the properties of the heading? Do you only look for parapgraphs or do you also look for other elements such as lists?

Report · Jul 10, 2024

Right now, I'm only normalizing the headings but lists would be just as easy.

Help Needed: Adobe PDF Extract API Not Recognizing Some Headings

Example of the Issue:

1 Correct answer