Help Needed: Adobe PDF Extract API Not Recognizing Some Headings
- June 20, 2024
- 1 reply
- 877 views
Hello everyone,
I am currently working on a project where I need to extract structured data from PDF documents using the Adobe PDF Extract API. While most of the data is extracted correctly, I am facing an issue with some headings not being recognized as headings by the API.
Example of the Issue:
In the JSON output, certain headings are not recognized correctly. For instance:
{
"Path": "//Document/P[113]",
"Text": "1.4 Context of the Plan",
"Font": {
"weight": 700
},
"TextSize": 13.5
},
{
"Path": "//Document/H2",
"Text": "1.2 Transmission-Development-Plan 2023",
"Font": {
"weight": 700
},
"TextSize": 13.5
}
In the above example, the first entry with Path: "//Document/P[113]" is a heading ("1.4 Context of the Plan") but is not recognized as such (please see attached image as well). However, the second entry with Path: "//Document/H2" is recognized correctly as a heading.
I have a couple of questions regarding this issue:
- Has anyone else encountered similar issues with the Adobe PDF Extract API?
- Are there any best practices or alternative approaches for reliably detecting headings in PDF documents?
