Help Needed: Adobe PDF Extract API Not Recognizing Some Headings

Question

Hello everyone,I am currently working on a project where I need to extract structured data from PDF documents using the Adobe PDF Extract API. While most of the data is extracted correctly, I am facing an issue with some headings not being recognized as headings by the API.Example of the Issue:In the JSON output, certain headings are not recognized correctly. For instance: {"Path": "//Document/P[113]","Text": "1.4 Context of the Plan","Font": {"weight": 700},"TextSize": 13.5},{"Path": "//Document/H2","Text": "1.2 Transmission-Development-Plan 2023","Font": {"weight": 700},"TextSize": 13.5} In the above example, the first entry with Path: "//Document/P[113]" is a heading ("1.4 Context of the Plan") but is not recognized as such (please see attached image as well). However, the second entry with Path: "//Document/H2" is recognized correctly as a heading. I have a couple of questions regarding this issue: Has anyone else encountered similar issues with the Adobe PDF Extract API?Are there any best practices or alternative approaches for reliably detecting headings in PDF documents?

Joel Geraci · Accepted Answer

This is a fairly common occurrence. The AI does an excellent job on most layouts but it does miss some things. The good news is that it's easy to compensate for.

I typically "normalize" the output from Extract prior to trying to use it in another application. One of the steps is to build a map of all of the recognized headings then look for paragraphs that have the same exact properties as any of the headings and then reassign them by updating the Path property.

Example of the Issue:

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded