• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Help Needed: Adobe PDF Extract API Not Recognizing Some Headings

Community Beginner ,
Jun 20, 2024 Jun 20, 2024

Copy link to clipboard

Copied

Hello everyone,

I am currently working on a project where I need to extract structured data from PDF documents using the Adobe PDF Extract API. While most of the data is extracted correctly, I am facing an issue with some headings not being recognized as headings by the API.

Example of the Issue:

In the JSON output, certain headings are not recognized correctly. For instance:

 

{
"Path": "//Document/P[113]",
"Text": "1.4 Context of the Plan",
"Font": {
"weight": 700
},
"TextSize": 13.5
},
{
"Path": "//Document/H2",
"Text": "1.2 Transmission-Development-Plan 2023",
"Font": {
"weight": 700
},
"TextSize": 13.5
}

 

In the above example, the first entry with Path: "//Document/P[113]" is a heading ("1.4 Context of the Plan") but is not recognized as such (please see attached image as well). However, the second entry with Path: "//Document/H2" is recognized correctly as a heading.

 

I have a couple of questions regarding this issue:

 

  • Has anyone else encountered similar issues with the Adobe PDF Extract API?
  • Are there any best practices or alternative approaches for reliably detecting headings in PDF documents?

 

Views

292

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

Community Expert , Jun 20, 2024 Jun 20, 2024

This is a fairly common occurrence. The AI does an excellent job on most layouts but it does miss some things. The good news is that it's easy to compensate for.

I typically "normalize" the output from Extract prior to trying to use it in another application. One of the steps is to build a map of all of the recognized headings then look for paragraphs that have the same exact properties as any of the headings and then reassign them by updating the Path property. 

Votes

Translate

Translate
Community Expert ,
Jun 20, 2024 Jun 20, 2024

Copy link to clipboard

Copied

This is a fairly common occurrence. The AI does an excellent job on most layouts but it does miss some things. The good news is that it's easy to compensate for.

I typically "normalize" the output from Extract prior to trying to use it in another application. One of the steps is to build a map of all of the recognized headings then look for paragraphs that have the same exact properties as any of the headings and then reassign them by updating the Path property. 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jul 01, 2024 Jul 01, 2024

Copy link to clipboard

Copied

Thanks a lot Joel for the reply! Just to understand it a bit more, so you look for all the headings and compare their properties with paragraph's properties and change the path of this pargraphs if they match the properties of the heading? Do you only look for parapgraphs or do you also look for other elements such as lists?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 10, 2024 Jul 10, 2024

Copy link to clipboard

Copied

LATEST

Right now, I'm only normalizing the headings but lists would be just as easy.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources