Adobe's Extract API: Non-Image Elements Classified as Images

Report · Mar 07, 2024

I'm currently working on implementing an automated mechanism to enable users to apply alt text to images within a PDF file.

Here's the algorithm I'm using:

Utilize Adobe's autotag API to make the PDF accessible.
Extract all images using Adobe's extract API.
Present each extracted image to the user, allowing them to select the image they wish to apply alt text to.
Apply the chosen alt text to the selected images, and generate an updated PDF with the alt text applied.

However, I'm encountering issues with the process:-

The images extracted using Adobe's extract API sometimes don't align with the images in the accessibility tags. This discrepancy is particularly noticeable when equations are mistakenly identified as images, leading to index mismatching problems. Could anyone suggest potential solutions or alternatives to address this issue? Please refer to the images attached below.

Report · Mar 07, 2024

I'll start by clarifying some terminology. Extract does not identify "images". It identifies "figures" meaning areas of the page that may be constructed from text, line art, images, or a combination of these things. This gives the developer the opportunity to replace the figure with the correct alt-text rather than the actual text within the figure area. In your second image, you wouldn't want the text "ten point zero ex ten minus 6" to be read. Instead, you'd want to hear "ten times ten to the negative 6th power".

Also, the Auto-tag API is dependent on Extract so you should be getting the same kind of structure from both. You can use Acrobat to add the correct alt-text to the auto-tagged PDF.

Report · Mar 08, 2024

Thank you very much for your response, it has been incredibly helpful in understanding this better. I do have a follow-up question regarding the consistency of structure between the Auto-tag and the Extract API, you mentioned that I should expect similar structures from both the APIs. Does this imply that the structure visible in Adobe Acrobat (accessibility tags) for a PDF tagged using the Autotag - API should align with the 'path' outlined in the structuredData.json file generated through the Extract API?

To provide a clearer context, I've included an example:-

In Image 1 (StructurePane.png) , we can observe a tagged PDF where the abbreviation "CEC" is labeled as a "Figure."
In Image 2 (structuredData_json), we have the structuredData.json file generated via the Extract API. Here, the text "CEC" is part of a paragraph, but there's no nested tagging designating "CEC" as a Figure.

When running the Extract API on this PDF, it yields two figures as output. However, the 'path' in the structuredData.json file for abbreviations differs from that of figures, the path for the extracted images is: "Path": "//Document/Figure" which differs from the text abbrebriation classified as a Figure (as shown in Image 2).

I apologize if this question seems elementary. Your clarification on this matter would be valuable. Thank you once again.

Report · Mar 08, 2024

That's interesting. Can you share PDF. You can send it to me privately if you don't want to post it.

Report · Mar 11, 2024

same problem, pdf is attached.

Adobe's Extract API: Non-Image Elements Classified as Images

Photos