Adobe's Extract API: Non-Image Elements Classified as Images

Question

I'm currently working on implementing an automated mechanism to enable users to apply alt text to images within a PDF file.

Here's the algorithm I'm using:

Utilize Adobe's autotag API to make the PDF accessible.
Extract all images using Adobe's extract API.
Present each extracted image to the user, allowing them to select the image they wish to apply alt text to.
Apply the chosen alt text to the selected images, and generate an updated PDF with the alt text applied.

However, I'm encountering issues with the process:-

The images extracted using Adobe's extract API sometimes don't align with the images in the accessibility tags. This discrepancy is particularly noticeable when equations are mistakenly identified as images, leading to index mismatching problems. Could anyone suggest potential solutions or alternatives to address this issue? Please refer to the images attached below.

Joel Geraci · Answer

I'll start by clarifying some terminology. Extract does not identify "images". It identifies "figures" meaning areas of the page that may be constructed from text, line art, images, or a combination of these things. This gives the developer the opportunity to replace the figure with the correct alt-text rather than the actual text within the figure area. In your second image, you wouldn't want the text "ten point zero ex ten minus 6" to be read. Instead, you'd want to hear "ten times ten to the negative 6th power".

Also, the Auto-tag API is dependent on Extract so you should be getting the same kind of structure from both. You can use Acrobat to add the correct alt-text to the auto-tagged PDF.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded