Copy link to clipboard
Copied
I'm currently working on implementing an automated mechanism to enable users to apply alt text to images within a PDF file.
Here's the algorithm I'm using:
However, I'm encountering issues with the process:-
The images extracted using Adobe's extract API sometimes don't align with the images in the accessibility tags. This discrepancy is particularly noticeable when equations are mistakenly identified as images, leading to index mismatching problems. Could anyone suggest potential solutions or alternatives to address this issue? Please refer to the images attached below.
Copy link to clipboard
Copied
I'll start by clarifying some terminology. Extract does not identify "images". It identifies "figures" meaning areas of the page that may be constructed from text, line art, images, or a combination of these things. This gives the developer the opportunity to replace the figure with the correct alt-text rather than the actual text within the figure area. In your second image, you wouldn't want the text "ten point zero ex ten minus 6" to be read. Instead, you'd want to hear "ten times ten to the negative 6th power".
Also, the Auto-tag API is dependent on Extract so you should be getting the same kind of structure from both. You can use Acrobat to add the correct alt-text to the auto-tagged PDF.
Copy link to clipboard
Copied
Thank you very much for your response, it has been incredibly helpful in understanding this better. I do have a follow-up question regarding the consistency of structure between the Auto-tag and the Extract API, you mentioned that I should expect similar structures from both the APIs. Does this imply that the structure visible in Adobe Acrobat (accessibility tags) for a PDF tagged using the Autotag - API should align with the 'path' outlined in the structuredData.json file generated through the Extract API?
To provide a clearer context, I've included an example:-
When running the Extract API on this PDF, it yields two figures as output. However, the 'path' in the structuredData.json file for abbreviations differs from that of figures, the path for the extracted images is: "Path": "//Document/Figure" which differs from the text abbrebriation classified as a Figure (as shown in Image 2).
I apologize if this question seems elementary. Your clarification on this matter would be valuable. Thank you once again.
Copy link to clipboard
Copied
That's interesting. Can you share PDF. You can send it to me privately if you don't want to post it.
Copy link to clipboard
Copied