Iam working on pdf remediation. I have normal pdfs. Iam thinking to write a script to read a normal pdf and identify various contents like headers, sub headers, lists, forms, tables, images and then add tags to the pdf content accordingly and generate a tagged pdf which will pass adobe accessibility check. My idea is reduce manual tagging efforts (in adobe acrobat dc pro software) by atleast 60 to 70%.
Are there sdks which support adding tags programmatically to a normal pdf?
Thanks in advance
Copy link to clipboard
It's not impossible. However, it requires both C++ programming skills and a very deep knowledge of PDF internals: the graphics model, the text model and the tagging model, which all interact. If you have that (or the time to study) you can use the PDSEdit layer in a custom plug-in.
Bear in mind that identifying "headers, sub headers, lists, forms, tables" is all guesswork. These things are not marked in a different way, pre-tagging. A table is a mixture of lines and text which the human eye quickly recognises as having patterns that make it a table. If you are working with highly standardised documents this is much easier.
By the way, Adobe's accessibility checker is not considered the industry standard for good accessibility; if you go to this trouble you should probably aim higher.
Why C++ specifically as programming language to interact with or build a PDF document?
I interpret this question as asking, how technically is the document markup model supported by PDF format represented in that format? Has anyone actually published guidance here? We live in a world where PDF is the afterthought format to more robust data modeling logics. How can we enable those logics to port into a PDF friendly namespace programmatically.
this is just a personal opinion, but it is shocking that even Adobe hasn't made more transparent open source ways for programmers to enable their content generation tools to output PDF in a way that retains the structure and semantics of content (not just visual layout).
of course for reasons of access, but as developers, data engineers, even enthusiasts, we should be nagging the heck out of these technical gatekeepers.
Copy link to clipboard
This function already exists in Acrobat Pro, there is no need to reinvent the wheel.
But as explained above, automatisms can do a lot of things but it's a human who has to polish the job.
We make PDFs dynamically so it would defnitely be something I would love to be able to do. Doesn't sound like a reasonable option.
Did you find any solutions?
problem with accepting Acrobat a solution is chiefly pay to play/optimize format. It also suggests that adobe is the only "vendor" who can build a UI/UX to enable such needed improvements to pdf format files.
the initial question is a real developer's question. If we actually de-mystified the programmatic process for doing what has already been done (to your point about adobe's "overlay" method), we would institute real change to the quality of pdf as format for end users.