Copy link to clipboard
Copied
How can I tell if a PDF is tagged for accessibility without opening it and looking at the Properties? I'd like a solution that can run through thousands of PDFs and tell me which ones are accessible (tagged).
Similar question: can I tell if a PDF has JavaScript in it?
Copy link to clipboard
Copied
++EDITED REPLY , fixed typos
Hi pdc@TD ,
If I am not mistaken, all PDF producing software use some sort of core JavaScript engine to edit PDF objects in a PDF document and also to perform arithmetic operations (among other built-in features).
But if you were specifically referring on how to find out the presence of any JavaScript scripts while a PDF document is viewed in Adobe Acrobat Pro DC (requires paid-subscription), you may go to Tools => JavaScript => All JavaScript.
This method will show the user all of the scripts that are currently used in a PDF file .
Now, based on your primary inquiry, I will assume that in both of your questions you are referring to how to check many documents in bulk and determine if all of those files are both tagged and also contain JavaScript objects.
Is that correct?
If yes, you may employ the Action Wizard tool and customize an new action in combination of other readily available Acrobat built-in tools or with a custom JavaScript script.
However, based on your requirement criteria I wouldn't recommend doing it from the Acrobat Pro program due to possible crashes.
In addition, the Action Wizard will try to open up each document one at a time as it performs the custom actions to allow user interaction as each document is processed on real time (which makes the whole experience tedious and not efficient).
You're better off if you manually do a batch script.
Be aware that in the case of checking for tagged PDFs with a batch script, such script will only check for a text string that may indicate the presence of a parent tree mapping object. In this case, if the batch script is successful it will indicate if such PDFs have a document structure defined (nothing more).
Moreover, it is also worth noting that the output results of a batch script (like the one I am sharing below) shouldn't be confused with Accessibility checks nor it is meant to validate PDFs to see if they conform to a required standard; there is so much more involved with accessible PDFs (such as PDF/UA compliance standards, or detecting problems with embedded fonts, for example).
Anyway, if you open a PDF that you know is tagged using a file text editor (such as Microsoft's Notepad (if on MS Windows), or Notepad++, and TextEdit (if on a macOS)), you would want to empasize in your batch script to look for the string of the "StructTreeRoot" element (or property).
So, just for testing purposes, if you open a PDF with a text editor and perform a search for the "StructTreeRoot" string pattern, then chances are that the PDF document is tagged; and that is due to the presence of the StructTreeRoot catalog in those files. See screenshot:
You may test with another file using the same method above to search for the "JavaScript" string pattern; if the PDF has JavaScript scripts on it, the search will highlight such text string indicating that one or more JavaScript object(s) are in use with that file.
With these observations in mind, now you can employ a batch script like the example below:
@ECHO OFF
cd C:\Users\userAccount\Desktop
mkdir C:\Users\userAccount\Desktop\Tagged_PDFs
for /f %%a in ('findstr /M "StructTreeRoot" *.pdf') do XCOPY /Y %%a C:\Users\userAccount\Desktop\Tagged_PDFs >NUL
cd C:\Users\userAccount\Desktop\Tagged_PDFs
mkdir C:\Users\userAccount\Desktop\Tagged_PDFs\PDFs_with_JS
for /f %%b in ('findstr /M "JavaScript" *.pdf') do XCOPY /Y %%b C:\Users\userAccount\Desktop\Tagged_PDFs\PDFs_with_JS >NUL
START C:\Users\userAccount\Desktop\Tagged_PDFs\PDFs_with_JS
This batch script is meant to be executed from any directory on a Microsoft Windows computer. And it will expect to look for files in the User account's Desktop folder (i.e. C:\Users\yourUserAccount\Desktop directory) .
As you execute this batch script, the first action will create a new subfolder: "Tagged_PDF's" and then it will look for every PDF in the Desktop's parent directory for tagged PDFs that match the "StructTreeRoot" string pattern.
The last portion of the first script will copy the list of matched files that were found in C:\Users\yourUserAccount\Desktop parent directory to the new "Tagged_PDF's" subfolder that was created earlier.
Soon after the first script finalizes it will execute the second script, which will switch directory from C:\Users\yourUserAccount\Desktop to the Tagged_PDF's subfolder.
In there, it will create a new subfolder named "PDFs_wtih_JS", and it will perform another string search on the files that were identified with a tagged structure and copied to the Tagged_PDF's folder ; this time it will look for the "JavaScript" string pattern and copy the PDFs files that match that string pattern from the Tagged_PDF's folder to the new "PDFs_wtih_JS" subfolder.
The last line of the batch script just opens up a new File Explorer window of the "PDFs_wtih_JS" folder, in which only the files that are both tagged and that have a JavaScript code in them will be listed.
Although this solution is NOT entirely realted to Adobe Acrobat, it will allow you to run through thousands of PDFs and sort them in a folder to indicate which files contain both tag properties and JavaScript objects .
ADDITIONAL NOTES:
Batch scripts that process too many files may hog other background services or programs that may be opened simultaneously as the script executes and parse those files.
I would suggest to test first with no more than 250 PDF files at a time and see how it performs.
Copy link to clipboard
Copied
Thanks for the detailed answer! I have tried to look for keywords in PDFs using Notepad++, but I wasn't sure what to look for. Are "StructTreeRoot" and "JavaScript" 100% dependable?
Surely there's a totally reliable way of doing this. I've heard of a product called SiteImprove which can tell if a PDF is tagged. How does it know? And when a screen reader like JAWS reads a PDF, it reads the tag tree. How does it do it? There's an API that can be exploited for such things. All you need is a language like C to access it.
Copy link to clipboard
Copied
You're welcome.
Yes, those two strings will identify 100% if PDFs are tagged or if they contain JavaScript code.
The only issue is that, if some or most of the PDFs that will be parsed by the batch script are encrypted (or protected with password) you may observe that some files may not be read.
In which case, you may also modify the batch script and search for the string "Encrypted", to determine if PDFs are password protected.
Anyway, all my batch script does is to just find text strings.
And not because a PDF document appears to be tagged after the batch script executes, it also means that such PDFs are Accessible or that they conform in any way with accessibility standards requirements.
Programs (or online services) like SiteImprove or ExifTool must be used to determine accessibility compliance (or to perform deep PDF Tags analysis).
Such tools employ more advanced methods that examine and extract the XMP's PDF Info tags from the XML metadata object.
All that information is readily available based off of the Adobe Acrobat's PDF specification, so experimenting with API's in other programming languages is not exactly a necessity (unless you are developing your own plug-ins, for example).
For the purpose of a batch script that would examine the XMP's PDF Marked Info tags, all you would be looking for is if the "taggedPDF" descriptor is marked as true or false.
See more here:
Copy link to clipboard
Copied
Searching for text strings is not 100% reliable, no. To start with, you could get a false hit on strings in metadata. Unlikely perhaps. But also because if an object is deleted and the document simply saved, all the old objects remain in the file until a SAVE AS is done, so you will find false hits.
The Acrobat SDK exposes a C++ API for plug-ins that can walk through the objects actually within a PDF. It cannot be used to make standalone apps, on a server, or via scripting. Knowledge of the PDF specification is a must.
Copy link to clipboard
Copied
Thank you for clarifying @Test Screen Name and always keeping an eye.
Copy link to clipboard
Copied
Thanks folks. "Knowledge of the PDF specification is a must" ends it for me. I don't intend to wade into that muck.
But I figured there must be some 3rd-party tools out there written by people who have gone to the trouble. The Adobe person I first approached said she wasn't aware of any such 3rd-party tools. I had my doubts. I think she was more interested in closing the ticket than digging deeper. I'm pretty vocal about my dislike of Adobe. I'd say more, but that would violate the "be kind and respectful" guideline.
Copy link to clipboard
Copied
A guy I work with created a Java (not C++) program that will extract doc-level and field-level JavaScript from a PDF. I don't know if he has a good knowledge of the PDF specification. So often my colleagues and I have had to solve our PDF problems ourselves. Adobe was little or no help. This is just another example. Was this such a difficult problem? My colleague figured it all out in a couple of days.
Copy link to clipboard
Copied
This is not within Adobe's brief. They invented the PDF specification, but since they handed it over to ISO, there's no way to get Adobe's help in interpreting it. Sometimes the people here have knowledge to share, sometimes not.
Copy link to clipboard
Copied
Adobe is still part of the ISO's specification committees. Two full-time Adobe engineers are on the committees for all types of PDF and are active contributors, sometimes chairing working groups.
The PDF Association is designated by the ISO as the entity to write and develop all PDF standards, and the association has a fair amount of reference material on their website. Here are some materials you might find helpful:
... tagged is not equal to accessible.
This is correct. Not only must a PDF be tagged, all real content must be tagged — with the correct tag, as well. Structure and syntax are critical for accessible PDFs just as with any markup language...HTML, XML, JATS, etc.
Looking for only the metadata info that indicates the PDF is tagged is not sufficient. A PDF with only <P> tags is not accessible by any means of the imagination. And a PDF with an incorrect reading order is totally useless to those using assistive technologies.
There are many 3rd party tools and services that will examine PDFs for accessibility, and some claim to remediate the errors. But none of them can determine the status with 100% accuracy; this process still requires a human interpretation to determine what the correct, most appropriate tag and structure is for the content. AI doesn't do that well enough for most documents.
But these tools will do a much better job of evaluating PDFs than you'll have with your own script. For one, all of these vendors know the PDF and PDF/UA standards and most are on the standards committees. And they didn't develop their software tools in a day, either. Their tools have been used by the industry for 10-15 years.
Dust off your wallet as none of these tools/services are cheap:
Copy link to clipboard
Copied
pdc@TD ,
Yeah... speaking for myself, I am very used to ungrateful feedback and harsh critiques ( like yours).
I don't work for Adobe, and no it wasn't a difficult problem.
It is easy to come back and troll your own post after community members (like myself), do most of the legwork for people like you (who seem to lack enough appetite in genuine learning and ask other people to do work for them) in roughly less than 24 hours for free!
That was a voluntary contribution.
And I am not sure why are you picking on voluntary contributors who have helped you to vent a personal frustration that you yourself seem to have with Adobe Inc.
Anyway, this is what you asked:
I'd like a solution that can run through thousands of PDFs and tell me which ones are accessible (tagged).
Similar question: can I tell if a PDF has JavaScript in it?
Note where you say "accessible (tagged)".
Since a PDF document can be tagged, it seems like it was you who was having a hard time understanding that tagged is not equal to accessible.
Not because a PDF document is tagged it means that it is also Accessible (or that it meets Accessibility compliance standards).
The batch script does detects PDFs that are tagged.
And it also detects if they have JavaScript on them.
Moreover, if you ever care about learning how to script your own codes, the batch script can be modified to also search for PDFs that are encrypted.
You made it complicated by asking if this method was reliable, and also by bringing to the forum's attention about exploiting APIs.
Your original question had nothing to do with how to check for Accessibility compliance or how exploit APIs using other programming languages to check a PDF for accessibility with 100% reliability.
So, I did answered your question thouroughly.
And the batch script that I am sharing is 100% reliable to check for tagged PDFs in bulk using text string searches.
Seems more to me like you didn't even tried the batch script.
For what is worth, it seems like you need to be told AGAIN that it is not reliable to check for PDF Accessibility using batch scripts that search for text strings in a file.
This wasalready clearly explained to you twice in this thread.
But anyway, why don't you share your friend's script, written in JAVA, here in these community forums ?
Would that be a problem ???
Maybe we can all benefit (and every other community memebers that are reading).
Copy link to clipboard
Copied
Thanks all.
I do understand the difference between accessible and tagged. Our team has been asked several times which of our ~3000 PDFs are accessible. There was a time when our little team did the tagging using Acrobat (yuck), but now we farm the work out to a company that has decent tools. The PDFs all get thoroughly tested with JAWS. So if a PDF has tags, chances are very good that it is accessible. I've only seen a couple of PDFs that had a "tagged yes" property, but had few or no tags. They were a mistake. So for our purposes, looking for tags is a sufficiently reliable way of spotting accessible PDFs.
As for JavaScript, we are currently having issues with viewing PDFs with Edge. PDFs with JavaScript are particularly troublesome. So we have been asked how many of our 3000 PDFs have JavaScript. The guy who came up with a solution--which in addition to showing the user-edited doc-level and field-level code, also shows the Acrobat "AF" functions that format currency and dates in text fields--did it with Java. He didn't go into detail with me, but he did mention "third-party code", and that the process of digging out JavaScript from a PDF is complicated. So no, it's not as simple as sharing his script. I knew it could be done.
And all that stuff about Adobe and ISO? I don't get it. Does that free Adobe from answering questions about their product? Do they think it's OK to let a user community answer such questions on their behalf? A company that truly cares about its customers would look for reasons to help them, not dodge them. If you went to Walmart and asked an employee where the stationery department was, would you be happy to be told to ask another customer?
TTFN