Skip to main content
Known Participant
March 24, 2023
Question

Tagged = Yes property

  • March 24, 2023
  • 2 replies
  • 3613 views

How can I tell if a PDF is tagged for accessibility without opening it and looking at the Properties?  I'd like a solution that can run through thousands of PDFs and tell me which ones are accessible (tagged).  

Similar question: can I tell if a PDF has JavaScript in it?

This topic has been closed for replies.

2 replies

pdc@TDAuthor
Known Participant
March 27, 2023

Thanks for the detailed answer!  I have tried to look for keywords in PDFs using Notepad++, but I wasn't sure what to look for.  Are "StructTreeRoot" and "JavaScript" 100% dependable?

Surely there's a totally reliable way of doing this.  I've heard of a product called SiteImprove which can tell if a PDF is tagged.  How does it know?  And when a screen reader like JAWS reads a PDF, it reads the tag tree.  How does it do it?  There's an API that can be exploited for such things.  All you need is a language like C to access it.

 

Legend
March 28, 2023

Searching for text strings is not 100% reliable, no. To start with, you could get a false hit on strings in metadata. Unlikely perhaps. But also because if an object is deleted and the document simply saved, all the old objects remain in the file until a SAVE AS is done, so you will find false hits.

 

The Acrobat SDK exposes a C++ API for plug-ins that can walk through the objects actually within a PDF. It cannot be used to make standalone apps, on a server, or via scripting. Knowledge of the PDF specification is a must.

ls_rbls
Community Expert
Community Expert
March 28, 2023

Thank you for clarifying @Test Screen Name and always keeping an eye.

ls_rbls
Community Expert
Community Expert
March 27, 2023

++EDITED REPLY , fixed typos

 

Hi @pdc@TD ,

 

If I am not mistaken, all PDF producing software use some sort of core JavaScript engine to edit PDF objects in a PDF document and also to perform arithmetic operations (among other built-in features).

 

But if you were specifically referring on how to find out the presence of any JavaScript scripts while a PDF document is viewed in Adobe Acrobat Pro DC (requires paid-subscription), you may go to Tools => JavaScript => All JavaScript.

 

This method will show the user all of the scripts that are currently used in a PDF file .

 

Now, based on your primary inquiry, I will assume that in both of your  questions you are referring to how to check many documents in bulk and determine if all of those files are both tagged and also contain JavaScript objects.

 

Is that correct?

 

If yes, you may employ the Action Wizard tool and customize an new action in combination of other readily available Acrobat built-in tools or with a custom JavaScript script.

 

However, based on your requirement criteria I wouldn't recommend doing it from the Acrobat Pro program due to possible crashes.

 

In addition, the Action Wizard will try to open up each document one at a time as it performs the custom actions to allow user interaction as each document is processed on real time (which makes the whole experience tedious and not efficient).

 

You're better off if you manually do a batch script.

 

Be aware that in the case of checking for tagged PDFs with  a batch script, such script will only check for a text string that may indicate the presence of a parent tree mapping object. In this case, if the batch script is successful it will indicate if such PDFs have a document structure defined (nothing more).

 

Moreover, it is also worth noting that the output results of a batch script (like the one I am sharing below) shouldn't be confused with Accessibility checks nor it is meant to validate PDFs to see if they conform to a required standard; there is so much more involved with accessible PDFs (such as PDF/UA compliance standards, or detecting problems with embedded fonts, for example). 

 

Anyway, if you open a PDF that you know is tagged using a file text editor (such as Microsoft's Notepad (if on MS Windows), or Notepad++, and TextEdit (if on a macOS)), you would want to empasize in your batch script to look for the string of the "StructTreeRoot" element (or property).

 

So, just for testing purposes, if you open a PDF with a text editor and perform a search for the  "StructTreeRoot" string pattern,  then chances are that the PDF document is tagged; and that is due to the presence of the StructTreeRoot catalog in those files. See screenshot:

 

 

You may test with another file using the same method above to search for the "JavaScript" string pattern; if the PDF has JavaScript scripts on it, the search will highlight such text string indicating that one or more JavaScript object(s) are in use with that file.

 

 

With these observations in mind, now you can employ a batch script like the example below:

 

 

 

 

@ECHO OFF

cd C:\Users\userAccount\Desktop

mkdir C:\Users\userAccount\Desktop\Tagged_PDFs

for /f %%a in ('findstr /M "StructTreeRoot" *.pdf') do XCOPY /Y %%a C:\Users\userAccount\Desktop\Tagged_PDFs >NUL



cd C:\Users\userAccount\Desktop\Tagged_PDFs

mkdir C:\Users\userAccount\Desktop\Tagged_PDFs\PDFs_with_JS

for /f %%b in ('findstr /M "JavaScript" *.pdf') do XCOPY /Y %%b C:\Users\userAccount\Desktop\Tagged_PDFs\PDFs_with_JS >NUL


START C:\Users\userAccount\Desktop\Tagged_PDFs\PDFs_with_JS

 

 

 

 

 

 

 

 

 

This batch script is meant to be executed from any directory on a Microsoft Windows computer. And it will expect to look for files in the User account's Desktop folder (i.e.  C:\Users\yourUserAccount\Desktop directory) . 

 

As you execute this batch script, the first action will create a new subfolder: "Tagged_PDF's" and then it will look for every PDF in the Desktop's parent directory for tagged PDFs that match the "StructTreeRoot" string pattern.

 

The last portion of the first script will copy the list of matched files that were found in  C:\Users\yourUserAccount\Desktop parent directory to the new "Tagged_PDF's" subfolder that was created earlier.

 

Soon after the first script finalizes it will execute the second script, which will switch directory from C:\Users\yourUserAccount\Desktop to the Tagged_PDF's subfolder.

 

In there, it will create a new subfolder named "PDFs_wtih_JS", and it will perform another string search on the files that were identified with a tagged structure and copied to the Tagged_PDF's folder ; this time it will look for the "JavaScript" string pattern and copy the PDFs files that match that string pattern from the  Tagged_PDF's folder to the new "PDFs_wtih_JS" subfolder.

 

The last line of the batch script just opens up a new File Explorer window of the "PDFs_wtih_JS" folder, in which only the files that are both tagged and that have a JavaScript code in them will be listed.

 

Although this solution is NOT entirely realted to Adobe Acrobat, it will allow you to run through thousands of PDFs and sort them in a folder to indicate which files contain both tag properties and  JavaScript objects .

 

 

ADDITIONAL NOTES:

Batch scripts that process too many files may hog other background services or programs that may be opened simultaneously as the script executes and parse those files.

 

I would suggest to test first with no more than 250 PDF files at a time and see how it performs.