Skip to main content
Known Participant
March 24, 2023
Question

Tagged = Yes property

  • March 24, 2023
  • 2 replies
  • 3612 views

How can I tell if a PDF is tagged for accessibility without opening it and looking at the Properties?  I'd like a solution that can run through thousands of PDFs and tell me which ones are accessible (tagged).  

Similar question: can I tell if a PDF has JavaScript in it?

This topic has been closed for replies.

2 replies

pdc@TDAuthor
Known Participant
March 27, 2023

Thanks for the detailed answer!  I have tried to look for keywords in PDFs using Notepad++, but I wasn't sure what to look for.  Are "StructTreeRoot" and "JavaScript" 100% dependable?

Surely there's a totally reliable way of doing this.  I've heard of a product called SiteImprove which can tell if a PDF is tagged.  How does it know?  And when a screen reader like JAWS reads a PDF, it reads the tag tree.  How does it do it?  There's an API that can be exploited for such things.  All you need is a language like C to access it.

 

ls_rbls
Community Expert
Community Expert
March 28, 2023

You're welcome.

 

Yes, those two strings will identify 100% if PDFs are tagged or if they contain JavaScript code.

 

The only issue is that, if some or most of the PDFs that will be parsed by the batch script are encrypted (or protected with password) you may observe that some files may not be read.

 

In which case, you may also modify the batch script and search for the string "Encrypted", to determine if PDFs are password protected.

 

Anyway, all my batch script does is to just find text strings.

 

And not because a PDF document appears to be tagged after the batch script executes,  it also means that such PDFs are  Accessible or that they conform in any way with accessibility standards requirements.

 

Programs (or online services) like SiteImprove or ExifTool must be used to determine accessibility compliance (or to perform deep PDF Tags analysis).

 

Such tools employ more advanced methods that examine and extract the XMP's PDF Info tags from the XML metadata object.

 

All that information is  readily available based off of the Adobe Acrobat's  PDF specification, so experimenting with API's in other programming languages is not exactly a necessity (unless you are developing your own plug-ins, for example).

 

For the purpose of a batch script that would examine the XMP's PDF Marked Info tags, all you would be looking for is if the "taggedPDF" descriptor is marked as true or false.

 

See more here:

 

 

 

 

ls_rbls
Community Expert
Community Expert
March 27, 2023

++EDITED REPLY , fixed typos

 

Hi @pdc@TD ,

 

If I am not mistaken, all PDF producing software use some sort of core JavaScript engine to edit PDF objects in a PDF document and also to perform arithmetic operations (among other built-in features).

 

But if you were specifically referring on how to find out the presence of any JavaScript scripts while a PDF document is viewed in Adobe Acrobat Pro DC (requires paid-subscription), you may go to Tools => JavaScript => All JavaScript.

 

This method will show the user all of the scripts that are currently used in a PDF file .

 

Now, based on your primary inquiry, I will assume that in both of your  questions you are referring to how to check many documents in bulk and determine if all of those files are both tagged and also contain JavaScript objects.

 

Is that correct?

 

If yes, you may employ the Action Wizard tool and customize an new action in combination of other readily available Acrobat built-in tools or with a custom JavaScript script.

 

However, based on your requirement criteria I wouldn't recommend doing it from the Acrobat Pro program due to possible crashes.

 

In addition, the Action Wizard will try to open up each document one at a time as it performs the custom actions to allow user interaction as each document is processed on real time (which makes the whole experience tedious and not efficient).

 

You're better off if you manually do a batch script.

 

Be aware that in the case of checking for tagged PDFs with  a batch script, such script will only check for a text string that may indicate the presence of a parent tree mapping object. In this case, if the batch script is successful it will indicate if such PDFs have a document structure defined (nothing more).

 

Moreover, it is also worth noting that the output results of a batch script (like the one I am sharing below) shouldn't be confused with Accessibility checks nor it is meant to validate PDFs to see if they conform to a required standard; there is so much more involved with accessible PDFs (such as PDF/UA compliance standards, or detecting problems with embedded fonts, for example). 

 

Anyway, if you open a PDF that you know is tagged using a file text editor (such as Microsoft's Notepad (if on MS Windows), or Notepad++, and TextEdit (if on a macOS)), you would want to empasize in your batch script to look for the string of the "StructTreeRoot" element (or property).

 

So, just for testing purposes, if you open a PDF with a text editor and perform a search for the  "StructTreeRoot" string pattern,  then chances are that the PDF document is tagged; and that is due to the presence of the StructTreeRoot catalog in those files. See screenshot:

 

 

You may test with another file using the same method above to search for the "JavaScript" string pattern; if the PDF has JavaScript scripts on it, the search will highlight such text string indicating that one or more JavaScript object(s) are in use with that file.

 

 

With these observations in mind, now you can employ a batch script like the example below:

 

 

 

 

@ECHO OFF

cd C:\Users\userAccount\Desktop

mkdir C:\Users\userAccount\Desktop\Tagged_PDFs

for /f %%a in ('findstr /M "StructTreeRoot" *.pdf') do XCOPY /Y %%a C:\Users\userAccount\Desktop\Tagged_PDFs >NUL



cd C:\Users\userAccount\Desktop\Tagged_PDFs

mkdir C:\Users\userAccount\Desktop\Tagged_PDFs\PDFs_with_JS

for /f %%b in ('findstr /M "JavaScript" *.pdf') do XCOPY /Y %%b C:\Users\userAccount\Desktop\Tagged_PDFs\PDFs_with_JS >NUL


START C:\Users\userAccount\Desktop\Tagged_PDFs\PDFs_with_JS

 

 

 

 

 

 

 

 

 

This batch script is meant to be executed from any directory on a Microsoft Windows computer. And it will expect to look for files in the User account's Desktop folder (i.e.  C:\Users\yourUserAccount\Desktop directory) . 

 

As you execute this batch script, the first action will create a new subfolder: "Tagged_PDF's" and then it will look for every PDF in the Desktop's parent directory for tagged PDFs that match the "StructTreeRoot" string pattern.

 

The last portion of the first script will copy the list of matched files that were found in  C:\Users\yourUserAccount\Desktop parent directory to the new "Tagged_PDF's" subfolder that was created earlier.

 

Soon after the first script finalizes it will execute the second script, which will switch directory from C:\Users\yourUserAccount\Desktop to the Tagged_PDF's subfolder.

 

In there, it will create a new subfolder named "PDFs_wtih_JS", and it will perform another string search on the files that were identified with a tagged structure and copied to the Tagged_PDF's folder ; this time it will look for the "JavaScript" string pattern and copy the PDFs files that match that string pattern from the  Tagged_PDF's folder to the new "PDFs_wtih_JS" subfolder.

 

The last line of the batch script just opens up a new File Explorer window of the "PDFs_wtih_JS" folder, in which only the files that are both tagged and that have a JavaScript code in them will be listed.

 

Although this solution is NOT entirely realted to Adobe Acrobat, it will allow you to run through thousands of PDFs and sort them in a folder to indicate which files contain both tag properties and  JavaScript objects .

 

 

ADDITIONAL NOTES:

Batch scripts that process too many files may hog other background services or programs that may be opened simultaneously as the script executes and parse those files.

 

I would suggest to test first with no more than 250 PDF files at a time and see how it performs.