Skip to main content
Known Participant
March 24, 2023
Frage

Tagged = Yes property

  • March 24, 2023
  • 2 Antworten
  • 3612 Ansichten

How can I tell if a PDF is tagged for accessibility without opening it and looking at the Properties?  I'd like a solution that can run through thousands of PDFs and tell me which ones are accessible (tagged).  

Similar question: can I tell if a PDF has JavaScript in it?

Dieses Thema wurde für Antworten geschlossen.

2 Antworten

pdc@TDAutor
Known Participant
March 27, 2023

Thanks for the detailed answer!  I have tried to look for keywords in PDFs using Notepad++, but I wasn't sure what to look for.  Are "StructTreeRoot" and "JavaScript" 100% dependable?

Surely there's a totally reliable way of doing this.  I've heard of a product called SiteImprove which can tell if a PDF is tagged.  How does it know?  And when a screen reader like JAWS reads a PDF, it reads the tag tree.  How does it do it?  There's an API that can be exploited for such things.  All you need is a language like C to access it.

 

Legend
March 28, 2023

Searching for text strings is not 100% reliable, no. To start with, you could get a false hit on strings in metadata. Unlikely perhaps. But also because if an object is deleted and the document simply saved, all the old objects remain in the file until a SAVE AS is done, so you will find false hits.

 

The Acrobat SDK exposes a C++ API for plug-ins that can walk through the objects actually within a PDF. It cannot be used to make standalone apps, on a server, or via scripting. Knowledge of the PDF specification is a must.

pdc@TDAutor
Known Participant
April 21, 2023

 @pdc@TD ,

 

Yeah...  speaking for myself, I am very used to ungrateful feedback and harsh critiques ( like yours). 

 

I don't work for Adobe,  and no it wasn't a difficult problem.

 

It is easy to come back and troll your own post after community members (like myself), do most of the legwork for people like you (who seem to lack enough appetite in genuine learning and ask other people to do work for them) in roughly less than 24 hours for free!

 

That was a voluntary contribution.

 

And I am not sure why are you picking on voluntary contributors who have helped you to vent a personal frustration that you yourself seem to have with Adobe Inc.

 

Anyway, this is what you asked:

 

I'd like a solution that can run through thousands of PDFs and tell me which ones are accessible (tagged).  

Similar question: can I tell if a PDF has JavaScript in it?

 

Note where you say "accessible (tagged)".

 

Since a PDF document can be tagged, it seems like it was you who was having a hard time understanding that tagged is not equal to accessible.

 

Not because a PDF document is tagged it means that it is also Accessible (or that it meets Accessibility compliance standards).

 

The batch script does detects PDFs that are tagged.

 

And it also detects if they have JavaScript on them.

 

Moreover, if you ever care about learning how to script your own codes, the batch script can be modified to also search for PDFs that are encrypted.

 

You made it complicated by asking if this method was reliable, and also by bringing to the forum's attention about exploiting APIs.

 

Your original question had nothing to do with how to check for Accessibility compliance or how exploit APIs using other programming languages to check a PDF for accessibility with 100% reliability.

 

So, I did answered your question thouroughly.

 

And the batch script that I am sharing is 100% reliable to check for tagged PDFs in bulk using text string searches.

 

Seems more to me like you didn't even tried the batch script.

 

For what is worth, it seems like you need to be told AGAIN that it is not reliable to  check for PDF Accessibility using batch scripts that search for text strings in a file.

 

This wasalready clearly explained to you twice in this thread.

 

But anyway,  why don't you share your friend's script, written in JAVA, here in these community forums ?

 

Would that be a problem  ???

 

Maybe we can all benefit (and every other community memebers that are reading).


Thanks all.

I do understand the difference between accessible and tagged. Our team has been asked several times which of our ~3000 PDFs are accessible.  There was a time when our little team did the tagging using Acrobat (yuck), but now we farm the work out to a company that has decent tools.  The PDFs all get thoroughly tested with JAWS.  So if a PDF has tags, chances are very good that it is accessible.  I've only seen a couple of PDFs that had a "tagged yes" property, but had few or no tags.  They were a mistake.  So for our purposes, looking for tags is a sufficiently reliable way of spotting accessible PDFs.

As for JavaScript, we are currently having issues with viewing PDFs with Edge.  PDFs with JavaScript are particularly troublesome.  So we have been asked how many of our 3000 PDFs have JavaScript.  The guy who came up with a solution--which in addition to showing the user-edited doc-level and field-level code, also shows the Acrobat "AF" functions that format currency and dates in text fields--did it with Java.  He didn't go into detail with me, but he did mention "third-party code", and that the process of digging out JavaScript from a PDF is complicated.  So no, it's not as simple as sharing his script. I knew it could be done. 

And all that stuff about Adobe and ISO?  I don't get it.  Does that free Adobe from answering questions about their product?  Do they think it's OK to let a user community answer such questions on their behalf?  A company that truly cares about its customers would look for reasons to help them, not dodge them.  If you went to Walmart and asked an employee where the stationery department was, would you be happy to be told to ask another customer?

TTFN

ls_rbls
Community Expert
Community Expert
March 27, 2023

++EDITED REPLY , fixed typos

 

Hi @pdc@TD ,

 

If I am not mistaken, all PDF producing software use some sort of core JavaScript engine to edit PDF objects in a PDF document and also to perform arithmetic operations (among other built-in features).

 

But if you were specifically referring on how to find out the presence of any JavaScript scripts while a PDF document is viewed in Adobe Acrobat Pro DC (requires paid-subscription), you may go to Tools => JavaScript => All JavaScript.

 

This method will show the user all of the scripts that are currently used in a PDF file .

 

Now, based on your primary inquiry, I will assume that in both of your  questions you are referring to how to check many documents in bulk and determine if all of those files are both tagged and also contain JavaScript objects.

 

Is that correct?

 

If yes, you may employ the Action Wizard tool and customize an new action in combination of other readily available Acrobat built-in tools or with a custom JavaScript script.

 

However, based on your requirement criteria I wouldn't recommend doing it from the Acrobat Pro program due to possible crashes.

 

In addition, the Action Wizard will try to open up each document one at a time as it performs the custom actions to allow user interaction as each document is processed on real time (which makes the whole experience tedious and not efficient).

 

You're better off if you manually do a batch script.

 

Be aware that in the case of checking for tagged PDFs with  a batch script, such script will only check for a text string that may indicate the presence of a parent tree mapping object. In this case, if the batch script is successful it will indicate if such PDFs have a document structure defined (nothing more).

 

Moreover, it is also worth noting that the output results of a batch script (like the one I am sharing below) shouldn't be confused with Accessibility checks nor it is meant to validate PDFs to see if they conform to a required standard; there is so much more involved with accessible PDFs (such as PDF/UA compliance standards, or detecting problems with embedded fonts, for example). 

 

Anyway, if you open a PDF that you know is tagged using a file text editor (such as Microsoft's Notepad (if on MS Windows), or Notepad++, and TextEdit (if on a macOS)), you would want to empasize in your batch script to look for the string of the "StructTreeRoot" element (or property).

 

So, just for testing purposes, if you open a PDF with a text editor and perform a search for the  "StructTreeRoot" string pattern,  then chances are that the PDF document is tagged; and that is due to the presence of the StructTreeRoot catalog in those files. See screenshot:

 

 

You may test with another file using the same method above to search for the "JavaScript" string pattern; if the PDF has JavaScript scripts on it, the search will highlight such text string indicating that one or more JavaScript object(s) are in use with that file.

 

 

With these observations in mind, now you can employ a batch script like the example below:

 

 

 

 

@ECHO OFF

cd C:\Users\userAccount\Desktop

mkdir C:\Users\userAccount\Desktop\Tagged_PDFs

for /f %%a in ('findstr /M "StructTreeRoot" *.pdf') do XCOPY /Y %%a C:\Users\userAccount\Desktop\Tagged_PDFs >NUL



cd C:\Users\userAccount\Desktop\Tagged_PDFs

mkdir C:\Users\userAccount\Desktop\Tagged_PDFs\PDFs_with_JS

for /f %%b in ('findstr /M "JavaScript" *.pdf') do XCOPY /Y %%b C:\Users\userAccount\Desktop\Tagged_PDFs\PDFs_with_JS >NUL


START C:\Users\userAccount\Desktop\Tagged_PDFs\PDFs_with_JS

 

 

 

 

 

 

 

 

 

This batch script is meant to be executed from any directory on a Microsoft Windows computer. And it will expect to look for files in the User account's Desktop folder (i.e.  C:\Users\yourUserAccount\Desktop directory) . 

 

As you execute this batch script, the first action will create a new subfolder: "Tagged_PDF's" and then it will look for every PDF in the Desktop's parent directory for tagged PDFs that match the "StructTreeRoot" string pattern.

 

The last portion of the first script will copy the list of matched files that were found in  C:\Users\yourUserAccount\Desktop parent directory to the new "Tagged_PDF's" subfolder that was created earlier.

 

Soon after the first script finalizes it will execute the second script, which will switch directory from C:\Users\yourUserAccount\Desktop to the Tagged_PDF's subfolder.

 

In there, it will create a new subfolder named "PDFs_wtih_JS", and it will perform another string search on the files that were identified with a tagged structure and copied to the Tagged_PDF's folder ; this time it will look for the "JavaScript" string pattern and copy the PDFs files that match that string pattern from the  Tagged_PDF's folder to the new "PDFs_wtih_JS" subfolder.

 

The last line of the batch script just opens up a new File Explorer window of the "PDFs_wtih_JS" folder, in which only the files that are both tagged and that have a JavaScript code in them will be listed.

 

Although this solution is NOT entirely realted to Adobe Acrobat, it will allow you to run through thousands of PDFs and sort them in a folder to indicate which files contain both tag properties and  JavaScript objects .

 

 

ADDITIONAL NOTES:

Batch scripts that process too many files may hog other background services or programs that may be opened simultaneously as the script executes and parse those files.

 

I would suggest to test first with no more than 250 PDF files at a time and see how it performs.