Bulk Identify Non-Ocr files in a Large Library and OCR them.

Report · May 18, 2022

I have a large pdf library - 13k+ files, that I Indexed with Acrobat.

1 pdf per (title folder) with 1-5 (title folders) per (author folder).

Most of these files have been OCRed already. Some are not. How do I run some type of action to identify which files have not been OCRed, without having to manually open each one.

Acrobat Catalog Index log only logs (extracting), but doesn't specify if the file was All Image (unrecognized text).

I can't find a way to create an action to run through all of the files, only recognizing the ones that have not been recognized yet.

Please help.

Report · May 18, 2022

It's a tough one. You'd need to try to extract text from each page of every file, and see if you get any text. This could be done by programming a JavaScript action, but it's not immediately clear how it could report what it finds. If you're not contrained to running inside Acrobat I suggest you look for a (non-Adobe) text extraction tool for PDFs, and just check whether it extracts anything as you iterate over the folders. Much simpler.

Report · May 18, 2022

I think you can just use an Action with the Recognize Text command and run it on all your files. It will skip any files that already have "real" text in them.

Report · May 18, 2022

"only recognizing the ones that have not been recognized yet."

It is easier to do the opposite: detect already OCRized documents, but it's the same thing.

You can use the "Invisible text (text rendering mode 3)" Check to create a Profile, then use this Profile in an Action to sort the files.

Acrobate du PDF, InDesigner et Photoshopographe

Report · May 19, 2022

This will require running two Actions, though. My solution only requires one.

Plus, you'll need to manually copy the files back to the original folder at the end.

Report · May 19, 2022

Several steps can be added in a Profile or in an Action, but you're right I misunderstood that it was only necessary to sort them out.

Acrobate du PDF, InDesigner et Photoshopographe