Participant

Question

Bulk Identify Non-Ocr files in a Large Library and OCR them.

Forum|Forum|4 years ago
May 18, 2022
3 replies
1033 views

I have a large pdf library - 13k+ files, that I Indexed with Acrobat.

1 pdf per (title folder) with 1-5 (title folders) per (author folder).

Most of these files have been OCRed already. Some are not. How do I run some type of action to identify which files have not been OCRed, without having to manually open each one.

Acrobat Catalog Index log only logs (extracting), but doesn't specify if the file was All Image (unrecognized text).

I can't find a way to create an action to run through all of the files, only recognizing the ones that have not been recognized yet.

Please help.

This topic has been closed for replies.

JR Boulay

Community Expert

"only recognizing the ones that have not been recognized yet."

It is easier to do the opposite: detect already OCRized documents, but it's the same thing.

You can use the "Invisible text (text rendering mode 3)" Check to create a Profile, then use this Profile in an Action to sort the files.

Acrobate du PDF, InDesigner et Photoshopographe

try67

Community Expert

This will require running two Actions, though. My solution only requires one.

Plus, you'll need to manually copy the files back to the original folder at the end.

JR Boulay

Community Expert

Several steps can be added in a Profile or in an Action, but you're right I misunderstood that it was only necessary to sort them out.

Acrobate du PDF, InDesigner et Photoshopographe

try67

Community Expert

I think you can just use an Action with the Recognize Text command and run it on all your files. It will skip any files that already have "real" text in them.

T

Test Screen Name

Legend

It's a tough one. You'd need to try to extract text from each page of every file, and see if you get any text. This could be done by programming a JavaScript action, but it's not immediately clear how it could report what it finds. If you're not contrained to running inside Acrobat I suggest you look for a (non-Adobe) text extraction tool for PDFs, and just check whether it extracts anything as you iterate over the folders. Much simpler.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded