Copy link to clipboard
Copied
I have a large pdf library - 13k+ files, that I Indexed with Acrobat.
1 pdf per (title folder) with 1-5 (title folders) per (author folder).
Most of these files have been OCRed already. Some are not. How do I run some type of action to identify which files have not been OCRed, without having to manually open each one.
Acrobat Catalog Index log only logs (extracting), but doesn't specify if the file was All Image (unrecognized text).
I can't find a way to create an action to run through all of the files, only recognizing the ones that have not been recognized yet.
Please help.
Copy link to clipboard
Copied
It's a tough one. You'd need to try to extract text from each page of every file, and see if you get any text. This could be done by programming a JavaScript action, but it's not immediately clear how it could report what it finds. If you're not contrained to running inside Acrobat I suggest you look for a (non-Adobe) text extraction tool for PDFs, and just check whether it extracts anything as you iterate over the folders. Much simpler.
Copy link to clipboard
Copied
I think you can just use an Action with the Recognize Text command and run it on all your files. It will skip any files that already have "real" text in them.
Copy link to clipboard
Copied
"only recognizing the ones that have not been recognized yet."
It is easier to do the opposite: detect already OCRized documents, but it's the same thing.
You can use the "Invisible text (text rendering mode 3)" Check to create a Profile, then use this Profile in an Action to sort the files.
Copy link to clipboard
Copied
This will require running two Actions, though. My solution only requires one.
Plus, you'll need to manually copy the files back to the original folder at the end.
Copy link to clipboard
Copied
Several steps can be added in a Profile or in an Action, but you're right I misunderstood that it was only necessary to sort them out.