• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Bulk Identify Non-Ocr files in a Large Library and OCR them.

New Here ,
May 18, 2022 May 18, 2022

Copy link to clipboard

Copied

I have a large pdf library - 13k+ files, that I Indexed with Acrobat.

1 pdf per (title folder) with 1-5 (title folders) per (author folder).

Most of these files have been OCRed already.  Some are not. How do I run some type of action to identify which files have not been OCRed, without having to manually open each one. 

Acrobat Catalog Index log only logs (extracting), but doesn't specify if the file was All Image (unrecognized text).

I can't find a way to create an action to run through all of the files, only recognizing the ones that have not been recognized yet.

Please help.

TOPICS
Edit and convert PDFs , How to , Scan documents and OCR

Views

413

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
May 18, 2022 May 18, 2022

Copy link to clipboard

Copied

It's a tough one. You'd need to try to extract text from each page of every file, and see if you get any text. This could be done by programming a JavaScript action, but it's not immediately clear how it could report what it finds. If you're not contrained to running inside Acrobat I suggest you look for a (non-Adobe) text extraction tool for PDFs, and just check whether it extracts anything as you iterate over the folders. Much simpler.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
May 18, 2022 May 18, 2022

Copy link to clipboard

Copied

I think you can just use an Action with the Recognize Text command and run it on all your files. It will skip any files that already have "real" text in them.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
May 18, 2022 May 18, 2022

Copy link to clipboard

Copied

"only recognizing the ones that have not been recognized yet."

It is easier to do the opposite: detect already OCRized documents, but it's the same thing.

 

You can use the "Invisible text (text rendering mode 3)" Check to create a Profile, then use this Profile in an Action to sort the files.

 

Capture_441.png

 

Capture_439.png

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
May 19, 2022 May 19, 2022

Copy link to clipboard

Copied

This will require running two Actions, though. My solution only requires one.

Plus, you'll need to manually copy the files back to the original folder at the end.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
May 19, 2022 May 19, 2022

Copy link to clipboard

Copied

LATEST

Several steps can be added in a Profile or in an Action, but you're right I misunderstood that it was only necessary to sort them out.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines