• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
13

Identify Non-Ocr files in a Large Library and OCR them

Community Beginner ,
Feb 03, 2024 Feb 03, 2024

Copy link to clipboard

Copied

Hello,

 

I have around 5000 pdf files in various folders/subfolders; most of them are OCRed already, but some are not.

The thing is when I use the OCR tool on my root folder, it will also OCR the files that are already OCRed, which consume a lot of time and resources unnecessarily.

So my question is: How could I OCR only the files which are not OCRed already, without having to check manually?

 

Many thanks in advance!

 

TOPICS
Edit and convert PDFs , Scan documents and OCR

Views

1.0K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 03, 2024 Feb 03, 2024

Copy link to clipboard

Copied

1. Sort files using a Preflight profile in an Action that place them in two folders (Success or Error).

Search for "mode 3" in Preflight, this Check characterizes OCRized files, and embed it in a custom Profile (an Action can only use Profiles, not a Check directly).

 

2. Use an Action to OCRize those that are not.

 

Capture_2402031234.png

 

Capture_2402031238.png

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 03, 2024 Feb 03, 2024

Copy link to clipboard

Copied

PS: an Action can't move files into the Success/Error folder, it has to copy them, but this isn't a real problem.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Feb 03, 2024 Feb 03, 2024

Copy link to clipboard

Copied

Hi JR,

 

thank you very much for your response.

So indeed, I created the profile and action as you suggested. It created two subfolders, one with OCRized files, and one with those that are not-OCRized yet.

 

But now, how can I OCRized those in the non-OCRized folder? and also, after this process, how can I move these files back to the original folder (by overwriting the non-OCRized ones)?

 

Also, I have many many folders and subfolders; pdf files are organized as follows, with each Subfolder_x cointaining a various number of PDF files:

Root_Folder\Folder_1\Subfolder_1

Root_Folder\Folder_1\Subfolder_2

Root_Folder\Folder_1\Subfolder_3

etc.

Root_Folder\Folder_2\Subfolder_1

Root_Folder\Folder_2\Subfolder_2

Root_Folder\Folder_2\Subfolder_3

Root_Folder\Folder_2\Subfolder_4

Root_Folder\Folder_2\Subfolder_5

etc.

Root_Folder\Folder_t\Subfolder_1

...

Root_Folder\Folder_t\Subfolder_n

 

How can I run an action on the Root_Folder so that all Folder_x and Subfolders are processed accordingly and then that processed/OCRized files remains located on the same subfolders as before?

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 03, 2024 Feb 03, 2024

Copy link to clipboard

Copied

Like this:

 

Capture_2402031903.png

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Feb 04, 2024 Feb 04, 2024

Copy link to clipboard

Copied

Hi JR,

 

thanks but I don't understand how it will solve my issue. If I run this action simply like this, it runs the OCR tool on all my PDF files, even those which are already OCRized.

I want to OCR only those files which are not OCRized yet without having to move back files manually to their original folder.

 

I guess I have a mixture of the first "sort" action you suggested before and the "OCR action", that would look something like that:

1/ detect non OCRized-files

2/ OCR those files which were detected

3/ move back files which were detected to their original folder

But how can we move files to their original folder?

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 04, 2024 Feb 04, 2024

Copy link to clipboard

Copied

Sorry, I misunderstood your previous post.

 

You cannot do that since Profiles and Actions doesn't support conditions (if/else).

You need an Action that uses a JavaScript script.

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Feb 05, 2024 Feb 05, 2024

Copy link to clipboard

Copied

Thanks JR. But how can I do that?

What kind of script should I use? I never did that before. Is there some code posted somewhere I could use? What Javascript should check for?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 05, 2024 Feb 05, 2024

Copy link to clipboard

Copied

I hope that another expert better qualified than me in JavaScript can answer you quickly, otherwise I'll do some research.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Feb 09, 2024 Feb 09, 2024

Copy link to clipboard

Copied

Hello,

 

I did some research and found some guys but they were using ghostcript, xpdf, xpdvviewer or script done in Applescript here (https://forum.latenightsw.com/t/how-to-detect-whether-a-pdf-has-been-ocrd/1708/5) and I don't think this is usable in Javascript right?

 

However, there is perhaps a solution with that script using javascript here:

https://community.adobe.com/t5/acrobat-discussions/javascript-to-detect-scanned-pdfs-and-iso-standar...

But how do I use Javascript with Action in Adobe?

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 09, 2024 Feb 09, 2024

Copy link to clipboard

Copied

LATEST

You can't OCR with a script.

I would use an external tool to sort and split the files, and move the ones that needs OCRing to another folder.

Then run an Action in Acrobat to OCR just the files in that folder.

Then use another stand-alone tool to move those files back to their original locations.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines