New Participant

Question

Identify Non-Ocr files in a Large Library and OCR them

Forum|Forum|1 year ago
February 3, 2024
5 replies
2510 views

Hello,

I have around 5000 pdf files in various folders/subfolders; most of them are OCRed already, but some are not.

The thing is when I use the OCR tool on my root folder, it will also OCR the files that are already OCRed, which consume a lot of time and resources unnecessarily.

So my question is: How could I OCR only the files which are not OCRed already, without having to check manually?

Many thanks in advance!

This topic has been closed for replies.

JR Boulay

Community Expert

I hope that another expert better qualified than me in JavaScript can answer you quickly, otherwise I'll do some research.

Acrobate du PDF, InDesigner et Photoshopographe

N

NindaPhomAuthor

New Participant

Hello,

I did some research and found some guys but they were using ghostcript, xpdf, xpdvviewer or script done in Applescript here (https://forum.latenightsw.com/t/how-to-detect-whether-a-pdf-has-been-ocrd/1708/5) and I don't think this is usable in Javascript right?

However, there is perhaps a solution with that script using javascript here:

https://community.adobe.com/t5/acrobat-discussions/javascript-to-detect-scanned-pdfs-and-iso-standards/td-p/12585045

But how do I use Javascript with Action in Adobe?

try67

Community Expert

You can't OCR with a script.

I would use an external tool to sort and split the files, and move the ones that needs OCRing to another folder.

Then run an Action in Acrobat to OCR just the files in that folder.

Then use another stand-alone tool to move those files back to their original locations.

JR Boulay

Community Expert

Sorry, I misunderstood your previous post.

You cannot do that since Profiles and Actions doesn't support conditions (if/else).

You need an Action that uses a JavaScript script.

Acrobate du PDF, InDesigner et Photoshopographe

N

NindaPhomAuthor

New Participant

Thanks JR. But how can I do that?

What kind of script should I use? I never did that before. Is there some code posted somewhere I could use? What Javascript should check for?

JR Boulay

Community Expert

Like this:

Acrobate du PDF, InDesigner et Photoshopographe

N

NindaPhomAuthor

New Participant

Hi JR,

thanks but I don't understand how it will solve my issue. If I run this action simply like this, it runs the OCR tool on all my PDF files, even those which are already OCRized.

I want to OCR only those files which are not OCRized yet without having to move back files manually to their original folder.

I guess I have a mixture of the first "sort" action you suggested before and the "OCR action", that would look something like that:

1/ detect non OCRized-files

2/ OCR those files which were detected

3/ move back files which were detected to their original folder

But how can we move files to their original folder?

JR Boulay

Community Expert

PS: an Action can't move files into the Success/Error folder, it has to copy them, but this isn't a real problem.

Acrobate du PDF, InDesigner et Photoshopographe

N

NindaPhomAuthor

New Participant

Hi JR,

thank you very much for your response.

So indeed, I created the profile and action as you suggested. It created two subfolders, one with OCRized files, and one with those that are not-OCRized yet.

But now, how can I OCRized those in the non-OCRized folder? and also, after this process, how can I move these files back to the original folder (by overwriting the non-OCRized ones)?

Also, I have many many folders and subfolders; pdf files are organized as follows, with each Subfolder_x cointaining a various number of PDF files:

Root_Folder\Folder_1\Subfolder_1

Root_Folder\Folder_1\Subfolder_2

Root_Folder\Folder_1\Subfolder_3

etc.

Root_Folder\Folder_2\Subfolder_1

Root_Folder\Folder_2\Subfolder_2

Root_Folder\Folder_2\Subfolder_3

Root_Folder\Folder_2\Subfolder_4

Root_Folder\Folder_2\Subfolder_5

etc.

Root_Folder\Folder_t\Subfolder_1

...

Root_Folder\Folder_t\Subfolder_n

How can I run an action on the Root_Folder so that all Folder_x and Subfolders are processed accordingly and then that processed/OCRized files remains located on the same subfolders as before?

JR Boulay

Community Expert

1. Sort files using a Preflight profile in an Action that place them in two folders (Success or Error).

Search for "mode 3" in Preflight, this Check characterizes OCRized files, and embed it in a custom Profile (an Action can only use Profiles, not a Check directly).

2. Use an Action to OCRize those that are not.

Acrobate du PDF, InDesigner et Photoshopographe

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded