Skip to main content
New Participant
February 3, 2024
Question

Identify Non-Ocr files in a Large Library and OCR them

  • February 3, 2024
  • 5 replies
  • 2510 views

Hello,

 

I have around 5000 pdf files in various folders/subfolders; most of them are OCRed already, but some are not.

The thing is when I use the OCR tool on my root folder, it will also OCR the files that are already OCRed, which consume a lot of time and resources unnecessarily.

So my question is: How could I OCR only the files which are not OCRed already, without having to check manually?

 

Many thanks in advance!

 

This topic has been closed for replies.

5 replies

JR Boulay
Community Expert
February 5, 2024

I hope that another expert better qualified than me in JavaScript can answer you quickly, otherwise I'll do some research.

Acrobate du PDF, InDesigner et Photoshopographe
NindaPhomAuthor
New Participant
February 9, 2024

Hello,

 

I did some research and found some guys but they were using ghostcript, xpdf, xpdvviewer or script done in Applescript here (https://forum.latenightsw.com/t/how-to-detect-whether-a-pdf-has-been-ocrd/1708/5) and I don't think this is usable in Javascript right?

 

However, there is perhaps a solution with that script using javascript here:

https://community.adobe.com/t5/acrobat-discussions/javascript-to-detect-scanned-pdfs-and-iso-standards/td-p/12585045

But how do I use Javascript with Action in Adobe?

 

try67
Community Expert
February 9, 2024

You can't OCR with a script.

I would use an external tool to sort and split the files, and move the ones that needs OCRing to another folder.

Then run an Action in Acrobat to OCR just the files in that folder.

Then use another stand-alone tool to move those files back to their original locations.

JR Boulay
Community Expert
February 4, 2024

Sorry, I misunderstood your previous post.

 

You cannot do that since Profiles and Actions doesn't support conditions (if/else).

You need an Action that uses a JavaScript script.

 

Acrobate du PDF, InDesigner et Photoshopographe
NindaPhomAuthor
New Participant
February 5, 2024

Thanks JR. But how can I do that?

What kind of script should I use? I never did that before. Is there some code posted somewhere I could use? What Javascript should check for?

JR Boulay
Community Expert
February 3, 2024

Like this:

 

Acrobate du PDF, InDesigner et Photoshopographe
NindaPhomAuthor
New Participant
February 4, 2024

Hi JR,

 

thanks but I don't understand how it will solve my issue. If I run this action simply like this, it runs the OCR tool on all my PDF files, even those which are already OCRized.

I want to OCR only those files which are not OCRized yet without having to move back files manually to their original folder.

 

I guess I have a mixture of the first "sort" action you suggested before and the "OCR action", that would look something like that:

1/ detect non OCRized-files

2/ OCR those files which were detected

3/ move back files which were detected to their original folder

But how can we move files to their original folder?

 

JR Boulay
Community Expert
February 3, 2024

PS: an Action can't move files into the Success/Error folder, it has to copy them, but this isn't a real problem.

Acrobate du PDF, InDesigner et Photoshopographe
NindaPhomAuthor
New Participant
February 3, 2024

Hi JR,

 

thank you very much for your response.

So indeed, I created the profile and action as you suggested. It created two subfolders, one with OCRized files, and one with those that are not-OCRized yet.

 

But now, how can I OCRized those in the non-OCRized folder? and also, after this process, how can I move these files back to the original folder (by overwriting the non-OCRized ones)?

 

Also, I have many many folders and subfolders; pdf files are organized as follows, with each Subfolder_x cointaining a various number of PDF files:

Root_Folder\Folder_1\Subfolder_1

Root_Folder\Folder_1\Subfolder_2

Root_Folder\Folder_1\Subfolder_3

etc.

Root_Folder\Folder_2\Subfolder_1

Root_Folder\Folder_2\Subfolder_2

Root_Folder\Folder_2\Subfolder_3

Root_Folder\Folder_2\Subfolder_4

Root_Folder\Folder_2\Subfolder_5

etc.

Root_Folder\Folder_t\Subfolder_1

...

Root_Folder\Folder_t\Subfolder_n

 

How can I run an action on the Root_Folder so that all Folder_x and Subfolders are processed accordingly and then that processed/OCRized files remains located on the same subfolders as before?

 

JR Boulay
Community Expert
February 3, 2024

1. Sort files using a Preflight profile in an Action that place them in two folders (Success or Error).

Search for "mode 3" in Preflight, this Check characterizes OCRized files, and embed it in a custom Profile (an Action can only use Profiles, not a Check directly).

 

2. Use an Action to OCRize those that are not.

 

 

Acrobate du PDF, InDesigner et Photoshopographe