Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
13

Identify Non-Ocr files in a Large Library and OCR them

Community Beginner ,
Feb 03, 2024 Feb 03, 2024

Hello,

 

I have around 5000 pdf files in various folders/subfolders; most of them are OCRed already, but some are not.

The thing is when I use the OCR tool on my root folder, it will also OCR the files that are already OCRed, which consume a lot of time and resources unnecessarily.

So my question is: How could I OCR only the files which are not OCRed already, without having to check manually?

 

Many thanks in advance!

 

TOPICS
Edit and convert PDFs , Scan documents and OCR
2.4K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 03, 2024 Feb 03, 2024

1. Sort files using a Preflight profile in an Action that place them in two folders (Success or Error).

Search for "mode 3" in Preflight, this Check characterizes OCRized files, and embed it in a custom Profile (an Action can only use Profiles, not a Check directly).

 

2. Use an Action to OCRize those that are not.

 

Capture_2402031234.png

 

Capture_2402031238.png


Acrobate du PDF, InDesigner et Photoshopographe
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 03, 2024 Feb 03, 2024

PS: an Action can't move files into the Success/Error folder, it has to copy them, but this isn't a real problem.


Acrobate du PDF, InDesigner et Photoshopographe
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Feb 03, 2024 Feb 03, 2024

Hi JR,

 

thank you very much for your response.

So indeed, I created the profile and action as you suggested. It created two subfolders, one with OCRized files, and one with those that are not-OCRized yet.

 

But now, how can I OCRized those in the non-OCRized folder? and also, after this process, how can I move these files back to the original folder (by overwriting the non-OCRized ones)?

 

Also, I have many many folders and subfolders; pdf files are organized as follows, with each Subfolder_x cointaining a various number of PDF files:

Root_Folder\Folder_1\Subfolder_1

Root_Folder\Folder_1\Subfolder_2

Root_Folder\Folder_1\Subfolder_3

etc.

Root_Folder\Folder_2\Subfolder_1

Root_Folder\Folder_2\Subfolder_2

Root_Folder\Folder_2\Subfolder_3

Root_Folder\Folder_2\Subfolder_4

Root_Folder\Folder_2\Subfolder_5

etc.

Root_Folder\Folder_t\Subfolder_1

...

Root_Folder\Folder_t\Subfolder_n

 

How can I run an action on the Root_Folder so that all Folder_x and Subfolders are processed accordingly and then that processed/OCRized files remains located on the same subfolders as before?

 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 03, 2024 Feb 03, 2024

Like this:

 

Capture_2402031903.png


Acrobate du PDF, InDesigner et Photoshopographe
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Feb 04, 2024 Feb 04, 2024

Hi JR,

 

thanks but I don't understand how it will solve my issue. If I run this action simply like this, it runs the OCR tool on all my PDF files, even those which are already OCRized.

I want to OCR only those files which are not OCRized yet without having to move back files manually to their original folder.

 

I guess I have a mixture of the first "sort" action you suggested before and the "OCR action", that would look something like that:

1/ detect non OCRized-files

2/ OCR those files which were detected

3/ move back files which were detected to their original folder

But how can we move files to their original folder?

 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 04, 2024 Feb 04, 2024

Sorry, I misunderstood your previous post.

 

You cannot do that since Profiles and Actions doesn't support conditions (if/else).

You need an Action that uses a JavaScript script.

 


Acrobate du PDF, InDesigner et Photoshopographe
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Feb 05, 2024 Feb 05, 2024

Thanks JR. But how can I do that?

What kind of script should I use? I never did that before. Is there some code posted somewhere I could use? What Javascript should check for?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 05, 2024 Feb 05, 2024

I hope that another expert better qualified than me in JavaScript can answer you quickly, otherwise I'll do some research.


Acrobate du PDF, InDesigner et Photoshopographe
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Feb 09, 2024 Feb 09, 2024

Hello,

 

I did some research and found some guys but they were using ghostcript, xpdf, xpdvviewer or script done in Applescript here (https://forum.latenightsw.com/t/how-to-detect-whether-a-pdf-has-been-ocrd/1708/5) and I don't think this is usable in Javascript right?

 

However, there is perhaps a solution with that script using javascript here:

https://community.adobe.com/t5/acrobat-discussions/javascript-to-detect-scanned-pdfs-and-iso-standar...

But how do I use Javascript with Action in Adobe?

 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 09, 2024 Feb 09, 2024
LATEST

You can't OCR with a script.

I would use an external tool to sort and split the files, and move the ones that needs OCRing to another folder.

Then run an Action in Acrobat to OCR just the files in that folder.

Then use another stand-alone tool to move those files back to their original locations.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines