• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Batch conversion of PDF's to machine readable PDF

New Here ,
Nov 09, 2022 Nov 09, 2022

Copy link to clipboard

Copied

I am trying to convert 10000 PDF files to machine readable form using OCR in adobe pro. But some of the PDF's have renderable data and it is failing to convert them into machine readable form. I have seen a solution to convert the PDF into .tiff file and run the OCR and make it into a PDF. This worked but I cannot do the same thing for each file to convert all these 10000 PDF files. Is there any other way that I can do a batch processing on all the 10000 files together by running an action wizard or something like that?

TOPICS
Edit and convert PDFs , Scan documents and OCR

Views

3.0K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
1 ACCEPTED SOLUTION
Community Expert ,
Nov 10, 2022 Nov 10, 2022

Copy link to clipboard

Copied

OK, now I better understand. 

 

Have you tried flattening the document before OCR-ing the document?

 

https://www.ca4.uscourts.gov/caseinformationefiling/efiling_cm-ecf/technical-information/flatten-pdf...

View solution in original post

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Nov 09, 2022 Nov 09, 2022

Copy link to clipboard

Copied

Hi @BG2022 

 

Not being able to view the quality, nature, or resolution of the scans, it's almost impossible for me to dive too deeply into your issue.

 

However, I one time tried to OCR a 950 page book that had already been PDFed. About 1/2 way through, Acrobat locked up. Admittedly, that was about 15 years ago, and I've got a lot more ram, CPU power, and Acrobat is also newer. 

 

Converting to TIF format was good as it will save you several steps, primarily when you dump a bunch of TIF images into Acrobat, Acrobat will ask if you want them all into one document or to save them as separate documents. Then it will automatically go ahead and then OCR the PDF. 

 

What I would suggest is to pull out 300-400 pages and see how well that works. If it does fine, add another 100 pages. All good; add 100 more. At some point, you're going to go, Hmm, better go back 100 pages and leave it at that. 

 

But let me warn you, while Acrobat is doing OCR, you're computer is essentially closed down. You may think you can look at your email, but as soon as "that" page is done, Acrobat will jump in front and say, "Page completed; I'm doing the next page!!!"

 

So, plan on doing other things, long coffee breaks, lunch, dinner, whatever. 

 

Good luck!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Nov 09, 2022 Nov 09, 2022

Copy link to clipboard

Copied

Hi @gary_sc !!
From what I read you were talking about one single pdf that has 1000 pages. In my case I have 10000 different PDF's and each of them have 5-10 pages. My main goal is to figure out how to automate the process of converting all of them into machine readable form through OCR. I tried running batch OCR on all the files but to my badluck some of the files have renderable or editable text in few pages and it couldn't convert those pages into machine readable form so it is giving blank pages. There is no way I can identify which documents are 100 % converted and which are not from that 10000 files. I started searching for a viable solution and that's when I figured out about converting them into TIFF files and then running OCR and combining all the pages into a single PDF again. But I cannot do this process manually on 10,000 different files. If there is someway I can automate this process that would be really great! I would really appreciate it if you have any ideas of how to do that! 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Nov 09, 2022 Nov 09, 2022

Copy link to clipboard

Copied

OH! I thought you had a 10,000 page file. That's where I was coming from.

 

OK, again, since I cannot see the files that are rendering nothing, and I do not wish to comment without seeing the full issues. Just as a tip, bad scans never give good OCR — can't be done.

 

However, to deal with many many individual files, put a bunch of them in a folder. Now, go to the Scan & OCR tool. In the middle you can select one file or multiple files

2022-11-09_16-04-13.png

Once you've selected Multiple Files, go to the folder you want to process and let it process them. The rest should be fairly obvious.

2022-11-09_16-04-52.png

 

I hope that's more direct help for your needs. Sorry, I miss understood you.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Nov 10, 2022 Nov 10, 2022

Copy link to clipboard

Copied

Hi @gary_sc ,
Appreciate your help but I have done this method. So my concern is whenever in a pdf file a page contains editable text it is not converting that page into machine readable format instead it is converting it into a blank page removing all the text and images. I need a work around for that issue. I couldn't go to each pdf document and verify if it converted everything accurately or not. I wanted to create some sort of automated workflow that would convert all the documents accurately.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Nov 10, 2022 Nov 10, 2022

Copy link to clipboard

Copied

OK, now I better understand. 

 

Have you tried flattening the document before OCR-ing the document?

 

https://www.ca4.uscourts.gov/caseinformationefiling/efiling_cm-ecf/technical-information/flatten-pdf...

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Nov 11, 2022 Nov 11, 2022

Copy link to clipboard

Copied

LATEST

Hi @gary_sc 
Thank you so much it worked!

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines