Skip to main content
New Participant
November 9, 2022
Answered

Batch conversion of PDF's to machine readable PDF

  • November 9, 2022
  • 1 reply
  • 4833 views

I am trying to convert 10000 PDF files to machine readable form using OCR in adobe pro. But some of the PDF's have renderable data and it is failing to convert them into machine readable form. I have seen a solution to convert the PDF into .tiff file and run the OCR and make it into a PDF. This worked but I cannot do the same thing for each file to convert all these 10000 PDF files. Is there any other way that I can do a batch processing on all the 10000 files together by running an action wizard or something like that?

Correct answer gary_sc

Hi @gary_sc ,
Appreciate your help but I have done this method. So my concern is whenever in a pdf file a page contains editable text it is not converting that page into machine readable format instead it is converting it into a blank page removing all the text and images. I need a work around for that issue. I couldn't go to each pdf document and verify if it converted everything accurately or not. I wanted to create some sort of automated workflow that would convert all the documents accurately.


OK, now I better understand. 

 

Have you tried flattening the document before OCR-ing the document?

 

https://www.ca4.uscourts.gov/caseinformationefiling/efiling_cm-ecf/technical-information/flatten-pdf-fillable-form

1 reply

gary_sc
Brainiac
November 9, 2022

Hi @BG2022 

 

Not being able to view the quality, nature, or resolution of the scans, it's almost impossible for me to dive too deeply into your issue.

 

However, I one time tried to OCR a 950 page book that had already been PDFed. About 1/2 way through, Acrobat locked up. Admittedly, that was about 15 years ago, and I've got a lot more ram, CPU power, and Acrobat is also newer. 

 

Converting to TIF format was good as it will save you several steps, primarily when you dump a bunch of TIF images into Acrobat, Acrobat will ask if you want them all into one document or to save them as separate documents. Then it will automatically go ahead and then OCR the PDF. 

 

What I would suggest is to pull out 300-400 pages and see how well that works. If it does fine, add another 100 pages. All good; add 100 more. At some point, you're going to go, Hmm, better go back 100 pages and leave it at that. 

 

But let me warn you, while Acrobat is doing OCR, you're computer is essentially closed down. You may think you can look at your email, but as soon as "that" page is done, Acrobat will jump in front and say, "Page completed; I'm doing the next page!!!"

 

So, plan on doing other things, long coffee breaks, lunch, dinner, whatever. 

 

Good luck!

BG2022Author
New Participant
November 9, 2022

Hi @gary_sc !!
From what I read you were talking about one single pdf that has 1000 pages. In my case I have 10000 different PDF's and each of them have 5-10 pages. My main goal is to figure out how to automate the process of converting all of them into machine readable form through OCR. I tried running batch OCR on all the files but to my badluck some of the files have renderable or editable text in few pages and it couldn't convert those pages into machine readable form so it is giving blank pages. There is no way I can identify which documents are 100 % converted and which are not from that 10000 files. I started searching for a viable solution and that's when I figured out about converting them into TIFF files and then running OCR and combining all the pages into a single PDF again. But I cannot do this process manually on 10,000 different files. If there is someway I can automate this process that would be really great! I would really appreciate it if you have any ideas of how to do that! 

gary_sc
gary_scCorrect answer
Brainiac
November 11, 2022

Hi @gary_sc ,
Appreciate your help but I have done this method. So my concern is whenever in a pdf file a page contains editable text it is not converting that page into machine readable format instead it is converting it into a blank page removing all the text and images. I need a work around for that issue. I couldn't go to each pdf document and verify if it converted everything accurately or not. I wanted to create some sort of automated workflow that would convert all the documents accurately.


OK, now I better understand. 

 

Have you tried flattening the document before OCR-ing the document?

 

https://www.ca4.uscourts.gov/caseinformationefiling/efiling_cm-ecf/technical-information/flatten-pdf-fillable-form