Skip to main content
Inspiring
March 24, 2022
Answered

Newbie workflow optimize & OCR newspaper scans

  • March 24, 2022
  • 1 reply
  • 1615 views

I am new to Acrobat and i am scanning a year of newspapers. Each edition is anywhere from 10-16 pages and comes out weekly so I have 52 newspapers to make multipage PDFs. I scan as JPEG and then combine into a single PDF. What I am thinking of doing is saving optimized PDFs in one folder then batch process the OCR process. 

If I just combine and save each PDF is approximatly 200MB. Once I optimize they are around 10-15MB before OCR.

 

My question is should I OCR the pre-optimized files or does it matter? Any workflow tips?

 

Thanks,

Randy

This topic has been closed for replies.
Correct answer gary_sc

Hi Kurtis,

 

First: what a mess of a task you have. Good luck!

 

OK, now to your problem: JPG is not a good choice for scanning for many reasons, here are two:

 

1) JPG is a lossy format. Lossy here means that the greater the compression, the more information it loses in making the image smaller in storage size. That affects the quality of the image in a way that is worse in high contrast images SUCH AS TEXT. It's less noticeable in photos of trees and grass, even people. 

2) When you have a JPG image, you have to manually start the OCR process (an extra step), more on this in a second.

 

My general preference for scanning is to save the files in the TIF format. At first, this may seem bizarre because the image files are very large (storage size). However, after running the OCR process, the files are very reasonably sized (I've found the resultant PDF to be about .2% of the TIF file's size.)

 

The other advantage of TIF is that if you drag a TIF file onto the Acrobat icon, Acrobat will automatically convert the image into a PDF AND AUTOMATICALLY START THE OCR PROCESS. In addition, and this is important for you if you drag two or more TIF image files onto Acrobat, it will ask you if you want to combine all of the files into one PDF and then process the files with OCR while you sit and watch. It's all automatic.

 

Now, here's the one "catch": the OCR process is not fast, and to make things worse, Acrobat does not play nice with anything else you're doing while the OCR process is being done. So, if you try to read your email or check your FaceBook page, Acrobat will constantly bump itself in front of whatever you are looking at, letting you know that a page has been completed. Then the next page has been completed. (Etc.) So, plan accordingly.

 

One more bit of information: several years ago, I wrote the following blog for Adobe with many tips on getting the best quality scans for OCR. Newspapers are particularly nasty because of "bleed-through," where the text on the back of the paper shows up on the wrong side of a scan, which affects the quality of the OCR. (Note: you need to be signed in to your Adobe account to access this blog.)

 

https://community.adobe.com/t5/adobe-community-professionals/scanning-clean-searchable-pdfs/m-p/4785435?page=1#M89

 

Good luck!

 

 

 

1 reply

gary_sc
Community Expert
gary_scCommunity ExpertCorrect answer
Community Expert
March 24, 2022

Hi Kurtis,

 

First: what a mess of a task you have. Good luck!

 

OK, now to your problem: JPG is not a good choice for scanning for many reasons, here are two:

 

1) JPG is a lossy format. Lossy here means that the greater the compression, the more information it loses in making the image smaller in storage size. That affects the quality of the image in a way that is worse in high contrast images SUCH AS TEXT. It's less noticeable in photos of trees and grass, even people. 

2) When you have a JPG image, you have to manually start the OCR process (an extra step), more on this in a second.

 

My general preference for scanning is to save the files in the TIF format. At first, this may seem bizarre because the image files are very large (storage size). However, after running the OCR process, the files are very reasonably sized (I've found the resultant PDF to be about .2% of the TIF file's size.)

 

The other advantage of TIF is that if you drag a TIF file onto the Acrobat icon, Acrobat will automatically convert the image into a PDF AND AUTOMATICALLY START THE OCR PROCESS. In addition, and this is important for you if you drag two or more TIF image files onto Acrobat, it will ask you if you want to combine all of the files into one PDF and then process the files with OCR while you sit and watch. It's all automatic.

 

Now, here's the one "catch": the OCR process is not fast, and to make things worse, Acrobat does not play nice with anything else you're doing while the OCR process is being done. So, if you try to read your email or check your FaceBook page, Acrobat will constantly bump itself in front of whatever you are looking at, letting you know that a page has been completed. Then the next page has been completed. (Etc.) So, plan accordingly.

 

One more bit of information: several years ago, I wrote the following blog for Adobe with many tips on getting the best quality scans for OCR. Newspapers are particularly nasty because of "bleed-through," where the text on the back of the paper shows up on the wrong side of a scan, which affects the quality of the OCR. (Note: you need to be signed in to your Adobe account to access this blog.)

 

https://community.adobe.com/t5/adobe-community-professionals/scanning-clean-searchable-pdfs/m-p/4785435?page=1#M89

 

Good luck!

 

 

 

Inspiring
March 24, 2022

Thanks gary,

 

First, I did read your article as I saw you mentioned it in another post. Thanks for that BTW.

 

Normally I do scan to TIFF but my old book scanner only exports to USB so I use highest quality jpeg and I can get around 150 images on a 16GB stick. These original images are brought into Acrobat so no editing is involved to loose quality, JEPG is image file format from hell, I don't use it it usually. 

 

Good to know for next job;

  • "The other advantage of TIF is that if you drag a TIF file onto the Acrobat icon, Acrobat will automatically convert the image into a PDF AND AUTOMATICALLY START THE OCR PROCESS. In addition, and this is important for you if you drag two or more TIF image files onto Acrobat, it will ask you if you want to combine all of the files into one PDF and then process the files with OCR while you sit and watch. It's all automatic."

Your next comment about OCR taking time is why I am hoping to batch OCR overnight while I am not at work. I did a test this morning and a 10 page newspaper takes about 12-15 minutes. 

 

I am also fortunate as these newspapers do not have a lot of images so bleed through is not much of a problem.

 

I did like your bit about using Edit before OCR. Here is a screenshot I took using that. I think the client likes the brown paper though. 

 

Thanks for taking you time to write, I learned a lot, new to this scanning job and Acrobat.

 

gary_sc
Community Expert
Community Expert
March 24, 2022

Hi Kurtis,

 

Bleedthrough will ALWAYS happen with newspapers, images or not. Just keep that in mind.

 

And while I do appreciate the client's appreciation of the brown, one has a choice: the brown color OR better quality OCR. Like most things in life, choices have to be made. Besides, if there's too much brown, it's hard to read. OCR, like reading, depends upon contrast.

 

I do not believe my wife and I went through Antigonish when we were in NS in 2010, but I have to say we had a wonderful time. We made the mistake of circumnavigating the island not fully appreciating how big it is. But, in the two weeks, we were there, we did go all the way around. (Which may have something to do with our not seeing Antigonish.) You do live in a beautiful region, enjoy!

 

Delighted I may have helped you somewhat, good luck with this task.

 

Gary