Skip to main content
Inspiring
January 13, 2024
Answered

Scan and OCR

  • January 13, 2024
  • 6 replies
  • 3929 views

Hello,

I have to scan a book by using Acrobat Pro.

My goal is to scan this book in PDF and convert the PDF to Word in order to work on the text of the book.

What is the best workflow in order to have a perfect OCR (no error in the text of the exported Word file) ?

Correct answer gary_sc

Hi, @pierret18811376 ; big task ahead of you.

 

I'm assuming that copyright is not going to be an issue? As a book author, I need to point that out.

 

Meanwhile, the thought of 100% accuracy is not possible. Please pull back on that concept of perfect OCR; it doesn't exist. I've been scanning and OCRing for about 30 years, I do have some background in all this. But there are things that you can do to improve your quality. First off, be prepared to destroy the book. To get good clean scans, you have to destroy the binding. You cannot expect the curled edge caused by opening the book against the glass and assume that the OCR "will get it." It won't.

 

Beyond that, please take the time to read the following blog I wrote for Adobe a number of years ago. It covers how to achieve the best result when performing scanning and OCR.

 

https://community.adobe.com/t5/adobe-community-professionals/scanning-clean-searchable-pdfs/m-p/4785435?page=1#M89

 

Good luck!

 

6 replies

Inspiring
January 14, 2024

Hi @gary_sc , Hi @Abambo 

Ok, my actual scanner (HP Pagewide) is not as good as yours, but thanks to your advices, I have succeeded to have a good result in OCR.

So Gary, you use actually an Epson scanner and Silverfast for the software (what version of Silverfast and what Epson scanner?). 

And you Abambo, what is your scanner and software ?

gary_sc
Community Expert
Community Expert
January 14, 2024

Hi, @pierret18811376, to consider scanners, there is one aspect you can't get past: which kind of sensor they have. There are two kinds of sensors: CCD and CIS (Based on CMOS).

 

CCD sensors (similar to camera sensors) are more expensive to manufacture and require more power to function, but they provide much higher-quality images with less noise. 

 

CIS sensors are much cheaper to manufacture, but their sensitivity to light is lower (thus more noise) and are limited to lower resolution. [To digitally remove the noise, it causes the images to be slightly soft, just as what happens within (say) Photoshop as one of the ways you can digitally remove noise.)

 

So, if it's an inexpensive scanner, it will be a CIS sensor. Plus, if you have any type of combo scanner/printer/fax/etc., it will also be a CIS sensor.

 

The good news is that any kind of high-end scanner you get will also have higher-end mechanics.

 

However, if all you are ever doing is scanning documents for OCR, either is fine. If you need to scan images, negatives, slides, etc., you are wasting your money with a CCD scanner. If you need to scan images, negatives, slides, etc., you are wasting your money with a CIS sensor.

 

Also, if you are scanning documents for OCR, then Silverfast is wasting your money. If you are scanning images, negatives, slides, etc., you might be wasting money if you consider your time. Besides having higher scanning capabilities, I think the thing you are paying for with Silverfast is efficiency, which I do not have time to get into in a message, but I can simplify an answer on one basis: If you are scanning (say) 35mm negatives, on my scanner, my template can hold three across and up to six or seven down. If you have things set up to do (say) 18 images, on non-Silverfast scanners, there would be 18 separate scans. If you are scanning for high-resolution, that can take a long long time to scan. With Silverfast, it would just be six scans because it would scan the three across concurently. Otherwise, stated, that's one-third the scanning time. That means a lot.

 

Oh, one other ability: if I am using my Epson scanning software (which otherwise is good scanning software), and I have two photos I wish to scan, let's say one is color and one is B&W. I can only scan color or B&W. With Silverfast, I can scan one at color and one as B&W and if they are alligned up next to each other (left & right), that would be done in one scan. [If they were alligned as one above the other, that would still have to be done as two scans, but I would not have to first do the color and then do the B&W, it would still be done as one sweep of the scanner.]

 

BTW, ALL scanners, regardless of manufacturer lie about their highest resolution capability (even Epson).

 

Oh, and for the record, my Epson is the V800, now sold as the V850.

Abambo
Community Expert
Community Expert
January 14, 2024

I normally scan all on my printer (with network transfer to my PC). Holding the book, manipulating the PC at the same time will be challenging.

ABAMBO | Hard- and Software Engineer | Photographer
Abambo
Community Expert
Community Expert
January 14, 2024
quote

What is the best workflow in order to have a perfect OCR (no error in the text of the exported Word file) ?


By @pierret18811376

The spellchecker will be your best friend!

ABAMBO | Hard- and Software Engineer | Photographer
gary_sc
Community Expert
Community Expert
January 14, 2024

@Abambo is correct. 

 

One of the failings of Acrobat's OCR is that it will only correct one word at a time. My point is that in Word, if you find a word misspelled and that same word is misspelled many times, you can do a global fix on all of those "many times" with a few clicks. In Acrobat, you have to do each word, one at a time. 

 

So, as I show in that Blog I linked to above, look for the red lines and try to see if there are many that can be fixed in a few clicks.

Inspiring
January 14, 2024

Hi Gary,

My scanner is a multi function HP scanner, that make scan and print.

What I have done is to manage completely the scan with Acrobat Pro by using the function "create a PDF" and choosing "Scanner" in the menu. I give you my scan settings from Acrobat Pro :

 

1) Color mode on "Automatic", and resolution on 300 PPI (it is the maximum of PPI shown by Acrobat Pro, so I suppose my scanner is limited to 300 PPI) :

 

2) Optimization of the scanned PDF (I have no adaptative compression and no automatic background removing + descreening + improving text sharpness) :

 

 

3) Quality at the maximum :

So, I don't make my scans in TIFF with histogram, then in PDF : I make all my scans directly in PDF by using Acrobat Pro (Create PDF function), and I simply manually remove the background on each page with Acrobat Pro on the created PDF.

You are talking about "Levels" : where can I find it (is it in the scanner software) ? In my case, with Acrobat Pro, what could I do "at the creation of the scan" ?

Abambo
Community Expert
Community Expert
January 14, 2024

Levels is idealy in the scanner software (and hardware). You define the black point (everything left to this point will be perceived as 100% black) and the white point (everything right to this point will be 100 white) and the mid point (50% gray). That will allow you to "cut off" impurties and shine through, at the expense of eventually accuracy (what is light gray will eventually be white). 

 

Also, for OCR, best is to scan greyscale. 

 

HP MFP machines have a lousy quality, compared to the stand alone scanners of the time. They do their job, but the result could be better. I wouldn't scan high fidelity colour pictures with them, but for OCR, they do a nice job. 

ABAMBO | Hard- and Software Engineer | Photographer
Inspiring
January 13, 2024

Hi Gary,

Thank you.

According to your advices, I have done :

1) First I have set Acrobat Pro on this setting for the scan (OCR setting on "Text and image") :

 

 

2) As I cannot destroy the book binding, I have put the book the most flat on the scanner.

3) After scan, in Acrobat Pro, I have cropped each page around the text. Then I have removed the background on each page.

4) And I have exported it in a Word Document.

Result : it seems to be far better than before (less errors in the Word document).

gary_sc
Community Expert
Community Expert
January 13, 2024

Hi, @pierret18811376, did you remove any grey from the page via Levels, as I mentioned in my blog? Also, do scan at 300 ppi unless the text is very small, then go up to 600 ppi. (the latter will significantly increase the storage size of the TIF image, but once you PDF and OCR the page, it will be down to a "normal" size.

 

If you want to get the best quality, you need to do everything you can DURING THE SCAN. Even doing the same fixes in Photoshop that you're doing in the scan will not be as good as when they are done at creation in the scan.

gary_sc
Community Expert
gary_scCommunity ExpertCorrect answer
Community Expert
January 13, 2024

Hi, @pierret18811376 ; big task ahead of you.

 

I'm assuming that copyright is not going to be an issue? As a book author, I need to point that out.

 

Meanwhile, the thought of 100% accuracy is not possible. Please pull back on that concept of perfect OCR; it doesn't exist. I've been scanning and OCRing for about 30 years, I do have some background in all this. But there are things that you can do to improve your quality. First off, be prepared to destroy the book. To get good clean scans, you have to destroy the binding. You cannot expect the curled edge caused by opening the book against the glass and assume that the OCR "will get it." It won't.

 

Beyond that, please take the time to read the following blog I wrote for Adobe a number of years ago. It covers how to achieve the best result when performing scanning and OCR.

 

https://community.adobe.com/t5/adobe-community-professionals/scanning-clean-searchable-pdfs/m-p/4785435?page=1#M89

 

Good luck!

 

Participant
February 23, 2025

At the outset, I must thank you for the word of caution, I am told by Adobe that I have to Install Asian Language support files,

 

Moreover, I am assigned this job by the Owners and Authors

 

gary_sc
Community Expert
Community Expert
February 23, 2025

Hi, @ESSAAR; you are jumping into a thread that is over a year old and introducing issues with which I have no knowledge of what you're talking about. 

 

If you've already talked to Adobe and have been told that you require Asian language fonts and that the original author/owners have permitted you to make alterations in a PDF, then there's not much else I can add. 

 

Earlier in this thread, I listed a blog post I wrote about getting as good a quality scan as possible. I suggest you read that. Also, be aware that even the best quality scan will NOT make converting an entire document into editable text easy for subsequent editing, and certainly not extensive editing. I have no idea how mixing Asian fonts into the mix will work. I've only worked with the English language and standard fonts. 

 

All the best of luck!