Copy link to clipboard
Copied
Hello,
I have to scan a book by using Acrobat Pro.
My goal is to scan this book in PDF and convert the PDF to Word in order to work on the text of the book.
What is the best workflow in order to have a perfect OCR (no error in the text of the exported Word file) ?
Copy link to clipboard
Copied
Hi, @pierret18811376 ; big task ahead of you.
I'm assuming that copyright is not going to be an issue? As a book author, I need to point that out.
Meanwhile, the thought of 100% accuracy is not possible. Please pull back on that concept of perfect OCR; it doesn't exist. I've been scanning and OCRing for about 30 years, I do have some background in all this. But there are things that you can do to improve your quality. First off, be prepared to destroy the book. To get good clean scans, you have to destroy the binding. You cannot expect the curled edge caused by opening the book against the glass and assume that the OCR "will get it." It won't.
Beyond that, please take the time to read the following blog I wrote for Adobe a number of years ago. It covers how to achieve the best result when performing scanning and OCR.
Good luck!
Copy link to clipboard
Copied
Hi, @pierret18811376 ; big task ahead of you.
I'm assuming that copyright is not going to be an issue? As a book author, I need to point that out.
Meanwhile, the thought of 100% accuracy is not possible. Please pull back on that concept of perfect OCR; it doesn't exist. I've been scanning and OCRing for about 30 years, I do have some background in all this. But there are things that you can do to improve your quality. First off, be prepared to destroy the book. To get good clean scans, you have to destroy the binding. You cannot expect the curled edge caused by opening the book against the glass and assume that the OCR "will get it." It won't.
Beyond that, please take the time to read the following blog I wrote for Adobe a number of years ago. It covers how to achieve the best result when performing scanning and OCR.
Good luck!
Copy link to clipboard
Copied
At the outset, I must thank you for the word of caution, I am told by Adobe that I have to Install Asian Language support files,
Moreover, I am assigned this job by the Owners and Authors
Copy link to clipboard
Copied
Hi, @ESSAAR; you are jumping into a thread that is over a year old and introducing issues with which I have no knowledge of what you're talking about.
If you've already talked to Adobe and have been told that you require Asian language fonts and that the original author/owners have permitted you to make alterations in a PDF, then there's not much else I can add.
Earlier in this thread, I listed a blog post I wrote about getting as good a quality scan as possible. I suggest you read that. Also, be aware that even the best quality scan will NOT make converting an entire document into editable text easy for subsequent editing, and certainly not extensive editing. I have no idea how mixing Asian fonts into the mix will work. I've only worked with the English language and standard fonts.
All the best of luck!
Copy link to clipboard
Copied
Hi Gary,
Thank you.
According to your advices, I have done :
1) First I have set Acrobat Pro on this setting for the scan (OCR setting on "Text and image") :
2) As I cannot destroy the book binding, I have put the book the most flat on the scanner.
3) After scan, in Acrobat Pro, I have cropped each page around the text. Then I have removed the background on each page.
4) And I have exported it in a Word Document.
Result : it seems to be far better than before (less errors in the Word document).
Copy link to clipboard
Copied
Hi, @pierret18811376, did you remove any grey from the page via Levels, as I mentioned in my blog? Also, do scan at 300 ppi unless the text is very small, then go up to 600 ppi. (the latter will significantly increase the storage size of the TIF image, but once you PDF and OCR the page, it will be down to a "normal" size.
If you want to get the best quality, you need to do everything you can DURING THE SCAN. Even doing the same fixes in Photoshop that you're doing in the scan will not be as good as when they are done at creation in the scan.
Copy link to clipboard
Copied
Hi Gary,
My scanner is a multi function HP scanner, that make scan and print.
What I have done is to manage completely the scan with Acrobat Pro by using the function "create a PDF" and choosing "Scanner" in the menu. I give you my scan settings from Acrobat Pro :
1) Color mode on "Automatic", and resolution on 300 PPI (it is the maximum of PPI shown by Acrobat Pro, so I suppose my scanner is limited to 300 PPI) :
2) Optimization of the scanned PDF (I have no adaptative compression and no automatic background removing + descreening + improving text sharpness) :
3) Quality at the maximum :
So, I don't make my scans in TIFF with histogram, then in PDF : I make all my scans directly in PDF by using Acrobat Pro (Create PDF function), and I simply manually remove the background on each page with Acrobat Pro on the created PDF.
You are talking about "Levels" : where can I find it (is it in the scanner software) ? In my case, with Acrobat Pro, what could I do "at the creation of the scan" ?
Copy link to clipboard
Copied
Levels is idealy in the scanner software (and hardware). You define the black point (everything left to this point will be perceived as 100% black) and the white point (everything right to this point will be 100 white) and the mid point (50% gray). That will allow you to "cut off" impurties and shine through, at the expense of eventually accuracy (what is light gray will eventually be white).
Also, for OCR, best is to scan greyscale.
HP MFP machines have a lousy quality, compared to the stand alone scanners of the time. They do their job, but the result could be better. I wouldn't scan high fidelity colour pictures with them, but for OCR, they do a nice job.
Copy link to clipboard
Copied
Hi, @pierret18811376, Let me please point out one important thing for you to understand (and I fault myself for not mentioning this earlier): Acrobat CANNOT scan anything. Rather, it relies upon some software called TWAIN that lets Acrobat have access to the scanner's software. [On the Macintosh side of things, long ago, Jobs did not want anything that would let one piece of software control another software. So, what Apple did was to let Acrobat use TWAIN to control their own software, "Image Capture." the problem here is that Image Capture is the worst scanning software I've ever encountered. The other alternative here is for the Scanner software to add a plugin for other software to use. For whatever reason, Scanning companies tend not to do that for Acrobat.]
So, what this boils down to is that the software you are seeing, and your screenshot above, IS your scanner's software; not Acrobat's.
I have never used any Brother scanner, so I have no idea what options it may have. However, for example, Epson scanners (in the past) have offered various levels of options: Basic, General, Expert, or something like that. With each higher level, more options become available. It is with the Expert level one finds the Levels option.
Nonetheless, looking at the Brother's scanning software from your screenshots above, there are several options I think you may wish to see how they improve your scans.
1) If there are no colors needed, please set the color mode to Black & White (or greyscale, whatever they call it). This is because an old book may occasionally have a bit of color to the pages. This will cause it to scan in color. The only negative here is that the storage size of the final document will be larger.
2) Détramage, this is unnecessary, on or off, since there are no screened* images (or are there? Please let me know.).
3) Suppression de l'arriére-plan. I think you want this, and you want to set it to Élevée. What I think this will do is suppress the background (image) noise (for example, the text on the opposite side of the page you are working with). However, I am guessing that you will need to try this and see what it does.
4) amélioration de la netteté dut texte. Please put this to Élevée. Your scanner sensors are not high-quality (sorry), and if the software can compensate for that, it will help things.
*Screened images: these are images that are composed of dots of varying sizes. Newspapers traditionally have these.
Copy link to clipboard
Copied
@gary_sc , it's HP!
Copy link to clipboard
Copied
@Abambo, Ah, thanks for the correction. I was helping someone with a Brother scanner, and my brain hadn't refreshed! :>)
My first scanner was an Apple (1988); when that got crushed in one of California's earthquakes, I did have an HP for a while, but then I bought an Epson. I've had Epson's ever since. Epson's software is good, but I use SilverFast now because it is worth every penny.
Copy link to clipboard
Copied
What is the best workflow in order to have a perfect OCR (no error in the text of the exported Word file) ?
By pierret18811376
The spellchecker will be your best friend!
Copy link to clipboard
Copied
@Abambo is correct.
One of the failings of Acrobat's OCR is that it will only correct one word at a time. My point is that in Word, if you find a word misspelled and that same word is misspelled many times, you can do a global fix on all of those "many times" with a few clicks. In Acrobat, you have to do each word, one at a time.
So, as I show in that Blog I linked to above, look for the red lines and try to see if there are many that can be fixed in a few clicks.
Copy link to clipboard
Copied
I normally scan all on my printer (with network transfer to my PC). Holding the book, manipulating the PC at the same time will be challenging.
Copy link to clipboard
Copied
Ok, my actual scanner (HP Pagewide) is not as good as yours, but thanks to your advices, I have succeeded to have a good result in OCR.
So Gary, you use actually an Epson scanner and Silverfast for the software (what version of Silverfast and what Epson scanner?).
And you Abambo, what is your scanner and software ?
Copy link to clipboard
Copied
Hi, @pierret18811376, to consider scanners, there is one aspect you can't get past: which kind of sensor they have. There are two kinds of sensors: CCD and CIS (Based on CMOS).
CCD sensors (similar to camera sensors) are more expensive to manufacture and require more power to function, but they provide much higher-quality images with less noise.
CIS sensors are much cheaper to manufacture, but their sensitivity to light is lower (thus more noise) and are limited to lower resolution. [To digitally remove the noise, it causes the images to be slightly soft, just as what happens within (say) Photoshop as one of the ways you can digitally remove noise.)
So, if it's an inexpensive scanner, it will be a CIS sensor. Plus, if you have any type of combo scanner/printer/fax/etc., it will also be a CIS sensor.
The good news is that any kind of high-end scanner you get will also have higher-end mechanics.
However, if all you are ever doing is scanning documents for OCR, either is fine. If you need to scan images, negatives, slides, etc., you are wasting your money with a CCD scanner. If you need to scan images, negatives, slides, etc., you are wasting your money with a CIS sensor.
Also, if you are scanning documents for OCR, then Silverfast is wasting your money. If you are scanning images, negatives, slides, etc., you might be wasting money if you consider your time. Besides having higher scanning capabilities, I think the thing you are paying for with Silverfast is efficiency, which I do not have time to get into in a message, but I can simplify an answer on one basis: If you are scanning (say) 35mm negatives, on my scanner, my template can hold three across and up to six or seven down. If you have things set up to do (say) 18 images, on non-Silverfast scanners, there would be 18 separate scans. If you are scanning for high-resolution, that can take a long long time to scan. With Silverfast, it would just be six scans because it would scan the three across concurently. Otherwise, stated, that's one-third the scanning time. That means a lot.
Oh, one other ability: if I am using my Epson scanning software (which otherwise is good scanning software), and I have two photos I wish to scan, let's say one is color and one is B&W. I can only scan color or B&W. With Silverfast, I can scan one at color and one as B&W and if they are alligned up next to each other (left & right), that would be done in one scan. [If they were alligned as one above the other, that would still have to be done as two scans, but I would not have to first do the color and then do the B&W, it would still be done as one sweep of the scanner.]
BTW, ALL scanners, regardless of manufacturer lie about their highest resolution capability (even Epson).
Oh, and for the record, my Epson is the V800, now sold as the V850.

