I don't want to seem overly negative, as I really like Adobe products in general... but I've just paid full subscription for Adobe Acrobat Pro, hoping that the OCR would do a good job, and it's terrible. No better for accuracy than OCR scanning I used 20 years ago. Are there no settings to adjust the scanning quality? To adjust the contrast. It's one option and that's it seemingly.
I was hoping to convert this PDF document (1981 PDF Document) to maintain the original 1981 look, but be possible for blind people to use with a screen reader, without having to almost re-type the entire document. I can't see any options to tweak the AI / method used to try to get a better result. Am I barking up the wrong tree with Acrobat?
Thank you for supplying the document that you were working on, it was very helpful.
I did download it and I did run it through my copy of Acrobat Pro DC, and like you, did get dreadful results.
To be honest, I found the results no worse than I was expecting when I saw the quality of the copy that you were working on. Now please read all of this becuase first I'm going to disagree with you and then mostly agree with you.
First off the quality of the original scan was fairly dreadful. I've seen worse, much worse, but this was not good to begin with. Achieving good OCR is like taking great photos: the more you do in camera and the less you do in Photoshop, the better the image is going to be. It is absolutely no different when scanning: the more you do at the time of scanning the better the OCR results will be. On a scale of 1-10, I'd call this a 6+ and the results are about the same.
The scan appears to have been done about 200-225 ppi (300 ppi is considered minimum and 600 ppi is ideal). However, there is a fusyness to the quality of the scan that makes me think this was a photocopy before it was scanned, that's a big trench. The size of the font is OK but for some reason, it's been my experience that Acrobat's OCR has problems with Courier. Don't know why but it just seems that that's an issue for it.
Another issue with the scan is the bleeding from the back (or ghosting of text on the backside). That can screw up an OCR process as well.
So, in a nutshell, what I'm seeing with the results are about on par with the quality of the original scan for this PDF.
Now, on the other hand, why isn't it any better. What I have to think about is the current software trend using AI to better think out what could/should be taking place. I do not know if you follow Photoshop at all but they are investing a lot of time and money in using AI to do some enhanced enhancements. It's in the early stage right now but does show a lot of promise.
I do know that Adobe does not make it's own OCR engine, they rent it from another company (at this moment I can't remember which one, sorry). But I do wonder if ANY company is starting to utilize AI to increase the quality of OCR. If not I'd be astounded, but it all depends upon someone high up to say "Hey, we should look into this." But until that time, we have what we have.
A number of years ago I did a blog for Adobe on how to get a cleaner scan and wrote the following. It might give you some ideas to work with to get a better quality end result.
One thing I can suggest is that you take the end result of what you're getting with this and export it into a Word document and do the text corrrection in Word. Word has a variety of features that are significantly better than what Acrobat has for correcting an OCR document. the one big advantage you have with this document is that the formatting is very straightforward and will not be affected by the exporting to Word by any degree. FWIW, several years ago I found a family history that my mom wrote MANY years ago, scanned it, OCRed it, and then brought it into Word for correction. On my previous scale of 1-10, I'd have given that scan about a 3 because the original had pencil scribbles, the patten on the typewriter was causing slipping so some text was at an angle, there were lots of pencil corrections, it was a mess. But it got done.
Anyhow, here's the blog, I hope you get something from it.
I strongly endorse the response from @gary_sc.
It goes under GIGO, garbage in, garbage out! The original document appears to have been printed on a daisywheel, dot matrix, or low resolution inkjet printer typical of the time period (1981) and then photocopied!
Further analyzing the PDF file provided, to make matters worse, it appears to be a PDF file created by placing images into a Microsoft Word document and using Microsoft's own PDF creation which is notoriously problematic. That is probably the source of the images being 200-225 dpi and in fuzzy-wuzzy JPEG format. Microsoft Word has preferences as to what resolution to store placed images at. Always us the High fidelity resolution setting:
Furthermore, use Acrobat's Save as Adobe PDF PDFMaker facility to create PDF from Word, not Microsoft's! Create special options that result in images not being downsampled and ZIP-compressed within the PDF file. You absolutely don't want JPEG or even JPEG2000 for this purpose.
However, if there is a way for you to get the original scan images and ascertain whether they are significantly higher resolution (and preferably not JPEG), I would suggest creating a PDF file directly from such images and trying OCR in Acrobat from there. Even better, if you have the original paper, I would suggest totally rescanning at 600 dpi into lossless TIFF format and for pages with issues, doing some edits in Photoshop.
Thank you both for your replies. It's correct that this is a photocopy. The Blue File project was a means of sharing information across the UK with teachers and parents around IT in educational use.The booklet would have definitely been a photocopy of some nature.
I suppose I over estimate the power of computers. If I can read it easily (well - if I use a magnifier as my eye-sight is a bit shot), I'd expect a computer in 2021 to do a far better job than it did.
I'll do a 1 or 2 page experiment with better settings as close to what is recommended here, and see if that makes much difference. Thanks again.
Actually, I don't think that you are overestimating the power of computers, but rather underestimating the power of the human brain to compensate for anomolies in what we see and to make decisions based upon our experiences over time.
Plenty of work is being done in terms of applying artificial intelligence to recognition and interpretation of text. Ultimately, OCR should improve significantly, but in the meantime ...
Maybe another 20 years. Humans at their best are a wonderful thing 🙂
Certainly, OCR in Acrobat has worked far better with other cleaner documents I've tried since. I was using 300dpi JPGs at highest setting in the Word document initially. This time I scanned a page at 600dpi, saved as TIFF, and used Photoshop to reduce the reverse print that was showing though. Still poor, but a fair bit better.
I may be able to track down an original copy of the 1981 document, so that might be the way to go, to avoid a massive job of tweaking. Thanks again both for your help and thoughts. Certainly very helpful.
Please DO read my blog here:
If you are trying to fix the image AFTER the scan in PS, you will get poorer results than trying to fix the image at the time of the scan.
That's not to say you might help yourself by removing specs and artifacts in PS but most, if not all of bleed-through can be removed at the time of scanning.
I just looked at your document and to be very honest, it doesn't look like you read any of the preceding comments from Dov Isaacs or myself, nor did you read the linked reference to the blog I wrote for Adobe.
The quality of the scan you show in your attached PDF looks like a low quality photocopy of typewritten content with a lot of the parts of the characters missing (loops of letters not closed, ascenders and descenders missing regions, etc.). This is a nightmare for any OCR application. Plus, you also have a lot of bleed-through from the other side of the page, a specific issue I raised in my blog to prevent against. But since the photocopying is blurry, there's not much you can do with scanning at a higher resolution, the document is already starting from a bad place.
If you can find software that can do a better job than this, by all means, use it. But you cannot expect to take a worn, well used pallet to a cabinet maker and have them make fine furniture from it.
Hi gary_sc, please compare this file with the result of Acrobat's OCR. Do you see a difference?
You didn't attach any file to this message.
Didn't you comment on the file earlier?
That was a single 51 page document. I was assuming that you'd provide a single one page (each) comparison. I have no way to look at a 51 page document and wonder what the other version would be.