OCR seems very poor

Report · Jan 17, 2021

I don't want to seem overly negative, as I really like Adobe products in general... but I've just paid full subscription for Adobe Acrobat Pro, hoping that the OCR would do a good job, and it's terrible. No better for accuracy than OCR scanning I used 20 years ago. Are there no settings to adjust the scanning quality? To adjust the contrast. It's one option and that's it seemingly.

I was hoping to convert this PDF document (1981 PDF Document) to maintain the original 1981 look, but be possible for blind people to use with a screen reader, without having to almost re-type the entire document. I can't see any options to tweak the AI / method used to try to get a better result. Am I barking up the wrong tree with Acrobat?

Report · Jan 17, 2021

HI OneSwitch,

Thank you for supplying the document that you were working on, it was very helpful.

I did download it and I did run it through my copy of Acrobat Pro DC, and like you, did get dreadful results.

To be honest, I found the results no worse than I was expecting when I saw the quality of the copy that you were working on. Now please read all of this becuase first I'm going to disagree with you and then mostly agree with you.

First off the quality of the original scan was fairly dreadful. I've seen worse, much worse, but this was not good to begin with. Achieving good OCR is like taking great photos: the more you do in camera and the less you do in Photoshop, the better the image is going to be. It is absolutely no different when scanning: the more you do at the time of scanning the better the OCR results will be. On a scale of 1-10, I'd call this a 6+ and the results are about the same.

The scan appears to have been done about 200-225 ppi (300 ppi is considered minimum and 600 ppi is ideal). However, there is a fusyness to the quality of the scan that makes me think this was a photocopy before it was scanned, that's a big trench. The size of the font is OK but for some reason, it's been my experience that Acrobat's OCR has problems with Courier. Don't know why but it just seems that that's an issue for it.

Another issue with the scan is the bleeding from the back (or ghosting of text on the backside). That can screw up an OCR process as well.

So, in a nutshell, what I'm seeing with the results are about on par with the quality of the original scan for this PDF.

Now, on the other hand, why isn't it any better. What I have to think about is the current software trend using AI to better think out what could/should be taking place. I do not know if you follow Photoshop at all but they are investing a lot of time and money in using AI to do some enhanced enhancements. It's in the early stage right now but does show a lot of promise.

I do know that Adobe does not make it's own OCR engine, they rent it from another company (at this moment I can't remember which one, sorry). But I do wonder if ANY company is starting to utilize AI to increase the quality of OCR. If not I'd be astounded, but it all depends upon someone high up to say "Hey, we should look into this." But until that time, we have what we have.

A number of years ago I did a blog for Adobe on how to get a cleaner scan and wrote the following. It might give you some ideas to work with to get a better quality end result.

One thing I can suggest is that you take the end result of what you're getting with this and export it into a Word document and do the text corrrection in Word. Word has a variety of features that are significantly better than what Acrobat has for correcting an OCR document. the one big advantage you have with this document is that the formatting is very straightforward and will not be affected by the exporting to Word by any degree. FWIW, several years ago I found a family history that my mom wrote MANY years ago, scanned it, OCRed it, and then brought it into Word for correction. On my previous scale of 1-10, I'd have given that scan about a 3 because the original had pencil scribbles, the patten on the typewriter was causing slipping so some text was at an angle, there were lots of pencil corrections, it was a mess. But it got done.

Anyhow, here's the blog, I hope you get something from it.

http://photosbycoyne.com/Gary's_Help/Scanning/clean-scanning.html

Report · Nov 28, 2021

Hi
I am not sure what kind of OCR technology Google uses but it gives editable (Word document) results where Adobe gives a bunch of glyphs and symbols only. Not a single word, not ONE word recognized by the OCR engine. I used the same file and the result is impressive on Google and Zero on Adobe.
I understand that Bad info In doesn't help but Non-sense Out means nothing; it is wasteful.
Adobe promotes tools for this job that are giving the same result what I was getting more than 10 years ago.

Try Google instead .

Report · Jan 17, 2021

I strongly endorse the response from @gary_sc.

It goes under GIGO, garbage in, garbage out! The original document appears to have been printed on a daisywheel, dot matrix, or low resolution inkjet printer typical of the time period (1981) and then photocopied!

Further analyzing the PDF file provided, to make matters worse, it appears to be a PDF file created by placing images into a Microsoft Word document and using Microsoft's own PDF creation which is notoriously problematic. That is probably the source of the images being 200-225 dpi and in fuzzy-wuzzy JPEG format. Microsoft Word has preferences as to what resolution to store placed images at. Always us the High fidelity resolution setting:

Furthermore, use Acrobat's Save as Adobe PDF PDFMaker facility to create PDF from Word, not Microsoft's! Create special options that result in images not being downsampled and ZIP-compressed within the PDF file. You absolutely don't want JPEG or even JPEG2000 for this purpose.

However, if there is a way for you to get the original scan images and ascertain whether they are significantly higher resolution (and preferably not JPEG), I would suggest creating a PDF file directly from such images and trying OCR in Acrobat from there. Even better, if you have the original paper, I would suggest totally rescanning at 600 dpi into lossless TIFF format and for pages with issues, doing some edits in Photoshop.

Good luck!

- Dov Isaacs, former Adobe Principal Scientist (April 30, 1990 - May 30, 2021)

Report · Jan 17, 2021

Thank you both for your replies. It's correct that this is a photocopy. The Blue File project was a means of sharing information across the UK with teachers and parents around IT in educational use.The booklet would have definitely been a photocopy of some nature.

I suppose I over estimate the power of computers. If I can read it easily (well - if I use a magnifier as my eye-sight is a bit shot), I'd expect a computer in 2021 to do a far better job than it did.

I'll do a 1 or 2 page experiment with better settings as close to what is recommended here, and see if that makes much difference. Thanks again.

Report · Jan 18, 2021

@OneSwitch

Actually, I don't think that you are overestimating the power of computers, but rather underestimating the power of the human brain to compensate for anomolies in what we see and to make decisions based upon our experiences over time.

Plenty of work is being done in terms of applying artificial intelligence to recognition and interpretation of text. Ultimately, OCR should improve significantly, but in the meantime ...

- Dov Isaacs, former Adobe Principal Scientist (April 30, 1990 - May 30, 2021)

Report · Jan 19, 2021

Maybe another 20 years. Humans at their best are a wonderful thing 🙂

Certainly, OCR in Acrobat has worked far better with other cleaner documents I've tried since. I was using 300dpi JPGs at highest setting in the Word document initially. This time I scanned a page at 600dpi, saved as TIFF, and used Photoshop to reduce the reverse print that was showing though. Still poor, but a fair bit better.

I may be able to track down an original copy of the 1981 document, so that might be the way to go, to avoid a massive job of tweaking. Thanks again both for your help and thoughts. Certainly very helpful.

Report · Jan 19, 2021

Hi OneSwitch,

Please DO read my blog here:

http://photosbycoyne.com/Gary's_Help/Scanning/clean-scanning.html

If you are trying to fix the image AFTER the scan in PS, you will get poorer results than trying to fix the image at the time of the scan.

That's not to say you might help yourself by removing specs and artifacts in PS but most, if not all of bleed-through can be removed at the time of scanning.

Good luck!

Report · Apr 25, 2021

It is beyond my understanding why Acrobat DC has not getting better at OCR throughout so many years. I'm doing about everything with Acrobat DC except OCR. Try Kofax Power PDF Advanced Power PDF, the attached file is the result of its OCR. Too bad having two use two programs to produce a PDF...

Report · Apr 25, 2021

Hi V5E7A,

I just looked at your document and to be very honest, it doesn't look like you read any of the preceding comments from Dov Isaacs or myself, nor did you read the linked reference to the blog I wrote for Adobe.

The quality of the scan you show in your attached PDF looks like a low quality photocopy of typewritten content with a lot of the parts of the characters missing (loops of letters not closed, ascenders and descenders missing regions, etc.). This is a nightmare for any OCR application. Plus, you also have a lot of bleed-through from the other side of the page, a specific issue I raised in my blog to prevent against. But since the photocopying is blurry, there's not much you can do with scanning at a higher resolution, the document is already starting from a bad place.

If you can find software that can do a better job than this, by all means, use it. But you cannot expect to take a worn, well used pallet to a cabinet maker and have them make fine furniture from it.

Report · Apr 25, 2021

Hi gary_sc, please compare this file with the result of Acrobat's OCR. Do you see a difference?

Report · Apr 25, 2021

You didn't attach any file to this message.

Report · Apr 25, 2021

Didn't you comment on the file earlier?

Report · Apr 25, 2021

That was a single 51 page document. I was assuming that you'd provide a single one page (each) comparison. I have no way to look at a 51 page document and wonder what the other version would be.

Sorry

Report · Feb 03, 2022

gary_sc's article on how to scan is really good and helpful if you are digitizing paper. What is not so great are the various apologies made for Acrobat's long-standing crummy OCR results. Is the original scan less than ideal? Certainly. Do other comparabily priced PDF editor's OCR enginges do a way better job despite those deficiencies? If V5E7A's Kofax link is any indication, then definitely.

From the first paragraph of that document Kofax results looked like this copied to text file:

Over the last few years, the toy market has been increasingly flooded with -chip-based toys so that now they are a part of everyday life for moat children and many adults. Initially, most of the electronic games were packages utilising the home television sec but now all kinds of toys are appearing including mobile toys and learning games. With such a growth industry, one wonders what the next few years will produce but there is no doubt that it will be more -chip" based entertainment.

Acrobat DC yields this:

Over the last few years, th11 toy 111arket hau been increasingly flooded
with "chip" based toys so that now th&y are a part of evuryday life for
moat children and many adults. Initially, moat of the 11lectronic games
were packages utilising the homa television set but now all kinds of toys
are appearing including mobile toya and l11arning games. With such a
growth industry, one wond11rs what the next few years will produce but
there is no doubt that it will be more "chip" busttd antertailllllenc .

Sure they both have swampy kids and Acrobat did a better job capturing the quotes around "chip", but as far as general readability goes Kofax did a way better job than Acrobat's love of 1's and l's. Your mileage may vary with the usage of line breaks - Acrobat is more 'verbatim' and captures the look of the formatting a little better, but Kofax is overall more functional with less cleanup to make it presentable. If the goal was to make the existing PDF more accessible via screenreader then Kofax appears to do a better job of this out of the box.

Sadly the answer to OneSwitch's question seems to be: There is nothing you can do to improve Acrobat's OCR with the document as-is, this has been an issue for a long time, and you would be better off dropping some more money on something else.

Report · Dec 22, 2021

Astounding resulst vs… well… the Acrobat OCR engine 😕

Report · Mar 10, 2023

Aparentemente Gary y Dov Isaacs se han quedado en el tiempo al menos unas 2 decadas. Desconozco si serán profesionales o no, pero si van a dar una opinion correspondería que al menos se informen de las tecnologías de nuestro tiempo. Les doy a Gary y Dov un breve repaso gratis: semántica latente, deep learning, redes neuronales recurrentes, modelos preentrenados, transformers, aprendizaje autosupervisado, enormes modelos de lenguaje, modelos generativos preentrenados basados en transformers... bienvenidos a nuestro tiempo!

Efectivamente el método de reconocimiento de texto que utiliza adobe es anticuado y no está aplicando los enormes modelo de lenguaje que se utilizan hoy en día y por lo tanto no puede predecir absolutamente nada.

Report · Mar 10, 2023

Hi, @institucion288134099in1. Thank you! To put me in the same sentence as Dov, I find a great honor. I have tremendous respect for him.

Actually, I very much agree with you. I've been wondering myself WHEN (not if), AI will be trained on OCR, I think it will be a tremendous feature and is absolutely needed.

Fortunately or unfortuantely, Adobe does not create the OCR engine that they use. They license it from the creator. I'm very sorry, but as I sit here, I cannot remember who it is—I did used to know but I was told that years ago and it has slipped my mind. I say fortunatetly because that means that this third party can, and maybe already has, started on that. But I have no idea and will probably hear about this technology at the same time that you do. Remember, I do not work for Adobe (Dov does).

If ChatGPT can create whole, logical sentences out of whole thin electrons, I could not imagine how working with "mostly correct" text from the first pass of OCR could not be dramatically improved. The one catch to all of this is that all AI generated "stuff" needs an internet link for the main brain to do it's thing. I can only imagine how companies and other institutions would not want their documents to be available to the net.

But, besides that one issue, I really hope we see this soon!

Report · Mar 10, 2023

Oh, I just had one other thought to add to this: If you've ever worked with Garmmarly, you know the frustration of being more correct than that AI.

I recentyly wrote an article on different kinds of wood (my hobby), and Grammarly could not get past that there is a difference between hard woods and hardwoods. The former is that there are woods that are hard, and the latter are deciduous trees. Throughout the entire time I was writing that article, I constantly had to correct and re-correct the error that Grammarly was making. An AI-OCR could fall into the same hole. (Yes, the same with soft woods and softwoods.)

Report · Feb 17, 2024

I realise this thread is a few years old, but I have found a fantastic PDF app for only £5.49/month:

https://apps.apple.com/gb/app/scan-shot-document-scanner-pdf/id1575194801

In comparison to Adobe, this is an amazing piece of OCR software. It puts Adobe OCR, to shame.
I wanted to use Google OCR, but it is only an API service, which seems a bit bananas :banana:

I will show you what it did to the following image:

> ERROR Error: NG0203: takeUntilDestroyed () can only be used main.ts:6

within an injection context such as a constructor, a factory function, a

field initializer, or a function used with `runInInjectionContext` . Find

more at https://angular.io/errors/NG0203

at assertInInjection Context (core.mjs:10367:15)

at takeUntilDestroyed (rxjs-interop.mjs:23:33)

at

7:27)

TakeUntilDestroyComponent.ngOnInit (takeUntilDestroy.component.ts:2

at callHookInternal (core.mjs:4024:14)

at callHook (core.mjs:4051:13)

at callHooks (core.mjs:4006:17)

at executeInitAnd CheckHooks (core.mjs:3956:9)

at refreshView (core.mjs:13513:21)

at detectChangesInView (core.mjs:13663:9)

at detectChangesInEmbedded Views (core.mjs:13606:13)

When I tried this with Adobe, it couldn't extract any text at all and when Adobe OCR does extract text, it adds it to multiple text boxes, littered all over the place, whereas the Scanshot Document Scanner app just creates a single string of text, with paragraph breaks in the correct places.

Adobe Community

OCR seems very poor

1 Correct answer