Participating Frequently

Question

Horrendous, inaccurate OCR conversion

Forum|Forum|5 years ago
January 7, 2021
2 replies
4604 views

I took screenshots of a public domain book that was already digitized into fonts but which were not searchable. The result was better than any scan of paper would have produced. When I exported from PDF to Word I was horrified to see the OCR results which were probably worse than the software I used a decade ago - we're talking about maybe 90% accuracy.

I used Nuance AdvancedPDF on the same document - software which is from several years ago and no longer available so it's not a competitive product any more. The conversion was in the high 99% and near 100. The only thing it missed was consistent typeface so that pages were bolded text rather than plain text - which is fine and easily correct. Some pages had a slightly wrong font typeface. Also easily corrected. But the characters were near 100% conversion as they should have been.

So after spending all this money on the Adobe version because it's supposed to be the standard setter, why is the OCR conversion engine so horrible to the point of being unusable?

Bevi Chagnon - PubCom.com

Legend

I work in the accessibility field where we convert scanned and printed documents to accessible PDFs for those who use screen readers, so we OCR a lot. Some of our blind testers do this every day, in fact.

Our recommendation is to use Abby FineReader https://pdf.abbyy.com/. It's the easiest and most accurate OCR program on the market.

|    Bevi Chagnon   |  Designer, Trainer, & Technologist for Accessible Documents ||    PubCom |    Classes & Books for Accessible InDesign, PDFs & MS Office |

V

val1d

Participant

Over 4 years later... I want to thank you from the bottom of my heart for this comment. You have saved me so many headaches!

gary_sc

Community Expert

Hi Dick,

Your original supposition is not quite right.

The OCR process needs as good as it can get to achieve any level of accuracy. By definition, any screenshot will have a resolution of anywhere from 90-ish ppi to 120-ish ppi. To achieve good OCR results you need to start with around 300 ppi and the more the merrier.

Why is this? think of letter combinations like "ri" being seen as "n" or "ni" being seen as "m" for quick examples out of my head. Then you can add the size of the fonts to the equation (bigger is better).

I've been using a variety of OCR software for over 20 years and I know the problems and I know the solutions. Most people do not even realize that Acrobat cannot do any scanning, at all. Rather it uses Twain to link to your scanners software to do the work (that's on a PC where Twain is still available but on the Mac, Twain was dropped because it's a good access route to viruses so the Mac uses Image Capture written by Apple and is the worst scanning software in existence. But that's a different issue.)

Yes, you can perform OCR on screenshots (I've done it) but you can't seriously expect great results. However, if you must get the stuff on the screen to be accurate and want to spend the time, zoom into the page, take that screenshot, scroll, repeat, scroll, repeat. etc,. and you will get greater accuracy. But that's going to take time.

As far as your other software, I can only ask how reliable that is? How many different screenshots does it get that level of accuracy?

Look, I do not work for Adobe, I try to help people with their problems in these forums. Do you have an actual question that I can try to help you with?

D

Dick HertzAuthor

Participating Frequently

I appreciate your response but, respectfully, my suspicions are absolutely accurate. I'm making an apples to apples comparison of OCR capabilities between Adobe Acrobat 2020 and old software (Nuance Advanced PDF both versions 1 and 2) which beats the pants off Acrobat with the same source material. In fact, the OCR is so poor (less than 90% on text that is clearly sufficient) that I've decided I need to purchase dedicated OCR software and eat the big loss of Acrobat 2020 unless I can understand why it's so terrible at character accuracy. What Acrobat was very good with was formatting - such as bold/normal characters and layout. And quite frankly, that's so peripheral to actual character recognition accuracy that it makes me wonder whether Adobe spent much time at all improving its OCR engine in the past decade.

D

Dick HertzAuthor

Participating Frequently

I can't seem to edit my post and was going to say that I agree with you regarding 300DPI - but with scanned text. This isn't - it's pure screen shots of clean text. There isn't a great deal of fuzz, which is why the old competitor software is over 99% accurate and virtually perfect on standard pages.

But as I mentioned, all three software packages are using the same input source. Acrobat clearly fails the OCR part by a substantial margin. With industry leading software like this, there shouldn't be such poor results. Thanks for your insights, which I do appreciate.

Examples:

strategic thinking and planning

Acrobat: srr:1teg ic thi nki ng and plann ing

scenario planning

Acrobat: .ft'ennrio pln1111i11g

Advanced PDF picked it up perfectly. Acrobat 2020 failed miserably. The italics are more challenging but still fine for mos OCR software and both versions of the old had no problem. I'm at a loss as to why the first words failed. While I'd wonder whether it's Acrobat converting images into a poor format before OCR, I used the same file and got identical results. Adobe Acrobat 2020 OCR is just performing much worse than the competition. Hopefully there is some idenifiable fix.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded