OCR not recognising words not in dictionary

Question

Acrobat XI Pro. I ran the OCR on some scanned old documents. The documents are typed and well scanned (clear images).

When I search for text it finds most words and numbers but not surnames or place names (probably other words too).

I assumed that if it didn't find a word in the dictionary it would just copy the letters, but that doesn't seem to be the case.

Example is Lieutenant Bartlett, Hinaidi, Iraq. A search will find Lieutenant but not Bartlett. It will find Iraq but not Hinaidi (a place in Iraq) although it will find words with "hin" in them.

As a test I typed a sentence in a Word doc, scanned it and converted to Pdf. This is a nice crisp scan, with clear, large letters. I did the OCR and tried searching. Same result: it finds my first name (as that is common) but not my surname, it will find the name of the city I live in but not the name of the road.

I don't expect it to "know" my surname, or the name of the road, but surely shouldn't it just copy the individual letters to make up the word?

gary_sc · Answer

Hi, @Tim33471290e2k8, as someone who's done OCR processing since the early 90s, it's been interesting, fun, and frustrating as to its progress.

About the best improvement has been the OCR's ability to see what needs to be OCRed. Early on, one needed to zone out each section of a page manually to let the OCR software to work on "THAT" region. Now, all that manual work has been removed. Ironically, a number of users would like to have that functionality, especially when there is an image on a page and the text is sideways. This causes the OCR software to not know which way it should rotate the page — to accomotate the document's text or the image's text.

Unfortunately, a lot of the other dynamics of text recognition still remains bogged down and are unable to recognize words. The difference between a "1" (one), and an "l" (ell) are one common mistake. So "world" may be seen as "wor1d." Or the letter pair "in" may be seen as "m." Another area of OCR-frustration is hyphenated words at the end of a line. So, the word "difference," might be cut into two pieces of two misspelled words ("dif-" and "ference").

Mostly what causes these issues is poor quality scans, small fonts, or scans at too low a resolution. I know you say you did a "…well scanned" document, but there are such things as "good scans," and "great scans."

I go into these issues in a blog I wrote for Adobe a number of years ago, it's here is you wish to peruse it:

https://community.adobe.com/t5/adobe-community-professionals/scanning-clean-searchable-pdfs/m-p/4785435?page=1#M89

I write all this to let you know you are probably not doing anything wrong, and the OCR software that Adobe license is probably doing the best it can. What REALLY is necessary, and I'm amazed it is not being done yet, is to combine AI and OCR. My copy of Grammarly seems to do a fairly good job of understanding the context of a whole sentence, and could easily deal with letters that it doesn't understand them together as a word, but only up to a point. (It fails to understand, for example, that "balsa wood is a soft wood but it's not a softwood." [Note: Balsa wood is considered a hardwood becuase it's deciduous. All deciduous trees are hardwoods.]

Hopefully, I've stated something that gives you a better handle on the limitations of OCR. You are certainly welcome to try other OCR packages and you might find some of them do a better job of OCR-ing for your needs. I do wish you luck!

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded