OCR not recognising words not in dictionary
Copy link to clipboard
Copied
Acrobat XI Pro. I ran the OCR on some scanned old documents. The documents are typed and well scanned (clear images).
When I search for text it finds most words and numbers but not surnames or place names (probably other words too).
I assumed that if it didn't find a word in the dictionary it would just copy the letters, but that doesn't seem to be the case.
Example is Lieutenant Bartlett, Hinaidi, Iraq. A search will find Lieutenant but not Bartlett. It will find Iraq but not Hinaidi (a place in Iraq) although it will find words with "hin" in them.
As a test I typed a sentence in a Word doc, scanned it and converted to Pdf. This is a nice crisp scan, with clear, large letters. I did the OCR and tried searching. Same result: it finds my first name (as that is common) but not my surname, it will find the name of the city I live in but not the name of the road.
I don't expect it to "know" my surname, or the name of the road, but surely shouldn't it just copy the individual letters to make up the word?
Copy link to clipboard
Copied
Hi, @Tim33471290e2k8, as someone who's done OCR processing since the early 90s, it's been interesting, fun, and frustrating as to its progress.
About the best improvement has been the OCR's ability to see what needs to be OCRed. Early on, one needed to zone out each section of a page manually to let the OCR software to work on "THAT" region. Now, all that manual work has been removed. Ironically, a number of users would like to have that functionality, especially when there is an image on a page and the text is sideways. This causes the OCR software to not know which way it should rotate the page — to accomotate the document's text or the image's text.
Unfortunately, a lot of the other dynamics of text recognition still remains bogged down and are unable to recognize words. The difference between a "1" (one), and an "l" (ell) are one common mistake. So "world" may be seen as "wor1d." Or the letter pair "in" may be seen as "m." Another area of OCR-frustration is hyphenated words at the end of a line. So, the word "difference," might be cut into two pieces of two misspelled words ("dif-" and "ference").
Mostly what causes these issues is poor quality scans, small fonts, or scans at too low a resolution. I know you say you did a "…well scanned" document, but there are such things as "good scans," and "great scans."
I go into these issues in a blog I wrote for Adobe a number of years ago, it's here is you wish to peruse it:
I write all this to let you know you are probably not doing anything wrong, and the OCR software that Adobe license is probably doing the best it can. What REALLY is necessary, and I'm amazed it is not being done yet, is to combine AI and OCR. My copy of Grammarly seems to do a fairly good job of understanding the context of a whole sentence, and could easily deal with letters that it doesn't understand them together as a word, but only up to a point. (It fails to understand, for example, that "balsa wood is a soft wood but it's not a softwood." [Note: Balsa wood is considered a hardwood becuase it's deciduous. All deciduous trees are hardwoods.]
Hopefully, I've stated something that gives you a better handle on the limitations of OCR. You are certainly welcome to try other OCR packages and you might find some of them do a better job of OCR-ing for your needs. I do wish you luck!
Copy link to clipboard
Copied
Thanks @gary_sc for the comprehensive reply, much appreciated. To clarify: when I typed a sentence in Word I actually then pasted it into Paint so that I could save it as a jpeg. That I then converted to pdf, so the text is pristine, perfect. It does seem very odd behaviour for it not to piece together a word from individual letters from a perfect source.
My version of Acrobat Pro is pretty old so maybe I will see if I can have better luck with newer OCR software.
Copy link to clipboard
Copied
Hi, @Tim33471290e2k8, First, a quick comment: you will do much better with a TIF document than a jpg. JPG formats are lossy and have degradation (the amount depends upon how much compression does at the time of saving, but if you save one document multiple times, the amount of compression degradation adds up on each save. That cannot happen with a TIF document because TIF is not a lossy format.
But I do have a question for you: why are you writing in Word and then pasting the text into Paint? Why not just convert the Word document itself into a PDF?
If there are images or such in your Paint document that you want your text to work around, you'd be better off placing them in the Word document and letting the text wrap around the images in Word.
I know it can be a touchy issue when questioning someone's workflow, but I am curious about this one.
Copy link to clipboard
Copied
Because when I converted the Word doc to Pdf the text was already readable (searcheable). I was trying to test by specifically replicating using OCR on a picture with text in (as my project documents where the problem surfaces are scanned). This proved to me that the problem is not with the quality of the old scanned documents.
Copy link to clipboard
Copied
OK, I think I see where you are going, but now I have to ask if the text was on top of the images or by their side.
Do you mind sharing with me privately (via DM) one of these pages? You really have me intrigued.