What to do when OCR doesn’t work?

Report · Dec 20, 2023

Dear all,

in the context of a musical research (on a composer from the XIX century) I am analysing about 100 PDF files of a magazine that was published between 1780 and 1880ca. These PDFs are made from scans of the original copies held in a library.

My idea is to search each document for the name of the composer, extract the articles, and continue my research. This is crucial because each PDF is 500+ pages long, and in German, which I don't understand comfortably.

On a first attempt, Cmd-F > composer-name > Return gives no result, but so does each other word, a sign—I believe—that the PDF is not searchable and the text is not recognisable. I then run the Recognise Text feature from the Scan & OCR tool and try again. Given the general success I had with a few PDFs, I assumed this was working, and assumed that PDFs showing zero results had no mention of this composer (since "PDF-with-no-result" was <= than "PDF-with-results").

Today, while the Recognise Text was running, I saw the name of the composer passing by, but when searching for it, Acrobat returned no result. Trying for the first 2 & 4 letters of the surname failed as well, while the first 3 letters succeeded in returning some entries, highlighting the full name!!

It is crucial that I do not have to go page by page looking for this name, it could take 30 years!

Is there something I can do to set up Recognise Text so that it is actually reliable? I am setting the output to "Searchable Image" (which is the default for me). If not, is there any other reliable tool to achieve this?

Sometimes the Recognise Text shaped a 200MB document into a 1.6GB PDF, without in the end improving the OCR at all.

Any help is greatly appreciated.

I am using macOS, and while I have access to a 2016 MBP with Intel i7 (6gen) with Monterey and a new 2023 MBP with M3 Max (14/30), apart from the speed in analysis I am not seeing different results in the end.

Thank you

Report · Dec 20, 2023

Hi, @Inélsòre, interesting project.

Can you share one of the pages that you're working with? If you want, you can DM with the page. I've been scanning and OCRing for over 25 years. Before I make some suggestions, I'd like to see what you're working with.

I gather you are receiving these pages already pre-PDFed and you are not doing any of the scanning?

Let me know if you can send me a representative PDF.

Report · Dec 20, 2023

I gladly will, since these PDFs are stored in a digital library so basically accessible from anyone.

Just, I will do it tomorrow when I'm back at the Mac, 1AM here 🙂

Thanks!

Report · Dec 21, 2023

Here is the file (bigger than 47MB): https://www.dropbox.com/scl/fi/9pilofuu4xdlt4367osiu/AMZ-Vol-21-1819.pdf?rlkey=de3b8hmkld0vmyukgx4jb...

The word to look for is "Dotzauer". In my experience with this document, this yields 0 results after text recognition is run, while looking for "Dot" is giving me 2 results. I wonder if there are any others.

Report · Dec 21, 2023

Hi, @Inélsòre, I have some sad news for you: this is not going to happen.

Let me explain: as amazing as OCR is, there are limitations, and your text is hitting each and every limitation there is. The original documents are not clean (there are speckles and irregularities everywhere), they are not consistent in shading (one area may be lighter or darker than other areas), the scans are of insufficient resolution, and the printing has poor kerning between some letter groups.

Before I show you what I'm referring to above, let me explain some of the limitations of OCR. Simple letter pairs can confuse it such as "in" can be seen as "m" or "ri" as "n." Processing this in OCR can be improved by a higher-resolution scan. Normally, if the copy is good, 300 ppi is sufficient. But, when the text is small or the text is close together (small kerning), 600 ppi can help, but at a certain point, it won't make a difference.

Here is a screenshot of a sample of one page I worked with (for the record, this is page 15 of the total manuscript):

When I saw the quality of the document, I knew there would be issues, but I wanted to try and see what could be done with it. I extracted the single page and opened it in Photoshop. I then converted it to a grayscale image. Next, I opened up the Adobe Camera Raw filter to use an ACR Linier Mask to darken the bottom right of the page, as it was lighter than the rest. Then, I opened a "Levels" adjustment to make the non-text regions white and the text regions black. This brought me to have this as a result:

Looking at this, it's easy to see that many letters encroach upon the adjacent letters. Also, you can see where the quality of the print itself is, well, poor. If I select an arbitrary word, such as Jungfrauen (3rd line down), and copy and paste what the OCR pulled out, you get "Jungf 11auen." So, if you were searching for Jungfrauen, you'd never find it.

BTW, if I OCRed the page as it was without any Photoshop enhancement, I got "Jungf1·nue1."

To put this into perspective, in my mom's later years, she decided she'd write the family history for my sister and me. She had an electric typewriter, but the platen was old and didn't grab the paper as well as one would hope, so occasionally, the lines were not parallel and were tipped at an angle a bit. The ribbon was old, so the letters were not as dark as they could have been. Suffice it to say that while the OCR did OK on more of this document than I thought it would, there were whole paragraphs that were easier to retype than to try and fix the results of the OCR.

Sadly, as this document stands, there is no hope for Acrobat or any OCR product to do what you want. If you had access to the original documents to do a proper scan*, you'd succeed better. But the best you can do here is read it with your eyes. This whole thing shows how marvelous our eyes and brain can interpret what we see so much better than a computer. And no, I know of no AI work being done on OCR (and oh, how much I wish it were).

* Some of the "softness" seen in the letters in my Photoshopped version is caused by the processes I did to whiten the page. If you did those same things in the scanning software, the letters would have been cleaner and sharper, providing better success in the OCR. How much so, it's impossible to tell.

Report · Dec 21, 2023

Thank you so much Gary, this is such a great explanation!

I wonder where I could learn about these techniques about improving an image to prepare it for OCR, since I doubt it is something one finds in the Photoshop's handbook!

On the bright side of life, the library holding these books replied this morning, saying that, if I made one mistake, it was to download the PDFs instead of using their online searching tool. The online viewer shows the TIFFs at their full resolution and they've already been OCR'd, which, at least, is returning much better results!

I should be able to find already quite a lot of info there for now.

Concerning the AI tools, why I would not feel comfortable following any of these tutorials as I'm not comfortable with coding (though it seems I should, sooner or later), I found these while browsing:

- https://ocr-d.de/en/setup

- https://gitlab.com/scripta/escriptorium/-/wikis/full-install

- https://www.ocr4all.org/guide/setup-guide/macos

- https://readcoop.eu/transkribus

- https://programminghistorian.org/en/lessons/ocr-with-google-vision-and-tesseract#:~:text=Google%20Cl...

Since you're an expert in the field, you may want to enlighten us on whether any of this is good!

PS: maybe I should also try to download a single image at full res and try the OCR then, to see if it is indeed better.

Report · Dec 21, 2023

Hi, @Inélsòre, if you want to get into scanning, you might find some information on how to get better quality scans from this blog I wrote for Adobe several years ago. I'd point out that since I wrote this, scanning manufacturers have been "simplifying" their software and dumbing it down. The controls I point out in this article are not always available nowadays. So, if you're looking to buy a new scanner, I'd download the manual of the scanner you're considering and look to see what controls they have. Also, sad to say, but you get what you pay for. There IS a big difference between a $100 scanner and a $600 scanner. I have an Epson V800 that isn't cheap, but I can't blame my scanner for any mistakes I make! :>)

Also, it's better to get a less expensive scanner with a CCD sensor than the most expensive CIS sensor scanner. (Although any CCD scanner will be more expensive than a CIS scanner. What I'm trying to say here is that a top end CIS is not as good as a low end CCD scanner.)

I've not had a chance to read your links yet; give me some time, and I'll get back to you on them.

https://community.adobe.com/t5/adobe-community-professionals/scanning-clean-searchable-pdfs/m-p/4785...