Skip to main content
Inspiring
December 20, 2023
Question

What to do when OCR doesn’t work?

  • December 20, 2023
  • 1 reply
  • 6499 views

Dear all,

in the context of a musical research (on a composer from the XIX century) I am analysing about 100 PDF files of a magazine that was published between 1780 and 1880ca. These PDFs are made from scans of the original copies held in a library. 

My idea is to search each document for the name of the composer, extract the articles, and continue my research. This is crucial because each PDF is 500+ pages long, and in German, which I don't understand comfortably. 

On a first attempt, Cmd-F > composer-name > Return gives no result, but so does each other word, a sign—I believe—that the PDF is not searchable and the text is not recognisable. I then run the Recognise Text feature from the Scan & OCR tool and try again. Given the general success I had with a few PDFs, I assumed this was working, and assumed that PDFs showing zero results had no mention of this composer (since "PDF-with-no-result" was <= than "PDF-with-results"). 

Today, while the Recognise Text was running, I saw the name of the composer passing by, but when searching for it, Acrobat returned no result. Trying for the first 2 & 4 letters of the surname failed as well, while the first 3 letters succeeded in returning some entries, highlighting the full name!! 

It is crucial that I do not have to go page by page looking for this name, it could take 30 years! 

Is there something I can do to set up Recognise Text so that it is actually reliable? I am setting the output to "Searchable Image" (which is the default for me). If not, is there any other reliable tool to achieve this? 

Sometimes the Recognise Text shaped a 200MB document into a 1.6GB PDF, without in the end improving the OCR at all. 

Any help is greatly appreciated. 

I am using macOS, and while I have access to a 2016 MBP with Intel i7 (6gen) with Monterey and a new 2023 MBP with M3 Max (14/30), apart from the speed in analysis I am not seeing different results in the end. 

Thank you

1 reply

gary_sc
Community Expert
Community Expert
December 20, 2023

Hi, @Inélsòre, interesting project.

 

Can you share one of the pages that you're working with? If you want, you can DM with the page. I've been scanning and OCRing for over 25 years. Before I make some suggestions, I'd like to see what you're working with.

 

I gather you are receiving these pages already pre-PDFed and you are not doing any of the scanning?

 

Let me know if you can send me a representative PDF.

 

 

Inspiring
December 20, 2023

I gladly will, since these PDFs are stored in a digital library so basically accessible from anyone.

Just, I will do it tomorrow when I'm back at the Mac, 1AM here 🙂 

Thanks!