What to do when OCR doesn’t work?
Dear all,
in the context of a musical research (on a composer from the XIX century) I am analysing about 100 PDF files of a magazine that was published between 1780 and 1880ca. These PDFs are made from scans of the original copies held in a library.
My idea is to search each document for the name of the composer, extract the articles, and continue my research. This is crucial because each PDF is 500+ pages long, and in German, which I don't understand comfortably.
On a first attempt, Cmd-F > composer-name > Return gives no result, but so does each other word, a sign—I believe—that the PDF is not searchable and the text is not recognisable. I then run the Recognise Text feature from the Scan & OCR tool and try again. Given the general success I had with a few PDFs, I assumed this was working, and assumed that PDFs showing zero results had no mention of this composer (since "PDF-with-no-result" was <= than "PDF-with-results").
Today, while the Recognise Text was running, I saw the name of the composer passing by, but when searching for it, Acrobat returned no result. Trying for the first 2 & 4 letters of the surname failed as well, while the first 3 letters succeeded in returning some entries, highlighting the full name!!
It is crucial that I do not have to go page by page looking for this name, it could take 30 years!
Is there something I can do to set up Recognise Text so that it is actually reliable? I am setting the output to "Searchable Image" (which is the default for me). If not, is there any other reliable tool to achieve this?
Sometimes the Recognise Text shaped a 200MB document into a 1.6GB PDF, without in the end improving the OCR at all.
Any help is greatly appreciated.
I am using macOS, and while I have access to a 2016 MBP with Intel i7 (6gen) with Monterey and a new 2023 MBP with M3 Max (14/30), apart from the speed in analysis I am not seeing different results in the end.
Thank you
