Manually entering and deleting OCR.

Report · Dec 16, 2023

I am in the process of attempting to add OCR to several books of scanned sheet music.

The OCR software tries to make sense of the manuscript. The result is gobbledygook. This presents a nightmare when trying to then use the 'correct recognised text' function. It also thinks every word 'swing' is 'swin9' for example. On many pdf's it completely misses the title of the music - which is the entire reason I'm adding OCR to the books in the first place.

SO; My wishlist:

- that OCR software can be asigned to only look at the top say, 3cm of an scan, or alternatively able to be mass deleted from a page.

- OCR 'correction recognised text' to be assigned to change ALL instances of 'swin9' to 'swing' etc.

- OCR can be manually entered where the software has missed it

Am I asking too much for 2023? How would one with more experience using Acrobat achieve what I'm after?

All help would be much appreciated!

Report · Dec 17, 2023

Wow, I'd call that funny if it weren't so sad.

To try and help you, I do need a bit of information. Your comment about "swing" being turned into "swin9" is not uncommon. Things that can cause that are "arty fonts," small fonts, or scanning at too low a resolution. Also, poor-quality scanners can contribute to this. Lastly, you mention these are from books; how are you dealing with page curl near the book's binding?

As far as the issue of the notes being turned into text, that's a tough one. Many years (25?) ago, I used to use some OCR software where you could pre-map out what was an image and what was text. The ability of OCR software to recognize what was text and what wasn't pretty much eliminated the need for that (well, mostly, as seen in your situation).

Please test scanning at 600 ppi and see how that works. Also, please scan using your scanner's software and do not try to go through with Acrobat. (Acrobat, by itself, cannot scan at all. Rather, it uses some software called Twain to gain access to your scanner's software. So, let's cut out the middleman in this. When you scan, scan to the TIF format. The reason for that is that after collecting the page in your storage place (Desktop, special folder, wherever); if you drag the TIF files onto the Acrobat icon, Acrobat's OCR process will begin automatically. If you scan in PNG or JPG, you must manually tell Acrobat to OCR in the current open file — an extra step.

If you wish to send me one page of your scan in any format by DM-ing me, I'd appreciate seeing it. I am very curious as to what the page looks like before you OCRed the page. I'd like to see what I could do with it.

Oh, for the record, can you tell me which version of Acrobat you are using and what your OS is (and what release).

Thank you and good luck!