How to output OCR files in clear text?

Report · Aug 24, 2022

I am a novice user with Adobe Acrobat Pro 2020, and I would like some help.

I am a very old retired home user, and so nothing I produce needs to conform with any publishing, etc, requirements. Everything is for my own use.

Let me explain the issue I can't master. I am often dealing with 100 - 200 year old text. I am very impressed with the quality of Adobe's OCR. From what I see, OCR produces text letters that looks very similar to that which was OCR'd. ie it is readable but imperfect text, no crispness, but still correctly OCR'd. Where a bit of the original is too distorted for the OCR to recognise, the OCR outputs that bit as it looked after scanning, ie not OCR'd. Seems very clever to me.

I want to output my pdf with the recognised characters present as clear, crisp letters (should be possible as the OCR has correctly recognised it all), with the OCR still using the occasional substitution of a sort of facsimile of bits it can't recognise.

I have tried exporting to Word, but that introduces multiple errors, which are not apparent in Adobe's output. Exporting to text, text (aceesible), rtf, all introduce extra errors.

Now, what I am asking, can I get Adobe OCR to output with a crisp text (not fussy about the font), while still substituting (what I call a facsimile) of the bits the OCR can't recognise.

I have attached an image to show what I see when I OCR an old text, and below that is the same file exported to Word. I am hoping to achieve is for Adobe's OCR to output crisp letters like in the bottom image, etc.

Take care in these dangerous times,

Doug

Report · Aug 24, 2022

After OCRizing you can correct the recognized text directly in Acrobat Pro:

1.

2.

Report · Aug 24, 2022

Otherwise you can use this Preflifht fixup, but as suggested it will show you a layer containing "invisible" text only, so it just can be copy-pasted.

1.

2.

Report · Aug 24, 2022

Thanks for your reply JR. I do appreciate it. I can see that I was unable to describe my issue.

Your reply centres around using Acrobat to find and highlight suspect text. Detecting and correcting these errors is not a concern for me. I will try to be clearer.

When Acrobat OCRs a file it detects letters (characters) that it recognises. I have trouble understanding when it recognises a letter Acrobat outputs a match to what is often a poorly printed character.

In one of the attached other replies, an example of this is given. From "Dave__M", he shows this image

The top scan is a pixelated scan as expected, and Acrobat outputs editable text that is far from a crisp font.

The program has correctly identified the scanned block as "1952", and therefore it seems logical to me that it could, and should, output 1952 as a clear crisp font, as other OCR programs I have do.

As an example, using a basic OCR that I own, using the first line of the example I gave in my original posting, here is a comparison of Acrobat OCR with that different (basic) OCR program. (This is an image)

(Number 2 did not like uppercase DESPAIR being in a sentence)

Number 2 is the output I want from Acrobat. In my original posting the only reason I showed the exported Word output was to point out that this produces many extra errors. But errors aren't my question. Acrobat OCR is very good.

Summing up, as shown in the 1952 example above, Acrobat OCR recognised "1952" but still outputted it in a ... how do I describe it ... a "distorted" 1952. Surely(?) there must be an option to output in a real font. After all, the numerals 1952 have been recognised.

Hoping this describes my question better.

Taking my meds, putting on slippers, and getting my lap rug on my kneea, 😔 😊

Doug

Australia

Adobe Community

How to output OCR files in clear text?