How to Edit/Fix OCR errors by Acrobat Pro DC?

Report · Feb 13, 2022

Two problems, really, and both relate to the text as recognized (or not) in scanned images of text documents (or in text and image documents) by Acrobat Pro DC's OCR capability:

(1) How can I access the full OCR text (hidden) file so I can correct the (inevitable) OCR errors in text that the program does not identify as a "suspect" to make accessible for editing via the "Correct Recognized Text" feature? I.e., the program "thinks" it has correctly identified and spelled a word, so it failed to present that text as a "suspect" for possible correction, but after the "text" in that document is copied and pasted into a separate document, the human eye can easily ascertain that the OCR software mispelled or misinterpreted the image of that text, so the user desires to correct that error in the OCR's "text" to assure later search accuracy.)

(2) The other part of the problem occurs when the OCR fails to identify (at all) some portions of text image. In those events, I would have thought I could use the Edit PDF | Edit feature to insert that the missing text, but I could not. I found the "missing text" located outside of the border of one of the "text boxes" (where text was recognized and can be edited), but I couldn't find a way to move the boundaries of that text box one to "bring in" that unrecognized text and, worse, trying to do so distorted the text characters that initially were in that "text box". So this problem is really how to add into the "searchable text" items that the OCR failed to identify as text. (I've attached a pdf with 2 cropped screenshots, the first is of a portion of the pdf before OCR & trying to edit the OCR'd text; and the second is of that same portion of the pdf while in the Edit mode and that shows the text boxes surrounding text the software" recognized [that I can manually edit] as well as the text in the original that the software DID NOT recognize and therefore did not include in a text box, none of which I can edit.

Adobe musts know both of these issues exist, so I presume there must be some way to address them so I can end up with a correct and complete text file that can be searched. However, I just cannot tools that seem to be able to make these these types of corrections to the text.

I've previously used other standalone OCR software that easily permits making these sorts of corrections (including adding missing text) to the underlying "searchable text" of an OCR'd image, but I just can't figure out where Adobe has hidden these capabilities within Acrobat Pro DC. Or, if these capabilities aren't present in Acrobat Pro DC, why in the world would Adobe not have included them?

I will deeply appreciate any help on how best to deal with these two problems.

Report · Feb 14, 2022

It's not hidden anywhere, it's just not available in Acrobat. What you see under the Text Recognition panel is what you can use. For more advanced OCR capabilities use a dedicated OCR tool.

Report · May 09, 2022

Thanks for responding. It's good (though disappointing) to learnAdobe doesn't include such tools in Acrobat.

Because I'm guessing Adobe is not willing to address this long-standing problem, perhaps you can offer some suggestions: As you probably know, in the legal community, PDF has become - by far - the dominant file format for exchanging large populations of documents between parties in lawsuits and I believe most law firms use Adobe's tools for handling PDF files. These document populations easily can aggregate tens or hundreds of thousands of pages of PDF files, essentially all of which must be reviewed by both sides in the lawsuit.

But these document populations comprise too many pages for individual (or teams) of lawyers &/or paralegals to visually read the actual image of each and every page in every one of those PDF files to determine whether that PDF file contains information that may be relevant for use in the case or whether it contains privileged information and thus should be withheld from disclosure to the opposing side. So each side must use text searching tools that can "read" the OCR'd text in each of those PDF files to identify which of them contains any of the specific words that are of interest to that party. However, when those PDF files were created and OCR'd, if the OCR software "misreads" and then mispells a word that is a relevant term in the lawsuit, then the true word will not be found by text seaching and some documents that do contain relevant words are missed and (usually) don't get identified by either side OR documents that contain relevant terms that may make it privileged end up being inadvertently produced to the other side, creating a possibility that the opposing side may end up "seeing" something they were not supposed to have seen at all.

To my knowledge, no OCR program is 100% accurate, and I don't think anyone reasonably expects any to be . . . except, apparently, Adobe. When Acrobat is used to OCR a document image into a text searchable PDF file, its algorithms somehow watch for words (shapes, etc.) that it is "not sure of" and it allows a human to review those to make a final determination of whether and how those marks should appear as text in the translation. But if Acrobat is "sure" of it's "reading", then it's all over. No one gets to "double check" the OCR engine, so that -- even when its translation is wrong -- tha interpretation by the program is treated as immutably correct. As an example, if Acrobat's OCR engine incorrectly "reads" the original text's word "Tarnar" and records it in the text translation as "Farmer", then in a lawsuit involving a party named "Tarnar", that documentary evidence likely will never be found unless someone just happens to see the image itself and notice it.

I don't doubt at all that Adobe doesn't include in Acrobat any tools to allow users to correct its OCR program's unrecognized errors, but I would greatly appreciate it if someone can suggest other available tools that can assist in finding and correcting in a PDF file the unrecognized errors made by Adobe's OCR engine when it has created a PDF file.

I believe there are some tools that can extract and combine the OCR'd text from multiple PDF files and put that into a database that can be sorted such that a proofer can look at and sort together all the words in a population of documents to see there are, for example, 4,267 instances of "Tarnar" in the entire population of documents along with 7 instances of "Farmer", which can make it practical to examine those 7 instances of "Farmer" to determine which, if any, are misreadings of "Tarnar", but I'm not aware of any that then can make the appropriate corrections to the text files of those specific PDF files. Are you?

Thanks

Report · May 09, 2022

What options does you use at OCR?