Acrobat Pro - scan/OCR, text recognition, text correction, text retrieval, text as exported to Word

Question

I've been using various versions of Acrobat Pro for decades. Admittedly, it hasn't been "day in, day out" use, so I don't claim any expertise with it at all. But I also feel it shouldn't be as difficult as I've always found it to be with respect to handling and accessing the text in a PDF file. Acrobat Pro is a product I've always hoped could be genuinely useful for working effectively with text that's been "extracted from" scanned images of hardcopy pages of raw text and then stored within a pdf file.

But I've never been able to figure out how to make Acrobat Pro work as I expect it should with respect to just the text it "acquires" from scanned images of ordinary text documents. Forget all the cool image editing tools in Acrobat; I'm 100% concerned first with just the text in the PDF file: Acquiring it; correcting it for accuracy; and then being able to access and work with the "recognized" text after it has been manually corrected to conform to what a human being interprets from looking at the original image.

Very few of the documents I scan to pdf ever include any photos or other visual material. Essentially straight text only, so being able to deal with accurate text in pdf files is really the ONLY reason I use Acrobat Pro at all.

Of course, I never expect any OCR software will be 100% accurate at text recognition, but - more often than not - Acrobat Pro's "text recognition" is much poorer than I'd hoped for: Many mis-recognized words and nonsense combinations of characters even from "clean" pages. And, of course, when the images of text are less clear (old pages, smudges, tears, etc.), I expect the OCR accuracy will be worse.

So the first thing I want to know is what steps can I take to make my copt of Acrobat Pro always recognize text in any image file as accurately as it possibly can? Keep in mind that I am infinitely more concerned about the accuracy of the recognized text than with the many tools Acrobat Pro includes to alter or make the image itself look good in the final pdf file.

The second issue is about my frustration with efforts to try to correct the "recognized" text in the pdf file. I know OCR is never 100% accurate and -- particularly with many old documents -- when I go to "correct" the recognized text, I expect it will look like it's been up to 98% "mis-read". Nevertheless, I've spent hours using Acrobat Pro's clumsy "Correct recognized text" tool to fix just the instances where the program itself admits it might have "misread" something. [Unfortunately, there seems to be no tool to "fix" the program's "misreads" that it doesn't "believe" might be wrong but that it genuinely did not correctly recognize.] But even with fixing just what I'm allowed to correct, the effort ends up seeming to have been a complete waste of time because I've been unable to find a way to "retrieve" the corrected text from the saved PDF file. After saving that "corrected" file, it I then export that PDF to a Word document, the exported Word document doesn't get the corrected text; instead Acrobat Pro gives the Word document only the text as originally was mis-read. WHY??? How can the "corrected text file" be retrieved from these PDF files?

Though it's probably not directly related to helping me learn ways to make my Acrobat Pro better help me work with accurate text, the most galling thing about PDF overall is that, long ago, Adobe (with the help of others) convinced the U.S. Courts to accept and use PDF/A files as an accurate, secure electronic standard. Yet, after hundreds of thousands (or millions) of pages of case files and authorities have been scanned and made into "text accessible" PDF files, many of those pdf files have "baked in" uncorrectible text errors that can make reliable retrieval of many of those document essentially impossible when the search is for a word that the PDF file creation software's OCR engine "misread". That hurts everyone in one way or another.

Why isn't it a priority for Adobe to provide tools to make it easier to improve text accuracy in PDF files?

gary_sc · Answer

Hi, @pdaltonlaw, I do hear your pain.

From the beginning, Acrobat never has had a global correction. What I mean by that is if a word was misread multiple times, all the same way, there was nothing to click on to "repair all." Multiple requests for this have gone on dead ears. And, with all of the new AI features, none of that work seems to be aimed at auto-fixing the OCR process. Thus, simple things like hyphenated words are not joined. One of the potential risks of having AI fix Acrobat's OCR is to fix the original writing's grammar and spelling.

As far as fixing all of the previously OCRed documents, that would require an extensive amount of assisted work. As good as AI can be, it's still "stupid as "s#!t." A recent study showed that AI continued to be flummoxed by grade school word problems (train A leaves the station at …"). Meanwhile, as a user of Grammarly, I am always frustrated that it continues to want to fix my writing about the fact that I live in a green house, not a greenhouse. In addition, my word choices are often specific, and Grammarly always wants to rewrite my words into its "style." As an attorney, I'm sure you can appreciate the potential calamity if such "fixing" were left on its own with many legal documents..

Nonetheless, the biggest issue that I've seen screw up OCR is poor original content and/or poor-quality scanning. For the former, if you have a document that is multiple generations of Xeroxing a faxed document, you will never get a good quality OCR. Likewise, if you have a clean document and do not do a good-quality scan, you will also add to your problems.

I wrote a bunch of scanning tips for Adobe a number of years ago. Perhaps some of these might give you some ideas to help your scanning. If you have any further questions, please ask.

https://community.adobe.com/t5/adobe-community-professionals/scanning-clean-searchable-pdfs/m-p/4785435?page=1#M89

Good luck

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded