Acrobat Pro - scan/OCR, text recognition, text correction, text retrieval, text as exported to Word
I've been using various versions of Acrobat Pro for decades. Admittedly, it hasn't been "day in, day out" use, so I don't claim any expertise with it at all. But I also feel it shouldn't be as difficult as I've always found it to be with respect to handling and accessing the text in a PDF file. Acrobat Pro is a product I've always hoped could be genuinely useful for working effectively with text that's been "extracted from" scanned images of hardcopy pages of raw text and then stored within a pdf file.
But I've never been able to figure out how to make Acrobat Pro work as I expect it should with respect to just the text it "acquires" from scanned images of ordinary text documents. Forget all the cool image editing tools in Acrobat; I'm 100% concerned first with just the text in the PDF file: Acquiring it; correcting it for accuracy; and then being able to access and work with the "recognized" text after it has been manually corrected to conform to what a human being interprets from looking at the original image.
Very few of the documents I scan to pdf ever include any photos or other visual material. Essentially straight text only, so being able to deal with accurate text in pdf files is really the ONLY reason I use Acrobat Pro at all.
Of course, I never expect any OCR software will be 100% accurate at text recognition, but - more often than not - Acrobat Pro's "text recognition" is much poorer than I'd hoped for: Many mis-recognized words and nonsense combinations of characters even from "clean" pages. And, of course, when the images of text are less clear (old pages, smudges, tears, etc.), I expect the OCR accuracy will be worse.
So the first thing I want to know is what steps can I take to make my copt of Acrobat Pro always recognize text in any image file as accurately as it possibly can? Keep in mind that I am infinitely more concerned about the accuracy of the recognized text than with the many tools Acrobat Pro includes to alter or make the image itself look good in the final pdf file.
The second issue is about my frustration with efforts to try to correct the "recognized" text in the pdf file. I know OCR is never 100% accurate and -- particularly with many old documents -- when I go to "correct" the recognized text, I expect it will look like it's been up to 98% "mis-read". Nevertheless, I've spent hours using Acrobat Pro's clumsy "Correct recognized text" tool to fix just the instances where the program itself admits it might have "misread" something. [Unfortunately, there seems to be no tool to "fix" the program's "misreads" that it doesn't "believe" might be wrong but that it genuinely did not correctly recognize.] But even with fixing just what I'm allowed to correct, the effort ends up seeming to have been a complete waste of time because I've been unable to find a way to "retrieve" the corrected text from the saved PDF file. After saving that "corrected" file, it I then export that PDF to a Word document, the exported Word document doesn't get the corrected text; instead Acrobat Pro gives the Word document only the text as originally was mis-read. WHY??? How can the "corrected text file" be retrieved from these PDF files?
Though it's probably not directly related to helping me learn ways to make my Acrobat Pro better help me work with accurate text, the most galling thing about PDF overall is that, long ago, Adobe (with the help of others) convinced the U.S. Courts to accept and use PDF/A files as an accurate, secure electronic standard. Yet, after hundreds of thousands (or millions) of pages of case files and authorities have been scanned and made into "text accessible" PDF files, many of those pdf files have "baked in" uncorrectible text errors that can make reliable retrieval of many of those document essentially impossible when the search is for a word that the PDF file creation software's OCR engine "misread". That hurts everyone in one way or another.
Why isn't it a priority for Adobe to provide tools to make it easier to improve text accuracy in PDF files?
