Copy link to clipboard
Copied
When I run OCR on a scanned PDF, Acrobat diligently recognizes all the text, which is good, and also "recognizes" some bits of images as text, which is not so good. Removing these spurious text elements leaves holes in the image. Is there a way to remove them that restores the original image or should I resign myself to using copied images from the scan to replace the OCR-mangled elements?
Copy link to clipboard
Copied
You can specify a page range when performing OCR ("Text Recognition") in Acrobat, but if you have scattered pages in your file that you wish to ignore that might not be very useful. The only other option I can think of is to extract those pages, delete them from the original file, then run OCR on it, and import them back it.
This can be done using a custom-made script (except for the OCR part, which you'll need to run manually).
Copy link to clipboard
Copied
Unfortunately, most pages are a mix of text, equations, and diagrams, with occasional photos for variety. The most vexing document is a bill of materials in which washers, mounting holes, brackets, and even the texture of knurled knobs have wrongly become text while the part names, scales, descriptions, and measurements are properly recognized. This particular document would be solved with rule such as "don't look for text in column 1 of the table," though other documents are less structured.