Copy link to clipboard
Copied
I am currently digitizing a collection of paleontology publications so that the literature for specific genera can be quickly found. My problem is that much of the literature is in very bad condition due to it's age (some of it being from the 1800's). I would like to be able to improve the text's readability in photoshop before importing it into Acrobat to be OCR'd. The biggest issue is that many letters have not been completly printed (eg. an "a" is read as "ct" by acrobat because of gaps at the top and bottom of the "a" or an "e" is read as a "c"). Any suggestions on how to make badly printed letters more "whole" (esspecialy italicized characters) would be greatly appreciated. My current process for digitizing publications involves these steps:
1) Scan the document either by ADF or on a flatbed as grayscale JPEGs at 600 dpi (although, not necessary to scan at this resolution, it greatly improves results).
2) Open the images in photoshop and apply "Auto Levels" (black and white clipping at zero), apply an "Unsharp Mask" (Amount: 100%; Radius: 250 pixels; Threshold: 0 levels), save, and exit.
3) Combine the JPEGs in a PDF and OCR the document in it's corresponding language.
This works really well if the publication is in good condition however the acuracy on other documents is easily below 75%. Again, any suggestions on how to make badly printed letters more "whole" would be greatly appreciated.
Copy link to clipboard
Copied
Look into using other OCR solutions that can be trained like Finereader or Cooliris. Acrobat's OCR is really only meant for contemporary tasks like recognizing form data, not for book restauration. It's not the image quality or anything, you have to have a way to teach the program to interpret specific gaps and artifacts differently and you can't do that with Acro.
Mylenium