"cleaning up" pdf's of old, scan generated, scientific documents

Forum|Forum|5 years ago
July 7, 2020
4 replies
11722 views

Is there a process for "cleaning up" pdf's created from scaned documents? in this example ... an old scientific doument with symbols, latin names, etc.?

When I selected a section, then saved as PDF...... symbols, latin names, etc. are sometimes interperted incorrectly.

Any suggestions?

This topic has been closed for replies.

try67

Community Expert

I would recommend taking the original scans to something like Photoshop in order to clean them up, sharpen them, etc.

When done, convert them to a PDF file and then run Text Recognition on them. Acrobat is not really the tool to do the cleaning-up. It's not an image editor.

Bevi Chagnon - PubCom.com

Legend

Before bringing the pages into Photoshop for clean-up (that's a lot of work for so many pages!), I'd try these 2 options first:

Adjust the OCR settings within Acrobat. Right now it seems to miss some of the lighter characters in the original scan, and that's not unusual for a document printed 50-60 years ago and scanned who-knows-when. Play around with the settings and see if you can improve its accuracy.
Try another OCR software. Although Acrobat's is decent, other brands do a better job for certain types of scans. My firm's top 2 recomendations are:
1. Abby FineReader https://www.abbyy.com/
2. OmniPage https://www.kofax.com/Products/omnipage

Because of the complexity of your content, I recommend the "Pro" versions of these programs rather than the cheaper versions. They have better recognition of unusual symbols, STEM characters, and languages, as well as controls for cleaning up the background crud that gets caught into a scan.

|    Bevi Chagnon   |  Designer, Trainer, & Technologist for Accessible Documents ||    PubCom |    Classes & Books for Accessible InDesign, PDFs & MS Office |

Bevi Chagnon - PubCom.com

Legend

Might be a font issue, older TrueType or PostScript fonts that used the ASCii character set (https://www.asciitable.com/ ), versus today's OpenType fonts that are based on the Unicode character set (https://www.unicode.org). The computer industry adopted Unicode in January 2000. Although older TrueType and PostScript fonts can still be used, they're missing the advanced characters of Unicode, such as foreign language glyphs, math/science symbols, and dingbats.

If you look at the Fonts tab in File / Properties, tell us what fonts are listed.

|    Bevi Chagnon   |  Designer, Trainer, & Technologist for Accessible Documents ||    PubCom |    Classes & Books for Accessible InDesign, PDFs & MS Office |