What is the most efficient workflow for converting scanned typewritten pages into a text document. The destination output is an e-pub and similar file formats. Any hints, tips apps, or tricks you could suggest would be very much appreciated. I have the full creative suite so I can convert the underlying document to any other file format if necessary. The situation: I have the PDF of a 771-page typewritten manuscript. I opened it into Acrobat X, ran the OCR function and exported the text to three formats to test out which handled it better. The formats were Word doc, RTF, and a plain text. Acrobat X did a great job 95 percent of the time however, the remaining 5 percent is killing me. All three formats had the following problems to various degrees. Problem 1: Acrobat has a tendency to separate out parts of pages into what look like tables or separate text boxes. It also sometimes reads dust on the pages as punctuation and that also may be affecting the flow of the text. The result is huge rproblems when trying to flow the text because the text boxes overlap each other and hide behind other elements on the page. It is also randomly adding tab stops between words and paragraph stops at the end of lines. Problem 2: In the original manuscript every page is numbered at the bottom. These show up as random text elements and are also giving Acrobat problems in the export by interrupting the text flow. Problem 3: Too many special characters being inserted. Acrobat is being too smart in how it does the OCR and splitting up words with various special characters caused when Acrobat deciphers dust and smudges as letters. What processing steps could I handle differently? Here is a typical page from the manuscript:
... View more