Copy link to clipboard
Copied
What is the most efficient workflow for converting scanned typewritten pages into a text document. The destination output is an e-pub and similar file formats. Any hints, tips apps, or tricks you could suggest would be very much appreciated. I have the full creative suite so I can convert the underlying document to any other file format if necessary.
The situation: I have the PDF of a 771-page typewritten manuscript. I opened it into Acrobat X, ran the OCR function and exported the text to three formats to test out which handled it better. The formats were Word doc, RTF, and a plain text.
Acrobat X did a great job 95 percent of the time however, the remaining 5 percent is killing me. All three formats had the following problems to various degrees.
What processing steps could I handle differently?
Here is a typical page from the manuscript:
 
Copy link to clipboard
Copied
Unless you want your OCR'd PDF to be the final deliverable, don't go near PDF. It seems that you want text flow, perhaps a different kind of ePub entirely. Perhaps without keeping the original exact facsimile - which is the great strength of PDF, but not desirable in most ePubs. So... I suggest you look for OCR to Word, OCR to plain text or even OCR to ePub.
Copy link to clipboard
Copied
Didn't even know those existed. Thanks!
Copy link to clipboard
Copied
I encountered a similar situation when my sister and I found our mom's "Family History." [Note: it is our observation that whomever writes these histories is 3/4 storyteller and 1/4 historian but that's another story. It was about 24 pages and had many similar issues as yours but also included her typewriter would occasionally start slipping so that as one got to the bottom of a page, the page would start to shift direction so that the line tilted from left to right. That confused the blazes out of Acrobat.
But to answer your question, TSN's suggestion of getting a dedicated OCR package is not a bad idea but it will only solve the default page issues that both of us had.
The bad news is that there is NO magic wand here. Whatever tool you use there will be problems, that's the full nature of OCR especially when working with text such as you and I had (although you have considerably more text than what I did.
However: I ended up saving to rtf format, opening up in word and then resaving into txt format. That got rid of all formatting inclduing the page breaks.
To deal with the paragraph marks at the end (I didn't have this issue FWIW), what you can do is to first turn all paragraph breaks (two paragraph in a row) into something unique (I used "zzzz" by using Word's powerful Find and Replace. What you do is for the "Find" place ^p^p (that's the code for two paragraph symbols) and in the replace put zzzz. Then convert all single paragraph symbols ^p into spaces. Then go back and convert all "zzzz" into a single Paragraph symbol. The only thing that may catch this up is if there is a space before a Paragraph symbol. If you have this than first do a find for space-^p then replace with ^p. This can be a very fast way of dealing with these things.
Also, be aware that OCR does not know that partial words at the end of line will be continued if there's a hyphen.
However, whether you chose to get an OCR package, or via Acrobat to Word, just get ready to sit and work for some time cleaning this up. There's no way around that.
Good luck!
Copy link to clipboard
Copied
Thank you. It's always the little things that take the most time; your advice will be very helpful.