Best Workflow To Turn Long Typewritten Doc Into E-Pub?

Report · Jul 09, 2020

What is the most efficient workflow for converting scanned typewritten pages into a text document. The destination output is an e-pub and similar file formats. Any hints, tips apps, or tricks you could suggest would be very much appreciated. I have the full creative suite so I can convert the underlying document to any other file format if necessary.

The situation: I have the PDF of a 771-page typewritten manuscript. I opened it into Acrobat X, ran the OCR function and exported the text to three formats to test out which handled it better. The formats were Word doc, RTF, and a plain text.

Acrobat X did a great job 95 percent of the time however, the remaining 5 percent is killing me. All three formats had the following problems to various degrees.

Problem 1: Acrobat has a tendency to separate out parts of pages into what look like tables or separate text boxes.
It also sometimes reads dust on the pages as punctuation and that also may be affecting the flow of the text.
The result is huge rproblems when trying to flow the text because the text boxes overlap each other and hide behind other elements on the page. It is also randomly adding tab stops between words and paragraph stops at the end of lines.
Problem 2: In the original manuscript every page is numbered at the bottom. These show up as random text elements and are also giving Acrobat problems in the export by interrupting the text flow.
Problem 3: Too many special characters being inserted. Acrobat is being too smart in how it does the OCR and splitting up words with various special characters caused when Acrobat deciphers dust and smudges as letters.

What processing steps could I handle differently?

Here is a typical page from the manuscript:

Report · Jul 09, 2020

Unless you want your OCR'd PDF to be the final deliverable, don't go near PDF. It seems that you want text flow, perhaps a different kind of ePub entirely. Perhaps without keeping the original exact facsimile - which is the great strength of PDF, but not desirable in most ePubs. So... I suggest you look for OCR to Word, OCR to plain text or even OCR to ePub.

Report · Jul 10, 2020

Didn't even know those existed. Thanks!

Report · Jul 09, 2020

I encountered a similar situation when my sister and I found our mom's "Family History." [Note: it is our observation that whomever writes these histories is 3/4 storyteller and 1/4 historian but that's another story. It was about 24 pages and had many similar issues as yours but also included her typewriter would occasionally start slipping so that as one got to the bottom of a page, the page would start to shift direction so that the line tilted from left to right. That confused the blazes out of Acrobat.

But to answer your question, TSN's suggestion of getting a dedicated OCR package is not a bad idea but it will only solve the default page issues that both of us had.

The bad news is that there is NO magic wand here. Whatever tool you use there will be problems, that's the full nature of OCR especially when working with text such as you and I had (although you have considerably more text than what I did.

However: I ended up saving to rtf format, opening up in word and then resaving into txt format. That got rid of all formatting inclduing the page breaks.

To deal with the paragraph marks at the end (I didn't have this issue FWIW), what you can do is to first turn all paragraph breaks (two paragraph in a row) into something unique (I used "zzzz" by using Word's powerful Find and Replace. What you do is for the "Find" place ^p^p (that's the code for two paragraph symbols) and in the replace put zzzz. Then convert all single paragraph symbols ^p into spaces. Then go back and convert all "zzzz" into a single Paragraph symbol. The only thing that may catch this up is if there is a space before a Paragraph symbol. If you have this than first do a find for space-^p then replace with ^p. This can be a very fast way of dealing with these things.

Also, be aware that OCR does not know that partial words at the end of line will be continued if there's a hyphen.

However, whether you chose to get an OCR package, or via Acrobat to Word, just get ready to sit and work for some time cleaning this up. There's no way around that.

Good luck!

Report · Jul 10, 2020

Thank you. It's always the little things that take the most time; your advice will be very helpful.

Adobe Community

Best Workflow To Turn Long Typewritten Doc Into E-Pub?