OCR Book with Embedded Documents Brand N

Report · Jan 01, 2022

Happy New Year!

I am OCR a scanned book that has many embedded documents. I will be exporting the scanned documents to Word for further editing, but I want the embedded documents to be left alone (I want them saved as images, not as scanned text). Thus, problem 1 is I don't want Acrobat to try to OCR these scans. Problem 2 is that when they are scanned, they result in lots of queries in the "Correct Recognized Text". Right now, I am just spending a lot of time deleting the suggestion and hitting <cr> for "Accept". This is very time consuming.

The only solution I see right now is to just to do this repeative task for every image, export it and hope that it saves it as an image on the page, and then cut and paste an image for those it doesn't. (Sample page below–one of the easier ones).

I hope there is a better way!

Thanks.

Report · Jan 01, 2022

Hi DancingJewel,

I'm glad you are planning on moving things into Word because I think that will help you on several accounts.

First off, any time I want to move an OCRed document into Word, I never have Acrobat do the correcting. As you've discovered, it's very painful. Word, on the other hand, provides correcting all "teh" into "the" type of fixes which can considerably speed things up. So I say that you should just OCR each page and not try to fix things in Acrobat.

On your first issue, unless the software that came with your scanner provides regions to not OCR (some software lets you break a page into regions, and since Acrobat itself doesn't scan, it uses TWAIN to use other software on your computer), you cannot do exactly what you want. What you do not say is if there is how many of these images on a page there are. However, with the example you provided, there's not much you can do unless you re-scan these to provide the best image of these sections. Anyhow, at least with ignoring the Acrobat initiated corrections, things will go much faster.

[To better explain what I'm talking about see this blog I wrote for Adobe some time back:

http://photosbycoyne.com/Gary's_Help/Scanning/clean-scanning.html]

BTW, I'm not trying to sell anything here but if you are going to recreate this book, you will have significantly less frustrations if you use InDesign rather than Word. I've used both and there is no comparison, especially if you have inserts within the text. Good luck!

Report · Jan 01, 2022

There are pages with no images, pages with 1 image, pages that are all 1 image, and pages with many images. Simple, right?

I have InDesign, and would consider it, but I am scanning this for the family of a deceased author who can't find his original files! Not sure they would have InDesign.

But the underlying problem would be wanting to have the Word export have the "text" be editable and leave the images alone (well, I might try to clean them up).

Best I can envision right now would be an amalgam, where I cut and save images andthen import them into the final word document. I thought of using the (exact) option, but I don't want an image of the original text!

Anyway, thanks, I will try your suggestion and see what I get. And I will follow up.

Report · Jan 01, 2022

Ah, you didn't mention about wanting the text to be editable. Yeah, InDesign does have that limitation.

And yes, this is a labor of love, they will appreciate what you are doing more than you could be paid.

For the text that might be OCRed within an image, one option that might be possible is to go into "Edit" mode, select that text and delete it. But to do this you need to select the OCR option of using "Searchable Image," not "Searchable Image (Exact)" or Editable Text & Images." The "Searchable Image" option leaves the text and image alone and creates an invisible text layer over the page's content. If you delete that text, you will not affect the underlying image.

Does that make sense?

Report · Jan 01, 2022

Actually, it does. I was temporarily in the mind set of not wanting to do that because I didn't want the image of the original text, and then realized that (slaps forehead) when I export to Word it becomes text. Sigh...

Btw, I am kind of a newbie when it comes to this. I have been creating all my documents in some version of LaTeX since the late 1980s. Word has always seemed very clunky to me, and is what I call WYGIWWWTGY (what you get is what we want to give you), whereas LaTeX was WYGIWYW (What You Get is What You Want). With the advent of some real-ish time LaTeX engines, you cam pretty much see what you are doing in real time. But this isn't the most "shareable" format!

looking forward to trying your great suggestion!

Report · Jan 01, 2022

Thank you for that "WYGIWWWTGY."

I will be keeping that one.

Report · Jan 02, 2022

Sadly, that didn't work. I couldn't select the OCR text and selectively delete it. However, it did lead to a rather convoluted process that gets me 90% of the way there. I just need help to finish it off.

1) Set the OCR to Searchable Text (Exact), and Recognize Text

2) Close, save, and open Preflight

3) Select the "Acrobat Pro DC 2015 Profiles" and search for OCR in the Preflight search

4) Select Make OCR Text Visible then click Analyze and Fix

5) Save and close Preflight

6) Open Layers panel

7) Command-Mouse Drag to select all OCR fields under images and delete.

Great, it now is exactly what I want, but I can't successfully export it. I am not sure if I have tried all the combinations of toggling the invisible OCR text layer on/off, and whether I flatten, merge, or leave it alone and then export to Word. But every attempt has either resulted in just the original page images, or the text overlaying the original page images. I did turn off the "recognize text where necessary (?)" option in the export, as it seemed reasonable.

I should mention that I am doing the post in Mac Pages if that matters (I suppose I could get Word - urp), and with great difficulty I can select the underlying images and leave the text, but this is not easy.

thoughts? but in any case, thanks for the help so far. It definitely pointed me in the right direction.

Report · Jan 02, 2022

I have Pages but since I have Word, I just use that. So my knowledge of working in Pages is very weak. [I do some writing for a company where the layout artist uses Pages but the Boss uses Word. So, after I prepare some copy I send the Word version to the Boss and then open the Word version in Pages, save that out, and send that to the layout artist. I try and oblige when I can. ;>)]

I can suggest one other nuclear option, and it may seem like a lot of work but in the end may be the simplest:

Scan all the documents, make a copy of the whole thing and then OCR any way that gives you the best copy
Export that copy as plain text (.txt)
Bring that text into InDesign
Take the non-OCRed pages and select the ones with the other pages, and open them in Photoshop. In the crop feature, uncheck the box that says "Delete Cropped Pixels."
As you find things that need to be images, crop to the image, then do a "Save as..." to keep just the image (name it to the page number, if there are multiple images, then hyphen the number: E.g., 43-1, 43-2, etc.). If there are multiple images on a page, if you go back to the original page, they are still there. If you didn't uncheck that box in step 4, they'd be gone.
Now, reassemble the book in InDesign. You can reformat the book any way you want (if you do not know how to use Styles in InDesign, this is a great chance to learn how valuable they are).
Once done, save this as a PDF
Take the PDF, open it in Acrobat and save it out as a Word document so it can be updated by the person getting this.

Yes, this is a lot of work but it will give you exactly what you want, it will look the best, and keep a copy of the original while allowing for the updating.

Now, one other thing. IF you're going to do this nuclear option, do try and get the best possible scans possible of the images. There is a thing in photography that you need to get the best possible image in the camera and then make it better in Photoshop. You cannot make a bad picture excellent in Photoshop but you can make any image "better" in Photoshop. When it comes to scanning, people overlook the fact that if you do all of the tricks and options to make a picture look good in the scanning software, any later enhancements in Photoshop will sometimes not even be necessary and the result from a proper scan is often much better than "just scan and fix in Photoshop" can possibly do. I've been doing this a long time and I know this to be true.

So if you want to be really really picky about this whole project, scan it for the text and then scan it for the pictures. However, since these were already scanned before you got there, it may not be possible to make them that much better, but they could be worse if you do not try and enhance at the time of capture.

HOWEVER, unless you have a scanner with software that provides for that, you may not have too many options. I've seen a number of the scanning software from newer scanner/printer/copiers machines and the options you have are "lighter," "darker," "more contrast," etc., and do not have any Levels, Curves, or other features that are necessary to fix an image. So you may not have any options in this regard.

Either way, good luck! (and Happy New Year).

Report · Jan 02, 2022

I scanned the whole document at 600 dpi, grey-scale. Given that the source material in the book was not pristine, even this might be overkill! I am going to try to at least get better originals of newspaper articles and maybe some documents.

I might have found one other way, a continuation of what I started above. By turning off the visible layer, and on for the invisible text, I can select and delete the OCR for the images. By reversing that, I can delete the images for the text. Before I go nuclear, I think I am going to try a limited, conventional approach. But if that doesn't work out... we're going hot!

Thanks for the work flow. I am likely to try that also, at least for the experience!

Report · Jan 02, 2022

Your idea "I might have found one other way, a continuation of what I started above. By turning off the visible layer, and on for the invisible text, I can select and delete the OCR for the images. By reversing that, I can delete the images for the text."

Sounds like an interesting plan. Let me know how that works out.

Good luck!

Report · Jan 02, 2022

Guten Abend,

wie wäre es, wenn du Sicherheitskopien machst?or

Un meiner Anfangszeit waren die Texte und Bilder von PDF auch Word auch mal so durcheinander...

Einmal und nie wieder.

Seit dem mache ich mir von jedem Format Sicherheitskopien.

Wie wäre ers wenn du anstatt Word das schölnere Office nimmst? Aber nicht das von Microsoft.

Was du noch versuchen kannst, das Buch bei

https://calibre-ebook.com/

hochzuladen.
Dort habe ich mein Beispiel von oben, etwas leichter retten können.
Wünsch dir viel Glück-

Report · Jan 02, 2022

Hi @dancingjewel, I've done several of these types of jobs. Under Tools (more tools) find "Export your PDF to any format." I highly recommend exporting to Word first to clean up the text (or try rich text). Turning on invisible characters in Word and using find/replace is a big help.

In Acrobat, you have the option to export to Word with or without the images. Depending on the document, if you export with images, you can use those as placeholders. Then, export the image files separately. You can set a size to exclude as well. Then, you have a folder of image files you can retouch and convert in Photoshop, finally placing them into an InDesign file properly after mapping the Word document text in.

For documents and letters that are embedded in the PDF that won't extract directly as images, you can drag and highlight the area, right click, and save image as. Sometimes, the only option is BMP, which can bring on its own set of problems when trying to get them to look right. Either way, you will probably have to batch convert these types of images if there are a lot of them. Yes, these projects can become labors of love with many repetitive tasks, but there are some automations to take advantage of. I find them very satisfying! If I can help, please let me know.

Report · Jan 02, 2022

I appreciate your advice! I am hoping it doesn't come to that. As I mentioned above, I made the OCR text visible and deleted any image where OCR had been done (thankfully, with my first Chapter, there were only about six images that caused the problem, the other images in this Chapter were seen as images.

So, I am left with a file with 2 layers: One with the page images, one with the now visible OCR text. The problem is now in the exporting, where I have either exported just the page images and no OCR, or the text overlaying the page images. When I export the original file before opening the layers, I get the text and images (but probably NOT the images that were OCR'd. Hadn't thought of that possibility.

Still, the question is can I now "combine" the layers some way to get back to the original format. I have tried both flatten and merge, and it hasn't worked. Trying to contact Adobe Chat, and it tells me it's a 5-10 minute wait. That was 95 minutes ago. Might just try calling tomorrow, not much else to do since we are expecting 6" of snow tonight.

Otherwise, perhaps I can try to NOT delete then OCR under images, but just export as is, and then I would only need to capture those images that weren't seen as images in the first go. That might save some time!

Again, thanks.