Does not OCR all pages

Report · Dec 13, 2018

I have a 200+ page document that Adobe will not OCR multiple pages, sometimes within the same appendix when it OCRs others. It is absolutely essential that the whole document be OCRd. https://www.lp.org/wp-content/uploads/2018/09/2018_09_29-30_LNC_Minutes-approved.pdf

See specifically page 1 of Appendix A-1 and all but page 1 of Appendix F. There are more but these are some samples.

Report · Dec 14, 2018

Hi Carynd,

First off, thanks for supplying the document, it helped.

You do not say how this was scanned, nor using which version of Acrobat on what kind of computer nor with which scanner.

Nonetheless, I downloaded the document and verified that yes, not all of the document was converted into searchable text. So, I opened the document on my Mac and ran Text Recognition from the "Enhanced Scans" tab of tools. It took about 10 minutes but I know have the complete document fully searchable.

I am curious as to what your process for making this document because things you have on the top of a page (e.g., Appendix F Secretary's Report) are obviously digital text, the text below that is scanned text that is highly JPGed with lots of degradation. If you look at the screenshot below and look at the black on white text, you'll see a lot of gray splotches all over the place. That is JPG degradation that is caused by the JPG lossy process. This degradation will also degrade the quality of the OCR process as it can confuse the optical reading of the text.

In a word, generally never use JPG as a format for saving unless you are using zero compression. But this was heavy heavy compression, perhaps set as low as 30%. FWIW, the size of the document during scanning will have NO BEARING on the size of the final PDF. When I scan my documents and save as TIFs, each page is typically around 8-9 MB but after saving them as PDFs and creating searchable documents, the page sizes are around 40-50 kb.

Most of the above has nothing to do with your original question but perhaps it might lead to some of the issues you may be having. I would appreciate your letting us know what is your scanning-processing approach and we might be able to help you avoid these issues in the future.

If you wish to see your file after I processed it, here it is: Dropbox - 2018_09_29-30_LNC_Minutes-approved2.pdf

Lastly, here is a blog I wrote on how to get clean scanned documents using Acrobat:

https://forums.adobe.com/community/creativepipeline/blog/2018/01/22/scanning-clean-search-able-pdfs

I hope some of this helps.

Report · Dec 15, 2018

I am using the most current version of Adobe Pro but I am using a program called PDF edit as well that I think is causing the issue.

Let me let you know my issue and perhaps you can recommend a better workflow. I have multiple reports given to me that I have to insert into minutes while keeping the original Word section breaks and page numbers so I use PDF edit to break apart the PDF's into individual pages to insert as a group and it must be that program that is degrading the text.

I need a quick workflow to be able to put hundreds of appendixes into these minutes yet keep them decent quality. I appreciate the assistance.

Report · Dec 15, 2018

And I am using all Mac computers (I use several - on the version just prior to Mojave). No physical scanner used, it is all converted to PDF using either Adobe or Nuance - or sometimes the committees send me their reports already in PDF but I try to insist on getting the original Word versions.

Report · Dec 15, 2018

I read your blog and it was fascinating - I am chair of a historical committee as well and we do work with historical documents and that will help. Is the VueScan software decent? That is what I used when scanning historical documents because the native software on my Epson scanner is no longer supported on Mac (and since VueScan supports nearly everything I use it on all scanners).

Report · Dec 15, 2018

Hi Carynd,

Sorry for any delay in getting back to you, I was on a bike ride all morning.

VueScan is good software and for the price its excellent. It's been a LONG time since I've looked at it as I've used SilverFast for many years. SilverFast is probably one of the best 3rd party scanning software out there but it's not cheap AND it has a bit of a learning curve AND the support documents are not really as good as they should be.

However, that would be more critical if you were doing a lot of image scanning but since your primary focus is on documents, the nuances that SilverFast provides are not that critical.

Also, just about ANY software is better than Apple's "Image Capture." I've been using Macs since '85 so I obviously like them but Image Capture is bad on so many accounts I do not wish to waste any time with it.

If my memory serves me well, I think one of the dynamics that VueScan had that I was not pleased with was that you didn't have the Levels adjustment that I prefer when working with printed documents: it's fast and easy. I think you had to work with contrast which is a bit trickier. I can check on that if you'd like (no time at this moment).

If others are supplying the PDFs that you are inserting into the master document, please share with them the issue of JPG degradation. Unfortunately there is no going back on these operations. [BTW, this is why when working with images you never want to take your original JPG image, make changes, save as JPG, make more changes, again save as JPG, etc. Each time you JPG a JPG, it gets degraded. If you want to make changes in your original images, first save out a copy as either PSD or TIF format, and make changes up and down the wazoo, there will be no JPG degradation and once you're done, THEN you can save it as a JPG to send to folks whatever, but always keep your original image untouched. That way, if you learn new techniques or there are other ways to do what you wanted to do, you can always go back to the original image.—Sorry for the digression.]

Oh, I share your issues with Epson on legacy scanners but occasionally they do update things. It's worth a check to see if they have updated your scanning software for your scanner. FWIW, I do like Epson scanning software. It's not as good as SilverFast but for company software, it's not bad at all.

Oh, did you download my version of your document? If so I'll be deleting it from my computer's Dropbox.

Let me know if there's other things I can help with,

Report · Dec 16, 2018

Yes I downloaded - thank you. Do you know a program that can take a PDF and export it in one step into separate TIFFs for each page?

Report · Dec 16, 2018

It looks like PDFElement will do this - I wonder if you have any experience with that.

Report · Dec 16, 2018

Hi Carynd,

You already have it: Acrobat DC Pro. One BIG suggestion: be sure to export this into an empty folder. Otherwise you'll have (in the case of the document you sent me) 273 separate TIF documents. (continue below)

And if you want to save them as PDFs, select the Organize Pages tool and select the following.

But just out of curiosity, why do you wish to save them as TIFs? What workflow are you considering?

Report · Dec 17, 2018

I wish to put them into separate TIFFs as I need to insert each page (image) into an existing Word document without losing the headers and footers .... (i.e. the appendixes in the sample document we have been working with) (Word will only bulk insert image files and not PDF files - and of course, it won't bulk insert anything on a Mac)

Report · Dec 17, 2018

Ah, OK,

While I use Word often, any time I need to be able to format things within a document I go to InDesign. For just straight text that I do not need to worry about formatting, Word is fine. As such, I was unaware that you could not place a PDF into Word.

One thing you may wish to test is since the documents you are getting are overly-compressed JPG documents, a lot of the damage has already been done to the text. If you save the documents into TIFs, no more subsequent damage will take place, obviously good.

What I'm not sure about is if you were to save the JPGs that you receive as JPGs but with no compression, and place them into your Word documents, would (1) the appearance of the JPGs be any worse, and (2) would the final PDF be the same size or smaller than the ones you saved as TIFs.

The big issue on this whole thing is would the size make any difference to your needs.

Thanks for putting up with all my questions.

Report · Dec 19, 2018

I will have to check into InDesign - as you can tell I am Secretary for a national political organization so this is not a one-off thing. BTW, I redid the document with TIFFs and then converted the whole thing into PDF and it OCR'd fine but NOT in Adobe. I got a good OCR through Nuance (which I think uses the ABBY engine).

And for anyone reading, I had issues inserting multiple TIFFs into Word but it could be because I was over a workspace connection - need to desk on a direct desktop.

Report · Dec 19, 2018

Not fully sure what you mean by having "issues inserting multiple TIFFs into Word" beyond Word being a PIA inserting anything into (and trying to keep the formatting). Any time I have anything that requires that kind of thing I automatically go to ID.

Also I'm not sure I understand what you are referring to "a workspace connection." Do you mean a server?

Either way, when I received your document, I ran it through Acrobat on my desktop Mac with NO issue. Took a bit of time (it was a long document) but it came out just fine.

But whatever you work out that works for you is the best.