Participant

Question

Is this feature a speedy one? I need to quote a client for scanning a 275 page book.

Forum|Forum|7 years ago
January 26, 2019
1 reply
566 views

I have never used this feature - how long would you estimate it would take to scan a 275 page book?

I have no idea how to begin estimating my time!

Any input is appreciated!

Lynette

Scan documents and OCR

This topic has been closed for replies.

gary_sc

Community Expert

Hi Lynette,

Whew, that can vary all over the place because there are a number of issues.

My first question is can you destroy the book? the reason for that is the binding will prevent you from opening the book enough to get flat pages. If the book is soft cover you probably can flatten it on the scanner sufficiently but again, you'll probably be doing damage to the book. If you cannot destroy the book, the quality of the scanning and mostly the OCR (Optical Character Recognition) will be impaired.

Are you using a flat scanner or do you have (or have access to) a sheet scanner where you can place a stack of pages into this and click a button? (These can scan both sides at the same time, flatbed scanners can only do one page at a time and that take a LOT longer.)

If you are looking at scanner specs, they will often talk about how fast they can scan a page. Please do not pay much attention to this number because it will always be based on a low-resolution scan. Resolution is an image's way of talking about information: the greater the resolution, the more information in the image. The more resolution, the better quality the OCR will be. So for quality scans, slower is better. (Plan on 300 ppi minimum, 600 ppi is a good max.) [ppi: pixels per inch]

But, assuming you can destroy the book, if you have a flatbed scanner, plan on about 1 minute per page (once you have things set up and if you have a sheet-feeder scanner (like a FujiScan), plan on about 10 pages per minute (plus extra time for Acrobat to do the OCR).

And one other question (because I have to ask), will you end up breaking any copyright laws with the end product of this scanning?

Does this help?

lynetteinakAuthor

Participant

Thank you. I CAN destroy the book, and I do have a multi-page feeder (guessing up to 20 pages, possibly?

How much longer does the OCR take in relation to scanning?

The author's foundation is actually requesting to have the book digitized, so no worries there

Lynette

gary_sc

Community Expert

Hi Lynette,

Sorry for the delay but I had some important commitments this morning. Free finally.

You haven't said if you are on a PC or Mac but if you are on the PC, do not bother to use Acrobat yet. If you are on a Mac, DO NOT bother to use Acrobat yet. (I'm a Mac user and the simple issue is that you cannot use a scanner's software on a Mac within Acrobat. I'll leave the story at that.)

There's a lot here and I advise you to read this all the way through before you start. This is NOT hard, it's just that I tend to cover the small points to make sure that you understand what to look for and expect. It's what I do. ;>)

Using your multi-feed scanner set everything to go into one folder on your desktop. 20 pages per run is probably just fine, RTFM and verify but you might be able to get more, your call to look and test.

Using their software, set it up for both sides scanning, set it up for at least 300, preferably 600 ppi, set it up to save as TIF format*, and set it up for auto naming/numbering so that you get "mybook-001.tif," "mybook-002.tif", etc.

Before you actually start, run a test group of about 10 pages to make sure you're facing the pages in the correct position and order. Also check the quality of the scan. I have no idea which bulk scanner you are using nor can I know what their software is like but if you can make any adjustments to the Levels as shown in this blog I wrote for Adobe please do.

https://forums.adobe.com/community/creativepipeline/blog/2018/01/22/scanning-clean-search-able-pdfs

Once you are happy with the results, start from the beginning, reset any counters and do the whole book.

This may sound weird, but do not do (say) 50 pages and start processing, rather do the entire book. The reason for this is that it's possible the counter in your software may base "the next number" on what's in the folder. If you removed the first 50, than the next group will start with -001 and that can screw up combining the whole group in the right order later on. [Trust me on this, I wasted a lot of time on this before I learned my lesson.]

Now it's probable that your scanner's software will OCR the pages as it operates. That's fine but unnecessary as I found that on the FujiScan the OCR's capability was not only of poor quality but the size of the documents was outrageous. To solve this is easy: simply open up Acrobat and select the "Enhance Scan" tool. Once that's open, select the "Or recognize text in multiple files" option.

That will pop up a new window and from the dropdown menu in the upper left, select "Add Folders."

Now locate the folder that has all of the images of the book's pages, OK if it asks if you want to combine all of the pages into a single document, and if it asks to OCR the document, OK that as well.

Now go off to lunch as this might take a while (an hour or two, not sure how long as your computer is probably different than mine, etc., YMMV)

You may wish to do a test run on this latter part of the test pages you did earlier (10-ish pages or so) just to make sure things are good.

Oh, one last warning/comment. OCR is an amazing piece of software capability. But keep in mind that it's also very dumb. It does NOT know if you have a hyphen at the end of a line that that means the word that has the hyphen and the word at the beginning of the next line are parts of the same word. So if you have "........ docu-" and the next line starts with "ment..........." Those will end up as two unknown words. Just be aware.

* There are several reasons for saving the documents as TIF documents. The advantage is that they are very good for storing all of the images' data. This is also the bad thing, they can be fairly large. A single page can easily be 7-8 MB in size BUT, once it's converted into a PDF (that's been OCRed), the size will end up being about 40kb. While JPG documents will be smaller in initial size, they can have a lot of degradation caused by their compression capabilities that are acceptable for photographs but can be dreadful for text documents.

Let me know if this makes sense and how the whole thing works out.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded