Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

I need to scan 300 file boxes of hardcopy files and convert each one into searchable .pdfs. I'm looking for any hardware and software recommendations

New Here ,
Oct 17, 2018 Oct 17, 2018

I need to scan 300 file boxes of hardcopy files and convert each one into searchable .pdfs. I'm looking for any hardware and software recommendations.  We usually use Fujitsu scansnap ix500 for documents and sv600 scan snap overhead for plans and drawings.  This method will automatically combine the images into one .pdf.

That works fine for a file here an there but the volume in this upcoming job (300 legal size file boxes)  will require a more robust production type scanner that will not function in conjunction with the sv600 overheard.  I am told that the separate .pdf.s must be sent to the desk top and assembled / combined there.  What are the hardware recommendations for the desktop necessary to support this?  Is there an easier way? Any recommendations would be greatly appreciated.

Thanks !

TOPICS
Scan documents and OCR
1.8K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Oct 18, 2018 Oct 18, 2018

Hi James,

Several years ago I also had a gob of items to scan and I was lent a FujiScan scanner (don't remember which one right now) and for my purposes, it did fine. Actually it did spectacular because before I was lent that scanner I was doing this on a flatbed. ('Nuff said.)

What I did want to add to your information was one particular discovery: The FujiScan would both scan AND OCR each page (both sides) all at once in a very fast fashion. However, I found the quality of the OCR was not very good and the storage size of the documents was very large. So, what I did was to scan away like crazy and at the end of the day I'd point Acrobat to OCR "that folder" of images and go leave for home. When I got back they were all done and I then started the next day's activities.

To get high quality OCR, there are two things to consider: One is scanning, please be aware that faster is not always better. Faster typically means lower resolution and for quality OCR, higher resolution will give better results. If you have letter combinations such as "ri," a lower resolution may cause this to show up as "n."

The other thing to consider is the quality of the OCR software and Acrobat's is very good. FujiScan not so much.

So yes, you have a gazillion boxes to scan but having a machine do them in top top top speed may not give you the results you want. FWIW, I found running the scanner at 600 ppi did make it run a bit slower but increased the accuracy of the OCR significantly. But, as I was referring to, Acrobat is not fast when it processes the pages and it has a VERY ANNOYING habit of popping to the top of your screen after each page. So trying to do something like reading email became almost impossible and that's why I'd stick it to end of day's activities and go home to let the machine do what it needed to do.

If money is no object and speed is important, but several of these machines, higher some part time folks to crank them away and then, if you no longer need all of the scanners you get, donate some of them to a local charity. They will be grateful.

Hope that helps!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jul 30, 2019 Jul 30, 2019

Oh, how I would love to hear what became of your quest!

I am associated with a group of WWII army veterans and have volunteered to take on a project that will require a lot of scanning and subsequent OCR. Our 95-year-old Historian has loaned me about six storage boxes of priceless information -- I estimate between 5 to 10 thousand pages -- diaries that have never been published, After Action Reports, rosters, unit histories, battlefield maps, etc.  My goal is to scan, convert using OCR, index all of the names and places, then add it to our website for anyone to use.

A few years back, I volunteered to OCR and index over 70 years of our magazine, but, thankfully, someone else had already made '.jpg' files of all of the page images.

What scanner did you settle on? I have an 'all-in-one' HP, but realize that will take me long past my expiration date!  I'm just now starting to look for a scanner (i.e. Fujitsu SP-1425), but don't trust the posted scan speeds (and I did see Gary's note to you re "faster is not always better")

I also have OmniPage Ultimate, Power PDF Advanced, Adobe Photoshop, and some software that I have written. Am also curious if you are able to have each scanned page saved as an individual .jpg?

Once I can get the desired process to work, then documented, I want to hire someone to wade thru these boxes.

Any other suggestions or links to resources would be greatly appreciated!

Thank you!  

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 30, 2019 Jul 30, 2019

Hi Wayne,

What a GREAT and IMPORTANT project. I do wish you well.

A couple of points and thoughts:

Your 70 years of your magazine: In my organization someone also had not only volunteered to take care of our journal but had paid a "professional" company to do the work. I only saw the result long after the work had been done and was very disappointed to see that it also had been scanned into JPG and also scanned at low resolution so that if you zoomed in you not only got immediate pixelation but also you could see the jpg degradation caused by scanning into jpg.

Here's what I'm talking about:

2019-07-30_13-13-57.png

You've probably seen this when a co-worker makes a flyer about an upcoming social event and sends it out as a JPG as opposed to a PDF.

JPG is fine for the last stop for images but when you have a dark color against a light color, such as what you have with text, the jpg degradation is very ugly.

When you're doing this scanning it is very important to scan to the TIF format. Yes you will be blown away about the size of the images but do not worry. My regular scanner is a flatbed Epson (V800 Photo) and when I scan a standard letter sized image at 100 % at 300 ppi, the file storage size is around 8+ MB. But after converting to a PDF via Acrobat Pro, a full page of text (no images) will convert to about 70-150 KB. As you add images, the storage size will go up but nothing tremendous.

I appreciated your comments about using your scanner to scan all that's there taking you beyond your expiration point. Over my years of photography, I had some 10,000 slides that I wanted to digitize. To get a good quality, high-resolution scan can easily take about 5 minutes each. That also would have taken me past my expiration point. What I ended up doing was to photograph each of the slides. I wrote a complete description of this process and Adobe posted this here:

https://forums.adobe.com/community/creativepipeline/blog/2017/06/30/digitizing-your-slides-by-photog...

The important issue about scanning is that it is not all that different than photography: the better quality the original document is, and the less "photoshopping" you need to do. And from that, the much higher quality the end result will be.

I was delighted to see that you appreciated my comment "faster is not always better." That must always be weighed against the realities of time. It's a compromise that only you (or whomever is doing the work) must balance.

With that in mind, I do suggest that you set aside (say) 10 pages, scan these exact same pages at various levels of quality and then process them all the way through the OCR process. When you are done, then in Acrobat convert them to a Word document and look for those tale tale red underlines showing questionable spelling. Those will be either (1) pronouns it doesn't recognize, (2) hyphenated words (e.g., var-ious), or (3) actual OCR mistakes (e.g., "ri" being seen as an "n"). As a Word document it's much easier to see and review the mistakes.

I am also going to suggest you look at this other blog I wrote that Adobe published on how to create a clean scan to get a clean PDF:

https://forums.adobe.com/community/creativepipeline/blog/2018/01/22/scanning-clean-search-able-pdfs

One thing that photography has taught me is that EVERYTHING in life is a compromise. If you want to take a photo of your kids in front of the mountain you have a choice as to whether you take a photo of the mountain with your tiny kids in front of it or a photo of your kids with a part of the mountain in the background. You cannot have both. (Well, you can if you make a billboard-sized print out of the photo...)

You know the amount of time you have, you know your knowledge and experience. The good news is that by the time you finish this project you'll be on these boards helping others. (And I look forward to the help!)

Please share your progress on this project and thank you for doing it!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jul 30, 2019 Jul 30, 2019

Wow, GARY!  I feel like I won the lottery with the wealth of valuable information and the details provided in the links!

Either you have ESP, or have my house bugged Later on I was going to ask about digitizing photos and negatives - yet another item on my radar. Yes, I used my Nikon DSLR, with macro lens to 'copy' old family photos - some from the 1880's. But there are several storage bins of old photos, some from WWII, lots from my earlier years, that I sometimes hear screaming at me to take action!  Your links are priceless!

Now, back to the scanning process, I certainly made many of the mistakes that you mentioned with processing newspaper clippings, but am eager to try again with the hundreds of yellowed WWII related clippings I now have.  

As I document my procedures, I will condense all of your suggestions into a checklist.  

Thank you, thank you, thank you!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jul 30, 2019 Jul 30, 2019
LATEST

You are VERY welcome.

It does make me very delighted to know that I am helping folks, it does help to make this worthwhile.

Now for some disappointing news: at this point in time, neither Photoshop, Adobe Camera Raw, and Lightroom (which is ACR with a a database) know how to properly deal with a photograph of negatives. It might sometime in the future because just about every photographer I know has bugged Adobe about adding that capability into these applications. It "can" be done now but you have to mentally reverse everything you know about working with Curves, you need to use them backwards which means you have to mentally unlearn everything you've learned how to deal with Curves has to be unlearned, flipped, and learned again.

In addition, the higher-priced scanning software (I'm thinking SilverFast here) has built in tonality to give you the same "look" that 25 ASA Kodachrome provided which will be different than 100 ASA Fujichrome.

At this point in time, there is no discussion for native processing of photos of negatives processed through Photoshop will have that level of depth. So, at this point in time, scanning your negatives will give you better results. However, if you're like me and have ginormous amounts of negatives (and few if any positives from those negatives), you'd be better off photographing them just to get them digitized, and as you come across the unique images that warrant the time to scan, then you scan those.

You can always PM me if you want to keep me up to date on your progress.

Best,

Gary

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines