We have a brand new look! Take a tour with us and explore the latest updates on Adobe Support Community.
I scanned old internal reports to hundreds of PNGs. I need to turn each batch of PNGs into a single OCRed PDF using Acrobat. I am wondering which of the following two options is better:
Preliminary testing (with 10 pages) seems to show that option 1 (Combine and OCR) will yield slightly smaller PDFs, but I am not sure why or if I can expect this on large scale as well.
I assume that if I combine the PNGs to a single PDF and I OCR this PDF, the OCR engine can perhaps better optimize the fonts, the document overhead, etc. than when combining already OCRed PDFs into one larger PDF.
Anyone with more experience can provide advice or thoughts?
In the grand scheme of things it doesn't make a bit of difference.
However, let me give you an interesting tip: If you save your documents as pngs or jpgs, and drag that document onto Acrobat, they will be converted into PDFs. And stop. However, if your document is saved as TIFs, and you drag your documents onto Acrobat, they will be converted into PDFs AND OCRed in the same process.
And, if you drag a bunch of TIFs onto Acrobat a window will pop up and ask if you want all of these saved into one document or separate documents — and you can do either one.
Now, I state that you can drag them onto the Acrobat icon but you can get the same result if you select either of these options here:
Now, one last hint: if you make sure the name of your documents are in proper order before you do the PDF-ing, you will not have to worry about their order in the final document. That is, if the names of your files are Joe.tif, Sam.tif, Mary.tif, there's no telling how they will land in your new single PDF. But if you number them, either 01-Sam.tif, 02-Joe.tif, etc. Than their order is secured. Oh, one more on this issue: if you scan into a folder, typically the 2nd item will be auto-numbered. That is noname.tif, noname-2.tif, noname-3.tif, etc. If you remember, it's important to fix the first one and call it noname-1.tif. Otherwise that first one will be at the end of your combined document.
Oh, one last last thing: here's a blog I wrote for Adobe years ago, you might find some scanning tips:
Hi gary_sc, thank you very much for all your tips. I did not know about TIF auto-OCR vs PNG... that's interesting.
I did indeed name my files in the right order.
I completed a 400 page report following both ways ("Combine and OCR" and "OCR and Combine") and the difference was significant. The former ("Combine and OCR") was 20% smaller than the later. Furthermore, the later gave a "The font 'Times New Roman-Italic-110869' contains bad /Widths" error when going through the PDF, which did not occur with the former.
It probably depends on the type of document (ration of images vs text, quality of the individual files, pre-preparation of the files, etc.), but it would seem that combining into one large PDF and OCRing after may not only result in a slightly smaller file for the same final quality, but also avoid strange font errors.
Oh, and regarding your blog, it was very interesting but I remember having read it a few years ago...
Do you have any recommendations on the ideal resolution? I experimented with 300 dpi versus 600 dpi, black and white versus grey scale versus colour, and so far, 600 dpi grey scale seems to give me the best results provided that I adjust the scanner's histogram to eliminate the greyish background, pretty much like you explain it in your blog.
First off a BIG THANKS on your observation on the process order. That might have something to do with the size of your document, I've never processed a document so large and the few tests I've done in the past were probably not large enough to display such a difference. Oh, and a big thanks for appreciating my blog.
Again, the observation of tif over png (or jpg) was a fairly recent discovery I found by accident and thought that Acrobat was broken or something (:>)).
Now, your last points: first 600 versus 300. 600 is SUPPOSED to give a higher quality but it's been my observation that the best results depends upon the media. That is the original document. Becuase you do many page documents, do a test run (everything is the same but the ppi) on the first page and see what happens. As I suggested in my blog, copy the text from each test and paste it into Word: which one gives you the most red underlines? Than use the one that gave you the least underlines. However, if the document happens to have any very small text, than 600. Period.
As far as Black and white verus grayscale: use grayscale. I'm sure you've zoomed in and seen the pixelation of black and white. Any time you have imperfect quality of a letter's shape, issues may happen. (Although 600 ppi will give you smaller pixels and therefore better results. But again it's worth testing to see how that document will perform.) Grayscale introduces antialiasing into the image quality and that can (and should) improve OCRing. As far as color, it will increase the storage size of the document by 3 (red plus blue plus green as opposed to just black) but unless there are elements to the document that you need to retain the color, just go grayscale. To reiterate, capturing color, by itself, should have no affect on the quality of the OCR.
Best and good luck!