Searchable Image vs. Searchable Image (Exact) - Quality of OCR

Report · Jan 11, 2013

I work in the legal field, and have always used the "Searchable Image (Exact)" setting when running OCR in-house on document productions. I'm currently using Acrobat X when I do my own OCR.

I have a vendor who says they use Acrobat 9 "Searchable Image" for all their document productions. They say even though the actual image of the document is altered, they get better quality OCR results than they do with "Searchable Image (Exact)." They say the document is deskewed so text can be read better by the OCR engine.

My problem is that many documents are being altered. Especially architectural drawings are tilting drastically to the right, the edge of the drawing is being clipped off entirely from the document, and dotted lines are being added to the image itself -- apparently where the edge of the page used to be. This is unacceptable. They say the entire drawing is being flipped so that the first few lines of text on the drawing is horizontal. So if there is a small bit of text which is slanted, the entire drawing is tilted to adjust to that one piece of text.

Is it true that OCR is that much better using "Searchable Image" vs. "Searchable Image (Exact)"? I have to select one setting for all docs, and I'm inclined not to alter our documents in any way at the expense of OCR.

They also say that they're having trouble OCRing docs that I can OCR without a problem using both Acrobat 8 and Acrobat X. Does that make any sense? I switched directly from 8 to X so I'm not familiar with Acrobat 9.

Report · Jan 11, 2013

Once files are OCR'd using "Searchable Image" and SAVED, would you need to go back and
find copies of the original pre-OCR files and reOCR them using "Searchable Image (Exact)"?

If having that "accurate representation" of the original document is a critical attribute then, yes, you to fetch the pre-OCR'd PDFs and run them through Searchable Image (Exact).

Regardless of the version of Acrobat (4.x, 5.x, 6.x, 7.x, 8.x, 9.x, X or XI) the essential characteristic of Searchable Image (Exact) has not changed.
That is, the image does not undergo the "tweaking" provided by Searchable Image.

Consequently, PDFs processed through Searchable Image (Exact) by whatever Acrobat version will be acceptable.

What happens if you process a PDF containing a scanned image of textual content through Searchable Image (Exact)?
The PDF gains a second "hidden layer" of OCR output. (I've done this - inquiring minds want to know eh? <g>)
Of course, using Acrobat Pro, one could remove an existing "hidden layer" of OCR output.

[addendum: this removal is of *all* present - if one present it is gone, if more than one then all go]

The "click path" to the means of accomplishing this varies with the Acrobat version.

Because you work with PDFs were accurate presentation of the original document is of import you may want to configure Acrobat such that you've some barriers to an "oops".

An example:

In Acrobat's Preferences select the "Convert To PDF" category then select TIFF.
Edit the settings.
Monochrome Compression --- CCIT G4
Grayscale Compression -- ZIP
Color Compression -- ZIP
RGB Policy: Preserve embedded profiles
CMYK Policy: Off
Grey Policy: Off
Other Policy: Preserve embedded profiles

An example:

Configure Acrobat's Optimization Options
--| via Acrobat's Optimize Scanned PDF
--| via Optimization Options dialog presented after clicking the "Options" button in the Optimization pane in the configuration dialog associated with Create PDF from Scanner.

Untick "Automatic" (you don't want "Aggressive" or "Adaptive')
Tick "Custom"

For Compression
--| Color/Grayscale --> select "Lossless'
--| Monochrome --> select "CCIT Group 4"

Open Acrobat's Help and go to "Scan a paper document to PDF".
A close read of the information will be informative and useful.

You want to avoid processing any image of the document of record with lossy compression (which compresses by destructive removal of pixels - thus 'corrupting' that 'accurate representation').

You want to avoid filtering that alters the image.

Be well...

Message was edited by: CtDave

View solution in original post

Report · Jan 11, 2013

Searchable Image *does* alter the image.
That is why, for PDF content that is a scanned image that falls into the "legal record" bin need to processed for OCR only with Searchable Image (Exact).

Again Searchable Image alters the image (the "record") that is how the scanned image gets made pretty (better as-viewed presentation).
.
Using Acrobat's OCR capabilities from Acrobat 5.x Full through Acrobat X the only significant change has been in OCR recognition accuracy.
.
Given that your documents fall in the "legal record" bin what your vendor is doing results in documents that are not accurate representations of the original document.

Passing on such may incur some measure of legal liability.
.
Be well...

Report · Jan 11, 2013

Thanks so much for the quick response!!

Once files are OCR'd using "Searchable Image" and SAVED, would you need to go back and find copies of the original pre-OCR files and reOCR them using "Searchable Image (Exact)"?

If a set of documents has been OCR'd using "Searchable Image (Exact)" in Acrobat 8 or 9, does rerunning OCR on those same files using Acrobat X improve the quality of the OCR? I've got lots of productions that were OCR'd in Acrobat 8 before I upgraded to Acrobat X. And someone just sent me a large production OCR'd in Acrobat 9.

Report · Jan 11, 2013