Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
1

Searchable Image vs. Searchable Image (Exact) - Quality of OCR

Community Beginner ,
Jan 11, 2013 Jan 11, 2013

I work in the legal field, and have always used the "Searchable Image (Exact)" setting when running OCR in-house on document productions.  I'm currently using Acrobat X when I do my own OCR. 

I have a vendor who says they use Acrobat 9 "Searchable Image" for all their document productions.  They say even though the actual image of the document is altered, they get better quality OCR results than they do with "Searchable Image (Exact)."  They say the document is deskewed so text can be read better by the OCR engine. 

My problem is that many documents are being altered.  Especially architectural drawings are tilting drastically to the right, the edge of the drawing is being clipped off entirely from the document, and dotted lines are being added to the image itself -- apparently where the edge of the page used to be.  This is unacceptable.  They say the entire drawing is being flipped so that the first few lines of text on the drawing is horizontal.  So if there is a small bit of text which is slanted, the entire drawing is tilted to adjust to that one piece of text. 

Is it true that OCR is that much better using "Searchable Image" vs. "Searchable Image (Exact)"?  I have to select one setting for all docs, and I'm inclined not to alter our documents in any way at the expense of OCR. 

They also say that they're having trouble OCRing docs that I can OCR without a problem using both Acrobat 8 and Acrobat X.  Does that make any sense?  I switched directly from 8 to X so I'm not familiar with Acrobat 9.

TOPICS
Scan documents and OCR
56.5K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
1 ACCEPTED SOLUTION
LEGEND ,
Jan 11, 2013 Jan 11, 2013

Once files are OCR'd using "Searchable Image" and SAVED, would you need to go back and
find copies of the original pre-OCR files and reOCR them using "Searchable Image (Exact)"?

If having that "accurate representation" of the original document is a critical attribute then, yes, you to fetch the pre-OCR'd PDFs and run them through Searchable Image (Exact).

Regardless of the version of Acrobat (4.x, 5.x, 6.x, 7.x, 8.x, 9.x, X or XI) the essential characteristic of Searchable Image (Exact) has not changed.
That is, the image does not undergo the "tweaking" provided by Searchable Image.

Consequently, PDFs processed through Searchable Image (Exact) by whatever Acrobat version will be acceptable.

What happens if you process a PDF containing a scanned image of textual content through Searchable Image (Exact)?
The PDF gains a second "hidden layer" of OCR output. (I've done this - inquiring minds want to know eh? <g>)
Of course, using Acrobat Pro, one could remove an existing "hidden layer" of OCR output.

[addendum: this removal is of *all* present - if  one present it is gone, if more than one then all go]

The "click path" to the means of accomplishing this varies with the Acrobat version.

Because you work with PDFs were accurate presentation of the original document is of import you may want to configure Acrobat such that you've some barriers to an "oops".

An example:

In Acrobat's Preferences select the "Convert To PDF" category then select TIFF.
Edit the settings.
Monochrome Compression --- CCIT G4
Grayscale Compression -- ZIP
Color Compression -- ZIP
RGB Policy: Preserve embedded profiles
CMYK Policy: Off
Grey Policy: Off
Other Policy: Preserve embedded profiles

An example:

Configure Acrobat's Optimization Options
--| via Acrobat's Optimize Scanned PDF
--| via Optimization Options dialog presented after clicking the "Options" button in the Optimization pane in the configuration dialog associated with Create PDF from Scanner.

Untick "Automatic" (you don't want "Aggressive" or "Adaptive')
Tick "Custom"

For Compression
--| Color/Grayscale --> select "Lossless'
--| Monochrome --> select "CCIT Group 4"

For Filtering
--| review Acrobat's Help to read the discussion on the choices available.
(Deskew | Background removal | Edge shadow removal | Despeckle | Descreen | Halo Removal  | Text Sharpening)

Open Acrobat's Help and go to "Scan a paper document to PDF".
A close read of the information will be informative and useful.

You want to avoid processing any image of the document of record with lossy compression (which compresses by destructive removal of pixels - thus 'corrupting' that 'accurate representation').

You want to avoid filtering that alters the image.


Be well...

Message was edited by: CtDave

View solution in original post

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 11, 2013 Jan 11, 2013

Searchable Image *does* alter the image.
That is why, for PDF content that is a scanned image that falls into the "legal record" bin need to processed for OCR only with Searchable Image (Exact).

Again Searchable Image alters the image (the "record") that is how the scanned image gets made pretty (better as-viewed presentation).
.
Using Acrobat's OCR capabilities from Acrobat 5.x Full through Acrobat X the only significant change has been in OCR recognition accuracy.
.
Given that your documents fall in the "legal record" bin what your vendor is doing results in documents that are not accurate representations of the original document.

Passing on such may incur some measure of legal liability.
.
Be well...

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jan 11, 2013 Jan 11, 2013

Thanks so much for the quick response!!

Once files are OCR'd using "Searchable Image" and SAVED, would you need to go back and find copies of the original pre-OCR files and reOCR them using "Searchable Image (Exact)"?

If a set of documents has been OCR'd using "Searchable Image (Exact)" in Acrobat 8 or 9, does rerunning OCR on those same files using Acrobat X improve the quality of the OCR?  I've got lots of productions that were OCR'd in Acrobat 8 before I upgraded to Acrobat X.  And someone just sent me a large production OCR'd in Acrobat 9.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 11, 2013 Jan 11, 2013

Once files are OCR'd using "Searchable Image" and SAVED, would you need to go back and
find copies of the original pre-OCR files and reOCR them using "Searchable Image (Exact)"?

If having that "accurate representation" of the original document is a critical attribute then, yes, you to fetch the pre-OCR'd PDFs and run them through Searchable Image (Exact).

Regardless of the version of Acrobat (4.x, 5.x, 6.x, 7.x, 8.x, 9.x, X or XI) the essential characteristic of Searchable Image (Exact) has not changed.
That is, the image does not undergo the "tweaking" provided by Searchable Image.

Consequently, PDFs processed through Searchable Image (Exact) by whatever Acrobat version will be acceptable.

What happens if you process a PDF containing a scanned image of textual content through Searchable Image (Exact)?
The PDF gains a second "hidden layer" of OCR output. (I've done this - inquiring minds want to know eh? <g>)
Of course, using Acrobat Pro, one could remove an existing "hidden layer" of OCR output.

[addendum: this removal is of *all* present - if  one present it is gone, if more than one then all go]

The "click path" to the means of accomplishing this varies with the Acrobat version.

Because you work with PDFs were accurate presentation of the original document is of import you may want to configure Acrobat such that you've some barriers to an "oops".

An example:

In Acrobat's Preferences select the "Convert To PDF" category then select TIFF.
Edit the settings.
Monochrome Compression --- CCIT G4
Grayscale Compression -- ZIP
Color Compression -- ZIP
RGB Policy: Preserve embedded profiles
CMYK Policy: Off
Grey Policy: Off
Other Policy: Preserve embedded profiles

An example:

Configure Acrobat's Optimization Options
--| via Acrobat's Optimize Scanned PDF
--| via Optimization Options dialog presented after clicking the "Options" button in the Optimization pane in the configuration dialog associated with Create PDF from Scanner.

Untick "Automatic" (you don't want "Aggressive" or "Adaptive')
Tick "Custom"

For Compression
--| Color/Grayscale --> select "Lossless'
--| Monochrome --> select "CCIT Group 4"

For Filtering
--| review Acrobat's Help to read the discussion on the choices available.
(Deskew | Background removal | Edge shadow removal | Despeckle | Descreen | Halo Removal  | Text Sharpening)

Open Acrobat's Help and go to "Scan a paper document to PDF".
A close read of the information will be informative and useful.

You want to avoid processing any image of the document of record with lossy compression (which compresses by destructive removal of pixels - thus 'corrupting' that 'accurate representation').

You want to avoid filtering that alters the image.


Be well...

Message was edited by: CtDave

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jan 12, 2013 Jan 12, 2013

Thanks so much CtDave.  You've been so helpful.  I asked them to change their settings to "Searchable Image (Exact)" and rerun the entire set of 22,000 original docs through OCR.  I did a couple of tests, and I didn't see that the quality of the OCR using "Searchable Image" was better than "Searchable Image (Exact)".   Is what they're saying about the results of OCR being better using "Searchable Image" incorrect? 

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 12, 2013 Jan 12, 2013

Does your vendor have a properly built sampling plan IAW ANSI ASQC Z1, Sampling Procedures and Table for Inspection by Attributes that objectively identifies a better OCR output?

If such has been built and performed I'd be interested in a look-see. The key is a proper "build", execution and documentation.


Failing that I'll go with the past 15 years of "empirical evidence" coupled with a chunk of time actually grubbing under the hood to see what one gets when a common set of source PDFs are processed through each.
fwiw - I've not encountered a superiority of one OCR output over the other.
I have observed that the image is "prettier" after use of Searchable Image.
This is expected once one understands what the process does.

Other variables have more significance / impact on OCR output accuracy.


22,000 documents - Yikes! That's a fair dinkum of documents.
Just confirms my strongly held conviction that contracts that involve "eDoc" processing by any discipline are significantly improved by have an "eDoc" wrangler available as a subject matter expert when wordsmithing the contract.
It'd save time and money, reduce aggravations/negative 'vibs", and on the whole promote harmonious collaboration. Otherwise you increase the odds of gettng what you asked for but not what you wanted.


Be well...

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jan 12, 2013 Jan 12, 2013

I did not see the contract, and wasn't involved until the very end so I don't know what was involved on the front end.  I am the paralegal who was handed the final product to QC before it was finalized.  

The first batch they had reduced all 22,000 docs to 8 1/2 by 11 portrait.  All landscape docs had been flipped portrait and reduced in size to fit portrait.  So very detailed 44 x 28 inch horizontal architectural drawings had been reduced to 4 x 6 blurry drawings sitting in the middle of a mostly blank sheet of paper.  Totally illegible. 

Bates numbers were cutoff in the margins or missing totally and some weren't searchable.  Everything had been converted from color to black/white -- including color photos.  Very nice quality original PDF docs had been reduced to fuzzy unsearchable docs.  Emails embedded within other emails had been shuffled to the bottom of the production -- totally out of context with their parent emails.  Original AutoCAD drawings had been printed to PDF and were basically illegible. 

Original native Word and Excel docs had been printed to PDF, greatly reduced in size for bates numbering, then OCR'd, and many words weren't searchable.  Tiff files had been printed to PDF, large sized docs had been reduced to 8 1/2 by 11, then OCR'd, so the text is so tiny that many words aren't searchable. 

Many drawings were cockeyed (I guess so that the first line of text on the drawing was horizontal) -- edges of drawings had been clipped off and dotted lines had been added to the image. 

They say they always use the same system for all PDF productions of this type, and it's totally automated, and I'm the first one to complain.  They say they print all docs (native files as well as PDF files) to PDF using an industry standard electronic PDF printer.  And the electronic PDF printer is having trouble recognizing page sizes for some original PDF files so it defaults to 8 1/2 by 11 portrait.  This is causing very large PDF files to flip portrait and print like tiny postage stamps in the middle of a blank page.  Then they use Acrobat 9 to OCR the final result -- so the text is very tiny and OCR quality not great.

It wasn't until several phone calls later, that we realized that they always use "Searchable Image" instead of "Searchable Image (Exact)" which is causing some of the odd things I noticed with the docs.  Drawings especially are tilted off center, and edges completedly clipped off. 

I am definitely not an expert, but it wasn't until I compared original docs to the final bates stamped version that I was able to see all the problems. 

There are also docs they say they can't OCR at all (even manually with Acrobat 9) that I've been able to OCR myself using both Acrobat 8 and X.  I was going to reOCR the final product again using Acrobat X just to try to catch all the docs they weren't able to OCR.  I can't figure out a way to find those that aren't OCR'd in the 22,000. 

I will suggest that the attorney hire an eDoc wrangler next time.  This eDiscovery firm has offices all over the world.  They do productions for the government and many law firms.  I was beginning to think I was being too picky.

Thanks so much for your input!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 12, 2013 Jan 12, 2013

Finding PDF that have / lack OCR.
You can utilize Acrobat's Preflight(s) for this.


Regarding a PDF that won't take to OCR (and has no renderable text, although text is present).
This can happen. Occasionally I have to deal with a few of these. In all cases the source PDF came out of a non-Adobe application.
Most annoying. As the PDF is all I have I resort to a "refry" to a tabloid page size with appropriate orientation and at 400 ppi.
OCR "takes". A lesson learned - for every best practice (don't refry PDF) there can be exceptions.

Some observations (of, in my opinion, egregious activity).

--| Reprocessing to letter, portrait
Or is this a contractually specified page size/orientation?
If so, it is driving the production of problematic deliverables.

--| Refry of good PDFs to images to support OCR (for "search") 
Many of your "source" PDFs have renderable text that will support search.
Example:
-- --| PDFs sourced from Word or Excel possess renderable text.
When done properly the fonts / font sub-sets are embedded and map to Unicode. This assures the PDFs' page content is searchable "as-is". Refry of a PDF to obtain an image so one can then OCR "to make searchable" is egregious.


--| PDFs of scanned images at the source paper's page size.
Why refry by printing to a smaller page size? For usability, as a true and accurate copy of the hardcopy one wants the appropriate page size and orientation. These days most 'discovery' work is done via shake 'n bake of the eDoc not the hardcopy (*).

(*) If one needs hardcopy then print it. One can create imprints to paper in a plethora of ways.
The imprint can be sized to fit the desire paper size. Of course going down in page size for a scanned image means what's on paper is not too usable. But, doing so with a PDF from CAD that is vector graphic is something else. Yes, a 44 x 34 from Microstation is "smaller" on a sheet of letter size paper. But, the imprint is crisp and usable. If you're a geezer like me you'll want a magnification "glass". But, oh my, that imprint is sweet and is usable.

--|  PDF output of CAD.
Reprocessing into an image for OCR.
Most CAD applications in use can output PDF. Text will present in one or more layers on the output.
Unless the CAD fonts are of some in-house branding that fails to comply with proper "font" practices all this will be renderable. With proper font selection the font families used will map to Unicode. Thus "searchable".

If one's client does not want a PDF with layers then use an Acrobat Pro Preflight to flatten them.
Typically the text present in the PDF won't be embedded. If needed use the Acrobat Pro Preflight to embed them.

Note that some CAD output will have vector graphics (CAD file content created with the CAD application) and raster (an image). Typically legacy drawings are scanned and this image brought into the CAD file for the drawing. A designer then redraws/remasters using vector graphics. Often this is incremental (it is a time consuming activity) to process approved drawing revisions as these come in. The output PDF will have the renderable text that is associated with text placed via the CAD application. The image of text from the source raster will not be searchable. One *does not* reprocess such a PDF to an image to OCR for "search".
More (much more) is lost.

Do the trials and review what you get compared to what you had. This will validate my statement.
(I'm confident on this because I've done it -- that "what if" itch I just gotta scratch <grin>.)

A Note: OCR of "mixed" content (say a chart, plot, scanned CAD drawing having text, lines, curves, etc.) rarely yields useful OCR output. Scan a collection of these. OCR. Export / Save As to a text file. Do the compare/contrast.  Or view the "hidden" text (Acrobat lets you do that). This can be a most informative exercise.

--| Always the same .... no one complains
One could write volumes on this. But then I suspect that's already been done.

~~~~~~~~~~~~~~~~~

Some nattering on my part.
An eDiscovery firm that has no "eDoc wranglers" is in harm's way.

(Actually, the wrangler is something of a joat. Competent understanding of the core discpline and compent understanding of  the workflows to which the "eDocs" are subjected to.

Being a programmer/developer is not a prerequisite.  -- You know, like it or not you've already joined the "wrangler" posse.)


If what you've described resides in content provided to clients it becomes "when" not "if" a client gets to hold the bag. This may be minor; it may be major. Regardless client ire will tend to be directed towards the eDiscovery firm not the firm's vendor(s).

Be well...

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jan 12, 2013 Jan 12, 2013

WOW!!  I may need to read this over a few times to let it all sink in, but I'm determined to get a handle on this. 

No, they were not asked to reduce all docs to 8 1/2 by 11.  They were supposed to be same size pages as originals.  Hundreds of these files were actually quite large original PDF files (44 by 28) output directly out of AutoCAD.  Beautiful docs which you could zoom in and see very good details.  They just needed to be bates stamped.  Instead, they were converted to 4 by 6 inch blurry images sitting portrait on a mostly blank page -- the only thing legible is the quite large bates number.  The drawing is completely illegible. 

I'm not sure what's wrong with the PDF files they can't OCR.  They use Acrobat 9, and I can OCR each one of them just fine using Acrobat X.  I even tried Acrobat 8, and it worked fine.

Thanks for all you help with this.  Now I feel like I've not been too picky at all on this production.

You've truly been a great help!!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 12, 2013 Jan 12, 2013

A "ps" -- The bottom line is that the eDocs are "legal".

The scanned image can only be an acceptable substitute for the hardcopy if they are an accurate and true representation of the hardcopy in all material regards.

Use of "Searchable Image" can be objectively demonstrated to alter the image of the textual content. Consequently, "accurate and true" are sacrificed.

If one document is "off" that puts all documents in, at a minimum, a "suspect" category.

If I were an attorney with that $500/hour billing rate I'd be licking my chops.

(Heck I shoulda gone to law school <g> - but, I do process out some PDFs that are "legal" so I've learned what's needed in terms of my "deliverable".)

For example: Use of despeckle can remove a decimal point associated with a number.

So, the bowl of egg salad that calls for ".5" teaspoons of salt gets "5" teaspoons of salt.

(A good reason to always park the "0" before that decimal.)

It does not take much imagination to consider the impact on other activities where the correct numeric value is of significance.

Be well...

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Contributor ,
Jul 13, 2016 Jul 13, 2016
LATEST

CtDave wrote:

[...]

Regardless of the version of Acrobat (4.x, 5.x, 6.x, 7.x, 8.x, 9.x, X or XI) the essential characteristic of Searchable Image (Exact) has not changed.

[...]

What happens if you process a PDF containing a scanned image of textual content through Searchable Image (Exact)?
The PDF gains a second "hidden layer" of OCR output.

[...]

If you do OCR using only "Searchable Image", the exact same thing happens as with "Searchable Image (Exact)": a hidden text layer is added in front of the bitmap image.

Hence, what IS the actual difference between "Searchable Image" and "Searchable Image (Exact)"?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines