Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

OCR Accuracy

Community Beginner ,
Apr 21, 2016 Apr 21, 2016

I work in the pharmaceutical industry and I thought it was a requirement to have all PDF files OCR'd when sending them to FDA.  When I mentioned that it was commented to me that when you OCR a pdf file the text in that file will change.  It was mentioned that once we OCR a file you need to compare the 2 files. 

Is this true?

TOPICS
Acrobat SDK and JavaScript
998
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Apr 22, 2016 Apr 22, 2016

How did you create the PDF files?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Apr 23, 2016 Apr 23, 2016
LATEST

Acrobat (Pro or Standard) offers 3 ways to OCR.

(1) Searchable Image

(2) Searchable Image (Exact)

(3) ClearScan

#1 - Provides an OCR output whose glyphs have no stroke or fill  -- so, "invisible" or "hidden".

This method also dresses up the image a wee bit. Thus, an altered image rather than the exact image as provide by the scanner.

Consequently #1 is typically not acceptable to a FedGov agency (or any entity with an interest in a document of record having the proper "provenance").

#2. An OCR output developed as in #1. But, the exact image remains untouched.

Typically this is what a FedGov agency requires if submitting a scanned image of text.

So, the original image out of the scanner maintains its integrity and the OCR output supports find / search.

#3 ClearScan - Introduced a few versions back. When the bit-map of a character's image is recognized that is replace with a font (character glyph is seen as it has fill and stroke applied).  What is not recognized is left. And more magic...

Bottom line - That image out of the scanner that *was* the exact replica of the hardcopy and thus a valid / legal document of record is blown away, gone, dent de lion in the wind eh. Typically not acceptable for something submitted to a FedGov agency.

So - You use #2. But, there is more! What is the required resolution? Often it is 300ppi. Was lossy compression used? Typically a no-no.

So gotchas may result in submittal rejections.

For a submittal to a FedGov agency never-ever rely on hearsay; talk is cheap and like as not wrong or incomplete.

It is "your" submittal eh. Fetch and become one with the agency submittal guidelines / requirements. That's your success path.

Be well...

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines