Skip to main content
Participant
April 22, 2016
Question

ClearScan not encoding ligatures properly

  • April 22, 2016
  • 2 replies
  • 1760 views

I use the ClearScan option to OCR my pdfs.  All are high quality scans and OCR is very accurate with one exception: Ligatures. Characters like ff, fi, fl etc. are not encoded properly, i.e., they show up blank.  While they look fine on screen it makes it impossible to rely on the OCR for searching the PDFs.

To my bafflement Acrobat recognizes ligatures accurately when I don't do ClearScan but keep it as Image. I've tested multiple documents with both options, it's always the same.  Simple words like "different" are fine with the Image option but show up "di erent" with ClearScan from the same source.

Clearly the mapping of the ligature to the respective two code points doesn't work as it should.

How can I fix this?

And, especially:  How can I fix this after the fact, given that I already have a large number of PDFs ClearScanned.

Any help greatly appreciated!

Thanks!

This topic has been closed for replies.

2 replies

jandavidhAuthor
Participant
April 27, 2016

Further testing revealed that this does not only concern ligatures, but all characters that overlap over a following character, such as "f".

Even words like "force" or "farce" are copied as "rce" since the upper part of the f reaches into the space of o or a following it. So it seems that despite the fact that the characters are not connected they are treated as one character and not encoded as they should have been.

Does anyone have suggestions how I could fix this for already OCR'd PDFs?

I've already submitted a bug report but if there is a way to manually fix the encoding for problematic glyphs I'd highly appreciate hints how to go about it.

Also could anyone with Acrobat DC test this in order to determine if it's a version specific bug?

Lovekesh Garg
Adobe Employee
Adobe Employee
May 24, 2016

Hi David,

Sorry for the issue you are facing.

Can you please share a sample file(from Drop box or https://cloud.acrobat.com/send ) where you are facing this issue. What I am seeing is sometimes a character seems cropped but it was recognized correctly.

You can try Acrobat DC trial version also from Download Adobe Acrobat free trial | Acrobat Pro DC

Thanks.

Participant
December 20, 2016

I have this same problem with ligatures, especially involving the "fi" combination.

CtDave
Participating Frequently
April 24, 2016

Acrobat's other two methods for OCR (Searchable Image & Searchable Image Exact) provide a "hidden" output. The glyphs have no stroke and no fill. So you don't "see' them. What you see is the image of the characters.

ClearScan replaces a recognized character's image with a glyph having fill and stroke - thus you can see 'em.

After doing OCR via Search Image or Searchable Image Exact export the PDF text content (all that OCR output) to a TXT file.

What is there?

Be well...

jandavidhAuthor
Participant
April 24, 2016

Thank you for your reply.

Sorry for not having been precise in my original question. I meant the "searchable image" option when I wrote "keep it as Image".

I have systematically tested it with the same (not-OCR'd) PDF page. First I did ClearScan, saved it under a different name. Then I did Searchable Image (on the original PDF).  Comparing the output (i.e., the text when copied in a text editor) it turns out that ligatures have only been encoded properly with the Searchable Image option.

ClearScan gives "di erent"

Searchable Image gives "different"

I know that ClearScan creates its own font on the fly, whereas Searchable Image creates a hidden text layer. My question is:  Why, if Adobe is able to properly recognize ff as two f characters with the image-option, why does it not encode the ff-glyph that it created with ClearScan also as two f characters. Why is it able to properly assign a glyph that looks like "a" the codepoint for "a" but unable to assign the two corresponding codepoints for a glyph that looks like "ff". 

I assume that for all glyphs that don't correspond to single letters or are other weird shapes that don't seem to be text, the ClearScan option vectorizes them properly but then assigns them some single custom codepoint instead of trying to create the correct correspondences with existing characters. 

However it surely "knows" what the correct characters are since with Searchable Image it gets it right.

Therefore my question how to fix this.

CtDave
Participating Frequently
April 24, 2016

Consider a feature request.

Feature Request/Bug Report Form 

Be well...