Participant

Question

ClearScan not encoding ligatures properly

Forum|Forum|9 years ago
April 22, 2016
2 replies
1760 views

I use the ClearScan option to OCR my pdfs. All are high quality scans and OCR is very accurate with one exception: Ligatures. Characters like ff, fi, fl etc. are not encoded properly, i.e., they show up blank. While they look fine on screen it makes it impossible to rely on the OCR for searching the PDFs.

To my bafflement Acrobat recognizes ligatures accurately when I don't do ClearScan but keep it as Image. I've tested multiple documents with both options, it's always the same. Simple words like "different" are fine with the Image option but show up "di erent" with ClearScan from the same source.

Clearly the mapping of the ligature to the respective two code points doesn't work as it should.

How can I fix this?

And, especially: How can I fix this after the fact, given that I already have a large number of PDFs ClearScanned.

Any help greatly appreciated!

Thanks!

This topic has been closed for replies.

J

jandavidhAuthor

Participant

Further testing revealed that this does not only concern ligatures, but all characters that overlap over a following character, such as "f".

Even words like "force" or "farce" are copied as "rce" since the upper part of the f reaches into the space of o or a following it. So it seems that despite the fact that the characters are not connected they are treated as one character and not encoded as they should have been.

Does anyone have suggestions how I could fix this for already OCR'd PDFs?

I've already submitted a bug report but if there is a way to manually fix the encoding for problematic glyphs I'd highly appreciate hints how to go about it.

Also could anyone with Acrobat DC test this in order to determine if it's a version specific bug?

Lovekesh Garg

Adobe Employee

Hi David,

Sorry for the issue you are facing.

Can you please share a sample file(from Drop box or https://cloud.acrobat.com/send ) where you are facing this issue. What I am seeing is sometimes a character seems cropped but it was recognized correctly.

You can try Acrobat DC trial version also from Download Adobe Acrobat free trial | Acrobat Pro DC

Thanks.

Lovekesh Garg

Adobe Employee

Hi Lovekesh,

It is a very strange thing.

I have had this problem so many times in the past.

The way I have encountered the ligature problem is the following: I receive a document from a government agency requesting information. I scan the document, and then I do OCR on it with Acrobat. I want to be able to cut and paste the entire document into a Microsoft Word file so that I can work with the text. I have been doing this for years. In the last year or so I started to encounter the ligature recognition problem.

When I went on the forum and saw that another reader used image scan instead of clear scan, I tried that for the first time. And I got a good result with the ligatures.

Then I scanned a page from the same document I have been working with again, but I chose Clear Scan. And this time the ligatures were fine.

So I thought, ok, I will find an old document that I remember having the ligature problem with and send him a page from that.

I have looked at about 5 or 6 old documents that I scanned over the past 2 years, and I reopened them.

And I chose a selection from each document, copied it, and pasted into a doc file. And the ligatures were fine every time. Just to give you an example: when I was having troubles, any time I copied text where the word “beneficiary” was included, the result would be “bene ciary” — now the word is appearing correctly.

So there seems to be some resolution to the problem that occurred when I did image scan on a document and went back to the Clear Scan for a subsequent. I cannot really explain why this would happen or how it would affect all my older scanned PDF documents as well but am including the details in case it is something you are interested in investigating further.

Lisa M. Jacobs

Attorney at Law

455 E. Surry Road

Keene, NH 03431

978-297-9848

lisajacobs@immigra.com

Please send all packages with no signature required

This issue normally comes with low resolution scanned document. In past few releases, we try our best to resolve these kinds of problems. It might be possible that due to those fixes you won't face that issue again. Please share the file if you face this issue again. So that we can work more to improve our algorithms.

Thanks.

CtDave

Participating Frequently

Acrobat's other two methods for OCR (Searchable Image & Searchable Image Exact) provide a "hidden" output. The glyphs have no stroke and no fill. So you don't "see' them. What you see is the image of the characters.

ClearScan replaces a recognized character's image with a glyph having fill and stroke - thus you can see 'em.

After doing OCR via Search Image or Searchable Image Exact export the PDF text content (all that OCR output) to a TXT file.

What is there?

Be well...

J

jandavidhAuthor

Participant

Thank you for your reply.

Sorry for not having been precise in my original question. I meant the "searchable image" option when I wrote "keep it as Image".

I have systematically tested it with the same (not-OCR'd) PDF page. First I did ClearScan, saved it under a different name. Then I did Searchable Image (on the original PDF). Comparing the output (i.e., the text when copied in a text editor) it turns out that ligatures have only been encoded properly with the Searchable Image option.

ClearScan gives "di erent"

Searchable Image gives "different"

I know that ClearScan creates its own font on the fly, whereas Searchable Image creates a hidden text layer. My question is: Why, if Adobe is able to properly recognize ff as two f characters with the image-option, why does it not encode the ff-glyph that it created with ClearScan also as two f characters. Why is it able to properly assign a glyph that looks like "a" the codepoint for "a" but unable to assign the two corresponding codepoints for a glyph that looks like "ff".

I assume that for all glyphs that don't correspond to single letters or are other weird shapes that don't seem to be text, the ClearScan option vectorizes them properly but then assigns them some single custom codepoint instead of trying to create the correct correspondences with existing characters.

However it surely "knows" what the correct characters are since with Searchable Image it gets it right.

Therefore my question how to fix this.

CtDave

Participating Frequently

Consider a feature request.

Feature Request/Bug Report Form

Be well...

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded