Cannot search scanned/OCR'd document: identity-H encoding

Report · Jul 28, 2020

I have scanned a 25-page document with Acrobat Pro DC (details below) from an HP MFP. I applied OCR to the scan, and can hilight text but cannot search. Copy/paste word into text editor results in unprintable characters. I find that the document is in Identity-H encoding.

I tried the steps outlined in https://community.adobe.com/t5/acrobat/copy-text-in-pdf-gives-me-gibberish-is-there-a-way-to-ocr-to-... to no avail.

I still have the original document that I can re-scan. How can I control the encoding such that Acrobat produces a searchable document (the whole point of my scanning the document)? Thanks.

What I don't understand is how Acrobat could OCR something that it cannot search itself.

Architecture: x86_64
Build: 20.9.20067.384717
AGM: 4.30.101
CoolType: 5.14.5
JP2K: 1.2.2.46033.

Report · Jul 28, 2020

Hi DaveToo,

It's quite possible that your not getting any search results because the quality of your scan is not getting the words you are searching for. For example, if you're searching for "apple" but the word apple in the text was converted into (say) aple, you would not find that word (because you're not searching for that word).

Alternatively you mention "Identity-H encoding," I have to admit I know very little about this but I did find this that explains a number of the dynamics very well.

https://community.adobe.com/t5/acrobat/font-encoding-settings-removing-identity-h-encoding/td-p/1060...

While you do say you are scanning the documents, you do not say how you are scanning them. IF the problem is caused by a poor quality scan, than it's hard to get past that for a good quality OCR. Perhaps the information in this blog I wrote may be of assistance.

http://photosbycoyne.com/Gary's_Help/Scanning/clean-scanning.html

Good luck, let us know.

Adobe Community

Cannot search scanned/OCR'd document: identity-H encoding