Skip to main content
Pascal Garin
Inspiring
March 20, 2019
Question

Identity-H encoding

  • March 20, 2019
  • 2 replies
  • 29269 views

Hi,

I have a severe problem when encoding indic scripts. Here is the matter: I work for a French editor who is publishing either translations or bilingual books of Indian literature (Hindi-French, Bengali-French, Tamil-French for instance with a face-to-face presentation, the Indian text on the left-hand side page and the French in regard, on the even pages).

I have subscriptions to Microsoft Office 365 and Adobe Creative Cloud and always update them. I am thus using Word 2016 and Acrobat DC and InDesign, Illustrator, Dreamweaver, etc. 2019, today's lastest versions of these softwares. I am working on PC under Windows 10, equipped with all Indian languages that I use (same for the Office suite where the Hindi, Bengali, Tamil modules are installed, in addition to English and French, my native language).

I am used to prepare a neat version of the texts in MS Word, using styles to ease the exportation to InDesign, and I only use OpenType of TrueType fonts from renowned foundries (Adobe, Linotype, Monotype, Microsoft…).

Nevertheless, when I create a PDF version of these documents (which contain French, but also, Hindi, Bengali or Tamil texts), the PDF that is generated contains subsets of OpenType fonts but also Identity-H encoded text. This is really very annoying as the PDF becomes not exportable (I tried with a very simple text in Word, exported it to a PDF file and then reexported it from Acrobat DC to MS Word and the result was catastrophic).

Also, when I receive PDF files from Indian editors, they are also encoded with this Identity-H encoding and I cannot export them to MS Word, which tremendously complicates my work.

Frankly speaking, I don't really understand what this encoding means by the way: I thought it was an old problem when Indic scripts were not standardized, but this no longer the case and the Unicode Consortium has produced since years a very clear encoding norm. So I am very surprised that this problem remains even in modern softwares and, once again, even if you use OpenType fonts.

I tried different tunings of PDFMaker for the exportation, forcing for instance PDFMaker to embed the whole font in the PDF document, but the problem remains (the MS Word built-in PDF export module produces the same mess).

Does anybody have an explanation to this problem and a solution to offer me?

Thanking you in advance,

Regards,
Pascal Garin

This topic has been closed for replies.

2 replies

ls_rbls
Community Expert
Community Expert
September 15, 2019

Hello,

 

I found the following thread useful:

 

Legend
March 20, 2019

Identity-H is entirely normal and common. It means that the PDF directly uses codes from the font. To extract text when this encoding is used, the PDF also needs a “ToUnicode CMap”. You cannot see if one of these is present.

Exporting from InDesign or using Acrobat PDFMaker for Word should get this right, unless non-Unicode fonts are used. Don’t use such fonts.

Participant
October 27, 2021
quote

To extract text when this encoding is used, the PDF also needs a “ToUnicode CMap”. You cannot see if one of these is present.


By @Test Screen Name

 

Actually, you can if you use this tool: http://brendandahl.github.io/pdf.js.utils/browser/