Copy link to clipboard
Copied
Hi,
I have some PDFs with Japanese texts. Mostly the texts are in either MS-Mincho or MS-Gothic font. Mostly the fonts are in TrueType (CID) type and Identity-H encoding. I have no problem on copying Japanese texts from Acrobat and pasting them in Notepad. This must be a proof of a proper ToUnicode map in the PDF.
However, when I print the PDF into "Adobe PDF" printer driver, the copy & paste no longer works in the PDF output file. Now the Japanese texts are pasted as missing glyph character (a quotation mark in a square) in Notepad. I guess this means ToUnicode map was not created, most probably during the distillation process.
On the other hand, I have a PDF with Japanese texts whose font is "MSゴシック" or "MS明朝" (these means MS-Gothic and MS-Mincho, respectively), and more importantly, they are in 90ms-RKSJ-H encoding. I can copy and paste Japanese texts from the "Adobe PDF" printed output of this PDF. One thing weird is, the 90ms-RKSJ-H encoding is now changed in Identity-H in the output PDF. To see if the encoding is causing the issue, I printed this output PDF into "Adobe PDF" once again, but only got an error log from Distiller.
I checked a few articles including http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/5411.ToUnicode.pdf and http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/distfont.pdf but still not sure what the problem is. Do I have to configure or create ToUnicode mapping files? Or is this something cannot be done by the nature of Identity-H encoding? Thank you.
- eellor
1 Correct answer
1. The ability to copy the text does not prove there was a ToUnicode map. 32000-1 gives the method for text extraction, and ToUnicode is only part of this. Also, where these methods do not apply a viewer may use other methods and special knowledge. In fact the CMap 90ms-RKSJ-H has a well defined mapping to Unicode.
2. CMaps may not survive redistilling.
3. ToUnicode cannot survive redistilling. When a PDF is printed, the print mechanism is concerned ONLY with visible entities. NOTHING else is desi
...Copy link to clipboard
Copied
1. The ability to copy the text does not prove there was a ToUnicode map. 32000-1 gives the method for text extraction, and ToUnicode is only part of this. Also, where these methods do not apply a viewer may use other methods and special knowledge. In fact the CMap 90ms-RKSJ-H has a well defined mapping to Unicode.
2. CMaps may not survive redistilling.
3. ToUnicode cannot survive redistilling. When a PDF is printed, the print mechanism is concerned ONLY with visible entities. NOTHING else is designed to be kept, include interactivity and searchability.
4. For these reasons and many others redistilling (also called "refrying") is considered a very poor workflow indeed. Not forbidden but certainly not supported by Adobe, and no longer possible on Mac. Best to find an alternative. If you tell us why you do this activity, we may have a suggestion as to an alternative.

