Create PDF, why KANJI 9AD8(高) will be changed to 2FBC(⾼) when Meiryo UI ?

Report · Sep 23, 2019

When I create PDF by Adobe Acrobat Distiller.

Acrobat changes KANJI 9AD8(高) to 2FBC(⾼) when Meiryo UI.

Then internet world, I can see many documents includes 2FBC(⾼).

Normally it is difficult to input character 2FBC(⾼) to documents.

This behavior is not convenient. We can not serch document include "高".

Could you teach Adobe company about this phenomenon.

Step1 Original Word document.

Step2 Acrobat PDF. I can not search Meiryo UI 6587.

Step3 Word PDF. I can search Meiryo UI 6587. No problem.

Report · Sep 23, 2019

What is the original document created with before you convert to PDF?

Report · Sep 25, 2019

Thank you for ls_rbls-san. I used Office 365 word. This problem is appear at Meiryo UI font, not appear at MS UI Gothic.

Report · Sep 23, 2019

Distiller does not understand CJK remapping, it just takes its input and makes a PDF. So we need to look closely at all the steps and settings that you use on the way to the PDF. I checked the Meiryo UI font included with Windows 8.1, and it does include U+9AD8.

An interesting point is that Chrome shows both of your code points as identical

while some pages show different eg

(Key point for me: Is the low centre box detached?)

Report · Sep 25, 2019

Thank you for Test_Screen_Name-san. Distiller does not have remapping to CJK, of course.

But, some application had the function that use first code than large code in KANJI code. Because KANJI code has simple(current) style code and difficult(old) style code. For example

4E80(亀) and 9F9C(龜). Two KANJI character has the same mean KAME=Turtle. This function select 4E80 than 9F9C, because user should chose current style code. But Meiryo has more more first code 2FD4(⿔), so this phenomenon occurs, If disttller application codes includes this function.

Report · Sep 25, 2019

Step1. I make original documents by word of Office 365.

Report · Sep 25, 2019

Step2. I change original data to PDF by Acrobat distiller.

I open Acrobat Reader.

I search "文"(6587).

Result

I can find the character of MS UI Gothic font.

I can not find the character of Meiryo UI font.

Report · Sep 25, 2019

Since you have Acrobat, I assume Acrobat DC, please convert with the Acrobat ribbon in Microsoft Word. This does not use Distiller and should get much better results.

Report · Sep 25, 2019

Thank you Test_Screen_Name-san. Of course, I don't use Distiller, then I get best results. Only Distiller has been sprinkling dirty characters.

Report · Sep 25, 2019

Step3. I change original data to PDF by Word's save to PDF function, this PDF has no problem.

I open the PDF file by Acrobat Reader.

I search "文"(6587).

Result

I can search the "文” of MS UI Gothic, and Meiryo UI font.

Ther are no problem.

Report · Sep 23, 2019

元のOSバージョン+作成アプリケーションと、Distillerのバージョン、そしてどのように変換を行ったのか、といった情報が必要にはなります。

ただ、Windows 10+Word 2016上で作成した「高い」という文字を含んだ文書を、Adobe PDFプリンタードライバー経由で標準設定で書き出したPDFからテキスト抽出したものをコード確認する限りは、u+9ad8となっていることを確認しました。

Report · Oct 01, 2019

assause-さん　ありがとうございます。ポイントはMeiryo UIフォントを使うことです。MS Gothicなどでは起きません。この問題が発生する原因はDistillerが利用しているライブラリーが関係します。EUC、SJIS、UTF16などの文字コード変換すると、Meiryo UIフォントではCJKの漢字と康煕字典部首コードの漢字が同じにリンクされているため、予期せぬ結果になります。康煕字典部首コードにリンクしていないMS Gothicなどでは、この問題は起きません。

Report · Oct 01, 2019

改めて行ってはみたのですが、u+2ad8がPDFにした際にu+9fdcに統合される、という現象にはなりました。

いくつかのフォントを用いましたが、いずれも同様です。

実際にテストしたデータを添付しておきます。

Report · Oct 03, 2019

assause さん　ありがとうございます。その現象です。2ad8ではなく2fbcですが、本来は2fbcのままであるべきです。ところがMicrosoftのライブラリーが9fdcに統合しています。この機能が裏目に出て2fbcの文字になってしまうことがあり得るのです。

Report · Oct 03, 2019

２番目のpdfのMS UI Gothicの高は違う文字なので2fbcのコードが入っていたはずですが、pdfにすると9fdcに変化してしまいますね。

Report · Oct 03, 2019

Webで、高と、2fbcの⾼とで検索してみてください。違う結果になります。2fbcの⾼を世の中の人が普通に入力しているとは思えないのです。わかっていただけるでしょうか？原因の一つがDistillerなのです。

Report · Oct 03, 2019

CJK統合漢字が康煕部首側に変化する場合もある、ということでしょうか。
確かに提示されたWeb上のPDFは康煕部首側でしたが、これだけでは断言が難しいです。

ccc3141592さんも仰るように、基本、u+2fbcは入力することはまずありません。
そしてCJK統合→康煕部首への変化としたら問題だとは思うのですが、これは結果のPDFと変換エンジンだけ見て決めつけるのは拙速ではないかと思います。
よって、作成アプリケーションから前後関係をすべて明確にし、確実な再現方法を求めることが必要ですし、少なくともアプリケーション上のデータ状態と生成されたPSファイルの確認など、Distillerを通す前のデータ状態を確認することは欠かせないところです。

Report · Oct 03, 2019

可能性があること、ありがとうございます。そのPDFをどうやって作られたのか、作られた方に聞いてみたところ、Meiryo UIで、Wrod、Distiller Xのとき、実際に発生しました。おっしゃる通りに現物をお見せできれば良いのですが。少なくとも誰も2fbcなどを入力していないのにWeb上に検索できない2fbcなどを含むPDFができてしまったことは確かでしょう。康煕部首は結構あるのでとても困ります。Adobeは直すべきです。

Report · Oct 04, 2019

Ver.Xのみとした場合、それ自体はサポート終了品なので、現行バージョンに移行することが求められます。

ですからDCサブスクリプションと Officeのサポート品での組み合わせで発生するかどうかもあります。

Report · Jul 13, 2020

はじめまして。現行バージョンである Acrobat Distiller 20.0 (Windows) でもこの問題が起きています。

フォントはMeiryo UIに限らず、メイリオや游明朝・游ゴシックなどで確認できます。

再現方法：Wordなどで "埼玉県日高市" と入力してフォントをメイリオなどにする。「印刷」でプリンターを「Adobe PDF」にして印刷（PDFファイルに出力）。そのPDFをAcrobatで開いてテキストを選択コピーしてその文字コードを調べると、"玉", "日", "高" だったはずの文字が次の文字に変わっています：

⽟ ‎U+2F5F KANGXI RADICAL JADE
⽇ ‎U+2F47 KANGXI RADICAL SUN
⾼ ‎U+2FBC KANGXI RADICAL TALL

この問題はいろいろなところで問題になっているようです：

https://twitter.com/apricoton/status/771574863815249920

https://twitter.com/koedameiro/status/1107114209815326720

https://twitter.com/hal_sk/status/1281853581218336768

このためにAcrobat Distillerで生成されたPDFが、正常なテキストを取り出せない、検索ができない、音声読み上げができないなどアクセシビリティーの問題やデータの再利用が難しい問題があります。

ぜひAdobeさんには問題を認識して対策を講じていただきたいです。

Report · Jul 13, 2020

Murakami-san

はじめまして。まだ治っていないんですね。私はあきらめて、Word、Excellなどから直接、PDF拡張子で保存することを、みなさんに勧めています。問題は起きません。

Distillerを使うと、Meiryoフォントに限らず、康煕字典部首コードにフォントを実際持っている新しいフォントは軒並み、この現象におちいるのではないかと推察します。

Murakamiさんの記事にあるとおり、困った問題です。早くAdobeが修正しないと、たくさんのゴミPDFファイルがWeb上にできて、消えません。検索、変換ができないのですから大変困った問題だと思います。
Adobeあるいは関係者からの方、どうか反応してください！