OCR in Japanese yields only garbage characters

Question

[Problem]Downloaded PDF of a Japanese academic article. OCRed it with language set to Japanese. Result was garbage characters only, as in the example below: Text: 1970年代アジアにおけるグローバル化の波及と日本OCR result: 1970 蟷ｴ莉｣繧｢繧ｸ繧｢縺ｫ縺翫¢繧九げ繝ｭ繝ｼ繝舌Ν蛹悶􈆮豕｢蜿翫→譌･譛ｬ These garbage characters are identical in other PDF readers. Same with Enhance. This is the third time this has happened recently, so I'm beginning to think it's a bug. [Workarounds attempted]Exported the article to TIFF and recombined into a new PDF. OCRed. Same result. Tried again with PNG. Same result. Tried embedding fonts with Preflight. Same result. Tried to export to plain text.Error: Bad PDF; could not read page structure. [13] [What now?]The PDF structure was created from TIFFs. If it's a "bad structure," then Acrobat is responsible as far as I can tell. Is there a workaround?  Attached original PDF for reference.

Omachi · Answer

In this result, the original character code is processed as UTF-8, and the character code is processed as Shift-JIS at the time of output or reading.

It is presumed that Shift-JIS, which is a character code unique to Japanese, was used because Japanese was set.

Unfortunately, I don't know how to deal with it.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded