OCR in Japanese yields only garbage characters
- July 4, 2022
- 2 replies
- 599 views
[Problem]
Downloaded PDF of a Japanese academic article. OCRed it with language set to Japanese. Result was garbage characters only, as in the example below:
Text: 1970年代アジアにおけるグローバル化の波及と日本
OCR result: 1970 蟷エ莉」繧「繧ク繧「縺ォ縺翫¢繧九げ繝ュ繝シ繝舌Ν蛹悶豕「蜿翫→譌・譛ャ
These garbage characters are identical in other PDF readers.
Same with Enhance.
This is the third time this has happened recently, so I'm beginning to think it's a bug.
[Workarounds attempted]
Exported the article to TIFF and recombined into a new PDF. OCRed. Same result.
Tried again with PNG. Same result.
Tried embedding fonts with Preflight. Same result.
Tried to export to plain text.
Error: Bad PDF; could not read page structure. [13]
[What now?]
The PDF structure was created from TIFFs. If it's a "bad structure," then Acrobat is responsible as far as I can tell.
Is there a workaround?
Attached original PDF for reference.