Skip to main content
Participating Frequently
July 4, 2022
Question

OCR in Japanese yields only garbage characters

  • July 4, 2022
  • 2 replies
  • 599 views

[Problem]

Downloaded PDF of a Japanese academic article. OCRed it with language set to Japanese. Result was garbage characters only, as in the example below:

 

Text: 1970年代アジアにおけるグローバル化の波及と日本

OCR result: 1970 蟷エ莉」繧「繧ク繧「縺ォ縺翫¢繧九げ繝ュ繝シ繝舌Ν蛹悶􈆮豕「蜿翫→譌・譛ャ

 

These garbage characters are identical in other PDF readers.

 

Same with Enhance.

 

This is the third time this has happened recently, so I'm beginning to think it's a bug.

 

[Workarounds attempted]

Exported the article to TIFF and recombined into a new PDF. OCRed. Same result.

 

Tried again with PNG. Same result.

 

Tried embedding fonts with Preflight. Same result.

 

Tried to export to plain text.

Error: Bad PDF; could not read page structure. [13]

 

[What now?]

The PDF structure was created from TIFFs. If it's a "bad structure," then Acrobat is responsible as far as I can tell.

 

Is there a workaround? 

 

Attached original PDF for reference.

This topic has been closed for replies.

2 replies

Omachi
Legend
July 6, 2022

In this result, the original character code is processed as UTF-8, and the character code is processed as Shift-JIS at the time of output or reading.

It is presumed that Shift-JIS, which is a character code unique to Japanese, was used because Japanese was set.

Unfortunately, I don't know how to deal with it.

Participating Frequently
July 5, 2022

Follow up:

This is definitely a bug, and I think it's a new one. I've been able to reproduce the same error with multiple files, whether scanned or downloaded and OCRed. As far as I can tell, it looks like Acrobat is trying to OCR into fonts I don't have installed. That's... dumb. 

 

[Workaround]

At least in some files, selecting Edit PDF and then highlighting any text on the page seems to fix the issue for the entire document. Not sure this applies to all files.

Participating Frequently
July 5, 2022

No, sorry, that workaround only works for each page individually. ┐(´д`)┌