Skip to main content
Inspiring
April 29, 2020
Answered

Inconsistency with OCR legibility (in Japanese)

  • April 29, 2020
  • 1 reply
  • 1344 views

Hello, I'm seeking help with an issue regarding text that I have OCRed using Acrobat. In brief: sometimes, text that I have OCRed in Japanese can't be successfully copied out of the document. 

 

Please allow me to explain my use case a little further. I am using DevonThink, a database program, to manage a library of PDFs and other documents. This program indexes the content of all of the files that are added to it, so that I can search my entire library all at once. My documents are a mix of English and Japanese, and over the years of time I have scanned them in a few different ways. Sometimes I used a proper Xerox machine, sometimes I used an app on my phone. I've checked, and no matter the source, all of my English-language PDFs that I have OCRed (using Acrobat XI, though I am now using Pro DC) appear in my DevonThink search results. However, I recently noticed that I'm missing many results in Japanese, in other words I search for a word that I know exists hundreds of times in document X, and document X does not appear in the search results, even if I can copy the text out of Acrobat. If I copy/paste text from this kind of PDF out of DevonThink (or Preview, for that matter) I get a blank line or three, depending on the length of the text I have selected.

 

When I looked more carefully, I found that my PDFs in Japanese fall into the following three categories: 

 

  1. I can copy/paste plain text out of Acrobat, same with DevonThink (PDF is indexed correctly, great)
  2. I can copy/paste plain text out of Acrobat, DevonThink results in blank (PDF is not indexed)
  3. When I copy/paste text from Acrobat I get total gibberish, same with DevonThink
  • Note: in this case, if I select "Copy with formatting" in Acrobat, I can paste the plain text with no issue, but this does not solve anything for DevonThink

 

If the answer to this issue is simply “it depends on the quality of your scan,” hey I can live with that. I can guess that OCRing Japanese characters is not a simple technical task! That said, though, in the cases where things aren't working right, the text is still in there somewhere, because in case 2 I can copy directly out of Acrobat, and in case 3 I can copy when I use "copy with formatting." 

 

I know this might not be strictly speaking an Acrobat question, because it has to do with the interaction of another program with my PDFs, but I can’t help but wondering if there isn’t something I could do in Acrobat to fix this. I don’t have any sophisticated technical knowledge of how PDFs work, which is why I’m here. I thought this might have something to do with fonts, but I'm attaching for reference the Preflight "List potential font problems" reports for all 3 cases, and obviously there's nothing really there. This issue has been bugging me for while now, so I'll be grateful for any help!!  

This topic has been closed for replies.
Correct answer dana36052552

On the extreme off chance that anyone else has faced a similar problem (lol) — upgrade your Mac OS to at least Mojave and everything should sort itself out... 

1 reply

dana36052552AuthorCorrect answer
Inspiring
May 19, 2020

On the extreme off chance that anyone else has faced a similar problem (lol) — upgrade your Mac OS to at least Mojave and everything should sort itself out...