How do I optimize for Russian OCR?

Report · Nov 19, 2018

I was hoping to use Acrobat to scan a Russian book into an editable form to aid in the translation of the book. Many of the Cyrillic characters are not being properly converted. Is there something I could do to improve the results?

Report · Nov 19, 2018

Hi Wayne,

The big question I have is what resolution are you scanning at? The higher the resolution the better quality of OCR will take place.

I do not know Cyrillic so please excuse me for not using that as an example but to use western characters, let's take the letter combination "ri." If the text is small and the resolution is low, that is very easily interpreted at "n."

When you scan the book, I'd try to aim for about 600 ppi.

Also, you did not state what kind of scanner you are using nor what operating system you are using. If you are using a Mac, the only way to scan from within the Mac OS is to use Apple's Image Capture. Trust me, that's about as useless of scanning software that exists on earth (and I've been a Mac user since 1985). Your best bet is to scan with the scanning software that came with your scanner. You do have more options with the PC but I hear so many issues and problems with the PC scanning as well so again, I strongly suggest you just scan directly with the software that came with your scanner.

When you do the scanning, save the images as TIF documents (avoid JPG). Do not be alarmed by the size of these documents as they could be 6-8 MB per page. When you process the pages through Acrobat they should end up at about 60-80 kb per page (and then you can toss the TIF documents).

Please let us know if this resolves any of your issues.

View solution in original post

Report · Nov 19, 2018

Hi Wayne,

The big question I have is what resolution are you scanning at? The higher the resolution the better quality of OCR will take place.

I do not know Cyrillic so please excuse me for not using that as an example but to use western characters, let's take the letter combination "ri." If the text is small and the resolution is low, that is very easily interpreted at "n."

When you scan the book, I'd try to aim for about 600 ppi.

Also, you did not state what kind of scanner you are using nor what operating system you are using. If you are using a Mac, the only way to scan from within the Mac OS is to use Apple's Image Capture. Trust me, that's about as useless of scanning software that exists on earth (and I've been a Mac user since 1985). Your best bet is to scan with the scanning software that came with your scanner. You do have more options with the PC but I hear so many issues and problems with the PC scanning as well so again, I strongly suggest you just scan directly with the software that came with your scanner.

When you do the scanning, save the images as TIF documents (avoid JPG). Do not be alarmed by the size of these documents as they could be 6-8 MB per page. When you process the pages through Acrobat they should end up at about 60-80 kb per page (and then you can toss the TIF documents).

Please let us know if this resolves any of your issues.

Report · Nov 20, 2018

Thank you for your reply. What it caused me to do is to vary some of the parameters to see what helped in the hope of fine tuning the scan to achieve the result I need. The book was printed from microfiche and the pages have a gray background from low contrast but overall the pages are readable and the characters are clear. I am using a PC running Windows 10 with an HP Officejet 5746 for scanning. I am scanning directly into Acrobat Pro DC using File>Create>PDF from Scanner>HP Officejet 5740 series TWAIN>Grayscale Document. The settings I varied are the Resolution and Quality. The best result was using 300dpi, Optimize 20% High, OCR Russian Searchable. For this though, only about 20% of the words were converted correctly. I probably would desire 80% correct to make the process useful.

Report · Nov 20, 2018

Hi Wayne,

Oooh, that's going to be hard to overcome (the microfiche). As far as the gray goes, check out the concepts I have within this blog I wrote, You might fine it helpful for the gray and some of the other artifacts.

https://forums.adobe.com/community/creativepipeline/blog/2018/01/22/scanning-clean-search-able-pdfs

Report · Nov 23, 2018

I gave it a good try. I used several resolutions, bit depths and post-processing of TIF and JPG to improve the contrast of the image. None of the OCR attempts gave an improvement over 20% of the words being converted correctly. I appreciate your help on this. Everything you said I believe is correct. I have now canceled my Acrobat trial and will move on to other ways to translate the document.

Report · Nov 23, 2018

Hi Wayne,

Gosh, I'm sorry that this is not working for you. I'd offer to look at one of the documents but since I do not know cyrillic, I'd have no way to know that it wasn't "reading" the document correctly or not.

I do think the big problem/issue is the original document is from a microfiche. By the time you expand the document to a full size, the resolution has got to be poor (at least poor for OCR. I would be curious to see what this looks like. If you could send one of these to me on this thread, I'd like to see what you're working with. Again, this is just curiosity.

Best,

Report · Nov 24, 2018

Hi gary_sc-

Here is the first page of the book I was trying to OCR. The second line "Kommyhap..." was the only one that converted correctly, probably because it was bold font. The rest converted poorly. (This is a cropped 300dpi JPG using default exposure, a scaled down version that this forum allowed to pass through).

Cheers,

Wayne