How to OCR using SDK in C#

Report · May 10, 2016

Hi All,

Below is my requirement in detail.

I have a PDF which contains the scanned documents.

I want to convert the PDF content to (XML).

Can anyone help me out to achieve this using SDK with C#.

Manoj K Singh

Report · May 12, 2016

This isn't that useful but I believe it'll end up being a 2-stage process that you may need to use a 2+ libraries to perform the various steps.

Don't get me wrong, there's OCR libs for c# that read pdfs full of images, and no doubt you saw the price was over $4k for a developer license haha. There's a few. I'm assuming you want to avoid that.

There's tools like Xpdf and others that you should find and try just so you can read the PDF images themselves. After you get those images, you might need to convert them to a different image format, and them feed them into an OCR library. Google manages a project Tesseract OCR which you may want to look at. I believe it only compiles to C++ but you know there's ways to use a C++ library with c#.

A lot of work to do, but that's probably why they made their direct PDF Image -> OCR Text plugins so expensive.

How to OCR using SDK in C#

Photos