Skip to main content
manojkumsingh
Participant
May 11, 2016
Question

How to OCR using SDK in C#

  • May 11, 2016
  • 1 reply
  • 2501 views

Hi All,

Below is my requirement in detail.

I have a PDF which contains the scanned documents.

I want to convert the PDF content to (XML).

Can anyone help me out to achieve this using SDK with C#.

Manoj K Singh

This topic has been closed for replies.

1 reply

sinious
Legend
May 12, 2016

This isn't that useful but I believe it'll end up being a 2-stage process that you may need to use a 2+ libraries to perform the various steps.

Don't get me wrong, there's OCR libs for c# that read pdfs full of images, and no doubt you saw the price was over $4k for a developer license haha. There's a few. I'm assuming you want to avoid that.

There's tools like Xpdf and others that you should find and try just so you can read the PDF images themselves. After you get those images, you might need to convert them to a different image format, and them feed them into an OCR library. Google manages a project Tesseract OCR which you may want to look at. I believe it only compiles to C++ but you know there's ways to use a C++ library with c#.

A lot of work to do, but that's probably why they made their direct PDF Image -> OCR Text plugins so expensive.