Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

How to OCR using SDK in C#

New Here ,
May 10, 2016 May 10, 2016

Hi All,

Below is my requirement in detail.

I have a PDF which contains the scanned documents.

I want to convert the PDF content to (XML).

Can anyone help me out to achieve this using SDK with C#.

Manoj K Singh

2.4K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
May 12, 2016 May 12, 2016
LATEST

This isn't that useful but I believe it'll end up being a 2-stage process that you may need to use a 2+ libraries to perform the various steps.

Don't get me wrong, there's OCR libs for c# that read pdfs full of images, and no doubt you saw the price was over $4k for a developer license haha. There's a few. I'm assuming you want to avoid that.

There's tools like Xpdf and others that you should find and try just so you can read the PDF images themselves. After you get those images, you might need to convert them to a different image format, and them feed them into an OCR library. Google manages a project Tesseract OCR which you may want to look at. I believe it only compiles to C++ but you know there's ways to use a C++ library with c#.

A lot of work to do, but that's probably why they made their direct PDF Image -> OCR Text plugins so expensive.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines