Copy link to clipboard
Copied
Hi All,
Below is my requirement in detail.
I have a PDF which contains the scanned documents.
I want to convert the PDF content to (XML).
Can anyone help me out to achieve this using SDK with C#.
Manoj K Singh
Copy link to clipboard
Copied
This isn't that useful but I believe it'll end up being a 2-stage process that you may need to use a 2+ libraries to perform the various steps.
Don't get me wrong, there's OCR libs for c# that read pdfs full of images, and no doubt you saw the price was over $4k for a developer license haha. There's a few. I'm assuming you want to avoid that.
There's tools like Xpdf and others that you should find and try just so you can read the PDF images themselves. After you get those images, you might need to convert them to a different image format, and them feed them into an OCR library. Google manages a project Tesseract OCR which you may want to look at. I believe it only compiles to C++ but you know there's ways to use a C++ library with c#.
A lot of work to do, but that's probably why they made their direct PDF Image -> OCR Text plugins so expensive.