Skip to main content
November 7, 2016
Answered

Which SDK do we need to buy for making a image PDF to a text PDF

  • November 7, 2016
  • 2 replies
  • 3776 views

Hi all,

Our requirements are very simple. We have numerous PDFs which are basically scanned copies of the letters that contain text and images. We want to convert them into searchable PDF without losing the document layout and other real image content like logos or pictures. We want to do it programmatically using Acrobate SDK or API and NOT using the Adobe Acrobat Pro DC viewer application. Please let me know which Acrobat SDK do we need to get. Old forums suggest SDKs like Capture or PDF Library or Acrobat SDK or LiveCycle ES. I am not sure which one of these SDKs do we need. These SDKs come with numerous additional functionalities that we don't even need. is there a basic SDK that we can buy for our simple requirement?

Regards,

Viral Sheth

This topic has been closed for replies.
Correct answer Joel Geraci

Thanks Joel once again. But ABBYY is a third party library. Doesn't Adobe provide its own set of SDK or libraries to perform OCR on PDFs? I am really surprised that nobody from Adobe's sales team has even tried to approach a potential customer like me. Now I am kind of skeptical of how responsive their tech support team would be. I am trying some other products as well. Those vendors are actively in touch with me as soon as they saw a potential customer in me.


Lov435:

While I understand your frustration, I can tell you from experience (I helped launch Acrobat Capture and it's API) that when Adobe tried to supply PDF OCR capabilities, nearly everyone we talked to wanted to replace the built in OCR engine with their own... and the Adobe engine was really good... but OCR is one of those things that people have strong opinions about. By Adobe licensing the PDF Library to multiple 3rd party OCR developers, you're able to get the best of both worlds. You get to have some competition and choice in the actual OCR space while still getting Adobe technology when the resulting PDF file is created.

2 replies

Joel Geraci
Community Expert
Community Expert
November 10, 2016

The Adobe PDF Library would be the tool you'd use to get at the images and then insert any recognized text back into the PDF but you're pretty much on your own to find a library that will deconstruct the page and perform the OCR.

http://www.datalogics.com/products/pdf/pdflibrary/ 

J-

November 10, 2016

Thanks Joel for the answer. Can you please explain what you mean by "deconstruct the page and perform the OCR"? What kind of output is the Adobe PDF Library capable of generating? E.g if I call its API method on a PDF file, will it produce a text file with the text extracted out of the PDF? Won't it produce another PDF file with the text portion of the image within the PDF converted to a searchable text form?

Joel Geraci
Community Expert
Community Expert
November 10, 2016

Ok - You have a document that is composed of just scanned pages. The PDF Library can give you access to these images but only know that the object is an image. It doesn't know what it's an image of. You'd take this image and pass it to another library that can perform the OCR, that library would examine the image, separate out the parts that are readable text and (your term) "real image content". The output of that library can then be used to reconstruct a PDF that has searchable text and images.

ABBYY use the Adobe PDF Library in their tools to do just that and they have an SDK version.

OCR, PDF, Text Scanning Software and Solutions - ABBYY  

November 9, 2016

Nobody knows the answer of this question?