Copy link to clipboard
Copied
Hi all,
Our requirements are very simple. We have numerous PDFs which are basically scanned copies of the letters that contain text and images. We want to convert them into searchable PDF without losing the document layout and other real image content like logos or pictures. We want to do it programmatically using Acrobate SDK or API and NOT using the Adobe Acrobat Pro DC viewer application. Please let me know which Acrobat SDK do we need to get. Old forums suggest SDKs like Capture or PDF Library or Acrobat SDK or LiveCycle ES. I am not sure which one of these SDKs do we need. These SDKs come with numerous additional functionalities that we don't even need. is there a basic SDK that we can buy for our simple requirement?
Regards,
Viral Sheth
Lov435:
While I understand your frustration, I can tell you from experience (I helped launch Acrobat Capture and it's API) that when Adobe tried to supply PDF OCR capabilities, nearly everyone we talked to wanted to replace the built in OCR engine with their own... and the Adobe engine was really good... but OCR is one of those things that people have strong opinions about. By Adobe licensing the PDF Library to multiple 3rd party OCR developers, you're able to get the best of both worlds. You get
...Copy link to clipboard
Copied
Nobody knows the answer of this question?
Copy link to clipboard
Copied
The Adobe PDF Library would be the tool you'd use to get at the images and then insert any recognized text back into the PDF but you're pretty much on your own to find a library that will deconstruct the page and perform the OCR.
http://www.datalogics.com/products/pdf/pdflibrary/
J-
Copy link to clipboard
Copied
Thanks Joel for the answer. Can you please explain what you mean by "deconstruct the page and perform the OCR"? What kind of output is the Adobe PDF Library capable of generating? E.g if I call its API method on a PDF file, will it produce a text file with the text extracted out of the PDF? Won't it produce another PDF file with the text portion of the image within the PDF converted to a searchable text form?
Copy link to clipboard
Copied
Ok - You have a document that is composed of just scanned pages. The PDF Library can give you access to these images but only know that the object is an image. It doesn't know what it's an image of. You'd take this image and pass it to another library that can perform the OCR, that library would examine the image, separate out the parts that are readable text and (your term) "real image content". The output of that library can then be used to reconstruct a PDF that has searchable text and images.
ABBYY use the Adobe PDF Library in their tools to do just that and they have an SDK version.
Copy link to clipboard
Copied
Thanks Joel once again. But ABBYY is a third party library. Doesn't Adobe provide its own set of SDK or libraries to perform OCR on PDFs? I am really surprised that nobody from Adobe's sales team has even tried to approach a potential customer like me. Now I am kind of skeptical of how responsive their tech support team would be. I am trying some other products as well. Those vendors are actively in touch with me as soon as they saw a potential customer in me.
Copy link to clipboard
Copied
Adobe only provides Scanning and OCR technology to hardware vendors – we don’t offer a solution for software vendors
Copy link to clipboard
Copied
Lov435:
While I understand your frustration, I can tell you from experience (I helped launch Acrobat Capture and it's API) that when Adobe tried to supply PDF OCR capabilities, nearly everyone we talked to wanted to replace the built in OCR engine with their own... and the Adobe engine was really good... but OCR is one of those things that people have strong opinions about. By Adobe licensing the PDF Library to multiple 3rd party OCR developers, you're able to get the best of both worlds. You get to have some competition and choice in the actual OCR space while still getting Adobe technology when the resulting PDF file is created.
Copy link to clipboard
Copied
Thanks Joel. I see what you are saying. So what are those third party OCR developers whom Adobe has given the licenses to use their PDF libraries to develop OCR solution? If there are many, can you name a couple of the most popular ones?
Copy link to clipboard
Copied
Glad I could help. The only one I'm familiar with is ABBYY. Full disclosure, ABBY is a customer of Datalogics but I'd recommend them anyway.
Find more inspiration, events, and resources on the new Adobe Community
Explore Now