How can I detect if a PDF needs to be OCRd?

Report · Mar 31, 2017

First, I have searched Google for quite some time trying to figure this out on my own and the closest I got was finding this article written by an Adobe employee named Rick Borstein: How can I detect if a PDF needs to be OCRd?

That article is EXACTLY what I want to be able to do - I need to be able to search my entire computer, or a folder with a large amount of PDF documents in it, and I need to know which of those thousands of PDF documents have already been OCRd, and which ones have not.

The problem im running into is 2 fold - first, the link to a Preflight file that Rick included in his blog, no longer appears to be working. Second, this blog post describes what to do using Adobe Pro, and I am using Adobe DC. I thought the overall process would be about the same, and I attempted to create my own Preflight profile, but then when I got to "Step 3" in his blog, it tells me to go to "Batch Processing", and I am unable to find where that is.

So my question has a couple of different parts - the first being, is this still the ONLY way to go about doing this type of search? This Blog post was created back in 2010, so im guessing Adobe may have created a more simple solution for doing what I need done, but I have been unable to find anything to suggest that. If that is the case, and doing this is still the only solution, then I would ask for a step by step guide on how to create exactly what Rick is talking about in his blog, using Adobe DC instead of Pro. And finally, im not sure if that file Rick linked to on his blog is all that important, or if I can just create my own Preflight profile - so I would ask for either a new/updated link that will allow me to download that file, or again, a step by step guide on how to create the exact Preflight profile I need to do this.

Thank you in advance!

Jonathan

Report · Mar 31, 2017

Let's get the basics out of the way first. You write " Second, this blog post describes what to do using Adobe Pro, and I am using Adobe DC. " For Adobe Pro, read Acrobat Pro. But what do you mean by "Adobe DC"? Acrobat Pro, Acrobat Standard or Acrobat Reader?

Report · Mar 31, 2017

Im using Adobe Acrobat Pro DC version 15.006.30280.

That blog post is from 2010 - so I have no idea what version he was using, but Acrobat has changed quite a bit in the least 7 years, so the instructions he posts in the blog are no longer relevant to the version im using.

Report · Mar 31, 2017

Fine, since you have Acrobat Pro, you have preflight. You now need to know that batch processing has changed to "actions" and the setup is somewhat different, but it can do most of the same things.

Report · Mar 31, 2017

Right - I understand I already have Preflight. Thats not the issue. The issue is that:

1. The link to the Preflight file on that blog post, no longer works. So I have no idea what setting he used to get this to do what he is saying it did.

2. I have no idea what the steps are in "Actions" to run the Preflight process to do what I need it to do - which is search either my entire computer, or an entire folder and tell me if OCRd text is detectable in that file or not.

What I need is a step by step guide for BOTH the Preflight settings (assuming that I am creating the Preflight process myself instead of finding a working download link), as well as how to run the process in the way I need to to accomplish my end goal.

Report · Apr 01, 2017

An action can't scan the whole computer, but it can scan a whole folder, as the article says. That should be pretty obvious from the setup screens, though moving to two different folders as it does is unusual and advanced. Can't help you with the profile, sorry.

If I had to do this task, and I don't have to, I'd study the Acrobat SDK for how to extract text. Then I'd write a program to scan the disk. The program would examine each file, see if it's PDF, and try to extract text. The presumption being, if it had text, it doesn't need OCR. If you're an experienced programmer this would perhaps be a few day's work, depending how experienced you are at studying very large API documents And gluing together disparate APIs. The JavaScript method getPageNthWord is the basic glue of text extraction. Or you could try saving as TXT (doc.saveAs).

Report · Apr 04, 2017

I was sort of hoping someone from Adobe would respond? Maybe they could forward my post to Rick who from what I can tell, still works at Adobe? I was trying to find contact info directly to him but I couldn't find anything past his LinkedIn profile.

Report · Apr 04, 2017

If you want developer support from Adobe you can, I believe, buy a developer support case for $200. Please let us know how it goes If you go down that path.