Copy link to clipboard
Copied
We have many PDFs that can't seem to be OCR'd easily.
They give an error about renderable text being on the page.
We are either printing them out and rescanning them, or breaking them up into TIFFs and recombining them.
We have a project that has 6000+ poorly made PDFs from an outside company that we have to OCR.
What is the best method to OCR these documents so that we do not have to rescan or break up each of the individual 6000+ PDFs.
Thanks,
Carl
Copy link to clipboard
Copied
Hi, Sorry for the issue you are facing. Wea re aware of this issue and teams are already working on it.
We will get back to you soon once the issue is resolved.
Thanks.
Copy link to clipboard
Copied
With the latest release of Acrobat DC on 11th April 2017, the issue of error "Page contains renderable text" has been resolved. Go to What's new in Adobe Acrobat DC for more details.
To get the latest product update, click on the menu Help-> Check for updates
Thanks.
Copy link to clipboard
Copied
This response is completely unhelpful because the link takes you to a different issue of "What's new". How do it deal with this issue on a pdf document that has renderable text in it? I can't highlight text in the pdf document. .
Copy link to clipboard
Copied
I think the idea was to inform you about an update to Acrobat DC that removes this problem. Depending on which version of Acrobat you have, you may not benefit from this update.
Renderable text is true text in a PDF file - or more specifically in this case on a PDF page. Even if 99.9% of all text cannot be selected, but you have one character of renderable text on that page, the OCR routine will not work. You need to remove that text - or merge it with the background image. Removing can be done by bringing up the Contents navigation pane on the left side of Acrobat (View>Show/Hide>Navigation Panes>Content). You can now expand your content tree until you find objects that are text. You can select such an object and hit the Delete key to remove it from the page. This may not be what you want, because that renderable text may be important to your page (e.g. as a header or footer, or a page number).
To merge the text into the background image, there is no need to re-scan the documents. You can export them as high resolution TIFF images, and then import these images into Acrobat to create a new PDF file. This will create a file that does no longer contain any renderable text, and you should be able to OCR such a file without a problem.
You can write a script that goes through a document and tries to find text by using the JavaScript Doc.getPageNumWords() function. Just do that for every page in the document and if it reports any words, you know that you have renderable text. This does require a custom script.
Having said that, I would probably not OCR 6000+ documents in Acrobat: Acrobat was not designed to be run in a mode where it processes file after file without the application getting restarted every now and then. My experience is that Acrobat (like many applications) leaks resources, and when you try to process too many files in an Action or Batch Process, you will end up in a situation where Acrobat slows down to a crawl. This has gotten better over the years, but you will have to break up your 6000+ job into smaller batches. How big these batches can be depends on what version of Acrobat you have, what exactly you are doing in your process, and on the actual documents you are processing. It may work with 100 documents at a time, but you also could have to cut down to 50 or even less.