renderable text

New Here ,
Mar 23, 2017 Mar 23, 2017

Copy link to clipboard

Copied

We have many PDFs that can't seem to be OCR'd easily.

They give an error about renderable text being on the page.

We are either printing them out and rescanning them, or breaking them up into TIFFs and recombining them.

We have a project that has 6000+ poorly made PDFs from an outside company that we have to OCR.

What is the best method to OCR these documents so that we do not have to rescan or break up each of the individual 6000+ PDFs.

Thanks,

Carl

TOPICS
Scan documents and OCR

Views

1.1K

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Mar 27, 2017 Mar 27, 2017

Copy link to clipboard

Copied

Hi, Sorry for the issue you are facing. Wea re aware of this issue and teams are already working on it.

We will get back to you soon once the issue is resolved.

Thanks.

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Adobe Employee ,
Apr 19, 2017 Apr 19, 2017

Copy link to clipboard

Copied

With the latest release of Acrobat DC on 11th April 2017, the issue of error "Page contains renderable text" has been resolved. Go to What's new in Adobe Acrobat DC for more details.

To get the latest product update, click on the menu Help-> Check for updates

Thanks.

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jun 15, 2018 Jun 15, 2018

Copy link to clipboard

Copied

This response is completely unhelpful because the link takes you to a different issue of "What's new".  How do it deal with this issue on a pdf document that has renderable text in it?  I can't highlight text in the pdf document.  . 

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jun 15, 2018 Jun 15, 2018

Copy link to clipboard

Copied

LATEST

I think the idea was to inform you about an update to Acrobat DC that removes this problem. Depending on which version of Acrobat you have, you may not benefit from this update.

Renderable text is true text in a PDF file - or more specifically in this case on a PDF page. Even if 99.9% of all text cannot be selected, but you have one character of renderable text on that page, the OCR routine will not work. You need to remove that text - or merge it with the background image. Removing can be done by bringing up the Contents navigation pane on the left side of Acrobat (View>Show/Hide>Navigation Panes>Content). You can now expand your content tree until you find objects that are text. You can select such an object and hit the Delete key to remove it from the page. This may not be what you want, because that renderable text may be important to your page (e.g. as a header or footer, or a page number).

To merge the text into the background image, there is no need to re-scan the documents. You can export them as high resolution TIFF images, and then import these images into Acrobat to create a new PDF file. This will create a file that does no longer contain any renderable text, and you should be able to OCR such a file without a problem.

You can write a script that goes through a document and tries to find text by using the JavaScript Doc.getPageNumWords() function. Just do that for every page in the document and if it reports any words, you know that you have renderable text. This does require a custom script.

Having said that, I would probably not OCR 6000+ documents in Acrobat: Acrobat was not designed to be run in a mode where it processes file after file without the application getting restarted every now and then. My experience is that Acrobat (like many applications) leaks resources, and when you try to process too many files in an Action or Batch Process, you will end up in a situation where Acrobat slows down to a crawl. This has gotten better over the years, but you will have to break up your 6000+ job into smaller batches. How big these batches can be depends on what version of Acrobat you have, what exactly you are doing in your process, and on the actual documents you are processing. It may work with 100 documents at a time, but you also could have to cut down to 50 or even less.

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines