Batch Rasterizing header-annotated scanned PDFs

Report · Nov 13, 2017

In Acrobat Pro XI I am trying to batch rasterize many PDF files in order to OCR them. The files are scans that have been annotated with header page nos.

The batch OCR process is throwing up "This Page contains renderable text" errors. This seems to be because many files have been annotated with headers that include a page no. After encountering the error the OCR function behaves unpredictably. Sometimes it carries on OCRing the rest of the file; at other times it bombs out midway through leaving the rest of the file unsearchable.

In addition, headers do not seem to be subject to the flatten command (in the way that forms and comments are), so flattening does not seem to do what I need.

Hence, I am trying to batch rasterize all of the files, followed by batch OCR.

So, Q/ How can I batch rasterize header-annotated scanned PDFs in Acrobat Pro XI?

Obviously I don't want to convert each file into exported images and then re-assemble it. I have hundreds of documents, some containing hundreds of pages.

OS: Windows 7 (64 bit)

Thanks,

Report · Nov 13, 2017

You may want to switch to Acrobat DC. The latest version does no longer have this limitation, and you can OCR with renderable text on a page. Based on the time it will take you to figure out how to do this in a (semi-) automated way, and then actually run the process, the investment for a new version of Acrobat may be cheaper than the time you spend to do this with Acrobat XI. Another method would be to use a dedicated OCR application (I keep Abbyy's FineReader around for OCR tasks that Acrobat cannot handle).

Flattening for comments and form fields will convert interactive elements into static PDF content. Your header is already static PDF content: You have a page with an image, and text, both of them static PDF content.

View solution in original post

Report · Nov 13, 2017

You may want to switch to Acrobat DC. The latest version does no longer have this limitation, and you can OCR with renderable text on a page. Based on the time it will take you to figure out how to do this in a (semi-) automated way, and then actually run the process, the investment for a new version of Acrobat may be cheaper than the time you spend to do this with Acrobat XI. Another method would be to use a dedicated OCR application (I keep Abbyy's FineReader around for OCR tasks that Acrobat cannot handle).

Flattening for comments and form fields will convert interactive elements into static PDF content. Your header is already static PDF content: You have a page with an image, and text, both of them static PDF content.