Skip to main content
matthewjohn
Participant
May 28, 2019
Question

Taking 'Editable Text and Images' (Clearscan) further?

  • May 28, 2019
  • 1 reply
  • 2422 views

Adobe Acrobat Pro DC 2019 with 'Recognize Text' set to 'Editable Text and Images' (formerly 'Clearscan') does an amazing job generating custom fonts for a scanned book, and achieves a file size significantly smaller than 'Searchable Images' (the longer a book gets). However, it seems the more pages a book has, the more custom fonts are created, rather than just leveraging the 'close enough' offered by the previously generated custom characters. Would really love if the Adobe team would look at ways of 'cleaning' and/or 'consolidating' the custom fonts generated by the 'Editable Text and Images' feature, so the file sizes can drop even further (while relatively maintaining the look of the original printed font).

Case Study

I have a few books that I've bought, de-spined, and scanned at 1200dpi (with a Fujitsu ScanSnap), and audited the font list and internal space usage (via File > Properties > Fonts, and File > Save as Other > Optimized PDF... > Audit space usage...) to see how, the longer a book gets, the more custom font faces are generated, and the more the file size lifts (even with the default fallback font set in the preferences​). See below.

Book 1, 199 Pages

Book 2, 366 Pages

The two books are by the same publisher and author (with the same printed font) and while some of the file size lift (in book 2 vs book 1) could be due to poor print quality in the second book, with all Adobe's 'font smarts,' why not create something to say "this letter 'd' looks 90% like one of these 3 previously rendered 'd's so lets just use the closest one in place of another custom 'd'?"

After digging around the Preflight and Optimize PDF settings, I've been unable to find any other approaches like the one I'm suggesting above. And as the list in File > Properties > Fonts isn't editable, I'm at a loss for how to proceed beyond looking for other OCR software outside the Adobe family. Overall objective here is to take the PDFs mobile, allowing for highlighting and annotated on the go, and the smaller the file sizes (of course) the faster the sync.

Any tips or tricks are welcome. Thanks for thoughts, M

This topic has been closed for replies.

1 reply

Inspiring
September 24, 2020

Hi John, as I was looking for a solution to the exact same problem and same conclusions as you, I stumbled upon your post. I can't believe that we are only two dealing with this specific problem!

 

My wish is that the OCR engine of Adobe Acrobat becomes more adjustable and less of a black box. It would be nice if we could set a "similarity threshold" for fonts to prevent Adobe to generate new fonts for yet another similar character as you mentioned above. Also, I would love to be able to prevent "images" from being downscaled to big blurs when I select "Editable Text and Images".

 

For the font issue, I managed to find a time consuming work around. I OCR the scanned document with Editable Text and Images, export to 1200 dpi images, reimport into Acrobat and OCR once more with Editable Text and Images. This reduces the variation in fonts and the overall space occupied by fonts in the final (second pass) document. Images do suffer though, so I developed another workanound that fixes problems with images.

 

As soon as an image contians a little bit of text, Editable Text and Images messes with the image so badly that it often becomes blurred and unreadable. When the document is short or when there are not too many of those blurred and unreadable images, I use Photoshop to separate the text and the images into two separate images (ie. page001a.png and page001b.png) using the Marquee tool, selecting the images and cutting the pixels (Shift + Del) on one, and selecting the text and cutting the pixels on the other. I then OCR the image containing the text with Editable Text and Images, and OCR the image containing the 'images' with Searchable Image (300 dpi). Then, I manually overlay the image on top of the text. Very tedious and time consuming, but hey, when you need the small size and quality of Editable Text and Images with readable images, you sometimes have to sweat a bit!

 

Hope this helps until Adobe decides to make things right.