Unable to extract text from PDFs containing embedded fonts

Report · Apr 08, 2021

Hello,

We are a developer that relies upon being able to extract text from printed PDFs, via the Windows Enhanced Metafile format using our printer driver. We have noticed that printing from current versions of Acrobat Reader is problematic in this regard when the PDF document contains embedded fonts.

In many cases, text output from the PDF document is represented as glyph indices rather than characters. We have developed a system for converting glyph indices back into appropriately encoded text characters (Unicode). Our application also handles the embedded temporary font files that Acrobat Reader creates when printing and uses them to help decode the glyphs accordingly.

Attached are some examples of embedded fonts that have been extracted, via printing a sample PDF from Adobe Reader DC v2021.001.20145. These fonts are missing appropriate character encoding indices in the Character to Glyph Index Mapping Table (cmap), typically located in the font header. This instead contains a linear sequence of integers that does not map to a character encoding (see example below).

Example of missing character indices in embedded font (from FNTFD93.ttf attached)

Without these character encoding indices, it is proving impossible to convert the glyph back into the text character it represents.

We have noticed that this is an issue specific to Acrobat Reader. When we take the same sample PDF and print through an alternative PDF reader e.g. Foxit Reader, this issue is not present and we can extract text from the PDF without any problem. We have noticed that the temporary font files product by other readers are different and usually contain 2 mapping tables, which contain the glyph->Unicode mapping.

You can replicate this issue by printing to the Microsoft XPS driver. When printing through Adobe Reader, you are unable to extract the characters from the XPS output file. However, printing the same file by, for example, Foxit Reader, you can extract the characters from the XPS output.

Do you know why Acrobat Reader is producing these temporary fonts without the mapping, and is there a way to change the output so it does produce what we require (i.e. The ability to extract text from the print stream correctly)?

As Acrobat Reader is the established standard on Windows, we would be grateful for any advice on how to resolve this, as our customers are greatly impacted by this issue.

Many Thanks

Report · Apr 08, 2021

Adobe views the only job of "Print" in Acrobat Reader to be something which looks like the PDF on a printer. They don't view supporting post-processing as an aim, possibly they view this as something undesirable for working with their free software. Paid-for Acrobat has various rather simplistic APIs for text extraction.

Report · Apr 08, 2021

Per @Test Screen Name's response, the sole purpose of the “print” function of Adobe Acrobat Reader is to provide hard copy output that matches what you see on the screen, nothing more and nothing less.

It is not intended for providing a means of extracting anything or for that matter, regenerating PDF, a process we call “refrying a PDF.” If you wish to programatically extract text from a PDF file, there are plenty of tools available via Adobe or third parties to achieve that functionality either as plug-ins to Reader or Acrobat or via standalone programs.

- Dov Isaacs, former Adobe Principal Scientist (April 30, 1990 - May 30, 2021)

Adobe Community

Unable to extract text from PDFs containing embedded fonts