Missing font leads to missing characters at the Extraction API output

Report · Nov 22, 2021

When extracting text from Page 59 (zero based counting) at the following PDF, The word 2008 is extracted as unknown characters:

"Text": "WESTERN UNION  Annual Report "

PDF file

When watching the PDF with Acrobat Reader, it looks ok:

I tried with PDFBox, and got the following error for these characters:

No Unicode mapping for twoalt (2) in font HGLLLJ+BulmerMT-ItalicAlt

Any help with that would be highly appriciated !!!

Report · Nov 23, 2021

I haven't dug too far into the file but generally, if the font uses custom encoding and the ToUnicode map isn't in the font resource, we can't extract the text accurately.

Missing font leads to missing characters at the Extraction API output

Photos