Skip to main content
Participant
November 23, 2021
Question

Missing font leads to missing characters at the Extraction API output

  • November 23, 2021
  • 1 reply
  • 427 views

When extracting text from Page 59 (zero based counting) at the following PDF, The word 2008 is extracted as unknown characters: 

"Text": "WESTERN UNION  Annual Report "

PDF file 

When watching the PDF with Acrobat Reader, it looks ok:

 


I tried with PDFBox, and got the following error for these characters:

No Unicode mapping for twoalt (2) in font HGLLLJ+BulmerMT-ItalicAlt


Any help with that would be highly appriciated !!!

    This topic has been closed for replies.

    1 reply

    Joel Geraci
    Community Expert
    Community Expert
    November 23, 2021

    I haven't dug too far into the file but generally, if the font uses custom encoding and the ToUnicode map isn't in the font resource, we can't extract the text accurately.