Missing font leads to missing characters at the Extraction API output

New Here ,
Nov 22, 2021 Nov 22, 2021

Copy link to clipboard

Copied

When extracting text from Page 59 (zero based counting) at the following PDF, The word 2008 is extracted as unknown characters: 

"Text": "WESTERN UNION  Annual Report "

PDF file 

When watching the PDF with Acrobat Reader, it looks ok:

Orit21877757boh2_0-1637652576119.png

 


I tried with PDFBox, and got the following error for these characters:

No Unicode mapping for twoalt (2) in font HGLLLJ+BulmerMT-ItalicAlt


Any help with that would be highly appriciated !!!

Views

68

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Nov 23, 2021 Nov 23, 2021

Copy link to clipboard

Copied

LATEST

I haven't dug too far into the file but generally, if the font uses custom encoding and the ToUnicode map isn't in the font resource, we can't extract the text accurately. 

Likes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Resources