0
Missing font leads to missing characters at the Extraction API output
New Here
,
/t5/acrobat-services-api-discussions/missing-font-leads-to-missing-characters-at-the-extraction-api-output/td-p/12544246
Nov 22, 2021
Nov 22, 2021
Copy link to clipboard
Copied
When extracting text from Page 59 (zero based counting) at the following PDF, The word 2008 is extracted as unknown characters:
"Text": "WESTERN UNION Annual Report "
PDF file
When watching the PDF with Acrobat Reader, it looks ok:
I tried with PDFBox, and got the following error for these characters:
No Unicode mapping for twoalt (2) in font HGLLLJ+BulmerMT-ItalicAlt
Any help with that would be highly appriciated !!!
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting.
Learn more
Community Expert
,
LATEST
/t5/acrobat-services-api-discussions/missing-font-leads-to-missing-characters-at-the-extraction-api-output/m-p/12545426#M3153
Nov 23, 2021
Nov 23, 2021
Copy link to clipboard
Copied
I haven't dug too far into the file but generally, if the font uses custom encoding and the ToUnicode map isn't in the font resource, we can't extract the text accurately.
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting.
Learn more

