Discrepancy between Adobe Extract PDF and the PDF content
I am a newbie in PDF and I would appreciate having more explanation about how the Adobe SDK API works through the Adobe sample extractPDF using the class ExtractTextInfoFromPDF.java.
I have a source PDF that contains this definition:
7 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont /Helvetica
/Encoding /WinAnsiEncoding
>>
endobjIncluding the following text sequence:
BT
3 Tr
0.00 Tc
/F3 10.5 Tf
1 0 0 1 302.16 776.64 Tm
<i,/ILLENEUVE > Tj
ET
And when I run the extractPDF sample via the Adobe API to get the TEXT info, I get this:
"Font": {
"alt_family_name": "* Titlingmes New Roman",
"embedded": true,
"encoding": "Identity-H",
"family_name": "* Titlingmes New Roman",
"font_type": "CIDFontType0",
"italic": false,
"monospaced": false,
"name": "*Times New Roman-Bold-3921",
"subset": false,
"weight": 700
},
"HasClip": false,
"Lang": "fr",
"Page": 0,
"Path": "//Document/Sect/P",
"Text": "VILLENEUVE ",
"TextSize": 10.0As you can see, the API has correctly translated "i,/" (3 characters, unless '/' in this sequence has a special meaning) into the "V" character ?
The PDF has been generated using a CANON scanner with OCR/Tagging as the search capabilty is available ont this document, except when searching for "VILLENEUVE".
It must be noted that when opening the PDF for display, the "V" letter is not clearly displayed ...
Can someone explain me the mystery (TEXT correctly extracted using the Adobe ExtractPDF API) ?
Thanks, Eric
