CMap from CIDFont with indentity-h encoding
Hi,
I'm working on a pdf parser / content extraction.
Everything work well until I tested a certain PDF that has these hex strings :
<0014001300130014> displays as "1001" and <00290028003100280037003500280036> displays as "FENETRES" in viewers.
There is 0x1D difference from ascii character code.
I don't know how to retrieve the correct character code from the hex string.
I can't do a "1D" transformation because this a a PDF among thousands.
Getting the used font gives me :
<< /Type /Font /Subtype /Type0 /Encoding /Identity-H /DescendantFonts [31 0 R]
/BaseFont /QBYAWA+F0_CIDFont /ToUnicode 32 0 R >>
- ToUnicode object is :
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000><FFFF>
endcodespacerange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
And "DescendantFonts leads to
<< /Type /Font /Subtype /CIDFontType2 /BaseFont /QBYAWA+F0_CIDFont /FontDescriptor 35 0 R >>
And FontDescriptor has the key /FontFile2
But event if the font, I d'ont have any cmap, only these tables :
"cvt ", "fpgm", "glyf", "head", "hhea", "hmtx", "loca", "maxp", "name", "prep"
Does anyone has a clue on how to retrieve the character code mapping or what I'm missing here ?
