PDF String literal Parsing Issue
I have the following contents in the same PDF page, in different ObjectX:
First:
[(some text)] TJ ET Q
[(some other text)] TJ ET Q
Very simple and basic so far...
**The second**:
[( H T M L E x a m p l e)] TJ ET Q
[( S o m e s p e c i a l c h a r a c t e r s : < ¬ ¬ ¬ & ט ט © > \\ s l a s h \\ \\ d o u b l e - s l a s h \\ \\ \\ t r i p l e - s l a s h )] TJ ET Q
NOTE: It is not noticeable in text above, but:
'H T M L E x a m p l e' is actually 0H0T0M0L0[32]0E0x0a0m0p0l0e where each 0 is a literal value 0 == ((char)0) so if I ignore all the 0 values, this actually turns to be like the upper example...
Some Bytes:
htmlexample == [0, 72, 0, 84, 0, 77, 0, 76, 0, 32, 0, 69, 0, 120, 0, 97, 0, 109, 0, 112, 0, 108, 0, 101]
<content> == [0, 32, 32, -84, 0, 32, 32, -84, 0, 32, 32, -84, 0, 32, 0, 38, 0, 32, 0, -24, 0, 32, 0, -24, 0, 32, 0, -87, 0, 32, 0]
But in the next line I need to combine every two bytes into a char because of the following:
< ¬ ¬ ¬...> is actually <0[32][32]¬0[32][32]¬0[32][32]¬...> where the combination of [32]¬ is €
The font used for the problematic Object is:
#7 0# {
'Name' : "F4"
'BaseFont' : "AAAAAE+DejaVuSans-Bold"
'Subtype' : "Type0"
'ToUnicode' : #41 0# {
'Filter' : "FlateDecode"
'Length' : 1679.0f
} + Stream(5771 bytes)
'Encoding' : "Identity-H"
'DescendantFonts' : [#42 0# {
'FontDescriptor' : #43 0# {
'MaxWidth' : 2016.0f
'AvgWidth' : 573.0f
'FontBBox' : [-1069.0f, -415.0f, 1975.0f, 1174.0f]
'MissingWidth' : 600.0f
'FontName' : "AAAAAE+DejaVuSans-Bold"
'Type' : "FontDescriptor"
'CapHeight' : 729.0f
'StemV' : 60.0f
'Leading' : 0.0f
'FontFile2' : #34 0# {
'Filter' : "FlateDecode"
'Length1' : 83036.0f
'Length' : 34117.0f
} + Stream(83036 bytes)
'Ascent' : 928.0f
'Descent' : -236.0f
'XHeight' : 547.0f
'StemH' : 26.0f
'Flags' : 32.0f
'ItalicAngle' : 0.0f
}
'Subtype' : "CIDFontType2"
'W' : [32.0f, [348.0f, 456.0f, 521.0f, 838.0f, 696.0f, 1002.0f, 872.0f, 306.0f, 457.0f, 457.0f, 523.0f, 838.0f, 380.0f, 415.0f, 380.0f, 365.0f], 48.0f, 57.0f, 696.0f, 58.0f, 59.0f, 400.0f, 60.0f, 62.0f, 838.0f, 63.0f, [580.0f, 1000.0f, 774.0f, 762.0f, 734.0f, 830.0f, 683.0f, 683.0f, 821.0f, 837.0f, 372.0f, 372.0f, 775.0f, 637.0f, 995.0f, 837.0f, 850.0f, 733.0f, 850.0f, 770.0f, 720.0f, 682.0f, 812.0f, 774.0f, 1103.0f, 771.0f, 724.0f, 725.0f, 457.0f, 365.0f, 457.0f, 838.0f, 500.0f, 500.0f, 675.0f, 716.0f, 593.0f, 716.0f, 678.0f, 435.0f, 716.0f, 712.0f, 343.0f, 343.0f, 665.0f, 343.0f, 1042.0f, 712.0f, 687.0f, 716.0f, 716.0f, 493.0f, 595.0f, 478.0f, 712.0f, 652.0f, 924.0f, 645.0f, 652.0f, 582.0f, 712.0f, 365.0f, 712.0f, 838.0f], 160.0f, [348.0f, 456.0f, 696.0f, 696.0f, 636.0f, 696.0f, 365.0f, 500.0f, 500.0f, 1000.0f, 564.0f, 646.0f, 838.0f, 415.0f, 1000.0f, 500.0f, 500.0f, 838.0f, 438.0f, 438.0f, 500.0f, 736.0f, 636.0f, 380.0f, 500.0f, 438.0f, 564.0f, 646.0f], 188.0f, 190.0f, 1035.0f, 191.0f, 191.0f, 580.0f, 192.0f, 197.0f, 774.0f, 198.0f, [1085.0f, 734.0f], 200.0f, 203.0f, 683.0f, 204.0f, 207.0f, 372.0f, 208.0f, [838.0f, 837.0f], 210.0f, 214.0f, 850.0f, 215.0f, [838.0f, 850.0f], 217.0f, 220.0f, 812.0f, 221.0f, [724.0f, 738.0f, 719.0f], 224.0f, 229.0f, 675.0f, 230.0f, [1048.0f, 593.0f], 232.0f, 235.0f, 678.0f, 236.0f, 239.0f, 343.0f, 240.0f, [687.0f, 712.0f, 687.0f, 687.0f, 687.0f, 687.0f, 687.0f], 247.0f, [838.0f, 687.0f], 249.0f, 252.0f, 712.0f, 253.0f, [652.0f, 716.0f]]
'Type' : "Font"
'BaseFont' : "AAAAAE+DejaVuSans-Bold"
'CIDSystemInfo' : {
'Supplement' : 0.0f
'Ordering' : "Identity" + Stream(8 bytes)
'Registry' : "Adobe" + Stream(5 bytes)
}
'DW' : 600.0f
'CIDToGIDMap' : #44 0# {
'Filter' : "FlateDecode"
'Length' : 10200.0f
} + Stream(131072 bytes)
}]
'Type' : "Font"
}
There is no indication to the encoding type of the font.
As for the ToUnicode object, in the case of these font it is an unnecessary it should have been Identity-H but instead it is an X == X mapping here are some examples that goes from until FFFF:
<0000> <00ff> <0000>
<0100> <01ff> <0100>
<0200> <02ff> <0200>
<0300> <03ff> <0300>
<0400> <04ff> <0400>
<0500> <05ff> <0500>
<0600> <06ff> <0600>
<0700> <07ff> <0700>
<0800> <08ff> <0800>
<0900> <09ff> <0900>
<0a00> <0aff> <0a00>
<0b00> <0bff> <0b00>
<0c00> <0cff> <0c00>
<0d00> <0dff> <0d00>
<0e00> <0eff> <0e00>
<0f00> <0fff> <0f00>
<1000> <10ff> <1000>
<1100> <11ff> <1100>
....
....
....
<fc00> <fcff> <fc00>
<fd00> <fdff> <fd00>
<fe00> <feff> <fe00>
<ff00> <ffff> <ff00>
So the mapping is not in the ToUnicode object, but still other renderers can render it well!
The problem I'm facing is not the conversion itself I use the following and this works well for the F4(see below) string literals, but mess up all the other strings in the page.
new String(sb.toString().getBytes("UTF-8"),"UTF-16BE")
The problem is to know when to read the bytes as UTF-8 or any other encoding, where does the parameters for the String Literal reside?
Message was edited by: Adam Zehavi
Message was edited by: Adam Zehavi