Skip to main content
October 14, 2014
Question

PDF String literal Parsing Issue

  • October 14, 2014
  • 1 reply
  • 1561 views

I have the following contents in the same PDF page, in different ObjectX:

link to original file

First:

    [(some text)] TJ ET Q

    [(some other text)] TJ ET Q

Very simple and basic so far...

**The second**:

    [( H T M L   E x a m p l e)] TJ ET Q

    [( S o m e   s p e c i a l   c h a r a c t e r s :   <   ¬   ¬   ¬   &   ט   ט   ©   >   \\ s l a s h   \\ \\ d o u b l e - s l a s h   \\ \\ \\ t r i p l e - s l a s h  )] TJ ET Q

NOTE:  It is not noticeable in text above, but:

   

'H T M L   E x a m p l e' is actually 0H0T0M0L0[32]0E0x0a0m0p0l0e where each 0 is a literal value 0 == ((char)0) so if I ignore all the 0 values, this actually turns to be like the upper example...

Some Bytes:

 

    htmlexample == [0, 72, 0, 84, 0, 77, 0, 76, 0, 32, 0, 69, 0, 120, 0, 97, 0, 109, 0, 112, 0, 108, 0, 101]

    <content>  == [0, 32, 32, -84, 0, 32, 32, -84, 0, 32, 32, -84, 0, 32, 0, 38, 0, 32, 0, -24, 0, 32, 0, -24, 0, 32, 0, -87, 0, 32, 0]

But in the next line I need to combine every two bytes into a char because of the following:

<   ¬   ¬   ¬...> is actually <0[32][32]¬0[32][32]¬0[32][32]¬...> where the combination of [32]¬ is €

The font used for the problematic Object is:

    #7 0# {

        'Name' : "F4"

        'BaseFont' : "AAAAAE+DejaVuSans-Bold"

        'Subtype' : "Type0"

        'ToUnicode' : #41 0# {

            'Filter' : "FlateDecode"

            'Length' : 1679.0f

        } + Stream(5771 bytes)

        'Encoding' : "Identity-H"

        'DescendantFonts' : [#42 0# {

            'FontDescriptor' : #43 0# {

                'MaxWidth' : 2016.0f

                'AvgWidth' : 573.0f

                'FontBBox' : [-1069.0f, -415.0f, 1975.0f, 1174.0f]

                'MissingWidth' : 600.0f

                'FontName' : "AAAAAE+DejaVuSans-Bold"

                'Type' : "FontDescriptor"

                'CapHeight' : 729.0f

                'StemV' : 60.0f

                'Leading' : 0.0f

                'FontFile2' : #34 0# {

                    'Filter' : "FlateDecode"

                    'Length1' : 83036.0f

                    'Length' : 34117.0f

                } + Stream(83036 bytes)

                'Ascent' : 928.0f

                'Descent' : -236.0f

                'XHeight' : 547.0f

                'StemH' : 26.0f

                'Flags' : 32.0f

                'ItalicAngle' : 0.0f

            }

            'Subtype' : "CIDFontType2"

            'W' : [32.0f, [348.0f, 456.0f, 521.0f, 838.0f, 696.0f, 1002.0f, 872.0f, 306.0f, 457.0f, 457.0f, 523.0f, 838.0f, 380.0f, 415.0f, 380.0f, 365.0f], 48.0f, 57.0f, 696.0f, 58.0f, 59.0f, 400.0f, 60.0f, 62.0f, 838.0f, 63.0f, [580.0f, 1000.0f, 774.0f, 762.0f, 734.0f, 830.0f, 683.0f, 683.0f, 821.0f, 837.0f, 372.0f, 372.0f, 775.0f, 637.0f, 995.0f, 837.0f, 850.0f, 733.0f, 850.0f, 770.0f, 720.0f, 682.0f, 812.0f, 774.0f, 1103.0f, 771.0f, 724.0f, 725.0f, 457.0f, 365.0f, 457.0f, 838.0f, 500.0f, 500.0f, 675.0f, 716.0f, 593.0f, 716.0f, 678.0f, 435.0f, 716.0f, 712.0f, 343.0f, 343.0f, 665.0f, 343.0f, 1042.0f, 712.0f, 687.0f, 716.0f, 716.0f, 493.0f, 595.0f, 478.0f, 712.0f, 652.0f, 924.0f, 645.0f, 652.0f, 582.0f, 712.0f, 365.0f, 712.0f, 838.0f], 160.0f, [348.0f, 456.0f, 696.0f, 696.0f, 636.0f, 696.0f, 365.0f, 500.0f, 500.0f, 1000.0f, 564.0f, 646.0f, 838.0f, 415.0f, 1000.0f, 500.0f, 500.0f, 838.0f, 438.0f, 438.0f, 500.0f, 736.0f, 636.0f, 380.0f, 500.0f, 438.0f, 564.0f, 646.0f], 188.0f, 190.0f, 1035.0f, 191.0f, 191.0f, 580.0f, 192.0f, 197.0f, 774.0f, 198.0f, [1085.0f, 734.0f], 200.0f, 203.0f, 683.0f, 204.0f, 207.0f, 372.0f, 208.0f, [838.0f, 837.0f], 210.0f, 214.0f, 850.0f, 215.0f, [838.0f, 850.0f], 217.0f, 220.0f, 812.0f, 221.0f, [724.0f, 738.0f, 719.0f], 224.0f, 229.0f, 675.0f, 230.0f, [1048.0f, 593.0f], 232.0f, 235.0f, 678.0f, 236.0f, 239.0f, 343.0f, 240.0f, [687.0f, 712.0f, 687.0f, 687.0f, 687.0f, 687.0f, 687.0f], 247.0f, [838.0f, 687.0f], 249.0f, 252.0f, 712.0f, 253.0f, [652.0f, 716.0f]]

            'Type' : "Font"

            'BaseFont' : "AAAAAE+DejaVuSans-Bold"

            'CIDSystemInfo' : {

                'Supplement' : 0.0f

                'Ordering' : "Identity" + Stream(8 bytes)

                'Registry' : "Adobe" + Stream(5 bytes)

            }

            'DW' : 600.0f

            'CIDToGIDMap' : #44 0# {

                'Filter' : "FlateDecode"

                'Length' : 10200.0f

            } + Stream(131072 bytes)

        }]

        'Type' : "Font"

    }

There is no indication to the encoding type of the font.

As for the ToUnicode object, in the case of these font it is an unnecessary it should have been Identity-H but instead it is an X == X mapping here are some examples that goes from until FFFF:

    <0000> <00ff> <0000>

    <0100> <01ff> <0100>

    <0200> <02ff> <0200>

    <0300> <03ff> <0300>

    <0400> <04ff> <0400>

    <0500> <05ff> <0500>

    <0600> <06ff> <0600>

    <0700> <07ff> <0700>

    <0800> <08ff> <0800>

    <0900> <09ff> <0900>

    <0a00> <0aff> <0a00>

    <0b00> <0bff> <0b00>

    <0c00> <0cff> <0c00>

    <0d00> <0dff> <0d00>

    <0e00> <0eff> <0e00>

    <0f00> <0fff> <0f00>

    <1000> <10ff> <1000>

    <1100> <11ff> <1100>

    ....

    ....

    ....

    <fc00> <fcff> <fc00>

    <fd00> <fdff> <fd00>

    <fe00> <feff> <fe00>

    <ff00> <ffff> <ff00>

So the mapping is not in the ToUnicode object, but still other renderers can render it well!

The problem I'm facing is not the conversion itself I use the following and this works well for the F4(see below) string literals, but mess up all the other strings in the page.

   new String(sb.toString().getBytes("UTF-8"),"UTF-16BE")

The problem is to know when to read the bytes as UTF-8 or any other encoding, where does the parameters for the String Literal reside?

Message was edited by: Adam Zehavi

Message was edited by: Adam Zehavi

This topic has been closed for replies.

1 reply

October 24, 2014

I have read your post multiple times, and am still trying to make complete sense of it, in terms of what you're trying to accomplish. I think that you are approaching the problem in the wrong way, and my instincts tell me that you're trying to fabricate input data for what you see encapsulated in the PDF in terms of data structures. The specifying of null (0x00) bytes in the input data is what is leading me to this conclusion.

Keep in mind that the sole purpose of the Identity-H encoding (which is also instantiated as a CMap resource) is to map GIDs to their hexadecimal equivalent, which can look a lot like a 16-bit Unicode value because a GID is represented using 16 bits. The ToUnicode object can be an entirely different beast, because it maps GIDs to genuine Unicode values.