Copy link to clipboard
Copied
Hi,
I'm working on a pdf parser / content extraction.
Everything work well until I tested a certain PDF that has these hex strings :
<0014001300130014> displays as "1001" and <00290028003100280037003500280036> displays as "FENETRES" in viewers.
There is 0x1D difference from ascii character code.
I don't know how to retrieve the correct character code from the hex string.
I can't do a "1D" transformation because this a a PDF among thousands.
Getting the used font gives me :
<< /Type /Font /Subtype /Type0 /Encoding /Identity-H /DescendantFonts [31 0 R]
/BaseFont /QBYAWA+F0_CIDFont /ToUnicode 32 0 R >>
- ToUnicode object is :
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000><FFFF>
endcodespacerange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
And "DescendantFonts leads to
<< /Type /Font /Subtype /CIDFontType2 /BaseFont /QBYAWA+F0_CIDFont /FontDescriptor 35 0 R >>
And FontDescriptor has the key /FontFile2
But event if the font, I d'ont have any cmap, only these tables :
"cvt ", "fpgm", "glyf", "head", "hhea", "hmtx", "loca", "maxp", "name", "prep"
Does anyone has a clue on how to retrieve the character code mapping or what I'm missing here ?
Copy link to clipboard
Copied
This forum is for questions about Acrobat. Have you tested whether Acrobat can extract text as you wish?
Copy link to clipboard
Copied
This is not a question about acrobat but about PDF parsing.
Sorry about that, I didn't fint the correct forum.
Extracting text from Acrobat gives me this :
As you can (or can't see), the code are 0014, 0013 as in the text editor.
I'm starting to think that PDF doens't display btext ut "glyph" and "fgpm" that corresponds to the partial TTF embedded font file.
Copy link to clipboard
Copied
What do you mean "this"? You cannot email a picture. If you want to post a picture you must return to the forum to post it.
Copy link to clipboard
Copied
Oh, sorry, I'n new here and didn't thought the text would be transocded to space by the forum (or my browser).
I copied text "1001 FENETRES" from PDF viewer in acrobate.
On the screenshot below, you can see on the right, the pdf viewer and on the left the "squares" correspond to the copied text. Below that, the converted string ASCII -> HEX from notepad++
Those characters do not match against the ascii or unicode table.
Copy link to clipboard
Copied
That looks like a viewer screen shot. PDF viewers do not use text extraction. They MUST use embedded fonts, accessed via CMap or Encoding. For embedded TrueType fonts with Identity-H, the character string contains GID values. No other possibility exists.
Buy you are talking about text extraction, so please compare with Acrobat's text extraction. For example, copy/paste text. Pay close attention to the encoding when you paste.
Copy link to clipboard
Copied
This is a Acrobat Reader screenshot.
Let's look at another example, searching for text "1001" in PDF (FENETRES has the same result) :
I did my own text extraction and here is the ouput for the text "1001 FENETRE" in debug mode (last cols is CID+0x1d) :
TJ(<00140013001300140003>, 1, <0029>, -9, <0028>, 10, <0031>, 10.8, <00280037>, 1, <00350028>, 10, <0036>)
0014 => " " => "1"
0013 => " " => "0"
0013 => " " => "0"
0014 => " " => "1"
0003 => " " => " "
0029 => ")" => "F"
0028 => "(" => "E"
0031 => "1" => "N"
0028 => "(" => "E"
0037 => "7" => "T"
0035 => "5" => "R"
0028 => "(" => "E"
0036 => "6" => "S"
Or the end of the screenshot text (below dialog) :
TJ(<00240033>, 2, <0028>, -16, <0003>, -3, <0017>, -16, <001a>, -16, <0018>, -16, <0015>, -16, <0025>, -27, <0003>, -3, <0010>, -9, <0003>, -3, <0037>, -14, <0039>, -39, <00240003>, 17, <0029>, -5, <0035>, -3, <001a>, -16, <0017>, -16, <0003>, -3, <0018>, -16, <0014>, -16, <0018>, -16, <0003>, -3, <0013>, -27, <001b>, -16, <001c>, -16, <0003>, 17, <0019>, -16, <001a>)
0024 => "$" => "A"
0033 => "3" => "P"
0028 => "(" => "E"
0003 => " " => " "
0017 => " " => "4"
001a => " " => "7"
0018 => " " => "5"
0015 => " " => "2"
0025 => "%" => "B"
0003 => " " => " "
0010 => " " => "-"
0003 => " " => " "
0037 => "7" => "T"
0039 => "9" => "V"
0024 => "$" => "A"
0003 => " " => " "
0029 => ")" => "F"
0035 => "5" => "R"
001a => " " => "7"
0017 => " " => "4"
0003 => " " => " "
0018 => " " => "5"
0014 => " " => "1"
0018 => " " => "5"
0003 => " " => " "
0013 => " " => "0"
001b => " " => "8"
001c => " " => "9"
0003 => " " => " "
0019 => " " => "6"
001a => " " => "7"
All the text on the page correspond to the hexa strings CID+1D
You told me :
"For embedded TrueType fonts with Identity-H, the character string contains GID values. No other possibility exists."
I want to believe that. But how this PDF is working though ?
I can upload it if necesssary.
Copy link to clipboard
Copied
Did you try the method I said: display characters from the embedded font using GID? You must use the embedded font, it probably uses different GID than the original.
Your FIND shows Reader cannot extract the text. This is common and should be your first test. If Reader cannot extract NOR CAN YOU. Adobe have 20 years experience in this.
Copy link to clipboard
Copied
When your saying "display characters from the embedded font using GID". Do you mean looking for the "cmap" inside the font or the "glyf" or something else ? The embedded font is a partial one.
I didn't tried this before because this is the first time I encounter this kind of "embedded partial non mapping font".
The only way I see to figure this out is use an ocr external program (or parse the 'glyf' and 'fpgm' instructions from the font to make ocr).
I tested several ocr online tool and they seem to works well. With that I just have to find a way to map them against the embedded font CID and I can extract the correct ASCII / Unicode code characters.
Copy link to clipboard
Copied
You must be very clear whether you want to extract text or display text. These are completely different logic, with different results.
To display pages you must use the glyphs in the font. The order of characters could be random.
Yes, OCR of glyphs is a theoretical possibility for extraction. But how will you know it is necessary?
Copy link to clipboard
Copied
Of course, I'm aware of the difference don't worry.
For now, I "just" want to extract the text from PDF files. Further, I'd like to convert PDF file into another document.
I think I need to OCR the glyfs when I don't have any cmap wether in the PDF or the font file.
Maybe the fastest way would be to create another document with each glyf of the font and call an external OCR program.
That would gives me the order of glyf and therefore, their CIDs... To be tested...
Thank you for everything
Find more inspiration, events, and resources on the new Adobe Community
Explore Now