Tagged PDF with ActualText : Erroneous text extraction ?
I generated a tagged pdf by defining an ActualText as Span marked-content tag (see code below). I get the folowing copied text (duplicate characters). However, when the pdf is not tagged (no StructTreeRoot key in the catalog) I get the following copied text. In this case there is only an extra space which should not be present since I specify all the space (word breaks) explictly using ActualText.
Any help or advice would be appreciated!
******copied text from tagged PDF************
أَأََأ
لْلْ
فَفََف
ا
فًفًًف
ا
******copied text from normal PDF ************
أَ لْفَافًا
************PDF code***************
BT
1 0 0 -1 33200 3200 Tm
/P << /MCID 0 >>
BDC
/Span << /ActualText <FEFF0623064E> >>
BDC
/F150 2000 Tf
-456 0 Td <00> Tj
-10 1556 Td <01> Tj
-130 466 Td <02> Tj
EMC
/Span << /ActualText <FEFF06440652> >>
BDC
-647 -2022 Td <03> Tj
-70 1648 Td <04> Tj
EMC
/Span << /ActualText <FEFF0641064E> >>
BDC
-788 -1648 Td <05> Tj
318 908 Td <06> Tj
-204 380 Td <02> Tj
EMC
/Span << /ActualText <FEFF0627> >>
BDC
-772 -1288 Td <07> Tj
EMC
/Span << /ActualText <FEFF0641064B> >>
BDC
-552 0 Td <08> Tj
166 992 Td <06> Tj
-204 380 Td <09> Tj
EMC
/Span << /ActualText <FEFF0627> >>
BDC
-620 -1372 Td <07> Tj
EMC
EMC
ET
