Skip to main content
Participant
December 25, 2017
Question

Tagged PDF with ActualText : Erroneous text extraction ?

  • December 25, 2017
  • 0 replies
  • 765 views

I generated a tagged pdf by defining an ActualText as Span marked-content tag (see code below). I get the folowing copied text (duplicate characters). However, when the pdf is not tagged (no StructTreeRoot key in the catalog) I get  the following copied text. In this case there is only an extra space which should not be present  since I specify all the space (word breaks) explictly using ActualText.

Any help or advice would be appreciated!

******copied text from tagged PDF************

أَأََأ

لْلْ

فَفََف

ا

فًفًًف

ا

******copied text from normal PDF ************

أَ لْفَافًا

************PDF code***************

BT

1 0 0 -1 33200 3200 Tm

/P << /MCID 0 >>

BDC

/Span << /ActualText <FEFF0623064E> >>

BDC

/F150 2000 Tf

-456 0 Td <00> Tj

-10 1556 Td <01> Tj

-130 466 Td <02> Tj

EMC

/Span << /ActualText <FEFF06440652> >>

BDC

-647 -2022 Td <03> Tj

-70 1648 Td <04> Tj

EMC

/Span << /ActualText <FEFF0641064E> >>

BDC

-788 -1648 Td <05> Tj

318 908 Td <06> Tj

-204 380 Td <02> Tj

EMC

/Span << /ActualText <FEFF0627> >>

BDC

-772 -1288 Td <07> Tj

EMC

/Span << /ActualText <FEFF0641064B> >>

BDC

-552 0 Td <08> Tj

166 992 Td <06> Tj

-204 380 Td <09> Tj

EMC

/Span << /ActualText <FEFF0627> >>

BDC

-620 -1372 Td <07> Tj

EMC

EMC

ET

This topic has been closed for replies.