Participant

Question

Tagged PDF with ActualText : Erroneous text extraction ?

Forum|Forum|8 years ago
December 25, 2017
0 replies
768 views

I generated a tagged pdf by defining an ActualText as Span marked-content tag (see code below). I get the folowing copied text (duplicate characters). However, when the pdf is not tagged (no StructTreeRoot key in the catalog) I get the following copied text. In this case there is only an extra space which should not be present since I specify all the space (word breaks) explictly using ActualText.

Any help or advice would be appreciated!

******copied text from tagged PDF************

أَأََأ

لْلْ

فَفََف

ا

فًفًًف

ا

******copied text from normal PDF ************

أَ لْفَافًا

************PDF code***************

BT

1 0 0 -1 33200 3200 Tm

/P << /MCID 0 >>

BDC

/Span << /ActualText <FEFF0623064E> >>

BDC

/F150 2000 Tf

-456 0 Td <00> Tj

-10 1556 Td <01> Tj

-130 466 Td <02> Tj

EMC

/Span << /ActualText <FEFF06440652> >>

BDC

-647 -2022 Td <03> Tj

-70 1648 Td <04> Tj

EMC

/Span << /ActualText <FEFF0641064E> >>

BDC

-788 -1648 Td <05> Tj

318 908 Td <06> Tj

-204 380 Td <02> Tj

EMC

/Span << /ActualText <FEFF0627> >>

BDC

-772 -1288 Td <07> Tj

EMC

/Span << /ActualText <FEFF0641064B> >>

BDC

-552 0 Td <08> Tj

166 992 Td <06> Tj

-204 380 Td <09> Tj

EMC

/Span << /ActualText <FEFF0627> >>

BDC

-620 -1372 Td <07> Tj

EMC

ET

This topic has been closed for replies.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded