Known Participant

Question

PDTextSelect Object not Getting Special Characters like (≤ Ω Β ∞ ≠ ≥).

Forum|Forum|7 years ago
July 20, 2018
11 replies
1866 views

When i extracted selected text, some special characters like (≤ Ω Β ∞ ≠ ≥) are displayed as junk character.

Please suggest me, how to Get all characters.

Thanks.

Acrobat SDK and JavaScript

This topic has been closed for replies.

T

Test Screen Name

Legend

So, that is an API you are using!! CBMCTreeCtrl appears to be a custom C++ API. You must refer to the documentation or code of that API to see if you can pass a Unicode string. You will NOT be able to use the same interface you used to set one byte text. I think you still have not studied Unicode or the concept of encodings. This is vital to you.

However, I do have two suggestions on your next step.

1. Try and add directly the constant string "≤ Ω Β ∞ ≠ ≥" to your tree control, without using Acrobat. You may find this complex or impossible to solve, but once solved you will be ready to work with strings from Acrobat.

2. Also, copy and paste text from this same PDF into Word. Examine the Word document and make sure the characters ≤ Ω Β ∞ ≠ ≥ appear as you wish. If they do not appear in Word, you certainly cannot extract them from the PDF.

Vijaykumar L SAuthor

Known Participant

In both the cases, Expected result achieved.

I need to work on Acrobat String.

T

Test Screen Name

Legend

I do not understand what you mean by no API. Please help us understand in detail what you are doing. I know the tree view controller as something used though the .Net API and also via the MS Access API.

Vijaykumar L SAuthor

Known Participant

The following code used for creating Tree view Controller.

cBMCTreeCtrl m_BKMCTreeCtrl = new cBMCTreeCtrl();

m_BKMCTreeCtrl->Create(nTreeStyles, crTreeView, this, IDC_BKMCreatorTreeView);

m_BKMCTreeCtrl->ShowWindow(SW_SHOW);

but, When Getting text from AsText object, there only it shows junk character.

T

Test Screen Name

Legend

Ok, which API are you using to the tree view controller. Some of them are not Unicode aware so this is impossible for them. Others may accept a Unicode string, but you must change your code.

Vijaykumar L SAuthor

Known Participant

I extracted all the words through PDWordFinder and Added into List,

then directly insert those Words present in the List into Tree view controller.

I have not used any API to insert words into Tree view controller.

Vijaykumar L SAuthor

Known Participant

First I Get Astext object from PDWord,

then,When Getting text from AsText object, there only it shows junk character.

I think, i need to do something while extracting string from ASText Object.

T

Test Screen Name

Legend

Please answer my question about intended use.

Vijaykumar L SAuthor

Known Participant

I want to extract words from the PDF Page using WordFinder and displayed in the Tree view controller.

While Extracting, all words are getting correctly, Except special characters included in the word.

in the place of special characters, junk character is displayed, like(.,8,O).

I want to know how to display words with special characters.

T

Test Screen Name

Legend

So, what is your host encoding? Does your host encoding include the characters ≤ Ω Β ∞ ≠ ≥? Tip: mine does not. I would have to work in Unicode. I think you are assuming something impossible will work.

What is your aim for these characters: please list all the ways you need them to work? (For example: only in one message that is popped up using the Windows MessageBox function) Please be detailed.

T

Test Screen Name

Legend

Does it give the expected answer?

If not, what encoding do you choose when you convert the ASText?

Vijaykumar L SAuthor

Known Participant

it shows some other format...

i used the following code to encode ASText.

ASText nextWordASText = ASTextNew();

PDWordGetASText(nextWord, 0, nextWordASText);

CString TextObjectTxt = ASTextGetEncoded(nextWordASText, ASTextGetBestEncoding(nextWordASText, (ASHostEncoding)PDGetHostEncoding()));

Thanks..

lrosenth

Adobe Employee

Your problem is your trying to convert from what is most likely Unicode (UTF8 or UTF16) to Host encoding (probably ISO 8891) which may not include those characters. And even if they do, are you then trying to display them in a font that doesn’t include those glyphs.

Instead of using HostEncoding, use UTF8Encoding, as that will give you back an ASCII string which will show you where the extra characters are (via the UTF8 escaping).

Vijaykumar L SAuthor

Known Participant

Hi Test Screen,

I am using PDDocCreateWordFinderUCS for Word Finder then

then i Get AsText from each PDWord.

then i Encode AsText to get all characters.

is it the correct way i am going?

Vijaykumar L SAuthor

Known Participant

Thank you for the reply...

I will Looking Unicode Encoding Concept...

Vijaykumar L SAuthor

Known Participant

Thanks for the reply...

how to extract special characters using PDWordFinder object?

When i extracting "≤ Ω Β ∞ ≠ ≥" it display as " = O . 8 . = " .

i am using the following code...

ACCB1 ASBool ACCB2 SearchTextBasedonSelectedFont(PDWordFinder wObj, PDWord wInfo, ASInt32 pgNum, void* clientData)

{

CString csText;// = "";

CString csColor, csBlueText;

COLORREF wordColor = NULL;

char buf[256];

bool NonAlphaNum = false, LeadingPunc = false, LeadingSpace = false;

ASInt32 liStyleIndex;

bool bcolorValue = false;

static int FirstOccurence = 0;

PDStyle pdWordStyle;

PDColorValueRec color;

color.space = PDDeviceRGB;

color.value[0] = color.value[1] = color.value[2] = color.value[3] = fixedZero;

liStyleIndex = 0;

ASFixedQuad quad;

ASInt16 llAttr, liNumQuads;

long llCurSequence;

llCurSequence = ++(*(long*)clientData);

try

{

//Get the word color

if ((pdWordStyle = PDWordGetNthCharStyle(wObj, wInfo, liStyleIndex)) != NULL)

{

PDStyleGetColor(pdWordStyle, &color);

}

// To get the word in buffer

PDWordGetString(wInfo, buf, 256);

csText = buf;

PDStyleGetFont(pdWordStyle);

PDStyle aoPDWordStyle = PDWordGetNthCharStyle(wObj, wInfo, 0);

liNumQuads = PDWordGetNumQuads(wInfo);

}
}

Thanks..

T

Test Screen Name

Legend

If you use the UCS word finder you wull get a Unicode word string. This must be treated as an array of WCHAR. You NEED to Understand Unicode encoding.

Vijaykumar L SAuthor

Known Participant

TextSelect = AVPageViewTrackText(pageView, xHit, yHit, NULL);

PDDoc pdDoc = AVDocGetPDDoc(AVAppGetActiveDoc());

PDPage pdPage = AVPageViewGetPage(pageView);

int iPage = PDPageGetNumber(pdPage);

BKMCRot = PDPageGetRotate(pdPage);

if (TextSelect != NULL)

{

PDTextSelectEnumText(TextSelect, ASCallbackCreateProto(PDTextSelectEnumTextProc, BKMCTextEnumProc), NULL);

PDTextSelectEnumQuads(TextSelect, ASCallbackCreateProto(PDTextSelectEnumQuadProc, BKMCTextEnumQuadProc), NULL);

AVPageViewHighlightText(pageView, TextSelect);

ASBool bselection = AVDocSetSelection(AVAppGetActiveDoc(), ASAtomFromString("BMCreatorText"), TextSelect, true);

}

ACCB1 ASBool ACCB2 BKMCTextEnumProc(void* procObj, PDFont pdFont, ASFixed size, PDColorValue Color, char *buff, ASInt32 asLen)

{

//for getting Font Size

int iFontSize = FixedRoundToInt16(size);

csFontSize.Format(L"%d", iFontSize);

//For Getting Text Color

long Textcolor = CPDFLink::GetRGBFromPDColor(*Color);

long lRValue, lGValue, lBValue;

CString csRVal, csGVal, csBVal;

CColor::COLORREFToRGB(Textcolor, lRValue, lGValue, lBValue);

csRVal.Format(L"%d", lRValue);

csGVal.Format(L"%d", lGValue);

csBVal.Format(L"%d", lBValue);

csColorValue = (L"R=") + csRVal + (" G=") + csGVal + (" B=") + csBVal;

//for Getting Font name

char fontNameBuf[PSNAMESIZE];

PDFontGetName(pdFont, fontNameBuf, PSNAMESIZE);

csFontname = (CString)fontNameBuf;

//For multiple words we need to add each time.

CString csChar;

for (int iIndex = 0; iIndex < asLen; iIndex++)

{

char cBuff = buff[iIndex];

if (cBuff != 13 && cBuff != 10)

{

csChar = cBuff;

csBKMCKeyword += csChar;

}

buff = "";

return true;

}

lrosenth

Adobe Employee

Two things…

1 – Use PDTextSelectEnumTextUCS as that will return UCS (aka Unicode) encoded information so that you will be sure to get all text in a standardized fashion

2 – We careful with CString as (IIRC) it’s not great for arbitrary encodings.

Show more replies

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded