Skip to main content
Vijaykumar L S
Known Participant
July 20, 2018
Question

PDTextSelect Object not Getting Special Characters like (≤ Ω Β ∞ ≠ ≥).

  • July 20, 2018
  • 11 replies
  • 1866 views

When i extracted selected text, some special characters like (≤ Ω Β ∞ ≠ ≥) are displayed as junk character.

Please suggest me, how to Get all characters.

Thanks.

This topic has been closed for replies.

11 replies

Legend
July 26, 2018

So, that is an API you are using!! CBMCTreeCtrl appears to be a custom C++ API. You must refer to the documentation or code of that API to see if you can pass a Unicode string. You will NOT be able to use the same interface you used to set one byte text. I think you still have not studied Unicode or the concept of encodings. This is vital to you.

However, I do have two suggestions on your next step.

1. Try and add directly the constant string "≤ Ω Β ∞ ≠ ≥" to your tree control, without using Acrobat. You may find this complex or impossible to solve, but once solved you will be ready to work with strings from Acrobat.

2. Also, copy and paste text from this same PDF into Word. Examine the Word document and make sure the characters ≤ Ω Β ∞ ≠ ≥ appear as you wish. If they do not appear in Word, you certainly cannot extract them from the PDF.

Vijaykumar L S
Known Participant
July 26, 2018

In both the cases, Expected result achieved.

I need to work on Acrobat String.

Legend
July 26, 2018

I do not understand what you mean by no API. Please help us understand in detail what you are doing. I know the tree view controller as something used though the .Net API and also via the MS Access API.

Vijaykumar L S
Known Participant
July 26, 2018

The following code used for creating Tree view Controller.

cBMCTreeCtrl m_BKMCTreeCtrl = new cBMCTreeCtrl();

UINT nTreeStyles = WS_CHILD | WS_VISIBLE | WS_TABSTOP | WS_BORDER | TVS_LINESATROOT| TVS_HASLINES| TVS_HASBUTTONS| TVS_EDITLABELS | TVS_EX_MULTISELECT;

m_BKMCTreeCtrl->Create(nTreeStyles, crTreeView, this, IDC_BKMCreatorTreeView);

m_BKMCTreeCtrl->ShowWindow(SW_SHOW);

but, When Getting text from AsText object, there only it shows junk character.

Legend
July 26, 2018

Ok, which API are you using to the tree view controller. Some of them are not Unicode aware so this is impossible for them. Others may accept a Unicode string, but you must change your code.

Vijaykumar L S
Known Participant
July 26, 2018

I extracted all the words through PDWordFinder and Added into List,

then directly insert those Words present in the List into Tree view controller.

I have not used any API to insert words into Tree view controller.

Vijaykumar L S
Known Participant
July 26, 2018

First I Get Astext object from PDWord,

then,When Getting text from AsText object, there only it shows junk character.

I think, i need to do something while extracting string from ASText Object.

Legend
July 26, 2018

Please answer my question about intended use.

Vijaykumar L S
Known Participant
July 26, 2018

I want to extract words from the PDF Page using WordFinder and displayed in the Tree view controller.

While Extracting, all words are getting correctly, Except special characters included in the word.

in the place of special characters, junk character is displayed, like(.,8,O).

I want to know how to display words with special characters.

Legend
July 25, 2018

So, what is your host encoding? Does your host encoding include the characters ≤ Ω Β ∞ ≠ ≥? Tip: mine does not. I would have to work in Unicode. I think you are assuming something impossible will work.

What is your aim for these characters: please list all the ways you need them to work? (For example: only in one message that is popped up using the Windows MessageBox function) Please be detailed.

Legend
July 25, 2018

Does it give the expected answer?

If not, what encoding do you choose when you convert the ASText?

Vijaykumar L S
Known Participant
July 25, 2018

it shows some other format...

i used the following code to encode ASText.

ASText nextWordASText = ASTextNew();

PDWordGetASText(nextWord, 0, nextWordASText);

CString TextObjectTxt = ASTextGetEncoded(nextWordASText, ASTextGetBestEncoding(nextWordASText, (ASHostEncoding)PDGetHostEncoding()));

Thanks..

lrosenth
Adobe Employee
Adobe Employee
July 25, 2018

Your problem is your trying to convert from what is most likely Unicode (UTF8 or UTF16) to Host encoding (probably ISO 8891) which may not include those characters. And even if they do, are you then trying to display them in a font that doesn’t include those glyphs.

Instead of using HostEncoding, use UTF8Encoding, as that will give you back an ASCII string which will show you where the extra characters are (via the UTF8 escaping).

Vijaykumar L S
Known Participant
July 25, 2018

Hi Test Screen,

I am using PDDocCreateWordFinderUCS for Word Finder then

then i Get AsText from each PDWord.

then i Encode AsText to get all characters.

is it the correct way i am going?

Vijaykumar L S
Known Participant
July 24, 2018

Thank you for the reply...

I will Looking Unicode Encoding Concept...

Vijaykumar L S
Known Participant
July 24, 2018

Thanks for the reply...

how to extract special characters using PDWordFinder object?

When i extracting "≤ Ω Β ∞ ≠ ≥" it display as  " = O . 8 . = " .

i am using the following code...

ACCB1 ASBool ACCB2 SearchTextBasedonSelectedFont(PDWordFinder wObj, PDWord wInfo, ASInt32 pgNum, void* clientData)

{

CString csText;// = "";

CString csColor, csBlueText;

COLORREF wordColor = NULL;

char buf[256];

bool NonAlphaNum = false, LeadingPunc = false, LeadingSpace = false;

ASInt32 liStyleIndex;

bool bcolorValue = false;

static int FirstOccurence = 0;

PDStyle pdWordStyle;

PDColorValueRec color;

color.space = PDDeviceRGB;

color.value[0] = color.value[1] = color.value[2] = color.value[3] = fixedZero;

liStyleIndex = 0;

ASFixedQuad quad;

ASInt16 llAttr, liNumQuads;

long llCurSequence;

llCurSequence = ++(*(long*)clientData);

try

{

//Get the word color

if ((pdWordStyle = PDWordGetNthCharStyle(wObj, wInfo, liStyleIndex)) != NULL)

{

PDStyleGetColor(pdWordStyle, &color);

}

// To get the word in buffer

PDWordGetString(wInfo, buf, 256);

csText = buf;

PDStyleGetFont(pdWordStyle);

PDStyle aoPDWordStyle = PDWordGetNthCharStyle(wObj, wInfo, 0);

liNumQuads = PDWordGetNumQuads(wInfo);

}
}

Thanks..

Legend
July 24, 2018

If you use the UCS word finder you wull get a Unicode word string. This must be treated as an array of WCHAR. You NEED to Understand Unicode encoding.

Vijaykumar L S
Known Participant
July 20, 2018

TextSelect = AVPageViewTrackText(pageView, xHit, yHit, NULL);

PDDoc pdDoc = AVDocGetPDDoc(AVAppGetActiveDoc());

PDPage pdPage = AVPageViewGetPage(pageView);

int iPage = PDPageGetNumber(pdPage);

BKMCRot = PDPageGetRotate(pdPage);

if (TextSelect != NULL)

{

PDTextSelectEnumText(TextSelect, ASCallbackCreateProto(PDTextSelectEnumTextProc, BKMCTextEnumProc), NULL);

PDTextSelectEnumQuads(TextSelect, ASCallbackCreateProto(PDTextSelectEnumQuadProc, BKMCTextEnumQuadProc), NULL);

AVPageViewHighlightText(pageView, TextSelect);

ASBool bselection = AVDocSetSelection(AVAppGetActiveDoc(), ASAtomFromString("BMCreatorText"), TextSelect, true);

}

ACCB1 ASBool ACCB2  BKMCTextEnumProc(void* procObj, PDFont pdFont, ASFixed size, PDColorValue Color, char *buff, ASInt32 asLen)

{

//for getting Font Size

int iFontSize = FixedRoundToInt16(size);

csFontSize.Format(L"%d", iFontSize);

//For Getting Text Color

long Textcolor = CPDFLink::GetRGBFromPDColor(*Color);

long lRValue, lGValue, lBValue;

CString csRVal, csGVal, csBVal;

CColor::COLORREFToRGB(Textcolor, lRValue, lGValue, lBValue);

csRVal.Format(L"%d", lRValue);

csGVal.Format(L"%d", lGValue);

csBVal.Format(L"%d", lBValue);

csColorValue = (L"R=") + csRVal + (" G=") + csGVal + (" B=") + csBVal;

//for Getting Font name

char fontNameBuf[PSNAMESIZE];

PDFontGetName(pdFont, fontNameBuf, PSNAMESIZE);

csFontname = (CString)fontNameBuf;

//For multiple words we need to add each time.

CString csChar;

for (int iIndex = 0; iIndex < asLen; iIndex++)

{

char cBuff = buff[iIndex];

if (cBuff != 13 && cBuff != 10)

{

csChar = cBuff;

csBKMCKeyword += csChar;

}

}

buff = "";

return true;

}

lrosenth
Adobe Employee
Adobe Employee
July 20, 2018

Two things…

1 – Use PDTextSelectEnumTextUCS as that will return UCS (aka Unicode) encoded information so that you will be sure to get all text in a standardized fashion

2 – We careful with CString as (IIRC) it’s not great for arbitrary encodings.