Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
  • 한국 커뮤니티
0

Problem with searching special characters

New Here ,
Jan 19, 2017 Jan 19, 2017

Hello,

I have written a plugin to search text from the PDF document using search plugin HFT functions.

I can very well search the plain text without any special characters in it.

But when search text contains special characters then It could not search the whole search text.

Here is my code snippet:

ASText  searchKey = ASTextFromScriptText(searchText, kASEUnicodeScript);

SearchQueryDataRec srcQuery;

memset(&srcQuery, 0, sizeof (SearchQueryDataRec));

srcQuery.size = sizeof(SearchQueryDataRec);

srcQuery.query = searchKey;

srcQuery.type = kSearchActiveDoc;

srcQuery.match = kMatchAllWords;

srcQuery.options = kWordOptionWholeWord | kSearchEveryWhere;

srcQuery.scope = kSearchDocumentText;

srcQuery.path = NULL;

srcQuery.fs = NULL;

srcQuery.maxDocs = 1;

ASBool ret = SearchExecuteQueryEx(&srcQuery);

ASTextDestroy(searchKey);

For example:

If I want to search the below highlighted text then It does not search.

However I can very well search the below text:

Please let me know where I am going wrong.

Thanks.

TOPICS
Acrobat SDK and JavaScript
1.7K
Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 19, 2017 Jan 19, 2017

Three dots may be very troublesome. It might be thtee for characters, or it might be a single ellipsis character. Not clear to what encoding you'd use in the second case.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 19, 2017 Jan 19, 2017

What would be the encoding in both the cases?

I will try both.

Thanks.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 19, 2017 Jan 19, 2017

In the first case it's clear what to use for a single dot.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 20, 2017 Jan 20, 2017

Please tell me the correct encoding.

I am reading search text from file and then doing some encoding as below:

ASText searchKey = ASTextNew();

searchKey = ASTextFromEncoded(searchText, kASEUnicodeScript);

While searching, acrobat do not search it in correct way if there are special characters like ellipses(...) or double quotes ("") etc.

It does not encode properly.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 20, 2017 Jan 20, 2017

Any updates on this please?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 20, 2017 Jan 20, 2017

Have you checked whether a normal manual search works for these cases?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 20, 2017 Jan 20, 2017

And then, whether advanced search from the UI works? What specific characters do you need to use to make this work?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 20, 2017 Jan 20, 2017

I tried manually with the text containing special characters. It searches properly.

But when I use the search plugin HFT functions to search the same text, It opens some search panel at left side, and It does not search.

The search panel is different from what I normally search manually.

I manually searched by command Ctrl+F.

And what do you mean by advance search? Is it same as that of left panel that I am getting when I run plugin?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 20, 2017 Jan 20, 2017

[corrected] Find and Search used to be completely different. But the difference has become more blurred. I have not used this API directly, but you say it works for other strings, is that right?  If so...

What text do you pass the the UI? Three dots or ellipsis?

If ellipsis, what Unicode value is in the string passed?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 20, 2017 Jan 20, 2017

I just copied the text from PDF and searched the same.

I don't know about encoding.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 20, 2017 Jan 20, 2017

You need to be concerned about these details if you are not working with low ASCII. Every character has multiple different ways to be represented, you have to take control. So, what is the Unicode character you are using for this position? Take a hex dump.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 20, 2017 Jan 20, 2017

Hmm, exact contents of searchText as hex, that is.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 23, 2017 Jan 23, 2017

You mean you want hex conversion of all the search text?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 23, 2017 Jan 23, 2017

"So, what is the Unicode character you are using for this position? "

Which Unicode are you talking about? I am giving the input as a search text containing special characters to the search plugin and let it search. I did not get you in this regard. Can you please elaborate?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 23, 2017 Jan 23, 2017

You are filling the data area called searchText. please tell us the hex value of the entire data area. Really you must be concerned with this detail, encodings are vital. End users can think copy/paste, programmers must go further!

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 23, 2017 Jan 23, 2017

Okay.

Here is the hext dump of the text:

Text: The tool will be integrated in the following environment : […]

Hex code : 54 68 65 20 74 6f 6f 6c 20 77 69 6c 6c 20 62 65 20 69 6e 74 65 67 72 61 74 65 64 20 69 6e 20 74 68 65 20 66 6f 6c 6c 6f 77 69 6e 67 20 65 6e 76 69 72 6f 6e 6d 65 6e 74 20 3a 20 5b e2 80 a6 5d

One more text with special characters:

Text: The tool shall be able to capture semi-automatically the requirements included in a document and/or in a model. “Semi-automatic” means the text has to be formalized beforehand by the user or another dedicated tool.

Hex code:

54 68 65 20 74 6f 6f 6c 20 73 68 61 6c 6c 20 62 65 20 61 62 6c 65 20 74 6f 20 63 61 70 74 75 72 65 20 73 65 6d 69 2d 61 75 74 6f 6d 61 74 69 63 61 6c 6c 79 20 74 68 65 20 72 65 71 75 69 72 65 6d 65 6e 74 73 20 69 6e 63 6c 75 64 65 64 20 69 6e 20 61 20 64 6f 63 75 6d 65 6e 74 20 61 6e 64 2f 6f 72 20 69 6e 20 61 20 6d 6f 64 65 6c 2e 20 e2 80 9c 53 65 6d 69 2d 61 75 74 6f 6d 61 74 69 63 e2 80 9d 20 6d 65 61 6e 73 20 74 68 65 20 74 65 78 74 20 68 61 73 20 74 6f 20 62 65 20 66 6f 72 6d 61 6c 69 7a 65 64 20 62 65 66 6f 72 65 68 61 6e 64 20 62 79 20 74 68 65 20 75 73 65 72 20 6f 72 20 61 6e 6f 74 68 65 72 20 64 65 64 69 63 61 74 65 64 20 74 6f 6f 6c 2e

Thanks.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 23, 2017 Jan 23, 2017

The programmer must be prepared to answer the question "what encoding is this text" at any time. Otherwise you end up with mojibake and unexpected results. Ideally, by tracking it through from a known encoding, but sometimes analysis is the only way. Here we can see it is a single byte encoding to start with: 54 68 65 20 74 is "The J" in many encodings. If there were no non-ASCII characters we might survive not knowing the encoding.

Starting at the "t" in "environment" we have 74 20 3a 20 5b e2 80 a6 5d. The interesting part is 5b e2 80 a6 5d, since 5b is "[" and 5d is "]". Between that we have e2 80 a6. After looking in several tables I recognise this as UTF-8 for the Unicode charcter U+22EE, ellipsis. So you have a valid UTF-8 string for the required text.

So the next question is: is this the correct encoding for the API you use?


You code ASTextFromScriptText(searchText, kASEUnicodeScript);
What does kASEUnicodeScript mean? I don't know. I mean, I really don't know. The documentation does not say what it means. It is passed to some unknown platform API to be interpreted. Windows APIs don't usually accept UTF-8, so this is very unlikely to work.

Happily you can go directly from UTF-8 to an ASText with a well documented method. Use ASTextFromUnicode with an encoding type indicating your input is UTF-8.

Bear in mind that if this is a hard coded test value, that your UI will certainly return the input text in a different encoding, and you must handle this.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 24, 2017 Jan 24, 2017

Hello,

As you said, I tried ASTextFromUnicode(reinterpret_cast <ASUTF16Val *> (searchText), kUTF8).

Its skipping the special characters now and passing the plain text to search plugin.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 24, 2017 Jan 24, 2017

Any updates on this please?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 25, 2017 Jan 25, 2017

I thought you said it was working now? Ellipsis is punctuation so it won't be included in search text. if you want to test searching for actual text outside low ASCII try accented characters eg café.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 25, 2017 Jan 25, 2017

You said that Ellipsis is punctuation and it will not be included in the Search.

As I said previously, If I search that text with ellipsis manually then It will search properly.

But If I pass that text to search plugin then It is not working. Why?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jan 25, 2017 Jan 25, 2017

You say it works with Search. Let's clarify that you don't just mean "Find" and have specifically used it with Search, and it worked OK in the user interface.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Jan 30, 2017 Jan 30, 2017

Hello,

I specifically used with Search. As per your suggestion, I used ASTextFromEncoded(searchText, kUTF8); .

As you can see I have used UTF-8 encoding.

My searchText is = The tool will be integrated in the following environment : […]

And result is shown in below image:

As you can see, the text that Search plugin takes is correct but it does not "actually search".

Now comes the interesting part:

If I click on "New Search" button, then It takes the same search text as input to search and searches it successfully.

Here is the result:

   

This is very confusing. What are your thoughts on this?

Code snippet:

ASText searchKey = ASTextNew();

searchKey = ASTextFromEncoded(searchText, kUTF8);

SearchQueryDataRec srcQuery;

memset(&srcQuery, 0, sizeof (SearchQueryDataRec));

srcQuery.size = sizeof(SearchQueryDataRec);

srcQuery.query = searchKey;

srcQuery.type = kSearchActiveDoc;

srcQuery.match = kMatchPhrase;

srcQuery.options = kWordOptionWholeWord | kSearchEveryWhere;

srcQuery.scope = kSearchDocumentText;

srcQuery.path = NULL;

srcQuery.fs = NULL;

srcQuery.maxDocs = 1;

ASBool ret = SearchExecuteQueryEx(&srcQuery);

if (ret == false) {

  AVAlertNote("Search text not found");

}

ASTextDestroy(searchKey);

Thanks.

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
New Here ,
Feb 06, 2017 Feb 06, 2017

I managed to get it done by just removing

  1. srcQuery.type = kSearchActiveDoc; 
  2. srcQuery.match = kMatchPhrase;

I am able to search special character now, BUT ONLY IN ENGLISH.

Now, I am not able to find UniCode characters in other languages like Chinese!

Any suggestion on this?

Translate
Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines