Skip to main content
Known Participant
January 19, 2017
Question

Problem with searching special characters

  • January 19, 2017
  • 8 replies
  • 2072 views

Hello,

I have written a plugin to search text from the PDF document using search plugin HFT functions.

I can very well search the plain text without any special characters in it.

But when search text contains special characters then It could not search the whole search text.

Here is my code snippet:

ASText  searchKey = ASTextFromScriptText(searchText, kASEUnicodeScript);

SearchQueryDataRec srcQuery;

memset(&srcQuery, 0, sizeof (SearchQueryDataRec));

srcQuery.size = sizeof(SearchQueryDataRec);

srcQuery.query = searchKey;

srcQuery.type = kSearchActiveDoc;

srcQuery.match = kMatchAllWords;

srcQuery.options = kWordOptionWholeWord | kSearchEveryWhere;

srcQuery.scope = kSearchDocumentText;

srcQuery.path = NULL;

srcQuery.fs = NULL;

srcQuery.maxDocs = 1;

ASBool ret = SearchExecuteQueryEx(&srcQuery);

ASTextDestroy(searchKey);

For example:

If I want to search the below highlighted text then It does not search.

However I can very well search the below text:

Please let me know where I am going wrong.

Thanks.

This topic has been closed for replies.

8 replies

Legend
January 25, 2017

You say it works with Search. Let's clarify that you don't just mean "Find" and have specifically used it with Search, and it worked OK in the user interface.

Known Participant
January 30, 2017

Hello,

I specifically used with Search. As per your suggestion, I used ASTextFromEncoded(searchText, kUTF8); .

As you can see I have used UTF-8 encoding.

My searchText is = The tool will be integrated in the following environment : […]

And result is shown in below image:

As you can see, the text that Search plugin takes is correct but it does not "actually search".

Now comes the interesting part:

If I click on "New Search" button, then It takes the same search text as input to search and searches it successfully.

Here is the result:

   

This is very confusing. What are your thoughts on this?

Code snippet:

ASText searchKey = ASTextNew();

searchKey = ASTextFromEncoded(searchText, kUTF8);

SearchQueryDataRec srcQuery;

memset(&srcQuery, 0, sizeof (SearchQueryDataRec));

srcQuery.size = sizeof(SearchQueryDataRec);

srcQuery.query = searchKey;

srcQuery.type = kSearchActiveDoc;

srcQuery.match = kMatchPhrase;

srcQuery.options = kWordOptionWholeWord | kSearchEveryWhere;

srcQuery.scope = kSearchDocumentText;

srcQuery.path = NULL;

srcQuery.fs = NULL;

srcQuery.maxDocs = 1;

ASBool ret = SearchExecuteQueryEx(&srcQuery);

if (ret == false) {

  AVAlertNote("Search text not found");

}

ASTextDestroy(searchKey);

Thanks.

Known Participant
February 6, 2017

I managed to get it done by just removing

  1. srcQuery.type = kSearchActiveDoc; 
  2. srcQuery.match = kMatchPhrase;

I am able to search special character now, BUT ONLY IN ENGLISH.

Now, I am not able to find UniCode characters in other languages like Chinese!

Any suggestion on this?

Legend
January 23, 2017

You are filling the data area called searchText. please tell us the hex value of the entire data area. Really you must be concerned with this detail, encodings are vital. End users can think copy/paste, programmers must go further!

Known Participant
January 23, 2017

Okay.

Here is the hext dump of the text:

Text: The tool will be integrated in the following environment : […]

Hex code : 54 68 65 20 74 6f 6f 6c 20 77 69 6c 6c 20 62 65 20 69 6e 74 65 67 72 61 74 65 64 20 69 6e 20 74 68 65 20 66 6f 6c 6c 6f 77 69 6e 67 20 65 6e 76 69 72 6f 6e 6d 65 6e 74 20 3a 20 5b e2 80 a6 5d

One more text with special characters:

Text: The tool shall be able to capture semi-automatically the requirements included in a document and/or in a model. “Semi-automatic” means the text has to be formalized beforehand by the user or another dedicated tool.

Hex code:

54 68 65 20 74 6f 6f 6c 20 73 68 61 6c 6c 20 62 65 20 61 62 6c 65 20 74 6f 20 63 61 70 74 75 72 65 20 73 65 6d 69 2d 61 75 74 6f 6d 61 74 69 63 61 6c 6c 79 20 74 68 65 20 72 65 71 75 69 72 65 6d 65 6e 74 73 20 69 6e 63 6c 75 64 65 64 20 69 6e 20 61 20 64 6f 63 75 6d 65 6e 74 20 61 6e 64 2f 6f 72 20 69 6e 20 61 20 6d 6f 64 65 6c 2e 20 e2 80 9c 53 65 6d 69 2d 61 75 74 6f 6d 61 74 69 63 e2 80 9d 20 6d 65 61 6e 73 20 74 68 65 20 74 65 78 74 20 68 61 73 20 74 6f 20 62 65 20 66 6f 72 6d 61 6c 69 7a 65 64 20 62 65 66 6f 72 65 68 61 6e 64 20 62 79 20 74 68 65 20 75 73 65 72 20 6f 72 20 61 6e 6f 74 68 65 72 20 64 65 64 69 63 61 74 65 64 20 74 6f 6f 6c 2e

Thanks.

Legend
January 23, 2017

The programmer must be prepared to answer the question "what encoding is this text" at any time. Otherwise you end up with mojibake and unexpected results. Ideally, by tracking it through from a known encoding, but sometimes analysis is the only way. Here we can see it is a single byte encoding to start with: 54 68 65 20 74 is "The J" in many encodings. If there were no non-ASCII characters we might survive not knowing the encoding.

Starting at the "t" in "environment" we have 74 20 3a 20 5b e2 80 a6 5d. The interesting part is 5b e2 80 a6 5d, since 5b is "[" and 5d is "]". Between that we have e2 80 a6. After looking in several tables I recognise this as UTF-8 for the Unicode charcter U+22EE, ellipsis. So you have a valid UTF-8 string for the required text.

So the next question is: is this the correct encoding for the API you use?


You code ASTextFromScriptText(searchText, kASEUnicodeScript);
What does kASEUnicodeScript mean? I don't know. I mean, I really don't know. The documentation does not say what it means. It is passed to some unknown platform API to be interpreted. Windows APIs don't usually accept UTF-8, so this is very unlikely to work.

Happily you can go directly from UTF-8 to an ASText with a well documented method. Use ASTextFromUnicode with an encoding type indicating your input is UTF-8.

Bear in mind that if this is a hard coded test value, that your UI will certainly return the input text in a different encoding, and you must handle this.

Legend
January 20, 2017

Hmm, exact contents of searchText as hex, that is.

Known Participant
January 23, 2017

You mean you want hex conversion of all the search text?

Known Participant
January 23, 2017

"So, what is the Unicode character you are using for this position? "

Which Unicode are you talking about? I am giving the input as a search text containing special characters to the search plugin and let it search. I did not get you in this regard. Can you please elaborate?

Legend
January 20, 2017

You need to be concerned about these details if you are not working with low ASCII. Every character has multiple different ways to be represented, you have to take control. So, what is the Unicode character you are using for this position? Take a hex dump.

Legend
January 20, 2017

And then, whether advanced search from the UI works? What specific characters do you need to use to make this work?

Known Participant
January 20, 2017

I tried manually with the text containing special characters. It searches properly.

But when I use the search plugin HFT functions to search the same text, It opens some search panel at left side, and It does not search.

The search panel is different from what I normally search manually.

I manually searched by command Ctrl+F.

And what do you mean by advance search? Is it same as that of left panel that I am getting when I run plugin?

Legend
January 20, 2017

[corrected] Find and Search used to be completely different. But the difference has become more blurred. I have not used this API directly, but you say it works for other strings, is that right?  If so...

What text do you pass the the UI? Three dots or ellipsis?

If ellipsis, what Unicode value is in the string passed?

Legend
January 20, 2017

Have you checked whether a normal manual search works for these cases?

Legend
January 19, 2017

In the first case it's clear what to use for a single dot.

Known Participant
January 20, 2017

Please tell me the correct encoding.

I am reading search text from file and then doing some encoding as below:

ASText searchKey = ASTextNew();

searchKey = ASTextFromEncoded(searchText, kASEUnicodeScript);

While searching, acrobat do not search it in correct way if there are special characters like ellipses(...) or double quotes ("") etc.

It does not encode properly.

Known Participant
January 20, 2017

Any updates on this please?

Legend
January 19, 2017

Three dots may be very troublesome. It might be thtee for characters, or it might be a single ellipsis character. Not clear to what encoding you'd use in the second case.

Known Participant
January 19, 2017

What would be the encoding in both the cases?

I will try both.

Thanks.