Known Participant

Question

Problem with searching special characters

Forum|Forum|9 years ago
January 19, 2017
8 replies
2082 views

Hello,

I have written a plugin to search text from the PDF document using search plugin HFT functions.

I can very well search the plain text without any special characters in it.

But when search text contains special characters then It could not search the whole search text.

Here is my code snippet:

ASText  searchKey = ASTextFromScriptText(searchText, kASEUnicodeScript);
SearchQueryDataRec srcQuery;
memset(&srcQuery, 0, sizeof (SearchQueryDataRec));
srcQuery.size = sizeof(SearchQueryDataRec);
srcQuery.query = searchKey;
srcQuery.type = kSearchActiveDoc;
srcQuery.match = kMatchAllWords;
srcQuery.options = kWordOptionWholeWord | kSearchEveryWhere;
srcQuery.scope = kSearchDocumentText;
srcQuery.path = NULL;
srcQuery.fs = NULL;
srcQuery.maxDocs = 1;
ASBool ret = SearchExecuteQueryEx(&srcQuery);
ASTextDestroy(searchKey);

For example:

If I want to search the below highlighted text then It does not search.

However I can very well search the below text:

Please let me know where I am going wrong.

Thanks.

Acrobat SDK and JavaScript

This topic has been closed for replies.

T

Test Screen Name

Legend

You say it works with Search. Let's clarify that you don't just mean "Find" and have specifically used it with Search, and it worked OK in the user interface.

N

navnathk23503600Author

Known Participant

Hello,

I specifically used with Search. As per your suggestion, I used ASTextFromEncoded(searchText, kUTF8); .

As you can see I have used UTF-8 encoding.

My searchText is = The tool will be integrated in the following environment : […]

And result is shown in below image:

As you can see, the text that Search plugin takes is correct but it does not "actually search".

Now comes the interesting part:

If I click on "New Search" button, then It takes the same search text as input to search and searches it successfully.

Here is the result:

This is very confusing. What are your thoughts on this?

Code snippet:

ASText searchKey = ASTextNew();
searchKey = ASTextFromEncoded(searchText, kUTF8);
SearchQueryDataRec srcQuery;
memset(&srcQuery, 0, sizeof (SearchQueryDataRec));
srcQuery.size = sizeof(SearchQueryDataRec);
srcQuery.query = searchKey;
srcQuery.type = kSearchActiveDoc;
srcQuery.match = kMatchPhrase;
srcQuery.options = kWordOptionWholeWord | kSearchEveryWhere;
srcQuery.scope = kSearchDocumentText;
srcQuery.path = NULL;
srcQuery.fs = NULL;
srcQuery.maxDocs = 1;
ASBool ret = SearchExecuteQueryEx(&srcQuery);
if (ret == false) {
  AVAlertNote("Search text not found");
}
ASTextDestroy(searchKey);

Thanks.

N

navnathk23503600Author

Known Participant

I managed to get it done by just removing

srcQuery.type = kSearchActiveDoc;
srcQuery.match = kMatchPhrase;

I am able to search special character now, BUT ONLY IN ENGLISH.

Now, I am not able to find UniCode characters in other languages like Chinese!

Any suggestion on this?

T

Test Screen Name

Legend

You are filling the data area called searchText. please tell us the hex value of the entire data area. Really you must be concerned with this detail, encodings are vital. End users can think copy/paste, programmers must go further!

N

navnathk23503600Author

Known Participant

Okay.

Here is the hext dump of the text:

Text: The tool will be integrated in the following environment : […]

Hex code : 54 68 65 20 74 6f 6f 6c 20 77 69 6c 6c 20 62 65 20 69 6e 74 65 67 72 61 74 65 64 20 69 6e 20 74 68 65 20 66 6f 6c 6c 6f 77 69 6e 67 20 65 6e 76 69 72 6f 6e 6d 65 6e 74 20 3a 20 5b e2 80 a6 5d

One more text with special characters:

Text: The tool shall be able to capture semi-automatically the requirements included in a document and/or in a model. “Semi-automatic” means the text has to be formalized beforehand by the user or another dedicated tool.

Hex code:

54 68 65 20 74 6f 6f 6c 20 73 68 61 6c 6c 20 62 65 20 61 62 6c 65 20 74 6f 20 63 61 70 74 75 72 65 20 73 65 6d 69 2d 61 75 74 6f 6d 61 74 69 63 61 6c 6c 79 20 74 68 65 20 72 65 71 75 69 72 65 6d 65 6e 74 73 20 69 6e 63 6c 75 64 65 64 20 69 6e 20 61 20 64 6f 63 75 6d 65 6e 74 20 61 6e 64 2f 6f 72 20 69 6e 20 61 20 6d 6f 64 65 6c 2e 20 e2 80 9c 53 65 6d 69 2d 61 75 74 6f 6d 61 74 69 63 e2 80 9d 20 6d 65 61 6e 73 20 74 68 65 20 74 65 78 74 20 68 61 73 20 74 6f 20 62 65 20 66 6f 72 6d 61 6c 69 7a 65 64 20 62 65 66 6f 72 65 68 61 6e 64 20 62 79 20 74 68 65 20 75 73 65 72 20 6f 72 20 61 6e 6f 74 68 65 72 20 64 65 64 69 63 61 74 65 64 20 74 6f 6f 6c 2e

Thanks.

T

Test Screen Name

Legend

The programmer must be prepared to answer the question "what encoding is this text" at any time. Otherwise you end up with mojibake and unexpected results. Ideally, by tracking it through from a known encoding, but sometimes analysis is the only way. Here we can see it is a single byte encoding to start with: 54 68 65 20 74 is "The J" in many encodings. If there were no non-ASCII characters we might survive not knowing the encoding.

Starting at the "t" in "environment" we have 74 20 3a 20 5b e2 80 a6 5d. The interesting part is 5b e2 80 a6 5d, since 5b is "[" and 5d is "]". Between that we have e2 80 a6. After looking in several tables I recognise this as UTF-8 for the Unicode charcter U+22EE, ellipsis. So you have a valid UTF-8 string for the required text.

So the next question is: is this the correct encoding for the API you use?

You code ASTextFromScriptText(searchText, kASEUnicodeScript);
What does kASEUnicodeScript mean? I don't know. I mean, I really don't know. The documentation does not say what it means. It is passed to some unknown platform API to be interpreted. Windows APIs don't usually accept UTF-8, so this is very unlikely to work.

Happily you can go directly from UTF-8 to an ASText with a well documented method. Use ASTextFromUnicode with an encoding type indicating your input is UTF-8.

Bear in mind that if this is a hard coded test value, that your UI will certainly return the input text in a different encoding, and you must handle this.

T

Test Screen Name

Legend

Hmm, exact contents of searchText as hex, that is.

N

navnathk23503600Author

Known Participant

You mean you want hex conversion of all the search text?

N

navnathk23503600Author

Known Participant

"So, what is the Unicode character you are using for this position? "

Which Unicode are you talking about? I am giving the input as a search text containing special characters to the search plugin and let it search. I did not get you in this regard. Can you please elaborate?

T

Test Screen Name

Legend

You need to be concerned about these details if you are not working with low ASCII. Every character has multiple different ways to be represented, you have to take control. So, what is the Unicode character you are using for this position? Take a hex dump.

T

Test Screen Name

Legend

And then, whether advanced search from the UI works? What specific characters do you need to use to make this work?

N

navnathk23503600Author

Known Participant

I tried manually with the text containing special characters. It searches properly.

But when I use the search plugin HFT functions to search the same text, It opens some search panel at left side, and It does not search.

The search panel is different from what I normally search manually.

I manually searched by command Ctrl+F.

And what do you mean by advance search? Is it same as that of left panel that I am getting when I run plugin?

T

Test Screen Name

Legend

[corrected] Find and Search used to be completely different. But the difference has become more blurred. I have not used this API directly, but you say it works for other strings, is that right? If so...

What text do you pass the the UI? Three dots or ellipsis?

If ellipsis, what Unicode value is in the string passed?

T

Test Screen Name

Legend

Have you checked whether a normal manual search works for these cases?

T

Test Screen Name

Legend

In the first case it's clear what to use for a single dot.

N

navnathk23503600Author

Known Participant

Please tell me the correct encoding.

I am reading search text from file and then doing some encoding as below:

ASText searchKey = ASTextNew();

searchKey = ASTextFromEncoded(searchText, kASEUnicodeScript);

While searching, acrobat do not search it in correct way if there are special characters like ellipses(...) or double quotes ("") etc.

It does not encode properly.

N

navnathk23503600Author

Known Participant

Any updates on this please?

T

Test Screen Name

Legend

Three dots may be very troublesome. It might be thtee for characters, or it might be a single ellipsis character. Not clear to what encoding you'd use in the second case.

N

navnathk23503600Author

Known Participant

What would be the encoding in both the cases?

I will try both.

Thanks.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded