Regexp question

Question

GiddayI have a regexp that eliminates characters that aren't alphanumeric or )('$!-&?,. and then eliminates consecutive spaces: nS = nS.replace(/[^\w\s\)\('$\!\-&,\.]/gi, ""); nS = nS.replace(/[\s]{2,}/g, " ");I was wondering how it needs to be altered to also allow non-english characters such as accented ones like á and Asian types such as あ?Cheers for your help

sinious · Accepted Answer

There is no regexp for something iike that unfortunately. In languages with huge glyph sets (tens of thousands of chars) the posix spec doesn't contain something that can match all of those in a wildcard fashion. As you see in the spec above, about as good as you can get is the Latin-based \w.

For you to write something that can match every language the regexp would literally explicitly need to mention all those glyphs as acceptable.

You need to target something smaller such as just work on what you want to remove.

sinious · Answer

var nS:String = "abcdefghijklmnopqrstuvwxyz123456789 !@#$%^&*()/.'\" ar[هذا هو جزء من النص] cht[這是一些文本] heb[זה טקסט כלשהו]";
nS = nS.replace(/[\)\(\'\$\!\-\&\,\.]/g, "");
nS = nS.replace(/\s\s+/g, " ");
trace(nS);

Trace:

abcdefghijklmnopqrstuvwxyz123456789 @#%^*/" ar[هذا هو جزء من النص] cht[這是一些文本] heb[זה טקסט כלשהו]

I escaped some extra items in your regexp like & but your issue was using ^\w which in POSIX spec only stands for these characters:

The word class, of the form "\w", matches any character in the set of ASCII characters [a-zA-Z0-9_].

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1500.html

Seeing you're no longer looking for a-zA-Z I took out the case /insensitivity.

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded