Skip to main content
February 6, 2013
Answered

Regexp question

  • February 6, 2013
  • 1 reply
  • 1271 views

Gidday

I have a regexp that eliminates characters that aren't alphanumeric or )('$!-&?,. and then eliminates consecutive spaces:

nS = nS.replace(/[^\w\s\)\('$\!\-&,\.]/gi, "");
nS = nS.replace(/[\s]{2,}/g, " ");

I was wondering how it needs to be altered to also allow non-english characters such as accented ones like á and Asian types such as あ?

Cheers for your help

This topic has been closed for replies.
Correct answer sinious

There is no regexp for something iike that unfortunately. In languages with huge glyph sets (tens of thousands of chars) the posix spec doesn't contain something that can match all of those in a wildcard fashion. As you see in the spec above, about as good as you can get is the Latin-based \w.

For you to write something that can match every language the regexp would literally explicitly need to mention all those glyphs as acceptable.

You need to target something smaller such as just work on what you want to remove.

1 reply

sinious
Legend
February 7, 2013

var nS:String = "abcdefghijklmnopqrstuvwxyz123456789      !@#$%^&*()/.'\"  ar[هذا هو جزء من النص] cht[這是一些文本] heb[זה טקסט כלשהו]";

nS = nS.replace(/[\)\(\'\$\!\-\&\,\.]/g, "");

nS = nS.replace(/\s\s+/g, " ");

trace(nS);

Trace:

abcdefghijklmnopqrstuvwxyz123456789 @#%^*/" ar[هذا هو جزء من النص] cht[這是一些文本] heb[זה טקסט כלשהו]

I escaped some extra items in your regexp like & but your issue was using ^\w which in POSIX spec only stands for these characters:

The word class, of the form "\w", matches any character in the set of ASCII characters [a-zA-Z0-9_].

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1500.html

Seeing you're no longer looking for a-zA-Z I took out the case /insensitivity.

February 11, 2013

Thank you Sinious

I misworded my question - I meant to say...

Let through all languages and )('$!-&?,. - exclude everything else

What needs to be changed to allow that?

sinious
siniousCorrect answer
Legend
February 12, 2013

There is no regexp for something iike that unfortunately. In languages with huge glyph sets (tens of thousands of chars) the posix spec doesn't contain something that can match all of those in a wildcard fashion. As you see in the spec above, about as good as you can get is the Latin-based \w.

For you to write something that can match every language the regexp would literally explicitly need to mention all those glyphs as acceptable.

You need to target something smaller such as just work on what you want to remove.