Tokenize xml maybe using RegEx?
Copy link to clipboard
Copied
I'm working on a simple search engine for dynamically loaded XML data. I have data of this form (more or less):
<sessions>
<session id=##>
<title><![CDATA[The Title]]></title>
<presenter>A person or two goes here with title</presenter>
<date>2011-2-15-10-00-a</date>
<webex><![CDATA[https://alink.com]]></webex>
<audience> <![CDATA[Various types that might be interested]]> </audience>
<desc><![CDATA[A longish description that might include some simple html tags like bold or some lists]]></desc>
<resources>
<resource>
<name><![CDATA[A slide deck, website, white patper, etc.]]></name>
<link active="true"><![CDATA[thelink to the resourcepdf]]></link>
<tip><![CDATA[A description of what is at the site or why the resource is interesting]]></tip>
</resource>
<resource>
.....
</resource>
</resources>
</session>
<session>
....
</sessions>
</sessions>
I need to break apart all the "useful" words and run them through my indexer. Currently I'm using e4x to pull out certain nodes and get the content as a string. Then I'm using something like this to break it all up:
var tokens:Array=[" - ","?",",","."....etc];
for(var i:int=tokens.length-1;i>=0;i--){
str=str.split(tokens).join(" ");
}
Is there a quicker, more efficient, better way to do this? I'm just learning about RegEx and think it could maybe have some use here, but I'm not all that good with it. Part of the problem here is that the tokens array needs to take into account all of the possible characters that could signal divisions between words. But there are some many of them. It might be simpler to go the route of here are the things we want to keep. That list is much shorter.
remove any xml tags and remove any html tags
keep A-Za-z (including accented letters such as grave, acute, umlaut, etc.)
keep ' or - when they are in the middle of a word, i.e. surrounded by letters
everything else goes
So there are really two parts to this:
1. What is the best, fastest, easiest way to extract all the data from the xml.
2. What is the most reliable easiest way to break all that data into just the words.
Copy link to clipboard
Copied
On #1 I'd say it entirely depends on how much XML we're talking about here. I deal with files from 1k to 4MB on average and I'd greatly change my strategy depending on the ballpark average size of the data. String operations are typically cheaper than RegExp but splitting and joining on a lot of data would change my mind quickly on that. Under the hood most of the same operations happen anyhow. RegExp is more for convenience and allowing the compiler to perform a best practice for the needed string operation. Nothing beats a huge test though, running millions of tests and checking timing between your approaches.
So on that I'd say take the same (longer=better) string, split/join it in a huge for-loop and compare that against a RegExp .replace() and see what your particular data reveals for timing.
On #2 I'd say RegExp hands down. The very point of it is to wildcard-match your content which is your issue. You don't know what to expect on some data so you can wildcard part of it while maintaining reasonable validation.
So how much data are we talking here and will there be any multi-byte characters involved? Sometimes strings with characters like Chinese can throw a huge performance difference (or outright glitchy behavior) between choices.
Edit:
A quick test shows how much faster string ops are running on 100,000 cycles against the same string, but also shows you how wildcards can be good (but dangerous).
New AS3 doc, paste in frame 1:
import flash.utils.getTimer;
var tokens:Array = [' ',"\t",'||','~','@'];
var i:int;
var tokenIndex:int;
var testStr:String;
// STRING OPS
var startTime:uint = getTimer();
for (i = 0; i < 100000; i++)
{
testStr = "This is a||test string~with multiple@delimiters";
for (tokenIndex = 0; tokenIndex < tokens.length; tokenIndex++)
{
// joining via && because a space is a valid token
testStr = testStr.split(tokens[tokenIndex]).join("&&");
}
}
trace("String Ops Total Time: " + (getTimer() - startTime) + "\nLast str: " + testStr);
// REGEXP
var re:RegExp = /[\s\t|~@]+/g;
startTime = getTimer();
for (i = 0; i < 100000; i++)
{
testStr = "This is a||test string~with multiple@delimiters";
for (tokenIndex = 0; tokenIndex < tokens.length; tokenIndex++)
{
testStr = testStr.replace(re,'&&');
}
}
trace("RegExp Total Time: " + (getTimer() - startTime) + "\nLast str: " + testStr);
Trace output:
String Ops Total Time: 1355
Last str: This&&is&&a&&test&&&&string&&with&&multiple&&delimiters
RegExp Total Time: 4643
Last str: This&&is&&a&&test&&string&&with&&multiple&&delimiters
The point of printing the string is to see how the operation went. You can see splitting via string ops, while around 4x faster, added an extra delimiter (bolded) that the RegExp fixed. What's important to note is by specifying it like I did (very loosely) any amount of any one of those characters can be a delimiter. So || is a delimiter just as ||||||||||||| also is. You can get more specific to reduce this issue but with the easiest example possible here you can see the RegExp wrapping the string ops is far more intense than handling the ops yourself.
The more interesting test comes when you explain the potential variations. If there isn't a ton of them, RegExp may be slower but easier. The logic given to matching is just very time consuming. The above only illustrated tokens gone wild matching. Tokenizing into an array while performing a trim() (in Flex) on each item then joining them back together may remove that "fix" RegExp performed yet not cost you. If not flex, the "gottcha" is there's no trim() function and often the suggested replacement is a RegExp solution hehe.
Re-Edit:
D'oh tried it on my home laptop (workstation dual quad Xeon x5365 32gb vs single quad i7 1.6 4GB) and my laptop beat my workstation... $900 2 years newer versus $5,250 2 years older.. I hate technology
Trace on old i7 laptop:
String Ops Total Time: 1280
Last str: This&&&&is&&a&&test&&&&string&&with&&multiple&&delimiters
RegExp Total Time: 4609
Last str: This&&is&&a&&test&&string&&with&&multiple&&delimiters
Yes, slower in string ops but faster in RegExp. One must wonder why (and assume i7 64bit enhancements, including overclocking and pipeline bubble reduction).
Copy link to clipboard
Copied
I think you've got an unfair comparison in your code. You have an inner loop in the regex version, but my understanding is that isn't needed. Once removed the regex comes out faster:
String Ops Total Time: 1778
Last str: This&&&&&&&&is&&a&&test&&&&string&&with&&multiple&&delimiters
RegExp Total Time: 971
Last str: This&&is&&a&&test&&string&&with&&multiple&&delimiters
I'm thinking I might need some kind of hybrid solution to deal with removing all the xml tags and such and something else to deal with the leftover bits.
I tried a hybrid version using string replace and a regexp that came out in the middle:
startTime = getTimer();
var re2:RegExp=/\s+/g
var re3:RegExp=/&+/g;
for (i = 0; i < 100000; i++)
{
testStr = "This is a||test string~with multiple@delimiters";
for (tokenIndex = tokens.length-1; tokenIndex >=0; tokenIndex--)
{
testStr = testStr.replace(tokens[tokenIndex],"&&");
}
testStr = testStr.replace(re2, '&&');
testStr = testStr.replace(re3, '&&');
}
trace("Hybrid Ops Total Time: " + (getTimer() - startTime) + "\nLast str: " + testStr);
Hybrid Ops Total Time: 1672
Last str: This&&is&&a&&test&&string&&with&&multiple&&delimiters
Each of my "documents" that needs to be indexed is pretty short 100 to 300 words. And in the current case only a few hundred of them to do. Only dealing with English and a few accented charaters. I am doing this in Flash to run on a webpage.
Thanks for the help you've given me a lot to think about.
Copy link to clipboard
Copied
Whoops I didn't realize that string.replace(someString," ") only replace the first occurance. Things slow down a lot after that!
Copy link to clipboard
Copied
Ugh you're totally right, so much for copy and paste without inspection..
the String.replace() absolutely will do more than first occurance, it does what the RegExp formula specifies. This formula is infantile compared to the complexity it can parse.
You just need to use modifiers to get it to parse more than the first occurance. You start and end your regex between // but you apply the modifiers after the last /. The 'g' I used means 'global' (all occurances on the current line). For XML it's a string with multiple lines but I never designated that, so just add support with //gm.
e.g.:
import flash.utils.getTimer;
var tokens:Array = [' ', "\t", '||', '~', '@',"\n"];
var i:int;
var tokenIndex:int;
var testStr:String;
var stringReset:String = "This is@a multiline~string\nto||test\nagainst that should~replace|the@@@@@@@@whole\nstring@with~~~a dash@@between each";
// STRING OPS
var startTime:uint = getTimer();
for (i = 0; i < 100000; i++)
{
testStr = stringReset;
for (tokenIndex = 0; tokenIndex < tokens.length; tokenIndex++)
{
testStr = testStr.split(tokens[tokenIndex]).join("-");
}
}
trace("String Ops Total Time: " + (getTimer() - startTime) + "\nLast str: " + testStr);
// REGEXP
// g=global, m=multiline
var re:RegExp = /[\s\t|~@\n]+/gm;
startTime = getTimer();
for (i = 0; i < 100000; i++)
{
testStr = stringReset;
testStr = testStr.replace(re, '-');
}
trace("RegExp Total Time: " + (getTimer() - startTime) + "\nLast str: " + testStr);
Trace:
String Ops Total Time: 2742
Last str: This-is-a-multiline-string-to-test-against-----that-should-replace|the--------whole-string-with---a-dash--between-each
RegExp Total Time: 2043
Last str: This-is-a-multiline-string-to-test-against-that-should-replace-the-whole-string-with-a-dash-between-each
Using 'g' global and 'm' multiline might have been why you thought you needed to run it multiple times. I just specified the string at the top with newlines (\n) to simulate text with CRLF like a string may have. It catches almost all as you can see with both modes, except a single | by String ops. Only String ops is a bit slower now and strings with lots of spaces or any kind of delimiter are really adding in the extra dashes.
If your sample above is true to the data you need to parse then you have the basics of what would need to be done to performance test your solution. The real answer is there is no absolute better way to parse any particular data. Depending on the data and the complexity the solution will always be different and require (as always) some testing.
Copy link to clipboard
Copied
I understand about the replace. You can also run it with just a string literal and not a regex. In that case it only does the first one.
var str:String="My abc is here. Your abc is there.";
trace(str.replace("abc","$"));
This is what I came up with. Currently I have just over 102 xml nodes to strip out. And this is what it finds:
Words found: 11374
Unique words: 2190
Took: 165
I think for my current purposes this is fast enough. This is only part of a larger process and the whole thing is going to need to be asynchronous so I don't think this part will contribute too much load to the whole anymore.
Thanks for your thoughts.
function loadXML(e:Event):void {
xmlLoader.removeEventListener(Event.COMPLETE, loadXML);
xmlData=new XML(e.target.data);
var entry:XML;
var s:Number=getTimer();
var allWords:Array=[];
for each (entry in xmlData.sessions.session) {
var tmp:Array=tokenize(entry.toString().toLowerCase())
allWords = allWords.concat(tmp)
}
trace("Words found: "+allWords.length);
var obj:Object={}
for(var i:int=allWords.length-1;i>=0;i--){
obj[allWords]=true;
}
var results:Array=[]
for(var str:String in obj){
results.push(str);
}
results.sort();
trace("Unique words: "+results.length);
trace("Took: "+(getTimer()-s));
}
function tokenize(str:String):Array {
str=str.replace(reC," ");
str=str.replace(reE," ");
str=str.replace(reTags," ");
str=str.replace(reLink," ");
str=str.replace(reQuote," ");
str=str.replace(reHyphen," ");
var j:int=tmp.length;
while (j--) {
str=str.split(tmp
}
str=str.replace(reSpaces," ");
return str.split(" ");
}
Copy link to clipboard
Copied
Can you show me what all those re* RegExps are? You can almost certainly consolidate if not all, most of them into a single RegExp for much more performance.
If you look at my RegExp it's specifying multiple characters all with a single regexp. You specify single characters in an "or" fashion between brackets like so:
[ab]
That means 'a' or 'b', just one of them, not both. More examples of matching, if you want a space:
\s
If you want "one or more" you use +, so "one or more" spaces:
\s+
If you want "zero or more" you use *, so "zero or more" 'a' or 'b's (aaaaaaa or bbbbbb, but not both):
[ab]*
So you see my RegExp said "one or more" for all my options so it would match anything it enclosed in a "singular" fashion:
[\s\t\|~@\n]+
That says match "one or more" of the following characters: space, tab, |, ~, @ or newline
So for a short example you don't need to do "reQuote", "reHyphen" and "reSpaces", combine all 3 in a "one or more" fashion:
/[\"\-\s]+/gm
Adding the modifier for g=global and m=multiline, that means any quote (") hyphen (-, or the real hyphen) and space (\s). Add in as many characters between the brackets as you like.
2 modifiers to make your life easier. Case insensitivity is //i and you can do ranges as well, a-z, 0-9, etc. Here's something that will parse "a single" letter (any case) or number:
/[a-z0-9]/i
The dash is interperated as a range, not a literal dash. If you want a literal character then escape it with \.
e.g. match a-z, 0-9 or a dash:
[a-z0-9\-]
You get the general idea. A single String.replace() with a more complex RegExp will run faster. RegExps get insanely complex, all the way up to parsing entire documents with a single crazy RegExp.
Copy link to clipboard
Copied
Oh man! I thought I had pasted them into there. Talk about copy and pasting errors! I'm at home now and don't have them.
I get a lot of the concepts behind the shorthand for the regexs, but in general I've found they make almost no sense.
In general the first one is to get rid of <![cdata[ and the second one is ]]>.
The next one is one I got off the internet to remove all html tags and their attributes (if any).
The next one is to find any http or https and all the way to the next \s character and remove it.
The next is to take out any single quotes that aren't in the middle of words. (Keeping contractions, but not the use of 'single quoting' around words.)
The final one is to remove hyphens. Like the single quote about it is to keep hyphenated words, but not hypens or double hyphens with spaces on one side or the other (or both).
Then the tmp array is all the other characters that need to come out "!?@#$%^&*()[]{}–—+:;<>©®™= There may be a few others, but that is the bulk of it.
Finally the space one removes all the double spaces, returns, etc.
Thanks for your help.
Copy link to clipboard
Copied
So I worked on putting them together and here is what I came up with:
var re1:RegExp=/<!\[cdata\[|\]\]>|<.*?>|https?:\/\/.*\s|[\/\.,_…"“”!?@#$%^&\*()\[\]{}–—+:;<>©®™=]|--+/g;
var re2:RegExp=/[\r\n\s]['‘’\-]|['‘’\-][\r\n\s]|\s+/g;
For the part about wanting the various single quotes and/or hypens if they are at word, space, etc. boundry I found I wanted to make sure all the other stuff was removed first. I couldn't see how to affect the order within re1. Can't tell if it is faster or what because I'm on a completely different system, but it seems pretty good.
Words found: 11382
Unique words: 2191
Took: 120
Copy link to clipboard
Copied
What I'm seeing in these RegExps:
re1 in English:
Globally in the current string, any <![cdata[ or ]]> or <*> or http(s)://*\s or /.,(3 chars)"“”!(not)?(up to)@#$%^(and)*(nothing)[](n/a amount)(invalid range)–—(one or more):;<>©®™= or -(invalid range)(one or more than)
The substitutions are in (parens).
Think of RegExp like a language who's sole purpose is to give you a ton of wildcards with programattic-like features to "describe" content you want. Using characters like ! (exclaimation point) actually mean "not" just like they do in AS3.
So to match a string that has NO lowercase 'a':
/!a/
That's why I mentioned (not) in the description, for a simple example. If you explicitly want a character the safest thing you can do is escape it just like you did with the brackets. To match an exclaimation point:
/\!/
It's just like "reserved words" in coding. You'd never make a variable name like 'for' or 'if' because you know the compiler will balk. Same deal with RegExp. Knowing what are operators (|,&,[,!,^,$,{,},(,),.,etc) will help solidify your meaning. There's tons of reference guides out there but being Perl was the big proponent of regular expressions I often just follow the simple PHP preg_* function syntax referece (click the links at the top for categories: http://php.net/manual/en/reference.pcre.pattern.syntax.php )
Any time you add in an "or" with | you're better off making a new RegExp for that. It's much easier to debug smaller complicated RegExps than a string of a bunch all together. trace() your string between every step to see which RegExps are misbehaving and medicate as needed.
For your example, from what I assume you want to do is just remove things. I'd do it like so:
removal of CDATA wrapper:
var str:String = '<![cdata[this is some text]]> moo';
var cdataRe:RegExp = /\<\!\[cdata\[(.*)?\]\]\>/i;
str = str.replace(cdataRe,"$1");
trace(str);
// trace: this is some text moo
This is a replace that shows you parenthesis's ability to capture text. Captured text will be put (in order of parenthesis) inside variables $1, $2, $3, etc. I captured the text between CDATA tags and my replacement was only the text inside it.
removal of any HTTP(S), RTMP, FTP links:
// important to note no space after, but will match
// taking out ftp,http,rtmp
var str:String = 'this is some text http://www.moo.com/a/b/c/?ref=123&q=2 and https://www.foo.com/cpanel/?a=login.do links HTTPS:// RTMP://media.someserver.com/moo.flv ftp://woo:mooftp.moo.com';mooftp.moo.com';
var httpsRe:RegExp = /[fhr]t{1,2}m*ps*\:\/\/.*?\s+|[fhr]t{1,2}m*ps*\:\/\/.*$/igm;
str = str.replace(httpsRe,'');
trace(str);
// trace: this is some text and links
You get the idea. I'm describing every bit of the text as I go. I wanted to show a decent usage of the | (or) branch in the case of removing 2 different types of links. A link in the beginning or middle of a sentence will have a space after it, or if the link is at the end of the string with no space. However it's not perfect. You run tests on it and you'd see if it ended up at the end of a sentence and there was a period, that period would get eaten too. It's exceptions like "no space after it" or "end of a sentence" that greedy RegExps need a lot of extra conditional logic on. That's why I woudln't bundle more than a single purpose RegExp because when you REALLY field test against data those seemingly simple one-purpose RegExps end up being huge.
The re2 I see above seems to have some very specific data sent to it. It's saying: A string containing a return or newline or space followed by quotes or apostrophe or dash followed by quotes or apostrophe or dash followed by a return or newline or space or just one or more spaces.
That's a pretty weird RegExp. That would match something like this:
var a:String = "
'-
";
// or
var b:String = ' "" ';
The final 'or' is the only thing I'd condense because you have it in your bracket already. You're saying at the end [\r\n\s] or \s+. So:
var re2:RegExp = /[\r\n\s]["'\-]{2}[\r\n]*\s*/gm;
Writing it like that just states either return or newline or one or more spaces will match. You can see the usage of braces marking the range of matches I desire, so {2} means I need 2 of the previous characters specified in a row. 1-5 characters specified in a row is just as easy, /[a-z]{1,5}/ means from 1 to 5 lowercase letters from a to z.
Copy link to clipboard
Copied
Yeah the problem I have is that it is like a language and one with many dialects. Generally all the tutorials I find are for other versions of it and the escaped, (or doubled escape in AS3), or are in quotes or not or use a different character to indicate the start, etc. It is very confusing. The other thing I've found in most tutorials is that they give you a few of the special characters and then immediately dump you into very long complex expressions with no explanation. So I very much appreciate you taking the time to explain things in English.
Thank ou so much for pointing out which ones in my characters to remove are special and need to be escaped. Sometimes I notice that sometimes I forget. There are just so many! So I think it should look like this?
\/\.,_…"“”\!\?@#$%\ ^\&\*\(\)\[\]\{\}–—\+:;<>©®™=
Thank you for the groups in the cdata example. I knew those existed, but didn't quite know how to use them. Makes perfect sense.
The tags one and the url one both get me quite confused. Both <.*?> and https?:\/\/.*\s seem to do what I want. But when I search for examples on the web I get things that are much longer and more complicated. So would your http one be better/faster than the one I have? If the problem is greediness that isn't such a problem for my current purpose. I want all the periods, commas, etc. gone anyways. And even if some how the next thing it swallows up is a whole word it is just one word. The odds that it would be the one occurance of an important word is pretty low.
That final one is a bit of a trick and I think that my attempt at using the | to bundle three things into one is partially what is giving you trouble. Or not. Let's start from the end of it.
I was using \s+ to get rid of any extra returns, newlines, multiple spaces, etc. no matter where they appear and make them into a single space. So that the final product will (hopefully) just be a list of words with one space between each one that can then be passed to my indexing algorithm. If a few errant, not-words get in that is fine as long as the noise to signal ratio is low. So that isn't related to the two before expressions that I ored onto it.
The two before are to remove the apostrophes and hypens from the beginning and/or end of words, but leave them in the middle. At least that is what I'm hoping for. (BTW, did you notice that I did remember to escape the hypen? There is hope for me.) So things like this should go to:
'word' -> word
'a quote' -> a quote
key- or stop-words -> key or stop-words
don't -> don't
And the issue is there are three kinds of single quotes that might appear ' ‘ ’
So I'm not sure the \s+ should be bundled into the apostrophe|hypen stuff. The double quotes aren't a problem because they never appear in the middle of words and should just all be removed. Of course if someone was crazy enough to use two single apostrophes to make what looked like a double quote...well in any event I imagine the noise ratio of that to be pretty low as well.
Copy link to clipboard
Copied
The absolute basic metacharacters to escape as needed are:
http://php.net/manual/en/regexp.reference.meta.php
Escape those baddies unless you mean for them to "function".
My example of an URL parser was to show you a way to describe multiple protocols by analyzing the similarities and differences between them. For example I used HTTP(S)://, RTMP:// and FTP://. The (imperfect) RegExp I used:
/[fhr]t{1,2}m*ps*\:\/\/.*?\s+|[fhr]t{1,2}m*ps*\:\/\/.*$/igm
It's a 2-part and my example usage of a sitation "or" | is good at, but it's almost the same thing twice so let's just look at the repeated part:
[fhr]t{1,2}m*ps*\:\/\/.*?\s+
The protocols start with one of these characters [fhr]. Next they have 1 or 2 t's (one in ftp and rtmp, two in http), so use a range t{1,2} meaning 1 or 2 t's. RTMP has an M (but the others don't) so add in a possible 'M' via m* meaning "zero or more m's". I could have solidified that with m{1}* meaning possibly 1 'm'. Next they all have a 'p' at the end so no modifier, just a 'p'. HTTPS needs an 's' so the same thing again with "zero or more" s*. After that they all have :// so I just escape them all for sheer safety with \:\/\/. Lastly I'm capturing "zero or more" of everything after the forward slash with .* up until "one or more" spaces occurs. No spaces can be in URL encoded URLs so that's pretty safe.
The one thing I branched the same exact RegExp with is possibly ending with an URL and there's no space after it. To match that I'm using the metachar $ meaning "end of string". So the same RegExp, minus the space (\s+) but adding in that it's matched at the "end of the string" with $:
[fhr]t{1,2}m*ps*\:\/\/.*$
I add in the modifiers of //igm so it's i=case insensitive, g=replace all occurances, m=traverse all lines in a multiline string. Together the entire string will be parsed case insensitively.
That will capture any complex URL with the caveat I mentioned like this string:
var str:String = "This is an url http://www.moo.com. This is another sentence.";
Because the period at the end of the sentence exists before the space (\s+) it's going to get eaten and your result would be:
This is an url This is another sentence.
This is how RegExps start growing crazily with all sorts of complex looking amendments for special situations.
The best thing to do is draw up a set of rules that you can agree on about the task. I assure you the language can handle 99% of the craziest rules you can think up.
To illustrate, your example of needing to remove apostrophe quotes around words but not in the middle of them. You know the basic rules you want to translate to RegExp:
1. Find an apostrophe
2. Validate it by requiring it to have a second apostrophe
3. Validate that between the apostrophes there is either nothing or something, or only spaces
By validating that there's a beginning and ending apostrophe with characters inside that aren't a space you remove the chance or removing contractions (don't, can't, won't, etc).
Using those rules it's just about finding the RegExp tools and doing them in the proper order above.
var str:String = "'do' ''good things, don't do bad 'things', never do '''''''''''''''''(nothin') ' ' ' ' ''"; var removeAposRe:RegExp = /'([^\W]*)\s*?'+/igm;
str = str.replace(removeAposRe, '$1');
trace("[" + str + "]");
// trace: [do good things, don't do bad things, never do (nothin') ]
The simplicity of that was using the "not" operator ^ followed by the escape sequence "any non-word" \W. So saying [^\W]* is saying "zero or more of anything that is a word (or double negatively, NOT a non-word character)". I capture the contents with (parens) and supply it as the replacement with $1, minus the apostrophes. I add (outside parens) a possible space \s* to remove empty strings like ' '. The final apostrophe is also required and I saddled a "one or more" on it with '+ so it would remove sequences like this: '''''''''''''''''''''''.
That solves all your requirements in their order.
Copy link to clipboard
Copied
Thanks for all the help. I keep thinking it is making more sense to me, but then I run into a problem. So I'm using the CDATA with group
var reCDATA:RegExp=/\<\!\[cdata\[(.*)?\]\]\>/igm;
I'm reading the part with parens to be start a group of any characters except \r \n \t zero or more times with a lazy match.
So my string is:
var xml:XML=<session>
<name><![CDATA[Some text with a line break in it
the second line here]]></name>
<otherNode><![CDATA[Some text with a line break in it]]></otherNode>
</session>;
Because there is a line break in there the CDATA tag isn't getting removed. I thought maybe the multiline switch was missing, but it doesn't make a difference. I also tried the s-dotoall. It seemed that was supposed to make . match a \r \n or \t, but that didn't work either.
So what is the trick to get it to match across a line break?
I might have flailed into it:
var reCDATA:RegExp=/\<\!\[cdata\[(.*?)\]\]\>/gis;
If if put the ? inside the parens it seems to deal with the greediness. The m switch doesn't seem to do anything, but the s now works.
Copy link to clipboard
Copied
Without dealing with whitespace I'd test against:
(New AS3 doc):
var d:XML = <session>
<name><![CDATA[Some text with a line break in it
the second line here]]></name>
<otherNode><![CDATA[Some text with a line break in it]]></otherNode>
</session>;
var nNewline:RegExp = /[\r\n]+/gm;
var reCDATA:RegExp = /\<\!\[cdata\[(.*?)\]\]\>/gis;
var str:String = d.toXMLString().replace(nNewline,' ');
str = str.replace(reCDATA,"$1");
trace(str);
Trace:
<session> <name>Some text with a line break in it the second line here</name> <otherNode>Some text with a line break in it</otherNode> </session>
Just adding \n in won't help much parsing the data. The general idea is to cleans data in an order, like to remove all \r\n for a space first for safety and XML doesn't require them to be valid so it will parse even if \r\n is removed or replaced. Note that it doesn't handle the situation of possible \s\t before a linebreak or after (and the RegExp grows again..). After that I run the same CDATA killing RegExp.
Sometimes you really do need a first general cleanup of overall data before you run the next data cleanser. Code just grows and grows as you find more circumstances the RegExp doesn't handle.

