Filter text.contents (removing special characters)

Report · Mar 21, 2011

Hi guys,

I want to extract a string from a bunch of text (here a selection for example). This text is xml tagged.

If I do selection[0].contents, it captures the text and all the special characters (XML tags, carriage return). I can check something is "wrong" cause contents.length is greater than expected (John(space)Smith > 10 characters but contents.length > 14). I am not really surprised cause I knew this behaviour.

So I tried to filter it to remove any content which is not an alphanumeric character but here is where I fail.

If I use GREP with contents.match(/[\w]+/g), it's quite perfect. But if the contents has diacritics, this pattern fails to catch them.

So I could include them in the pattern but it's really probable I miss a lot.

So my question is "how to extract the pure text from the contents ensuring I get all the diacritics if any but without carrying special characters ?

TIA Loic

Report · Mar 21, 2011

Rather than trying to extract only the characters you want, how about removing the ones you do not? Something like this perhaps:

contents.replace(RegExp(/\W+/g), "")

Report · Mar 21, 2011

Thx Mayhem for your proposal.

However it fails if string has diacritics. Ex:

"Loïc".replace(RegExp(/\W+/g), "") //Loc

I need Loïc in output.

Thx anyway.

Loic

Report · Mar 21, 2011

I feel like there's a better solution to this (I'll post again if I come with one), but meanwhile, please note that writing this:

contents.replace(RegExp(/whatever/), "");

is really just a more verbose way of writing this:

contents.replace(/whatever/, "");

And, if you don't have the language spec in hand, you would think the first form converts /whatever/ to a string ("whatever") and then calls new RegExp("whatever") returning /whatever/ again. Actually ECMA-262/3rd sec. 15.10.3.1 says it can return the regexp unchanged, but why make it more confusing?

I think in general it's better to reserve the RegExp() constructor for making regular expressions out of strings...

Report · Mar 21, 2011

One reason to use a RegExp constructor is to deal with a performance issue in CS5.

Constructing a RegExp once and reusing the reference is much less expensive than using RegExp literals in CS5 (which must get constructed each time it's used)...

Harbs

Report · Mar 21, 2011

Harbs: The question re-use is orthogonal from the question of literal versus constructor. You can save a reference either way:

var
  ref1 = /myRE/,
  ref2 = new RegExp("myRE"),
  ref3 = RegExp("myRE");
for ( ... ) { } // tight loop here

But if indeed, either one is expensive, you should not be doing BOTH! Using "RegExp(/myRE/)" creates the literal first, and then passes it through the constructor (well, in this case, technically through "The RegExp Constructor Called as a Function," see sec. 15.10.3 of the spec).

My point is simple: don't do both. Pick one.

Report · Mar 21, 2011

Yes. I understood your point. (use RegExp("") rather than RegExp(//))

I was making another point.

I was pretty sure that using the RegExp constructor (RegExp("abc")) is very different than a literal (/abc/) in terms of performance in CS5.

I just did some test to double check my memory, and I did not remember very well..

Here's three tests:

Test #1:

var regex = /abc/;
var string = "abcd";
for(i=0;i<100000;i++){
    string.match(regex);
}

took about 4.745 sec.

Test #2:

var regex = RegExp("abc");
var string = "abcd";
for(i=0;i<100000;i++){
    string.match(regex);
}

took about 4.708 sec.

Test #3:

var string = "abcd";
for(i=0;i<100000;i++){
    string.match(/abc/);
}

took about 7.509 sec.

So the difference between a literal and a RegExp constructor is not the important factor, it's creating the reference and reusing it that's important...

Sorry about the confusion...

Harbs

Report · Mar 21, 2011

We think alike! Thanks! I just got done benchmarking, but I'll post it anyhow.

Benchmark	Time
Literal	10.811 sec.
Constructor	11.469 sec.
Both	18.322 sec.
Literal*	31.251 sec.
Constructor*	35.156 sec.
Both*	43.507 sec.

Code follows. *-variants use eval with different regexps to defeat any potential optimizations (I don't think there is much optimization though).

We are, of course, benchmarking different things. you're benchmarking tests, I'm benchmarking instantiation of the regexp. It's funny that we get different results though. For me, the regexp literal is always faster to create. For you, the faux constructor is faster to use. That makes no sense to me, they should be exactly the same.

function repeat(times, it) {
    var i;     
    for (i=0; i< times; i++) it(i)
}

function timeit(name, times, it) {
    var t0,t1;
    t0 = new Date().valueOf();
    repeat(times, it);
    t1 = new Date().valueOf();
    $.writeln(name+": "+(t1-t0)/1000+" sec.");
    return t1-t0;
}

var count=5e5;
timeit("Literal", count, function() { var re = /literal/; });
timeit("Constructor", count, function() { var re = new RegExp("literal"); } );
timeit("Both", count, function() { var re = new RegExp(/literal/); });

timeit("Literal*", count, function(n) { eval('var re = /literal'+n+'/') } );
timeit("Constructor*", count, function(n) { eval('var re = new RegExp("literal'+n+'")') } );
timeit("Both*", count, function(n) { eval('var re = new RegExp(/literal'+n+'/)') } );
0;

Report · Mar 21, 2011

Meh. I don't care much what others think of my coding style. Obviously I believe my way is easier to read, since there is no syntax coloring for /whatever/ regular expressions but is for the RegExp keyword. I can guarantee it does not get cast to a string and then back, as regular expressions created from strings cannot set modifiers and the global modifier in the example above does not get lost. Unless Adobe's engineers are doing something dead stupid (which admittedly wouldn't be the first time) there cannot possibly be a noticeable performance penalty.

Report · Mar 21, 2011

> I can guarantee it does not get cast to a string and then back,

> as regular expressions created from strings cannot set modifiers

> and the global modifier in the example above does not get lost.

Err...well, as I said, it (the faux constructor -- "RegExp()" called as a function, without the new) does not convert to a string. But if you call the actual constructor is does in fact do so. But it extracts the flags from the regexp and reuses them.

> Unless Adobe's engineers are doing something dead stupid

> (which admittedly wouldn't be the first time) there canno

> possibly be a noticeable performance penalty.

Take a look at my numbers. It's not 2x as slow but it is 1.4x as slow, with the real constructor (new).

Rerunning with the "faux" constructor (no New), I get 12.371 sec for the fast case (without eval), and 36.019 sec for the eval case.

And for the "Both" case with the faux constructor, 11.537 fast and 37.404 with eval.

Other numbers all within 100ms of my original benchmark, so I won't repeat them here.

But yeah, with the faux constructor it's not appreciably slower though it is slower by epsilon (7%).

Report · Mar 21, 2011

Mayhem SWE wrote:

Obviously I believe my way is easier to read, since there is no syntax coloring for /whatever/ regular expressions but is for the RegExp keyword.

I use BBEdit which does have syntax highlighting for RegExp literals...

Mayhem SWE wrote:

and the global modifier in the example above does not get lost.

I'm not sure what you mean.

var regex = RegExp("abc","g");
"abcd".replace(regex,"bca");

and

"abcd".replace(/abc/g,"bca");

and

var regex = RegExp(/abc/g);
"abcd".replace(regex,"bca");

are all functionally equivalent.

Harbs

Report · Mar 21, 2011

Not exactly a fair test, since you dn't have more than one "abc" in your test string. You'd need "abcdabcd" to test this.

But Mayhem SWE is arguing that because the modifier does not get lost (i.e. the /g works fine), therefore the RegExp() constructor is not converting the pattern back to a string, because a string has no way to represent a /g without being two strings. But that argument isn't really valid, because the complexity of what actally goes on. I was trying to avoid quoting the spec, but here we go:

15.10.4.1 new RegExp(pattern, flags)
If pattern is an object R whose [[Class]] property is "RegExp" and
flags is undefined, then let P be the pattern used to construct R
and let F be the flags used to construct R. If pattern is an 
object R whose [[Class]] property is "RegExp" and flags is not 
undefined, then throw a TypeError exception. Otherwise, let P be 
the empty string if pattern is undefined and ToString(pattern) 
otherwise, and let F be the empty string if flags is undefined 
and ToString(flags) otherwise.

it then goes on to explain what happens to F and P to construct the RegExp.

Report · Mar 21, 2011

John Hawkinson wrote:
Not exactly a fair test, since you dn't have more than one "abc" in your test string. You'd need "abcdabcd" to test this.

It actually was not a test at all...

I did not feel a need to test what I was writing because I know it to be true. I was simply requesting an explanation -- which you provided. Thanks!

Harbs

Report · Mar 21, 2011

var regex = RegExp("abc","g");

Hmm, interesting. The CS3 documentation browser merely says RegExp (pattern): RegExp, nothing about setting modifiers separately...?

Report · Mar 21, 2011

Yeah, the Adobe documentation on standard JavaScript functions is...incomplete. I'd recommend the MDC documentation. Definitely not w3schools, though, which pops up at the top of google hits (see http://w3fools.com/ for some reasons why not).

Report · Mar 21, 2011

Ahh, okay... Are the characters you need to remain all within UTF-8? Something like this to filter out unwanted character ranges might be what you need:

replace(RegExp(/[^\x20-\x7E\xA0-\xFF]/g), '')

(I've edited this expression a couple of times, so if you already tried it, copy from above and try again!)

Report · Mar 21, 2011

Hi Mayhem,

That looks great. Loïc comes nice and length is ok. I think you gave me the perfect pattern.

Thx a lot Loic

Report · Mar 21, 2011

OK, back to the original question.

Loic, what am I doing differently?

Report · Mar 21, 2011

Hi John,

As far as I can tel (or undestand), you are facing the extra characters issue (xml tags). This is all about getting the pure text without extra content

Report · Mar 21, 2011

But look at my example? I have XML tags but I have no extra characters! Can you show me an example you have that gets extra characters?

There has got to be a better way to do this. But hopefully one that does not involve checking each character individually (performance). Or exporting stories to external files (again, performance). What's the size of the text you need to do this on and the rough number of times you do it?

Report · Mar 21, 2011

But look at my example? I have XML tags but I have no extra characters!

John, the characters at #0, 5, 7, and 13 cannot be displayed, and thus show 'nothing'. If you display the charCodes, you'll see it's 16#FFEF for those invisible characters.

These semi-invisible codes are a pain, because there are lots of situations where they pop up and cause mischief; for example, in text exports (not visible in a text editor, but the database that imported it choked on them), or when you create a bookmark from them (in Acrobat you see weird "unknown character" blocks).

Report · Mar 21, 2011

*sigh*. You know, I was looking and expecting to see the SpecialCharacter enumerators, but that's not what this is about.

Sorry for being sloppy.

I inserted a current-page-number between the 'Sm', and I get this:

s=app.selection[0]; sc=s.contents;
for (i=0; i<sc.length; i++) print(i+"<"+sc.charCodeAt(0)+"> '"+sc+"'  "+s.characters.contents);
0<65279> ''  
1<74> 'J'  J
2<111> 'o'  o
3<104> 'h'  h
4<110> 'n'  n
5<65279> ''  
6<32> ' '   
7<65279> ''  
8<83> 'S'  S
9<24> ' '  1396797550
10<109> 'm'  m
11<105> 'i'  i
12<116> 't'  t
13<104> 'h'  h
14<65279> ''

But I guess these XML things are not the same as the SpecialCharacters enumerators.

It's certainly easy to filter out the 65279 characters, but that's not really sufficient. And I had thought that using anything other than .characters was supposed to save you from these things... But apparently not...

*confused again*

(And then the Jive forum just ate my post. grr.)

Report · Mar 21, 2011

I remebered first time I face these "transparent" characters. I was comparing <tag>foo</tag>.contents.length to foo.length and it returned false. It did'nt make any sense that foo was different than foo until I check lengths and got 5 for one and 3 for the other one. This is when I realized there was extra characters.

I rememebered I mad a topic to warn people cause it's really disturbing when you don't know.

Report · Apr 18, 2011

@John,
thank you for that line of code.

I just experimented a bit with that (InDesign CS4 6.0.6 German Version).
In the case of footnotes I get strange results.

If my selection is a single footnote, $.writeln returns absolutely nothing to the JavaScript console.

If my selection is a footnote plus an arbitrary character (could be a second footnote), JavaScript console is showing both characters.
In the case of two footnotes:

0 <4>' ' 1399221837
1 <4>' ' 1399221837

In the case of a footnote it seems there must be a always a second character to trigger a result.

Uwe

Report · Mar 21, 2011

All GREP related fun aside, Loïc, all you need to remove is some very special characters.

TextChar.h lists the following:

0x0003 BreakRunInStyle

0x0004 FootnoteMarker

0x0007 IndentToHere

0x0008 RightAlignTab (you might want to convert those to a regular tab, I guess)

0x0016 Table (when it's 'seen' as an inline object)

0x0017 "TableContinued" -- heyheyhey, we have something new here! Wonder when & how this one is gonna pop up.

0x0018 PageNumber (a.k.a. "AutoText")

0x0019 SectionName

0x001a NonRomanSpecialGlyph (you should probably check how this gets used)

(Then a long list of 'normal' character name definitions. This one comment is fun

kTextChar_Ellipse                    = 0x2026;          // Actually, it's "ellipsis"

The original programmers weren't really typesetters, then!)

The following are *hugely* important because you must do some special parsing if you encounter them! They are for encoding 32-bit Unicode values:

HighSurrogateStart = 0xD800; // includes private use 0xDB80 - 0xDBFF

HighSurrogateEnd = 0xDBFF;

LowSurrogateStart = 0xDC00;

LowSurrogateEnd = 0xDFFF;

This one may pop up for anchored objects (I think):

ReplacementCharacter = 0xFFFD; // an incoming character whose value is unrepresentable in Unicode

And this one dups for your XML marker codes:

ByteOrderingCharacter = 0xFFFE;

-- I think I got'em all.

Filter text.contents (removing special characters)

1 Correct answer

OK, back to the original question.