GREP search problems [CS4/JS]

Report · May 26, 2010

I have two problems with doing grep searches using a script written in Javascript for InDesign CS4. The first one is more serious.

Problem #1:

I am using the following search pattern, to find e-mail addresses:

[[:word:]\-\.]+@([[:word:]\-]+\.)+[[:word:]\-]{2,4}

This pattern works perfectly to find all e-mail addresses when I use the grep feature of the Find/Change dialog box in the user interface, but fails when I use it as an argument of the findGrep() or changeGrep() functions in a script. Interestingly, the problem is the part before the "@" symbol. If I replace that part with literal text, like so:

richard@([[:word:]\-]+\.)+[[:word:]\-]{2,4}

then the script will correctly match any e-mail address where the part before the "@" symbol is the word "richard".

Can anyone shed light on this problem?

Problem #2:

This is more general and less serious, but it is related to problem #1, because it's about things that work in the user interface but not in a script.

Strangely, a lot of the standard regular expression shorthand wildcards for character classes (like \w, \d, etc.) do not work when I use them in scripts, but their POSIX equivalents ([:word:], [:digit:], etc.) work fine. Either terminology -- \w or [:word:] -- works fine in the Find/Change dialog box of the user interface.

So this is not a serious problem, but I vastly prefer the shorter terminology. I find it easier to read, and to write.

Any ideas?

Thanks.

Report · May 26, 2010

One solution for your both problem use \\ instead of \ in javascript.

code for find grep in Indesign is

[[:word:]\-\.]+@([[:word:]\-]+\.)+[[:word:]\-]{2,4}

and code for javascript find grep is

[[:word:]\\-\\.]+@([[:word:]\\-]+\\.)+[[:word:]\\-]{2,4}

Shonky

Report · May 26, 2010

Thanks, that solved the problem.

Not sure why the pattern

richard@([[:word:]\-]+\.)+[[:word:]\-]{2,4}

was working in the first place. I thought maybe it was that some of those backslashes after the @ symbol are unnecessary anyway (I'm still getting a handle on regular expressions), but I took them out and it broke the script, so that can't be it.

In any case, thank you, you have fixed my problem.

Report · May 26, 2010

// Here's a litteral RegExp in JS:
var RE1 = /[a\-c]/;
alert( RE1.test("-") ); // TRUE
alert( RE1.test("b") ); // FALSE
// ...RE1 means a|-|c (as expected)


// Now, using explicitly the RegExp class,
// we need to pass a String :
var RE2 = RegExp("[a\-c]");
alert( RE2.test("-") ); // FALSE
alert( RE2.test("b") ); // TRUE
// ...RE2 actually means [a-c] !

// Why?

// Because "\" is a metamarker in JS litteral strings.
// "\" is supposed to escape a few special chars and
// is ignored in other cases:
alert( "\a\.\-" == "a.-" ); // TRUE !

// So, to get a "\" in a string, you need
// to use "\\"
var RE3 = RegExp("[a\\-c]");
alert( RE3.test("-") ); // TRUE
alert( RE3.test("b") ); // FALSE

// Finally RE3 works like RE1.

@+

Marc

Report · May 26, 2010

/* One more example */

var str = "a\\t"; // str contains 3 chars: a\t

// How to grab str in a RegExp?

// 1) In a litteral RegExp, we need to
// escape the backslash (because of \t):
alert( /a\\t/.test(str) ); // TRUE

// 2) Using explicitly the RegExp class,
// we need to express the pattern a\\t
// in a litteral string, so:
alert( RegExp("a\\\\t").test(str) ); // TRUE

// 4 backslahes to target 1 backslash!

@

Marc

Report · Jun 01, 2010

Thank you Marc, that's very helpful. I like to think I understand this stuff theoretically, but I always have problems writing my own RegExps in practice. It's good to see examples.

Just out of curiosity, I have another problem that I am curious if anyone can help me with.

I am trying to match markdown web links. Markdown is a markup language that we use in our office because our editors (I work at a newspaper) find it easy to read and write. The format is

[text](link)

which corresponds to the html

So I wrote the following:

{findWhat: "\\[[^][]+]\$[^)(]+\$"}

and it strangely worked on my copy of CS4 at home, but here at work it doesn't seem to be working. It works up until the first left-round-bracket -- \\[[^][]+]\\( -- and then the rest of it doesn't match.

Any ideas?

Thanks.

Report · Jun 01, 2010

Oops. I posted too soon. There is no difference between my copies of CS4 at work and home. That would have been a little strange. What happened is that I was importing two different Word documents, and one worked and one didn't. My problem is still not solved, but I will examine the documents more closely to see if I can figure out what the problem with the script is.

Report · Jun 01, 2010

I figured out the difference between the files. The one that failed to match was full of automatically generated Microsoft Word hyperlinks. The importing script I wrote is supposed to remove them completely before it does anything else to the text, and it appears to do that, but somehow there is something left over from the Word hyperlink which messes up regexp matches.

So the Word file will contain something like this:

Harrington recently published [a paper](http://www.harringtonspaper.com) in the journal of blahblahblah.

and somewhere along the line, Word automatically generates a hyperlink, starting at "http". Somehow, that section of text ends up being impossible to match, even if I just try to match the literal string "http". I'm not sure why, since I strip all the hyperlinks from the file when it comes into InDesign.

Unfortunately, it's unrealistic to try to get the writers to stop using Word.

Any ideas about what might be left over in the text from the Word hyperlink, that I cannot see?

Report · Jun 01, 2010

I figured it out. When I delete all the hyperlinks, I also need to delete all the hyperlinkTextSources, otherwise the regexp engine won't be able to smoothly find a match across a block of text that includes a hyperlinkTextSource.

Report · Jun 01, 2010

I richardh6,

Well, I don't understand this part of your pattern: ...[^][]+...

Do you mean: ...[^]]+... ?

In a find/change approach you also should use capture parenthesis, for example:

// ID CS4
// Apply a Markdown-to-HTML conversion
// on any [...](...) pattern of the documents:
app.findGrepPreferences.findWhat = '\\[([^]]+)]\\(([^)]+)\\)';
app.changeGrepPreferences.changeTo = '<a href=~"$2~">$1</a>';
app.changeGrep();

And in a JS process you could use something like this:

var RE_MARKDOWN = /\[([^]]+)]\(([^)]+)\)/;
// Or, if you prefer:
// var RE_MARKDOWN = RegExp("\\[([^]]+)]\\(([^)]+)\\)");

var markdown2html = function(s)
     {
     var m = s.match(RE_MARKDOWN);
     if( !m ) return false;
     return '<a href="%2">%1</a>'.
               replace('%1',m[1]).
               replace('%2',m[2]);
     }

// sample code
//----------------------------
var mk = "[a paper](http://www.harringtonspaper.com)";

alert( markdown2html(mk) );
// output: <a href="http://www.harringtonspaper.com">a paper</a>

Regards,

Marc

Report · Jun 01, 2010

Do you suppose it would be less confusing to use the Javascript RE even in the Find/Change case? E.g.:

var RE_MARKDOWN = /\[([^]]+)]\(([^)]+)\)/;
app.findGrepPreferences.findWhat = RE_MARKDOWN.toString().slice(1,-1);
app.changeGrepPreferences.changeTo = '<a href=~"$2~">$1</a>';
app.changeGrep();

Report · Jun 02, 2010

At your convenience. It should work too.

Report · Jun 02, 2010

Oh, sure. The question was whether it was better style or clearer to read...

Report · Jun 07, 2010

> The question was whether it was better style or clearer to read...

It's certainly easier to read. Nice trick, John.

Peter

Report · Jun 04, 2010

Thank you Marc and John for your insights.

Marc,

Well, I don't understand this part of your pattern: ...[^][]+...
Do you mean: ...[^]]+... ?

No, I actually did mean: ...[^][]+... and not ...[^]]+...

[^][]+ means one or more of the character class which includes any character except [ and ]. They are listed in reverse order (i.e. ][) in the regexp pattern because only the left square bracket is allowed to come second in that order, if you don't want to use backslashes. I'm not sure why I did it that way -- I guess I thought if any text came in with either bracket it should be rejected, but I have changed it to your suggestion, because there's no reason to exclude the opening brackets.

I took your suggestions and I wrote a working script which converts markdown to InDesign hyperlinks, which is a bit more involved than just converting it to html, because part of the original match stays in the text object and part of it gets assigned to a new hyperlink object as a string.

I have another question now (it's not urgent, if you're too busy), and for the sake of brevity I am using your markdown to html example. The question is, how exactly can you deal with escaped characters? Say, for instance, you had a sentence in an article where you're quoting someone and you say 'The next day was [her] worst day ever,' and you want '[her] worst day ever' to be a link. Or say you have a URL that has round brackets in it, which quite a few of them do on Wikipedia for some weird reason.

I thought you might be able to do this using lookbehind, but Javascript's regexp flavor does not appear to support lookbehind, so I wrote the following code, which hides away all the escaped characters during the major processing, and then restores them. It works, but is there a better way? It doesn't seem very elegant, although it's certainly easy and fairly foolproof, which is probably good:

function markdown2html (myObject) {
var HIDE_ESCAPED_CHARS = [
      {before: "\\\\\\\\", after: "%_BACKSLASH"},
      {before: "\\\\\\]", after: "%_RIGHT_SQUARE_BRACKET"},
      {before: "\\\\\\)", after: "%_RIGHT_ROUND_BRACKET"},
      {before: "\\\\\\*", after: "%_ASTERISK"},
      // Support for other markdown codes
      // will be added as needed, but in the
      // meantime, delete all single backslashes:
      {before: "\\\\", after: ""} ];

var RESTORE_ESCAPED_CHARS = [
      {before: "%_BACKSLASH", after: "\\"},
      {before: "%_RIGHT_SQUARE_BRACKET", after: "]"},
      {before: "%_RIGHT_ROUND_BRACKET", after: ")"},
      {before: "%_ASTERISK", after: "*"} ];

app.changeGrepPreferences = NothingEnum.nothing;
app.findGrepPreferences = NothingEnum.nothing;

multiChangeGrep (myObject, HIDE_ESCAPED_CHARS);
// convert hyperlinks
app.findGrepPreferences.findWhat = "\\[([^]]+)]\$([^)]+)\$";
app.changeGrepPreferences.changeTo = '<a href=~"$2~">$1</a>';
myObject.changeGrep();
// At this point you'd start processing the rest of the
// markdown according to your needs; like asterisks to
// bold and italic, etc.
// ...
// ...
// ...
multiChangeGrep (myObject, RESTORE_ESCAPED_CHARS);

app.changeGrepPreferences = NothingEnum.nothing;
app.findGrepPreferences = NothingEnum.nothing;
}

function multiChangeGrep (obj, findChangeArray) {
var findChangePair;
for (var i=0; i<findChangeArray.length; i++) {
    findChangePair = findChangeArray;
    app.findGrepPreferences.findWhat = findChangePair.before;
    app.changeGrepPreferences.changeTo = findChangePair.after;
    obj.changeGrep();
}
}


// -----------------------------------

// Sample use of the function markdown2html
// (assumes you have a document open)

tf = app.activeDocument.pages[0].textFrames.add();
tf.geometricBounds = ["2cm", "2cm", "12cm", "18cm"];

st = tf.parentStory;
st.contents =
    "Yesterday there were " +
    "[three dogs](http://www.dogsrule.com) in my yard. " +
    "[Today \\[the senator\\] said](http://www.dogtimes.com/story/34) " +
    "that all the dogs have found " +
    "[good homes](http:www.homes.com/dogs_\$and_cats\$). " +
    "Who knows what tomorrow may bring?\r\r" +
    "And now for good measure we will have a sentence " +
    "with two backslashes in it, one inside a [link\\\\,](http://www.link.com) " +
    "and one\\\\ outside.";

alert (st.contents);
markdown2html (st);
alert (st.contents);

Report · Jun 05, 2010

I have found a much shorter solution, that requires a lot less code and uses regular expressions entirely to accomplish the exact same task that the script in my last post does, which is to process markdown hyperlinks into html hyperlinks, while taking into account certain escaped characters. I have also started using John Hawkinson's method of making the regular expressions a little easier to read, since there's now a serious proliferation of backslashes:

function markdown2html (myObject) {
  var myRegexp;
    
  app.changeGrepPreferences = NothingEnum.nothing;
  app.findGrepPreferences = NothingEnum.nothing;
  
  // convert hyperlinks
  myRegexp = /\[((?:\\\\|\\\]|[^]])+)\]\(((?:\\\\|\\\)|[^)])+)\)/;
  app.findGrepPreferences.findWhat = myRegexp.toString().slice(1,-1);
  app.changeGrepPreferences.changeTo = '<a href=~"$2~">$1</a>';
  myObject.changeGrep(); 
  
  // remove stray backslashes
  myRegexp = /\\(.)/;
  app.findGrepPreferences.findWhat = myRegexp.toString().slice(1,-1);
  app.changeGrepPreferences.changeTo = '$1';
  myObject.changeGrep(); 
    
  app.changeGrepPreferences = NothingEnum.nothing;
  app.findGrepPreferences = NothingEnum.nothing;
}
                    
                    
// -----------------------------------



// Sample use of the function markdown2html 
// (assumes you have a document open)

var tf = app.activeDocument.pages[0].textFrames.add();
tf.geometricBounds = ["2cm", "2cm", "12cm", "18cm"];

var st = tf.parentStory;
st.contents = 
    "Yesterday there were " +
    "[three dogs](http://www.dogsrule.com) " +
    "in my yard.  " +
    "[Today \\[the senator\\] said]" +
    "(http://www.dogtimes.com/story/34) " +
    "that all the dogs have found " +
    "[good homes]" +
    "(http:www.homes.com/dogs_\\(and_cats\\)).  " +
    "Who knows what tomorrow may bring?\r\r" +
    "And now for the finale: a backslash in the " +
    "text, right before the end of the " +
    "[link text\\\\](http://www.onemorelink.com).";
              
alert (st.contents);
markdown2html (st);
alert (st.contents);

But strangely, I think the version in my previous post (with some slight modifications) might be more elegant and reliable in the general case, once I start adding support for a lot more markdown codes. In a way the previous one is simpler -- just get the escaped characters out of the way, deal with everything you have to deal with, and then put them back. Of course, I would probably change the placeholder text from long strings like "%_LEFT_SQUARE_BRACKET" to single Unicode characters that no one would ever use, like Linear B syllables or something, assigned to variable names like "LEFT_SQUARE_BRACKET".

Richard Harrington

Report · Jun 05, 2010

By the way, Marc, I read some of your website and I have to thank you for your very clear and thorough explanations of some advanced topics in scripting InDesign. I will be particularly carefully studying the section on adding menu items. I often think that I could write a script to have lasers shoot out of my eyes or to automatically generate a new play in the style of William Shakespeare, and my employers would just be confused. But if I could accomplish these things via menu items, then they would be official. People would be impressed.

Report · Jun 05, 2010

Or, slightly better, in case anyone else wants to use this script some day, here's the prototype version:

// markdownToHtml() is a method that can be
// invoked on any object that you can invoke
// findGrep() on.  It converts markdown
// hyperlinks to html hyperlinks.
Character.prototype.markdownToHtml = 
Word.prototype.markdownToHtml = 
TextStyleRange.prototype.markdownToHtml = 
Line.prototype.markdownToHtml = 
Paragraph.prototype.markdownToHtml = 
TextColumn.prototype.markdownToHtml = 
Text.prototype.markdownToHtml = 
Cell.prototype.markdownToHtml = 
Column.prototype.markdownToHtml = 
Row.prototype.markdownToHtml = 
Table.prototype.markdownToHtml = 
Story.prototype.markdownToHtml = 
TextFrame.prototype.markdownToHtml = 
XMLElement.prototype.markdownToHtml = 
Document.prototype.markdownToHtml = 
Application.prototype.markdownToHtml = 

function () {
  var myRegexp;
    
  app.changeGrepPreferences = NothingEnum.nothing;
  app.findGrepPreferences = NothingEnum.nothing;
  
  // convert hyperlinks
  myRegexp = /\[((?:\\\\|\\\]|[^]])+)\]\(((?:\\\\|\\\)|[^)])+)\)/;
  app.findGrepPreferences.findWhat = myRegexp.toString().slice(1,-1);
  app.changeGrepPreferences.changeTo = '<a href=~"$2~">$1</a>';
  this.changeGrep(); 
  
  // remove stray backslashes
  myRegexp = /\\(.)/;
  app.findGrepPreferences.findWhat = myRegexp.toString().slice(1,-1);
  app.changeGrepPreferences.changeTo = '$1';
  this.changeGrep(); 
    
  app.changeGrepPreferences = NothingEnum.nothing;
  app.findGrepPreferences = NothingEnum.nothing;
}
                    
                    
// -----------------------------------



// Sample use of the function markdown2html 
// (assumes you have a document open)

var tf = app.activeDocument.pages[0].textFrames.add();
tf.geometricBounds = ["2cm", "2cm", "12cm", "18cm"];

var st = tf.parentStory;
st.contents = 
    "Yesterday there were " +
    "[three dogs](http://www.dogsrule.com) in my yard.  " +
    "[Today \\[the senator\\] said](http://www.dogtimes.com/story/34) " +
    "that all the dogs have found " +
    "[good homes](http:www.homes.com/dogs_\\(and_cats\\)).  " +
    "Who knows what tomorrow may bring?\r\r" +
    "And now for the finale: a backslash in the " +
    "text, right before the end of the " +
    "[link text\\\\](http://www.onemorelink.com).";
              
alert (st.contents);
st.markdownToHtml();
alert (st.contents);

Report · Jun 07, 2010

Thanks a lot. Managing nested constructions through a regex is always a hard job. AFAIR the PHP source code of Markdown uses global paramaters to control the "nested parenthesis/bracket" depth.

Will your final script support the whole Markdown syntax including titles, tables, lists, images...? Would be great!!!

[Consider to submit your library to scriptopedia.org]

@+

Marc

Report · Jun 07, 2010

Yes, I've been reading up on that a bit and it seems that it might be a good idea to step outside the regex bubble to manage nested parentheses.

I've never thought of supporting the entire markdown protocol, but it's certainly an interesting project. It's a pretty big leap between putting out specific fires at my workplace and writing a script that would be generally useful to people in other contexts, but perhaps if I get enough of it done, I might as well deal with the rest of markdown. I am a bit of a newbie at programming but I'm sure if I had a half-way working version I could show it to people and it could be fixed up and made more robust.

And yes, thank you again John for that simple trick. I've been using it almost exclusively since you pointed it out.

Report · Jun 07, 2010

On second thought, a markdown-to-InDesign script would probably not be too much of a problem at all. I just took a look at the PHP code for converting markdown into html, and I could use that as a guide (I do have some experience with PHP). I'll get to work.

GREP search problems [CS4/JS]

1 Correct answer