• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

GREP search problems [CS4/JS]

Community Beginner ,
May 26, 2010 May 26, 2010

Copy link to clipboard

Copied

I have two problems with doing grep searches using a script written in Javascript for InDesign CS4.  The first one is more serious.

Problem #1:

I am using the following search pattern, to find e-mail addresses:

[[:word:]\-\.]+@([[:word:]\-]+\.)+[[:word:]\-]{2,4}

This pattern works perfectly to find all e-mail addresses when I use the grep feature of the Find/Change dialog box in the user interface, but fails when I use it as an argument of the findGrep() or changeGrep() functions in a script.  Interestingly, the problem is the part before the "@" symbol.  If I replace that part with literal text, like so:

richard@([[:word:]\-]+\.)+[[:word:]\-]{2,4}

then the script will correctly match any e-mail address where the part before the "@" symbol is the word "richard".

Can anyone shed light on this problem?

Problem #2:

This is more general and less serious, but it is related to problem #1, because it's about things that work in the user interface but not in a script.

Strangely, a lot of the standard regular expression shorthand wildcards for character classes (like \w, \d, etc.) do not work when I use them in scripts, but their POSIX equivalents ([:word:], [:digit:], etc.) work fine.   Either terminology -- \w or [:word:] -- works fine in the Find/Change dialog box of the user interface.

So this is not a serious problem, but I vastly prefer the shorter terminology.  I find it easier to read, and to write.

Any ideas?


Thanks.

TOPICS
Scripting

Views

4.9K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

Engaged , May 26, 2010 May 26, 2010

One solution for your both problem use \\ instead of \ in javascript.

code for find grep in Indesign is

[[:word:]\-\.]+@([[:word:]\-]+\.)+[[:word:]\-]{2,4}

and code for javascript find grep is

[[:word:]\\-\\.]+@([[:word:]\\-]+\\.)+[[:word:]\\-]{2,4}

Shonky

Votes

Translate

Translate
Engaged ,
May 26, 2010 May 26, 2010

Copy link to clipboard

Copied

One solution for your both problem use \\ instead of \ in javascript.

code for find grep in Indesign is

[[:word:]\-\.]+@([[:word:]\-]+\.)+[[:word:]\-]{2,4}

and code for javascript find grep is

[[:word:]\\-\\.]+@([[:word:]\\-]+\\.)+[[:word:]\\-]{2,4}

Shonky

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
May 26, 2010 May 26, 2010

Copy link to clipboard

Copied

Thanks, that solved the problem.

Not sure why the pattern   

richard@([[:word:]\-]+\.)+[[:word:]\-]{2,4}

was working in the first place.  I thought maybe it was that some of those backslashes after the @ symbol are unnecessary anyway (I'm still getting a handle on regular expressions), but I took them out and it broke the script, so that can't be it.

In any case, thank you, you have fixed my problem.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guide ,
May 26, 2010 May 26, 2010

Copy link to clipboard

Copied

// Here's a litteral RegExp in JS:
var RE1 = /[a\-c]/;
alert( RE1.test("-") ); // TRUE
alert( RE1.test("b") ); // FALSE
// ...RE1 means a|-|c (as expected)


// Now, using explicitly the RegExp class,

// we need to pass a String :

var RE2 = RegExp("[a\-c]");
alert( RE2.test("-") ); // FALSE
alert( RE2.test("b") ); // TRUE
// ...RE2 actually means [a-c] !

// Why?

// Because "\" is a metamarker in JS litteral strings.
// "\" is supposed to escape a few special chars and
// is ignored in other cases:
alert( "\a\.\-" == "a.-" ); // TRUE !

// So, to get a "\" in a string, you need
// to use "\\"
var RE3 = RegExp("[a\\-c]");
alert( RE3.test("-") ); // TRUE
alert( RE3.test("b") ); // FALSE

// Finally RE3 works like RE1.

@+

Marc

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guide ,
May 26, 2010 May 26, 2010

Copy link to clipboard

Copied

/* One more example */

var str = "a\\t"; // str contains 3 chars: a\t

// How to grab str in a RegExp?

// 1) In a litteral RegExp, we need to
// escape the backslash (because of \t):
alert( /a\\t/.test(str) ); // TRUE

// 2) Using explicitly the RegExp class,
// we need to express the pattern a\\t
// in a litteral string, so:
alert( RegExp("a\\\\t").test(str) ); // TRUE

// 4 backslahes to target 1 backslash!

@

Marc

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jun 01, 2010 Jun 01, 2010

Copy link to clipboard

Copied

Thank you Marc, that's very helpful.  I like to think I understand this stuff theoretically, but I always have problems writing my own RegExps in practice.  It's good to see examples.

Just out of curiosity, I have another problem that I am curious if anyone can help me with.

I am trying to match markdown web links.  Markdown is a markup language that we use in our office because our editors (I work at a newspaper) find it easy to read and write.  The format is

[text](link)

which corresponds to the html

<a href="link">text</a>

So I wrote the following:

{findWhat: "\\[[^][]+]\\([^)(]+\\)"}

and it strangely worked on my copy of CS4 at home, but here at work it doesn't seem to be working.  It works up until the first left-round-bracket -- \\[[^][]+]\\( -- and then the rest of it doesn't match.

Any ideas?

Thanks.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jun 01, 2010 Jun 01, 2010

Copy link to clipboard

Copied

Oops.  I posted too soon.  There is no difference between my copies of CS4 at work and home.  That would have been a little strange.  What happened is that I was importing two different Word documents, and one worked and one didn't.  My problem is still not solved, but I will examine the documents more closely to see if I can figure out what the problem with the script is.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jun 01, 2010 Jun 01, 2010

Copy link to clipboard

Copied

I figured out the difference between the files.  The one that failed to match was full of automatically generated Microsoft Word hyperlinks.  The importing script I wrote is supposed to remove them completely before it does anything else to the text, and it appears to do that, but somehow there is something left over from the Word hyperlink which messes up regexp matches.

So the Word file will contain something like this:

Harrington recently published [a paper](http://www.harringtonspaper.com) in the journal of blahblahblah.

and somewhere along the line, Word automatically generates a hyperlink, starting at "http".  Somehow, that section of text ends up being impossible to match, even if I just try to match the literal string "http".  I'm not sure why, since I strip all the hyperlinks from the file when it comes into InDesign.

Unfortunately, it's unrealistic to try to get the writers to stop using Word.

Any ideas about what might be left over in the text from the Word hyperlink, that I cannot see?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jun 01, 2010 Jun 01, 2010

Copy link to clipboard

Copied

I figured it out.  When I delete all the hyperlinks, I also need to delete all the hyperlinkTextSources, otherwise the regexp engine won't be able to smoothly find a match across a block of text that includes a hyperlinkTextSource.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guide ,
Jun 01, 2010 Jun 01, 2010

Copy link to clipboard

Copied

I richardh6,

Well, I don't understand this part of your pattern: ...[^][]+...

Do you mean: ...[^]]+... ?

In a find/change approach you also should use capture parenthesis, for example:

// ID CS4

// Apply a Markdown-to-HTML conversion

// on any [...](...) pattern of the documents:

app.findGrepPreferences.findWhat = '\\[([^]]+)]\\(([^)]+)\\)';
app.changeGrepPreferences.changeTo = '<a href=~"$2~">$1</a>';
app.changeGrep();

And in a JS process you could use something like this:

var RE_MARKDOWN = /\[([^]]+)]\(([^)]+)\)/;
// Or, if you prefer:
// var RE_MARKDOWN = RegExp("\\[([^]]+)]\\(([^)]+)\\)");

var markdown2html = function(s)
     {
     var m = s.match(RE_MARKDOWN);
     if( !m ) return false;
     return '<a href="%2">%1</a>'.
               replace('%1',m[1]).
               replace('%2',m[2]);
     }

// sample code
//----------------------------
var mk = "[a paper](http://www.harringtonspaper.com)";

alert( markdown2html(mk) );
// output: <a href="http://www.harringtonspaper.com">a paper</a>

Regards,

Marc

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jun 01, 2010 Jun 01, 2010

Copy link to clipboard

Copied

Do you suppose it would be less confusing to use the Javascript RE even in the Find/Change case? E.g.:

var RE_MARKDOWN = /\[([^]]+)]\(([^)]+)\)/;
app.findGrepPreferences.findWhat = RE_MARKDOWN.toString().slice(1,-1);
app.changeGrepPreferences.changeTo = '<a href=~"$2~">$1</a>';
app.changeGrep();

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guide ,
Jun 02, 2010 Jun 02, 2010

Copy link to clipboard

Copied

At your convenience. It should work too.


Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
LEGEND ,
Jun 02, 2010 Jun 02, 2010

Copy link to clipboard

Copied

Oh, sure. The question was whether it was better style or clearer to read...

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Jun 07, 2010 Jun 07, 2010

Copy link to clipboard

Copied

> The question was whether it was better style or clearer to read...

It's certainly easier to read. Nice trick, John.

Peter

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jun 04, 2010 Jun 04, 2010

Copy link to clipboard

Copied

Thank you Marc and John for your insights.

Marc,

Well, I don't understand this part of your pattern: ...[^][]+...

Do you mean: ...[^]]+... ?

No, I actually did mean: ...[^][]+...   and not ...[^]]+...

[^][]+   means one or  more of the character class which includes any character except [ and ].  They are listed in reverse order (i.e. ][) in the regexp pattern because only the left square  bracket is allowed to come second in that order, if you don't want to  use backslashes.  I'm not sure why I did it that way -- I guess I  thought if any text came in with either bracket it should be rejected,  but I have changed it to your suggestion, because there's no reason to  exclude the opening brackets.

I took your suggestions and I wrote a working script which converts  markdown to InDesign hyperlinks, which is a bit more involved than just  converting it to html, because part of the original match stays in the  text object and part of it gets assigned to a new hyperlink object as a  string.

I have another question now (it's not urgent, if you're too busy), and for the sake of brevity I am using your markdown to html example.  The question is, how exactly can you  deal with escaped characters?  Say, for instance, you had a sentence in  an article where you're quoting someone and you say 'The next day was  [her] worst day ever,' and you want '[her] worst day ever' to be a  link.  Or say you have a URL that has round brackets in it, which quite a  few of them do on Wikipedia for some weird reason.

I thought you might be able to do this using lookbehind, but  Javascript's regexp flavor does not appear to support lookbehind, so I  wrote the following code, which hides away all the escaped characters  during the major processing, and then restores them.  It works, but is  there a better way?  It doesn't seem very elegant, although it's  certainly easy and fairly foolproof, which is probably good:

function markdown2html (myObject) {
  var HIDE_ESCAPED_CHARS = [
      {before: "\\\\\\\\", after: "%_BACKSLASH"},
      {before: "\\\\\\]", after: "%_RIGHT_SQUARE_BRACKET"},
      {before: "\\\\\\)", after: "%_RIGHT_ROUND_BRACKET"},
      {before: "\\\\\\*", after: "%_ASTERISK"},
      // Support for other markdown codes
      // will be added as needed, but in the
      // meantime, delete all single backslashes:
      {before: "\\\\", after: ""} ];

  var RESTORE_ESCAPED_CHARS = [
      {before: "%_BACKSLASH", after: "\\"},
      {before: "%_RIGHT_SQUARE_BRACKET", after: "]"},
      {before: "%_RIGHT_ROUND_BRACKET", after: ")"},
      {before: "%_ASTERISK", after: "*"} ];
   
  app.changeGrepPreferences = NothingEnum.nothing;
  app.findGrepPreferences = NothingEnum.nothing;
 
  multiChangeGrep (myObject, HIDE_ESCAPED_CHARS);
  // convert hyperlinks
  app.findGrepPreferences.findWhat = "\\[([^]]+)]\\(([^)]+)\\)";
  app.changeGrepPreferences.changeTo = '<a href=~"$2~">$1</a>';
  myObject.changeGrep();
  // At this point you'd start processing the rest of the
  // markdown according to your needs; like asterisks to
  // bold and italic, etc.
  // ...
  // ...
  // ...
  multiChangeGrep (myObject, RESTORE_ESCAPED_CHARS);
 
  app.changeGrepPreferences = NothingEnum.nothing;
  app.findGrepPreferences = NothingEnum.nothing;
}
               
function multiChangeGrep (obj, findChangeArray) {
  var findChangePair;
  for (var i=0; i<findChangeArray.length; i++) {
    findChangePair = findChangeArray;
    app.findGrepPreferences.findWhat = findChangePair.before;
    app.changeGrepPreferences.changeTo = findChangePair.after;
    obj.changeGrep();
  }
}           

               
// -----------------------------------



// Sample use of the function markdown2html
// (assumes you have a document open)

tf = app.activeDocument.pages[0].textFrames.add();
tf.geometricBounds = ["2cm", "2cm", "12cm", "18cm"];

st = tf.parentStory;
st.contents =
    "Yesterday there were " +
    "[three dogs](http://www.dogsrule.com) in my yard.  " +
    "[Today \\[the senator\\] said](http://www.dogtimes.com/story/34) " +
    "that all the dogs have found " +
    "[good homes](http:www.homes.com/dogs_\\(and_cats\\)).  " +
    "Who knows what tomorrow may bring?\r\r" +
    "And now for good measure we will have a sentence " +
    "with two backslashes in it, one inside a [link\\\\,](http://www.link.com) " +
    "and one\\\\ outside.";
             
alert (st.contents);
markdown2html (st);
alert (st.contents);

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jun 05, 2010 Jun 05, 2010

Copy link to clipboard

Copied

I have found a much shorter solution, that requires a lot less code and uses regular expressions entirely to accomplish the exact same task that the script in my last post does, which is to process markdown hyperlinks into html hyperlinks, while taking into account certain escaped characters.  I have also started using John Hawkinson's method of making the regular expressions a little easier to read, since there's now a serious proliferation of backslashes:

function markdown2html (myObject) {
  var myRegexp;
   
  app.changeGrepPreferences = NothingEnum.nothing;
  app.findGrepPreferences = NothingEnum.nothing;
 
  // convert hyperlinks
  myRegexp = /\[((?:\\\\|\\\]|[^]])+)\]\(((?:\\\\|\\\)|[^)])+)\)/;
  app.findGrepPreferences.findWhat = myRegexp.toString().slice(1,-1);
  app.changeGrepPreferences.changeTo = '<a href=~"$2~">$1</a>';
  myObject.changeGrep();
 
  // remove stray backslashes
  myRegexp = /\\(.)/;
  app.findGrepPreferences.findWhat = myRegexp.toString().slice(1,-1);
  app.changeGrepPreferences.changeTo = '$1';
  myObject.changeGrep();
   
  app.changeGrepPreferences = NothingEnum.nothing;
  app.findGrepPreferences = NothingEnum.nothing;
}
                    
                    
// -----------------------------------



// Sample use of the function markdown2html
// (assumes you have a document open)

var tf = app.activeDocument.pages[0].textFrames.add();
tf.geometricBounds = ["2cm", "2cm", "12cm", "18cm"];

var st = tf.parentStory;
st.contents =
    "Yesterday there were " +
    "[three dogs](http://www.dogsrule.com) " +

    "in my yard.  " +
    "[Today \\[the senator\\] said]" +

    "(http://www.dogtimes.com/story/34) " +
    "that all the dogs have found " +
    "[good homes]" +

    "(http:www.homes.com/dogs_\\(and_cats\\)).  " +
    "Who knows what tomorrow may bring?\r\r" +
    "And now for the finale: a backslash in the " +
    "text, right before the end of the " +
    "[link text\\\\](http://www.onemorelink.com).";
             
alert (st.contents);
markdown2html (st);
alert (st.contents);

But strangely, I think the version in my previous post (with some slight modifications) might be more elegant and reliable in the general case, once I start adding support for a lot more markdown codes.  In a way the previous one is simpler -- just get the escaped characters out of the way, deal with everything you have to deal with, and then put them back.  Of course, I would probably change the placeholder text from long strings like "%_LEFT_SQUARE_BRACKET" to single Unicode characters that no one would ever use, like Linear B syllables or something, assigned to variable names like "LEFT_SQUARE_BRACKET".

Richard Harrington

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jun 05, 2010 Jun 05, 2010

Copy link to clipboard

Copied

By the way, Marc, I read some of your website and I have to thank you for your very clear and thorough explanations of some advanced topics in scripting InDesign.  I will be particularly carefully studying the section on adding menu items.  I often think that I could write a script to have lasers shoot out of my eyes or to automatically generate a new play in the style of William Shakespeare, and my employers would just be confused.  But if I could accomplish these things via menu items, then they would be official.  People would be impressed.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jun 05, 2010 Jun 05, 2010

Copy link to clipboard

Copied

Or, slightly better, in case anyone else wants to use this script some day, here's the prototype version:

// markdownToHtml() is a method that can be

// invoked on any object that you can invoke

// findGrep() on.  It converts markdown

// hyperlinks to html hyperlinks.

Character.prototype.markdownToHtml =
Word.prototype.markdownToHtml =
TextStyleRange.prototype.markdownToHtml =
Line.prototype.markdownToHtml =
Paragraph.prototype.markdownToHtml =
TextColumn.prototype.markdownToHtml =
Text.prototype.markdownToHtml =
Cell.prototype.markdownToHtml =
Column.prototype.markdownToHtml =
Row.prototype.markdownToHtml =
Table.prototype.markdownToHtml =
Story.prototype.markdownToHtml =
TextFrame.prototype.markdownToHtml =
XMLElement.prototype.markdownToHtml =
Document.prototype.markdownToHtml =
Application.prototype.markdownToHtml =

function () {
  var myRegexp;
   
  app.changeGrepPreferences = NothingEnum.nothing;
  app.findGrepPreferences = NothingEnum.nothing;
 
  // convert hyperlinks
  myRegexp = /\[((?:\\\\|\\\]|[^]])+)\]\(((?:\\\\|\\\)|[^)])+)\)/;
  app.findGrepPreferences.findWhat = myRegexp.toString().slice(1,-1);
  app.changeGrepPreferences.changeTo = '<a href=~"$2~">$1</a>';
  this.changeGrep();
 
  // remove stray backslashes
  myRegexp = /\\(.)/;
  app.findGrepPreferences.findWhat = myRegexp.toString().slice(1,-1);
  app.changeGrepPreferences.changeTo = '$1';
  this.changeGrep();
   
  app.changeGrepPreferences = NothingEnum.nothing;
  app.findGrepPreferences = NothingEnum.nothing;
}
                    
                    
// -----------------------------------



// Sample use of the function markdown2html
// (assumes you have a document open)

var tf = app.activeDocument.pages[0].textFrames.add();
tf.geometricBounds = ["2cm", "2cm", "12cm", "18cm"];

var st = tf.parentStory;
st.contents =
    "Yesterday there were " +
    "[three dogs](http://www.dogsrule.com) in my yard.  " +
    "[Today \\[the senator\\] said](http://www.dogtimes.com/story/34) " +
    "that all the dogs have found " +
    "[good homes](http:www.homes.com/dogs_\\(and_cats\\)).  " +
    "Who knows what tomorrow may bring?\r\r" +
    "And now for the finale: a backslash in the " +
    "text, right before the end of the " +
    "[link text\\\\](http://www.onemorelink.com).";
             
alert (st.contents);
st.markdownToHtml();
alert (st.contents);





Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guide ,
Jun 07, 2010 Jun 07, 2010

Copy link to clipboard

Copied

Thanks a lot. Managing nested constructions through a regex is always a hard job. AFAIR the PHP source code of Markdown uses global paramaters to control the "nested parenthesis/bracket" depth.

Will your final script support the whole Markdown syntax including titles, tables, lists, images...? Would be great!!!

[Consider to submit your library to scriptopedia.org]

@+

Marc

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jun 07, 2010 Jun 07, 2010

Copy link to clipboard

Copied

Yes, I've been reading up on that a bit and it seems that it might be a good idea to step outside the regex bubble to manage nested parentheses.

I've never thought of supporting the entire markdown protocol, but it's certainly an interesting project.  It's a pretty big leap between putting out specific fires at my workplace and writing a script that would be generally useful to people in other contexts, but perhaps if I get enough of it done, I might as well deal with the rest of markdown.  I am a bit of a newbie at programming but I'm sure if I had a half-way working version I could show it to people and it could be fixed up and made more robust.

And yes, thank you again John for that simple trick.  I've been using it almost exclusively since you pointed it out.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Jun 07, 2010 Jun 07, 2010

Copy link to clipboard

Copied

LATEST

On second thought, a markdown-to-InDesign script would probably not be too much of a problem at all.  I just took a look at the PHP code for converting markdown into html, and I could use that as a guide (I do have some experience with PHP).  I'll get to work.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines