Skip to main content
April 24, 2015
Question

Regex: since no negative look behind, what is the best way ...

  • April 24, 2015
  • 2 replies
  • 747 views

I have a great Photoshop scripting routine that uses regular expressions to find all of the parts of a string that are surrounded by underlines.

Regex:  /_([\s\S]*?)_/g

Text:  Match on _this_ and also _on this_ and even _on this too_.

... and life was nice, until my paragraph contained a URL that had underlines in it!

Now, I want to make sure that if I match on an underline, it isn't an underline within a URL.

I know that URLs don't have spaces, so I modified by regular expression to say "when you find a match, look back to see if there is an http without at least one space between it and the match".

Regex: /(?<!http) +_([\s\S]*?)_/g

Text: Match on _this_ and also _on this_ but if http://mysite.com/more_url.with?underscores_then.no.match until after you _leave the URL_.

... but alas, it appears that this implementation of Javascript doesn't support negative look behind.

So, can anyone think of an elegant regular expression that matches on parts of a string that are surrounded by underscores, unless they are within a URL?

    - Brad

This topic has been closed for replies.

2 replies

April 25, 2015

I gave up trying to figure out a one-liner Regex, so I wrote a function that can take a string containing zero or more URLs and do a search and replace only within the URL.

This looks like a really long function, but if you remove the comments and debug code, it is only about a dozen lines long.  It is written for readability rather than efficiency.

I hope that this helps others.

function replaceURL(textString, charFind, charRepl) {

    var debug = true // set to true if you want to see all the steps in the console window

    // Look for any URLs and replace the {charFind} with {charRepl}

    // NOTE: you must be careful to search for a special Regex character or replace on any character that could be in a URL (otherwise not reversible).

    // The # character is a safe replacement since it is neither a Regex character nor is it a valid URL character.

    //Special Regex:  \^$.|?*+()[{

    //Special URL:  $-_.+!*'(),

    // Finds all {charFind} characters within a URL.

    // Look for at least one word character \w+ followed by a ://  (e.g. http://, ftp://, etc.)

    // URLs can't have spaces, so continue through non whitespace characters \S*? until you find the {charFind} followed by any number of non whitespace characters \S*.

    // Note that we have to double escape the special characters because we first build a string and then the string is converted to Regex, which is the only way to put

    // a variable like charFind into the Regex.

    var myReURLString = "(.*?)(\\w+:\\/\\/\\S*?" + charFind + "\\S*)(.*)";

    var myReURL = new RegExp(myReURLString);

    var myReFind = new RegExp(charFind, "g");

    if (debug) {

        $.writeln("Searching for " + charFind + " and replacing with " + charRepl + "\n");

        $.writeln("Regex to find a URL with a {charFind} within the URL: " + myReURL + "\n");

        $.writeln("Regex to find the {charFind} within the URL: " + myReFind + "\n\n");

    }

    // Each pass through the loop will find a URL with the specific character and process it.

    // Take the textString, and split it into 3 parts: everything before the nth URL, the URL, and everything after the nth URL.

    // Then do a search through the URL for all instances of charFind and replace with charRepl

    // Finally, put all three parts back together again.

    // Repeat until there are no more URLs to process

    while ((myParts = myReURL.exec(textString)) !== null) {

        // [0] is original textString

        // [1] is everything before the URL

        // [2] is the URL

        // [3] is everything after the URL

        if (debug) {

            $.writeln("==== Starting ====\ntextString:\n" + textString + "\n");

            $.writeln("---- Before ----\n");

            $.writeln("myParts [1]: \n" + myParts[1] + "\n\n" +

                "myParts [2]: \n" + myParts[2] + "\n\n" +

                "myParts [3]: \n" + myParts[3] + "\n\n");

        }

        // Replace all the {charFind} in myParts[2] with {charRepl}

        myParts[2] = myParts[2].replace(myReFind, charRepl);

        if (debug) {

            $.writeln("---- After ----\n");

            $.writeln("myParts [1]: \n" + myParts[1] + "\n\n" +

                "myParts [2]: \n" + myParts[2] + "\n\n" +

                "myParts [3]: \n" + myParts[3] + "\n\n");

        }

        // Now put it back together again

        textString = myParts[1].concat(myParts[2], myParts[3]);

        if (debug) {

            $.writeln("textString:\n" + textString + "\n\n");

        }

    }

    return (textString);

}

var textString = "Lots of text _with underlines_ all _ through the _paragraph, but also a http://url.that.has/underlines_in-it__that.we.have.to.avoid followed _by_ more underlines__. And _here_ is another ftp://url.that_also_has_underlines in it_."

var newTextString = replaceURL(textString, "_", "#");

alert("\n\nOUT: " + newTextString);

// Output is:

// Lots of text _with underlines_ all _ through the _paragraph, but also a http://url.that.has/underlines#in-it##that.we.have.to.avoid followed _by_ more underlines__. And _here_ is another ftp://url.that#also#has#underlines in it_.

var newerTextString = replaceURL(newTextString, '#', "_");

alert("\n\nREVERSED: " + newerTextString);

// Output is:

// Lots of text _with underlines_ all _ through the _paragraph, but also a http://url.that.has/underlines_in-it__that.we.have.to.avoid followed _by_ more underlines__. And _here_ is another ftp://url.that_also_has_underlines in it_.

Pedro Cortez Marques
Legend
April 24, 2015

var myStr = "_this_ and also _on this_ but if http://mysite.com/more_url.with?underscores_then.no.match until after you _leave the URL_";

// first remove all links from string, then use your RegExp

$.writeln(myStr.replace(/\shttp.+?\s/g,'').match(/_([\s\S]*?)_/g));

April 24, 2015

Pedro,

  Thanks for taking time to reply.  Your solution is nice and concise.  However, I am wondering if only a regular expression can be used.  Let me provide a few more details.

  It turns out that I don't need the characters that are within the underline.  What I need are their character position.  Here is the scenario.

  The function takes a string and returns [0] the original text; [1] the original text with the underlines removed; [2] the number of times underlines were removed; and [3...] pairs with the start and end positions of the text that used to have underlines around it.

Text into function:

Match on _this_ and also _on this_ but if http://mysite.com/more_url.with?underscores_then.no.match until after you _leave the URL_.

Function returns:

returnArray[0] = Match on _this_ and also _on this_ but if http://mysite.com/more_url.with?underscores_then.no.match until after you _leave the URL_.

returnArray[1] = Match on this and also on this but if http://mysite.com/more_url.with?underscores_then.no.match‌‌ until after you leave the URL.

return Array[2] = 3

returnArray[3...] = 9,13,23,30,115,128

  I am not an experience Javascript programmer, so at the risk of putting my (inefficient and ugly?) code on display, here is the code that works - except for ignoring underscores in the URL.

function parseLine(textString) {

        // Will take a textString and find all the words that have _underlines on either side_.

        // Will then return an array:

        // [0] = original textString

        // [1] = textString with the _ removed

        // [2] = number of replacements made

        // [a,b ...] = pairs of numbers for the start and end characters where the underlines were

        //Regex for find the words between the _underlines_

        // var myRe = /(?<!http) +_([\s\S]*?)_/g;  <- would work if look behind were supported.

        var myRe = /_([\s\S]*?)_/g; 

        // Set up the first three indexes in the returned array: orignial textString, newTextString (without underlines), numbrer of replacements

        var changeIndex = [textString, "", 0];

        var myArray;

        var chopText = textString;

        var newText;

        var numReplace = 0;

        // Loop through all the matches

        // Remove the underlines, count the number of replacements, and record the places in the text where they were.

        while ((myArray = myRe.exec(chopText)) !== null) {

            // record begin and end point

            changeIndex.push(myArray.index, (myArray.index + myArray[1].length));

            // remove the underlines

            newText = chopText.replace(myArray[0], myArray[1]);

            chopText = newText;

            // Count the number of replacements

            numReplace++;

        }

        // put the text without the underlines into the index to be returned.

        changeIndex[1] = newText;

        // put the number of replacements into the index to be returned.

        changeIndex[2] = numReplace;

        $.writeln(changeIndex);

        return (changeIndex);

    }

Pedro Cortez Marques
Legend
April 24, 2015

Hope it helps, Brad

var myStr = "_this_ and also _on this_ but if http://mysite.com/more_url.with?underscores_then.no.match until after you _leave the URL_"; 

// Only one RegExp

$.writeln(myStr.match(/(^_([\s\S]*?)_\s)|(\s_([\s\S]*?)_\s)|(\s_([\s\S]*?)_$)/g).join('\n'));