Skip to main content
Fightergator
Inspiring
August 23, 2023
Answered

Help With Regexp Lookaheads to Extract Definitions

  • August 23, 2023
  • 3 replies
  • 1059 views

Am trying to extract definitions from a document glossary with a script.  Have run into a problem with my lookahead that I can't seem to sort.  Glossary entries look like image below:

The problem is that each entry may have one or two tabs.  The first tab is rendered as "......" separating the acronym from the definition.  Some glossaries have a second tab that appears as a blank space before the definition.  The following works fine for glossaries with a single tab. 

(?<=\bDRB\x08).*

However, if glossary uses two tabs, regexp picks up the second tab along with the definition. If I change my look ahead to:

(?<=\bDRB\x08\x08).*

It works for with two tabs, but not with one.  If I change it to:

(?<=\bDRB\x08+).*

...which should find one or more occurance of the tab character, I get a "Not Found" error.  Apparently operators do not work that same way in ascertions as they work in regexps. 

 

    This topic has been closed for replies.
    Correct answer frameexpert

    Rick...thanks for the assist.  Unfortunately, every time I run it I get a "null is not an object" on the line...

    data.definition = regex.exec (text)[2];

     Thought maybe your "data = {}" was declaring an array, so changed it to "data = []" but got the same result.  Here's code snippet I tested before turning into a function:

    var regex, data;
    var text = "DTM		data transfer module";
    var regex = /([\w\/]+)\t+(.+)/ig;
    
    if (regex.test (text) === true) {
            // Make an object to return with both values.
            data = [];
            data.acronym = regex.exec (text)[1];
            data.definition = regex.exec (text)[2];
            //return data;
            }
    //$.writeln(data[acronym]);
    //$.writeln(data[1]);

     The space in the text string is two tabs. 


    I am sorry about that. The "g" flag on the regex changes the behavior of the exec method. This one will work:

    var text, data; 
    
    text = "MPM		miles per minute";  // blank space is two tabs
    
    data = getAcronymAndDefinition (text);
    if (data) {
        alert (data.acronym);
        alert (data.definition);
    }
    
    function getAcronymAndDefinition (text) {
        
        var regex, data;
        
        // Regular expression for capturing the data.
        regex = /([\w\/]+)\t+(.+)/i;
        if (regex.test (text) === true) {
            data = {};
            data.acronym = regex.exec (text)[1];
            data.definition = regex.exec (text)[2];
            return data;
        }
    }
    

    3 replies

    frameexpert
    Adobe Expert
    August 24, 2023

    How are you using the regular expressions in your script? Are you using the doc.Find () method or using the RegExp object? The RegExp object in JavaScript/ExtendScript does not support Lookbehind.

    frameexpert
    Adobe Expert
    August 24, 2023

    Here is an example of using a capture group instead of lookbehind:

     

    #target framemaker
    
    var doc, pgf, text, regex, definition;
    
    doc = app.ActiveDoc;
    // Get the paragraph at the curser.
    pgf = doc.TextSelection.beg.obj;
    
    // Use a function to get the text (not shown).
    text = CP.getText (pgf, doc);
    
    // ExtendScript regular expression literal.
    regex = /DRB\x08+(.+)/;
    
    if (regex.test (text) === true) {
        definition = regex.exec (text)[1];
        alert (definition);
    }

    The best practice would be to create a function that you could call, perhaps passing in a paragraph and an acronym and then returning the definition. It depends on the overall functionality of your script.

     

    frameexpert
    frameexpertCorrect answer
    Adobe Expert
    August 31, 2023

    Rick...thanks for the assist.  Unfortunately, every time I run it I get a "null is not an object" on the line...

    data.definition = regex.exec (text)[2];

     Thought maybe your "data = {}" was declaring an array, so changed it to "data = []" but got the same result.  Here's code snippet I tested before turning into a function:

    var regex, data;
    var text = "DTM		data transfer module";
    var regex = /([\w\/]+)\t+(.+)/ig;
    
    if (regex.test (text) === true) {
            // Make an object to return with both values.
            data = [];
            data.acronym = regex.exec (text)[1];
            data.definition = regex.exec (text)[2];
            //return data;
            }
    //$.writeln(data[acronym]);
    //$.writeln(data[1]);

     The space in the text string is two tabs. 


    I am sorry about that. The "g" flag on the regex changes the behavior of the exec method. This one will work:

    var text, data; 
    
    text = "MPM		miles per minute";  // blank space is two tabs
    
    data = getAcronymAndDefinition (text);
    if (data) {
        alert (data.acronym);
        alert (data.definition);
    }
    
    function getAcronymAndDefinition (text) {
        
        var regex, data;
        
        // Regular expression for capturing the data.
        regex = /([\w\/]+)\t+(.+)/i;
        if (regex.test (text) === true) {
            data = {};
            data.acronym = regex.exec (text)[1];
            data.definition = regex.exec (text)[2];
            return data;
        }
    }
    
    Participating Frequently
    August 24, 2023

    Hi, Fightergator

    Try this code instead.

     

    (?<=\x08)\b.*

     

     

    Adobe Expert
    August 23, 2023

    I cannot check this with FrameMaker, but what I would test: pit the tab into brackets:

    (?<=\bDRB(\x08)+).*

    Does this help?

    Adobe Expert
    August 24, 2023

    In FrameMaker it does not matter, what I enter in the Find/Replace dialog:

    \x08+

    (\x08)+

    Both find one or several tabs.

    Obviously this is different in ExtendScript.