• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Help With Regexp Lookaheads to Extract Definitions

Contributor ,
Aug 23, 2023 Aug 23, 2023

Copy link to clipboard

Copied

Am trying to extract definitions from a document glossary with a script.  Have run into a problem with my lookahead that I can't seem to sort.  Glossary entries look like image below:

Glossary Entries.png

The problem is that each entry may have one or two tabs.  The first tab is rendered as "......" separating the acronym from the definition.  Some glossaries have a second tab that appears as a blank space before the definition.  The following works fine for glossaries with a single tab. 

(?<=\bDRB\x08).*

However, if glossary uses two tabs, regexp picks up the second tab along with the definition. If I change my look ahead to:

(?<=\bDRB\x08\x08).*

It works for with two tabs, but not with one.  If I change it to:

(?<=\bDRB\x08+).*

...which should find one or more occurance of the tab character, I get a "Not Found" error.  Apparently operators do not work that same way in ascertions as they work in regexps. 

 

Views

260

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 3 Correct answers

Explorer , Aug 24, 2023 Aug 24, 2023

Hi, Fightergator

Try this code instead.

 

(?<=\x08)\b.*

 

 

Votes

Translate

Translate
Community Expert , Aug 24, 2023 Aug 24, 2023

Here is an example of using a capture group instead of lookbehind:

 

#target framemaker

var doc, pgf, text, regex, definition;

doc = app.ActiveDoc;
// Get the paragraph at the curser.
pgf = doc.TextSelection.beg.obj;

// Use a function to get the text (not shown).
text = CP.getText (pgf, doc);

// ExtendScript regular expression literal.
regex = /DRB\x08+(.+)/;

if (regex.test (text) === true) {
    definition = regex.exec (text)[1];
    alert (definition);
}

The best practice would be to crea

...

Votes

Translate

Translate
Community Expert , Aug 31, 2023 Aug 31, 2023

I am sorry about that. The "g" flag on the regex changes the behavior of the exec method. This one will work:

var text, data; 

text = "MPM		miles per minute";  // blank space is two tabs

data = getAcronymAndDefinition (text);
if (data) {
    alert (data.acronym);
    alert (data.definition);
}

function getAcronymAndDefinition (text) {
    
    var regex, data;
    
    // Regular expression for capturing the data.
    regex = /([\w\/]+)\t+(.+)/i;
    if (regex.test (text) === true) {
        
...

Votes

Translate

Translate
Community Expert ,
Aug 23, 2023 Aug 23, 2023

Copy link to clipboard

Copied

I cannot check this with FrameMaker, but what I would test: pit the tab into brackets:

(?<=\bDRB(\x08)+).*

Does this help?

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Aug 23, 2023 Aug 23, 2023

Copy link to clipboard

Copied

In FrameMaker it does not matter, what I enter in the Find/Replace dialog:

\x08+

(\x08)+

Both find one or several tabs.

Obviously this is different in ExtendScript.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Explorer ,
Aug 24, 2023 Aug 24, 2023

Copy link to clipboard

Copied

Hi, Fightergator

Try this code instead.

 

(?<=\x08)\b.*

 

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Aug 24, 2023 Aug 24, 2023

Copy link to clipboard

Copied

How are you using the regular expressions in your script? Are you using the doc.Find () method or using the RegExp object? The RegExp object in JavaScript/ExtendScript does not support Lookbehind.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Aug 24, 2023 Aug 24, 2023

Copy link to clipboard

Copied

Here is an example of using a capture group instead of lookbehind:

 

#target framemaker

var doc, pgf, text, regex, definition;

doc = app.ActiveDoc;
// Get the paragraph at the curser.
pgf = doc.TextSelection.beg.obj;

// Use a function to get the text (not shown).
text = CP.getText (pgf, doc);

// ExtendScript regular expression literal.
regex = /DRB\x08+(.+)/;

if (regex.test (text) === true) {
    definition = regex.exec (text)[1];
    alert (definition);
}

The best practice would be to create a function that you could call, perhaps passing in a paragraph and an acronym and then returning the definition. It depends on the overall functionality of your script.

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Contributor ,
Aug 24, 2023 Aug 24, 2023

Copy link to clipboard

Copied

Let me thank everyone for your quick responses.  Yatani...you solution worked great for selecting each acronym, single or multiple tabs, and associated definition, which are all on one pgf.  Rick, your suggestion is the piece I needed to separate out the definition from the rest of the paragraph.  I'm not very smart on regex, but this was a great exercise in learning how to use their capture group ability instead of a lookbehind.  Much learning I have yet to do.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Contributor ,
Aug 30, 2023 Aug 30, 2023

Copy link to clipboard

Copied

I took Rick's suggestion and came up with the following to extract both the acronym (before the tab(s)) and the definition (following the tab(s)).  However, I can't figure out how combine the two searches and capture groups into a single function.  Any suggestions to streamline this?

var text = ""; 
var definition = ""; 
var acronym = "";
text = "MPM		miles per minute";  // blank space is two tabs
getAcronym(); //add acronym to array
getDefinition(); //add definition to array
function getAcronym() {
    var regex = /([\w\/]+)\t+.+/ig;
    acronym = "";
    if (regex.test(text) === true) {
        acronym = regex.exec(text)[1]
    }
    return acronym;
}
function getDefinition() {
    var regex = /[\w\/]+\t+(.+)/ig;
    definition = "";
    if (regex.test(text) === true) {
        definition = regex.exec(text)[1]
    }
    return definition;
}

 Bottomline is that it works and was a good exercise for me in working with regexs.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Aug 30, 2023 Aug 30, 2023

Copy link to clipboard

Copied

This is untested, but I would do something like this:

 

function getAcronymAndDefinition (text) {
    
    var regex, data;
    
    // Regular expression for capturing the data.
    regex = /([\w\/]+)\t+(.+)/ig;
    if (regex.test (text) === true) {
        // Make an object to return with both values.
        data = {};
        data.acronym = regex.exec (text)[1];
        data.definition = regex.exec (text)[2];
        return data;
    }
}

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Contributor ,
Aug 31, 2023 Aug 31, 2023

Copy link to clipboard

Copied

Rick...thanks for the assist.  Unfortunately, every time I run it I get a "null is not an object" on the line...

data.definition = regex.exec (text)[2];

 Thought maybe your "data = {}" was declaring an array, so changed it to "data = []" but got the same result.  Here's code snippet I tested before turning into a function:

var regex, data;
var text = "DTM		data transfer module";
var regex = /([\w\/]+)\t+(.+)/ig;

if (regex.test (text) === true) {
        // Make an object to return with both values.
        data = [];
        data.acronym = regex.exec (text)[1];
        data.definition = regex.exec (text)[2];
        //return data;
        }
//$.writeln(data[acronym]);
//$.writeln(data[1]);

 The space in the text string is two tabs. 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Aug 31, 2023 Aug 31, 2023

Copy link to clipboard

Copied

I am sorry about that. The "g" flag on the regex changes the behavior of the exec method. This one will work:

var text, data; 

text = "MPM		miles per minute";  // blank space is two tabs

data = getAcronymAndDefinition (text);
if (data) {
    alert (data.acronym);
    alert (data.definition);
}

function getAcronymAndDefinition (text) {
    
    var regex, data;
    
    // Regular expression for capturing the data.
    regex = /([\w\/]+)\t+(.+)/i;
    if (regex.test (text) === true) {
        data = {};
        data.acronym = regex.exec (text)[1];
        data.definition = regex.exec (text)[2];
        return data;
    }
}

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Contributor ,
Sep 02, 2023 Sep 02, 2023

Copy link to clipboard

Copied

LATEST

Thanks Rick...that's works like a champ.  When I first tried to run your code, the regex.test would not test true.  Had to build it up backwards to get it to work; ending up where you started.  I should have closed and restarted FM & Extendscript Toolkit to purge any flags or variables from memory.  This has been an excellent lesson for me in using regexs & the .exec method, which I had not seen before.  Now I can finally use capture groups in my scripts.  Hope you have a great Labor Day weekend. 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines