Text extraction algorithm

Report · Mar 30, 2020

Hi All,

I need help with a text extraction algorithm. Given a string of text inside a larger string, I need to extract a string of a certain number of characters with my given string in the middle. I am going to illustrate it with some nonsense paragraphs, but hopefully it will illustrate when I am trying to do. Here are three paragraphs with the given text "zzzz":

123456789 zzzz 1234567890
1 zzzz 12345 6789 1234567
1234 6789 12345678 zzzz 1

I want to extract 10 characters with the "zzzz" in the middle. If the "zzzz" can't be in the middle (as in the 2nd and 3rd paragraphs), I still want 10 characters. So, here is the result I want:

89 zzzz 12
1 zzzz 123
678 zzzz 1

Of course, there could be instances where the overall string contains less characters, but I want to set the maximum extraction to a specific number characters (10 in this case). So, given the overall length of a container string, the length of the target string and its position in the container string, and the number of total characters to extract, I am looking for a general algorithm to extract the characters, keeping the target string as close to the center as possible.

I am using ExtendScript but even pseudocode would be helpful. Any ideas or pointers would be appreciated. Thank you very much. -Rick

Report · Mar 31, 2020

This works for your example text:

text = '123456789 zzzz 1234567890\n1 zzzz 12345 6789 1234567\n1234 6789 12345678 zzzz 1\n';
search = 'zzzz';
max_length = 10;
lines = text.split('\n');

result = []
for (i=0; i<lines.length; i++)
{
	start = lines[i].indexOf(search);
	if (start > -1)
	{
		start -= (max_length-search.length)>>1;
		if (start < 0)
			start = 0;
		if (start + max_length >= lines[i].length)
			start = lines[i].length - max_length;
		result.push (lines[i].substr(start,max_length));
	}
}

alert (result.join('\n'))

Is this for a linguistics purpose, to display a concordance such as with AntConc? In that case the key phrase should always be centered, and left and right should be padded with spaces if necessary. That's just a case of, again, checking 'start' and 'start+phrase.length' but this time add spaces until they are in range:

text = '123456789 zzzz 1234567890\n1 zzzz 12345 6789 1234567\n1234 6789 12345678 zzzz 1\n';
search = 'zzzz';
max_length = 10;
lines = text.split('\n');

result = []
for (i=0; i<lines.length; i++)
{
	start = lines[i].indexOf(search);
	if (start > -1)
	{
		start -= (max_length-search.length)>>1;
		while (start < 0)
		{
			lines[i] = ' '+lines[i]
			start++;
		}
		while (start + max_length >= lines[i].length)
		{
			lines[i] += ' ';
		}
		result.push (lines[i].substr(start,max_length));
	}
}

alert (result.join('\n'))

which produces a padded

89 zzzz 12
 1 zzzz 12
78 zzzz 1

Report · Mar 31, 2020

In this case your search should be

.{0,3}zzzz.{0,3}

which paragrapses as zzzz preceded by up to three characters and followed by up to three characters. You can generalise that using this very cludgy code:

search = 'zzzz';
extras = (10 - search.length) / 2;
if (String(extras).indexOf('.') > -1) {
  left = parseInt(extras);
  right = parseInt (extras) + 0.5;
} else {
  left = right = extras;
}

app.findGrepPreferences = null;
app.findGrepPreferences.findWhat = '.{0,' + left + '}' + search + '.{0,' + right + '}';
found = app.documents[0].findGrep();
for (i = 0; i < found.length; i++) {
  $.writeln (found[i].contents)
}

Report · Mar 31, 2020

Jongware beat me to it! With a better approach, too.

Report · Mar 31, 2020

Thank you Jongware and Peter! It's good to see that you InDesign wireheads are still active. I am working in FrameMaker but like to post here for the extra brain power.

One problem with Jongware's solution: If the total length of the original string is less than the length of the extract string, then it didn't work. Here is a solution that I worked up, but I am open to optimizations. Thank you very much!

var lines, extractLength, text, find, start, middle, end, result, i;

// Sample lines.
lines = [];
lines.push ("A test is cool.");
lines.push ("This is a test.");
lines.push ("This is another test.");
lines.push ("This is a test of the emergency broadcast system.");
lines.push ("test");

// The find string and the total extraction length.
find = "test";
extractLength = 24;

result = [];
for (i = 0; i < lines.length; i += 1) {
    text = lines[i];
    start = text.indexOf (find);
    // Where does the find string start?
    if (start > -1) {
        // Find the middle of the find string.
        middle = Math.ceil (index + find.length / 2);
        // Find where the extract string should start.
        start = middle - extractLength / 2;
        if (start < 0) {
            start = 0;
        }
        // Get the end of the extract string.
        end = start + extractLength;
        // Move the end back if it is past the end.
        while (end > text.length) {
            end -= 1;
            // Move the start back if there is room.
            if (start > 0) {
                start -= 1;
            }
        }
        // Recalculate the extract length.
        extractLength = end - start;
        result.push ("..." + text.substr (start, extractLength) + "...");    
    }
}

alert (result.join ("\r"));

Report · Mar 31, 2020

var lines, extractLength, text, find, start, middle, end, result, i;

// Sample lines.
lines = [];
lines.push ("Hard test!");
lines.push ("A test is cool.");
lines.push ("This is a test.");
lines.push ("test is really cool!.");
lines.push ("This is a test of the emergency broadcast system.");
lines.push ("test");

// The find string and the total extraction length.
find = "test";
//-------------------------------------------------------------------------------------
extractLength = 4;
//-------------------------------------------------------------------------------------

result = ["Corrected by FRIdNGE! Base = " + extractLength + "\r"];

for (i = 0; i < lines.length; i += 1) {
    
    text = lines[i];
    
    if (text.length < extractLength || find.length > extractLength) {
        result.push ("Error on line " + Number(i+1) + " == " + lines[i]);
        continue;
    }

    start = text.indexOf (find);
    
    if (start > -1) {
        
        // Find the middle of the find string.
        middle = Math.ceil(start + find.length / 2);
        
        // Find where the extract string should start.
        start = Math.ceil(middle - extractLength / 2);
        
        if (start < 0) start = 0;
        // Get the end of the extract string.
        end = Math.ceil(start + extractLength);
        
        // Move the end back if it is past the end.
        while (end > text.length) {
            start -= 1;
            end -= 1;
        }
    
        // Recalculate the extract length.
        extractLength = end - start;
        result.push ("Good! |" + text.substr (start, extractLength) + "|  => " + extractLength + " chars");
        
    }

}

alert(result.join ("\r"));

Adobe Community

Text extraction algorithm

1 Correct answer