## Text extraction algorithm

Hi All,

I need help with a text extraction algorithm. Given a string of text inside a larger string, I need to extract a string of a certain number of characters with my given string in the middle. I am going to illustrate it with some nonsense paragraphs, but hopefully it will illustrate when I am trying to do. Here are three paragraphs with the given text "zzzz":

``````123456789 zzzz 1234567890
1 zzzz 12345 6789 1234567
1234 6789 12345678 zzzz 1``````

I want to extract 10 characters with the "zzzz" in the middle. If the "zzzz" can't be in the middle (as in the 2nd and 3rd paragraphs), I still want 10 characters. So, here is the result I want:

``````89 zzzz 12
1 zzzz 123
678 zzzz 1``````

Of course, there could be instances where the overall string contains less characters, but I want to set the maximum extraction to a specific number characters (10 in this case). So, given the overall length of a container string, the length of the target string and its position in the container string, and the number of total characters to extract, I am looking for a general algorithm to extract the characters, keeping the target string as close to the center as possible.

I am using ExtendScript but even pseudocode would be helpful. Any ideas or pointers would be appreciated. Thank you very much. -Rick

Scripting

Adobe Community Professional , Mar 31, 2020
Most Valuable Participant , Mar 31, 2020
This works for your example text:

``````text = '123456789 zzzz 1234567890\n1 zzzz 12345 6789 1234567\n1234 6789 12345678 zzzz 1\n';
search = 'zzzz';
max_length = 10;
lines = text.split('\n');

result = []
for (i=0; i<lines.length; i++)
{
start = lines[i].indexOf(search);
if (start > -1)
{
start -= (max_length-search.length)>>1;
if (start < 0)
start = 0;
if (start + max_length >= lines[i].length)
start = lines[i].length - max_length;
result.push (lines[i].substr(start,max_length));
}
}

``````

Is this for a linguistics purpose, to display a concordance such as with AntConc? In that case the key phrase should always be centered, and left and right should be padded with spaces if necessary. That's just a case of, again, checking 'start' and 'start+phrase.length' but this time add spaces until they are in range:

``````text = '123456789 zzzz 1234567890\n1 zzzz 12345 6789 1234567\n1234 6789 12345678 zzzz 1\n';
search = 'zzzz';
max_length = 10;
lines = text.split('\n');

result = []
for (i=0; i<lines.length; i++)
{
start = lines[i].indexOf(search);
if (start > -1)
{
start -= (max_length-search.length)>>1;
while (start < 0)
{
lines[i] = ' '+lines[i]
start++;
}
while (start + max_length >= lines[i].length)
{
lines[i] += ' ';
}
result.push (lines[i].substr(start,max_length));
}
}

``````

``````89 zzzz 12
1 zzzz 12
78 zzzz 1 ``````

Mar 31, 2020

In this case your search should be

``.{0,3}zzzz.{0,3}``

which paragrapses as zzzz preceded by up to three characters and followed by up to three characters. You can generalise that using  this very cludgy code:

``````search = 'zzzz';
extras = (10 - search.length) / 2;
if (String(extras).indexOf('.') > -1) {
left = parseInt(extras);
right = parseInt (extras) + 0.5;
} else {
left = right = extras;
}

app.findGrepPreferences = null;
app.findGrepPreferences.findWhat = '.{0,' + left + '}' + search + '.{0,' + right + '}';
found = app.documents[0].findGrep();
for (i = 0; i < found.length; i++) {
\$.writeln (found[i].contents)
}``````

Mar 31, 2020

Jongware beat me to it! With a better approach, too.

Mar 31, 2020

Thank you Jongware and Peter! It's good to see that you InDesign wireheads are still active. I am working in FrameMaker but like to post here for the extra brain power.

One problem with Jongware's solution: If the total length of the original string is less than the length of the extract string, then it didn't work. Here is a solution that I worked up, but I am open to optimizations. Thank you very much!

``````var lines, extractLength, text, find, start, middle, end, result, i;

// Sample lines.
lines = [];
lines.push ("A test is cool.");
lines.push ("This is a test.");
lines.push ("This is another test.");
lines.push ("This is a test of the emergency broadcast system.");
lines.push ("test");

// The find string and the total extraction length.
find = "test";
extractLength = 24;

result = [];
for (i = 0; i < lines.length; i += 1) {
text = lines[i];
start = text.indexOf (find);
// Where does the find string start?
if (start > -1) {
// Find the middle of the find string.
middle = Math.ceil (index + find.length / 2);
// Find where the extract string should start.
start = middle - extractLength / 2;
if (start < 0) {
start = 0;
}
// Get the end of the extract string.
end = start + extractLength;
// Move the end back if it is past the end.
while (end > text.length) {
end -= 1;
// Move the start back if there is room.
if (start > 0) {
start -= 1;
}
}
// Recalculate the extract length.
extractLength = end - start;
result.push ("..." + text.substr (start, extractLength) + "...");
}
}

Mar 31, 2020

``````var lines, extractLength, text, find, start, middle, end, result, i;

// Sample lines.
lines = [];
lines.push ("Hard test!");
lines.push ("A test is cool.");
lines.push ("This is a test.");
lines.push ("test is really cool!.");
lines.push ("This is a test of the emergency broadcast system.");
lines.push ("test");

// The find string and the total extraction length.
find = "test";
//-------------------------------------------------------------------------------------
extractLength = 4;
//-------------------------------------------------------------------------------------

result = ["Corrected by FRIdNGE! Base = " + extractLength + "\r"];

for (i = 0; i < lines.length; i += 1) {

text = lines[i];

if (text.length < extractLength || find.length > extractLength) {
result.push ("Error on line " + Number(i+1) + " == " + lines[i]);
continue;
}

start = text.indexOf (find);

if (start > -1) {

// Find the middle of the find string.
middle = Math.ceil(start + find.length / 2);

// Find where the extract string should start.
start = Math.ceil(middle - extractLength / 2);

if (start < 0) start = 0;
// Get the end of the extract string.
end = Math.ceil(start + extractLength);

// Move the end back if it is past the end.
while (end > text.length) {
start -= 1;
end -= 1;
}

// Recalculate the extract length.
extractLength = end - start;
result.push ("Good! |" + text.substr (start, extractLength) + "|  => " + extractLength + " chars");

}

}