• Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
Exit
0

Text extraction algorithm

Community Expert ,
Mar 30, 2020 Mar 30, 2020

Copy link to clipboard

Copied

Hi All,

I need help with a text extraction algorithm. Given a string of text inside a larger string, I need to extract a string of a certain number of characters with my given string in the middle. I am going to illustrate it with some nonsense paragraphs, but hopefully it will illustrate when I am trying to do. Here are three paragraphs with the given text "zzzz":

 

123456789 zzzz 1234567890
1 zzzz 12345 6789 1234567
1234 6789 12345678 zzzz 1

 

I want to extract 10 characters with the "zzzz" in the middle. If the "zzzz" can't be in the middle (as in the 2nd and 3rd paragraphs), I still want 10 characters. So, here is the result I want:

 

89 zzzz 12
1 zzzz 123
678 zzzz 1

 

Of course, there could be instances where the overall string contains less characters, but I want to set the maximum extraction to a specific number characters (10 in this case). So, given the overall length of a container string, the length of the target string and its position in the container string, and the number of total characters to extract, I am looking for a general algorithm to extract the characters, keeping the target string as close to the center as possible.

 

I am using ExtendScript but even pseudocode would be helpful. Any ideas or pointers would be appreciated. Thank you very much. -Rick

TOPICS
Scripting

Views

1.4K

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

Community Expert , Mar 31, 2020 Mar 31, 2020

Thank you Jongware and Peter! It's good to see that you InDesign wireheads are still active. I am working in FrameMaker but like to post here for the extra brain power.

 

One problem with Jongware's solution: If the total length of the original string is less than the length of the extract string, then it didn't work. Here is a solution that I worked up, but I am open to optimizations. Thank you very much!

 

var lines, extractLength, text, find, start, middle, end, result, i;

// Sample lines.
lines
...

Votes

Translate

Translate
Community Expert ,
Mar 31, 2020 Mar 31, 2020

Copy link to clipboard

Copied

This works for your example text:

 

text = '123456789 zzzz 1234567890\n1 zzzz 12345 6789 1234567\n1234 6789 12345678 zzzz 1\n';
search = 'zzzz';
max_length = 10;
lines = text.split('\n');

result = []
for (i=0; i<lines.length; i++)
{
	start = lines[i].indexOf(search);
	if (start > -1)
	{
		start -= (max_length-search.length)>>1;
		if (start < 0)
			start = 0;
		if (start + max_length >= lines[i].length)
			start = lines[i].length - max_length;
		result.push (lines[i].substr(start,max_length));
	}
}

alert (result.join('\n'))

 

 

Is this for a linguistics purpose, to display a concordance such as with AntConc? In that case the key phrase should always be centered, and left and right should be padded with spaces if necessary. That's just a case of, again, checking 'start' and 'start+phrase.length' but this time add spaces until they are in range:

 

text = '123456789 zzzz 1234567890\n1 zzzz 12345 6789 1234567\n1234 6789 12345678 zzzz 1\n';
search = 'zzzz';
max_length = 10;
lines = text.split('\n');

result = []
for (i=0; i<lines.length; i++)
{
	start = lines[i].indexOf(search);
	if (start > -1)
	{
		start -= (max_length-search.length)>>1;
		while (start < 0)
		{
			lines[i] = ' '+lines[i]
			start++;
		}
		while (start + max_length >= lines[i].length)
		{
			lines[i] += ' ';
		}
		result.push (lines[i].substr(start,max_length));
	}
}

alert (result.join('\n'))

 

 

which produces a padded

89 zzzz 12
 1 zzzz 12
78 zzzz 1 

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Mar 31, 2020 Mar 31, 2020

Copy link to clipboard

Copied

In this case your search should be 

.{0,3}zzzz.{0,3}

which paragrapses as zzzz preceded by up to three characters and followed by up to three characters. You can generalise that using  this very cludgy code:

search = 'zzzz';
extras = (10 - search.length) / 2;
if (String(extras).indexOf('.') > -1) {
  left = parseInt(extras);
  right = parseInt (extras) + 0.5;
} else {
  left = right = extras;
}

app.findGrepPreferences = null;
app.findGrepPreferences.findWhat = '.{0,' + left + '}' + search + '.{0,' + right + '}';
found = app.documents[0].findGrep();
for (i = 0; i < found.length; i++) {
  $.writeln (found[i].contents)
}

 

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Mar 31, 2020 Mar 31, 2020

Copy link to clipboard

Copied

Jongware beat me to it! With a better approach, too.

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Mar 31, 2020 Mar 31, 2020

Copy link to clipboard

Copied

Thank you Jongware and Peter! It's good to see that you InDesign wireheads are still active. I am working in FrameMaker but like to post here for the extra brain power.

 

One problem with Jongware's solution: If the total length of the original string is less than the length of the extract string, then it didn't work. Here is a solution that I worked up, but I am open to optimizations. Thank you very much!

 

var lines, extractLength, text, find, start, middle, end, result, i;

// Sample lines.
lines = [];
lines.push ("A test is cool.");
lines.push ("This is a test.");
lines.push ("This is another test.");
lines.push ("This is a test of the emergency broadcast system.");
lines.push ("test");

// The find string and the total extraction length.
find = "test";
extractLength = 24;

result = [];
for (i = 0; i < lines.length; i += 1) {
    text = lines[i];
    start = text.indexOf (find);
    // Where does the find string start?
    if (start > -1) {
        // Find the middle of the find string.
        middle = Math.ceil (index + find.length / 2);
        // Find where the extract string should start.
        start = middle - extractLength / 2;
        if (start < 0) {
            start = 0;
        }
        // Get the end of the extract string.
        end = start + extractLength;
        // Move the end back if it is past the end.
        while (end > text.length) {
            end -= 1;
            // Move the start back if there is room.
            if (start > 0) {
                start -= 1;
            }
        }
        // Recalculate the extract length.
        extractLength = end - start;
        result.push ("..." + text.substr (start, extractLength) + "...");    
    }
}

alert (result.join ("\r"));

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guide ,
Mar 31, 2020 Mar 31, 2020

Copy link to clipboard

Copied

LATEST
var lines, extractLength, text, find, start, middle, end, result, i;

// Sample lines.
lines = [];
lines.push ("Hard test!");
lines.push ("A test is cool.");
lines.push ("This is a test.");
lines.push ("test is really cool!.");
lines.push ("This is a test of the emergency broadcast system.");
lines.push ("test");

// The find string and the total extraction length.
find = "test";
//-------------------------------------------------------------------------------------
extractLength = 4;
//-------------------------------------------------------------------------------------

result = ["Corrected by FRIdNGE! Base = " + extractLength + "\r"];

for (i = 0; i < lines.length; i += 1) {
    
    text = lines[i];
    
    if (text.length < extractLength || find.length > extractLength) {
        result.push ("Error on line " + Number(i+1) + " == " + lines[i]);
        continue;
    }

    start = text.indexOf (find);
    
    if (start > -1) {
        
        // Find the middle of the find string.
        middle = Math.ceil(start + find.length / 2);
        
        // Find where the extract string should start.
        start = Math.ceil(middle - extractLength / 2);
        
        if (start < 0) start = 0;
        // Get the end of the extract string.
        end = Math.ceil(start + extractLength);
        
        // Move the end back if it is past the end.
        while (end > text.length) {
            start -= 1;
            end -= 1;
        }
    
        // Recalculate the extract length.
        extractLength = end - start;
        result.push ("Good! |" + text.substr (start, extractLength) + "|  => " + extractLength + " chars");
        
    }

}

alert(result.join ("\r"));

Votes

Translate

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines