Skip to main content
Known Participant
March 25, 2013
Answered

Script for ID Dictionary

  • March 25, 2013
  • 2 replies
  • 3796 views

Hi,

I have a large medical document (around 350 pages) with thousands of words. This document is already proofed for errors and I want to use it to build a specialized Indesign dictionary to use it in ID and other text processing software. In addition I want to build a database with those terms for using in future works in that discipline. So, I need a script that gathers all the words from the document (all the text is completed treaded in a single frame), eliminates de repeated words, sort them alphabethicaly and puts the results in a text file. Every word should be in his own paragraph.

After that I will give the resulting text document another quick reading to look for inconsistences, other errors and typos, etc. In the end I will use the final output to build a specialized used ID dictionary and a database with those terms for using in the technical proofreading of this type of books.

I confess that I do not know much about scripting (I really have tried...) so I have this question: Is this doable? Is there someone out there that can make me this script?

Thank you in advance,

Maria

This topic has been closed for replies.
Correct answer Jongware

Here is a working Javascript to do the word gathering and sorting. There are straightforward ways to do this, but my script uses a couple of shortcuts that are possible with both Javascript (split arrays on a regular expression; remove duplicates by feeding the result into an object) and InDesign (using everyItem to quickly gather all possible text). So it may be kind of unclear what happens where

A problem (as you may have already have found out by yourself) is how to determine what a 'word' is. This script replaces common punctuation and digits with a space, and then only gathers what's left between the spaces. You are sure to find some weird "words" this way, but then again so does your manual way.

After processing, the script prompts for a Save File name and then opens it in your default plain text editor.

textList = app.activeDocument.stories.everyItem().texts.everyItem().contents.join('\r');
textList = textList.replace(/[.,:;!?()\/\d\[\]]+/g, ' ');
textList = textList.split(/\s+/);
tmpList = {};
for (i=0; i<textList.length; i++)
tmpList[textList] = true;
resultList = [];
i = 0;
for (j in tmpList)
resultList[i++] = j;
resultList.sort();

defaultFile = new File (Folder.myDocuments+"/"+app.activeDocument.name.replace(/\.indd$/i, '')+".txt");
if (File.fs == "Windows")
  writeFile = defaultFile.saveDlg( 'Save list', "Plain text file:*.txt;All files:*.*" );
else
  writeFile = defaultFile.saveDlg( 'Save list');
if (writeFile != null)
{
  if (writeFile.open("w"))
  {
    writeFile.encoding = "utf8";
    writeFile.write (resultList.join("\r")+"\r");
    writeFile.close();
writeFile.execute();
  }
}

2 replies

Jongware
Community Expert
JongwareCommunity ExpertCorrect answer
Community Expert
March 27, 2013

Here is a working Javascript to do the word gathering and sorting. There are straightforward ways to do this, but my script uses a couple of shortcuts that are possible with both Javascript (split arrays on a regular expression; remove duplicates by feeding the result into an object) and InDesign (using everyItem to quickly gather all possible text). So it may be kind of unclear what happens where

A problem (as you may have already have found out by yourself) is how to determine what a 'word' is. This script replaces common punctuation and digits with a space, and then only gathers what's left between the spaces. You are sure to find some weird "words" this way, but then again so does your manual way.

After processing, the script prompts for a Save File name and then opens it in your default plain text editor.

textList = app.activeDocument.stories.everyItem().texts.everyItem().contents.join('\r');
textList = textList.replace(/[.,:;!?()\/\d\[\]]+/g, ' ');
textList = textList.split(/\s+/);
tmpList = {};
for (i=0; i<textList.length; i++)
tmpList[textList] = true;
resultList = [];
i = 0;
for (j in tmpList)
resultList[i++] = j;
resultList.sort();

defaultFile = new File (Folder.myDocuments+"/"+app.activeDocument.name.replace(/\.indd$/i, '')+".txt");
if (File.fs == "Windows")
  writeFile = defaultFile.saveDlg( 'Save list', "Plain text file:*.txt;All files:*.*" );
else
  writeFile = defaultFile.saveDlg( 'Save list');
if (writeFile != null)
{
  if (writeFile.open("w"))
  {
    writeFile.encoding = "utf8";
    writeFile.write (resultList.join("\r")+"\r");
    writeFile.close();
writeFile.execute();
  }
}

Inspiring
March 28, 2013

hi,

i thought of  having a speedproblem using every item, so exported the stories ...

#target Indesign

//http://www.shamasis.net/

Array.prototype.unique = function() {
    var o = {}, i, l = this.length, r = [];
    for(i=0; i<l;i+=1) o[this] = this;
    for(i in o) r.push(o);
    return r;
};

var storyFiles = new Array();

var currDoc = app.activeDocument;

var docName = currDoc.name;

var currStories = currDoc.stories.everyItem().getElements();

l = currStories.length;

while(l--){

currStory = currStories;

currStory.exportFile(ExportFormat.TEXT_TYPE, File('~/Desktop/' + docName + l + '.txt'));

storyFiles.push(File('~/Desktop/' + docName + l + '.txt'))

}

var masterStory = '';

l= storyFiles.length;

while(l--){

currExport = storyFiles;
currExport.open('r');
masterStory = masterStory + currExport.read();
currExport.close();
currExport.remove();
}

var finalCut =masterStory.replace(/[?,.!\n\r]/g,' ').split(' ').unique().sort().join('\n');

destFile = File('~/Desktop/' + docName.replace(/indd/, 'txt'));

write_file(destFile, finalCut);

destFile.execute();

function write_file ( _file, _data )

{

_file.open( 'w' );
_file.encoding = 'UTF-8';
_file.write( _data );
_file.close();

}

MrTIFF
Participating Frequently
March 26, 2013

Oh yes, doable, not particularly difficult.

Maria964Author
Known Participant
March 26, 2013

Hi, Stephen,

Thank you very much for you opinion. I think that this is doable and not particulary difficult (for you). For me it is rather difficult because I don't get along with scripting (I have tried...). But anyway I solved my problem: replace all the spaces between words with a paragraph, export to text and used grep in Nopepad++ to sort the words and delete the duplicates. With another quick proofing I will get a wonderful indesign user dictionary for medical terms.

Now another question: do you know any script that I can use to convert all the tables to text in a document?

Maria

MrTIFF
Participating Frequently
March 26, 2013

Finding the text in a table isn't hard ...

where do you want the table text to be put? in a new textFrame?

what do you need to do with the table text?

how do you want the table text to look?

what happens to the original table?

Making the table text look anything like the original table is, I think, hard. How would you do this "by hand"?  The answer will tell us what the script needs to do, and what decisions a such a script would need to make.