Compare text using script or grep

Report · Feb 15, 2023

Hi

I have a huge txt file where there are lines of Quran separated by soft return. I want to use this file as a base file as this is spell checked. I have attached the screenshot.

What I want to achieve is that whenever anyone types a line of Quran in InDesign, it should be matched with the line in the the base txt file.

Can it be done through grep? Is there any script which can do this ?

Thanks

Report · Feb 15, 2023

In my experience, Word is excellent at doing file comparisons. If this is truly a 'text' file, not one necessarily formatted in InDesign, you might consider using the better tool rather than trying to adapt ID's capabilities.

Word's approach is also full-page. and allows instant corrections. I can't quite imagine an ID script handling differences in any way but one at a time, which would be very, very tedious to process.

The other standard approach, if both documents are in InDesign, is to export them to PDF and use Acrobat DC to compare them. But that makes no provision for corrections; I still see Word as the right (and pretty good) tool for a job like this.

Report · Feb 15, 2023

Hello James

Thanks so much for the reply.

Actually you are right that Word has better options for this but I am working totally on InDesign and have little or no idea of doing page setups in Word. So I would prefer doing this in InDesign. If nothing works then I will have to do spell check in Word then again cope paste/import in Indesign.

Thanks

Report · Feb 15, 2023

So, I did set up somethig like this for a client, a few years ago. I don't think it'll work for you, but it's worth asking a few questions. What they wanted was, anytime anyone typed a phrase that appeared in their list of phrases, they wanted the person keying it to be informed. Their list had maybe fifteen phrases on it? So I set up fifteen-or-so GREP styles, to automatically apply highlighting to any second appearance of that phrase.

Is that what you were trying to ask for? Because it wouldn't work for you, as the number of lines in the Quran is rather larger than 15, and even just 15 GREP Styles running on a medium size document was a decent performance hit.

Like James, I see using other tools or environments as the best way to achieve what I think you're trying to do, but I'm still not certain that I understand what it is that you're after. Are you trying to help users quote the Quran in InDesign? Find line numbers, maybe?

Report · Feb 15, 2023

I'm thinking I answered a slightly different version of the OP. The need is not quite for file comparison, but something of a database lookup.

I think this would be a very, very complex task to embed/automate, even with some kind of script/SQL interface to that complete reference file. Almost something like a specialized word processing system in itself.

Report · Feb 18, 2023

Easy to do on a PC in VB 😉

Report · Feb 18, 2023

Sure. Just "COPY CON QURANCONVERT.EXE"... and type, type, type away.

This very much could be achieved... but "easy" is not a word I'd use in the project definition.

It does occur to me that it (1) could have lasting commercial value, or (2) might already exist (with, obviously, a different database) in the Bible editing world. Creation or conversion might be worth getting funded.

Report · Feb 18, 2023

Yes 😉 I was thinking about LIVE comparison - script would monitor what has been just typed and query database - kind of what you can achieve on the phone when you are typing - suggestions.

Report · Feb 15, 2023

Hi Joel

Thanks for the reply.

Yes I was looking somthing like this. Actually the whole Quran text is divided into 30 books. Each containing approx 20-22 pages. I would like to see what grep solution you provided so that I can implement the same.

Actually there are many variations in Quran Text, like IndoPak, Middle-Eastern and some others. I am working on IndoPak script and I have a spell checked txt version of IndoPak script which I want to use as a database for comparision.

The need for comparision is that after page setup of Al-Quran Text, I would like to run the comparision one final time before printing. Reason being there can be human error while working/page setup. So, I want to compay line by line from the database.

Thanks

Report · Feb 16, 2023

Ohhhhhh

Now, that makes sense!

Like I said previously, my solution won't work for you. What I used is called a "GREP Style." It's something you set up inside a paragraph style. It basically is constantly running a GREP Find query against everything marked with that paragraph style, and anytime it finds a hit, it marks it with a character style. StackExchange tells me that there are around seven to eight thousand lines in the Quran, depending on how you count 'em. (Does the surah name count? How about bismillah? But it doesn't matter, because...) This means that your paragraph style would need about eight thousand GREP queries running constantly on your entire document, which would probably be impossible. Even if it didn't crash InDesign, I expect it would slow it down to the point of unusability. It's a cool technique, just not for your use case. I'll add a quick animation at the end of my post showing you how it owrks.

What you'd want instead, I think, would be a script that you'd run once, after layout, which would compare the lines in your doc against your master document. That way you can have each of your non-matching lines flagged for your review, without spending a vast amount of resources on constantly searching your entire document. I like m1b's idea a lot, but it's hard to adapt to your case, because I very much doubt that you're in there re-keying the Quran when you have the complete raw text of the document in another file. You're trying to automate post-layout QA review, right?

Anyhow, I'm looking at your sample screenshot, and I'm seeing a few things that confuse me. The character drops (the pink rectangles) are there because you're using Adobe Arabic here, and it doesn't have an end of ayah glyph, yes? Also, the first line doesn't have an end of ayah because it's actually the bismillah, right? I'm asking because we'd need some kind of segmentation to chop up your target document, and end of ayah is something I've used myself to that end, in the past.

Lastly, here's the GREP Style technique that I think would crash your computer. It works fine for a few searches, probably not so well for thousands.

Report · Feb 17, 2023

@Joel Cherney

First of all, thanks so much for the effort.

Yes you are right. I want a script to flag out the lines which mismatches from the database.

In the screenshot you will see 2 character drops which are showing in pink. Those are the opening and closing Arabic brackets. In some lines you will see that there is only one char drop. That is a special character.

Currently, I just want the line till first opening bracket(excluding the bracket) to be read and then compare it with database. If it mismatches, apply a character style flagging that something is wrong in this line. If the compare result it ok, it will match the second line.

I may include the bracketed matter for comparision in future, but at present, I would like to do it without them.

To summarise, I want the script to :

a. Read a line from the databse (text file saved in a folder)

b. Read a line from the Indesign file I am working on.

c. Compare the lines.

d. If all is OK, proceed to the next line.

e. If all is not OK, flag the line in InDesign file by a character style.

f. (If it flags a particular word, it would be a wonderful solution)

I tried writing a script to read from the text frame but I could not. 😞

Thanks and regards

Shahid

Report · Feb 17, 2023

Okay, so there's lots of possibilities, here. The first that occurs to me is completely free of scripting effort, and feels like a late 20th century workaround. Back then, software generally didn't have a way to mark text as e.g. Dari or Pashto or Burmese or whatever. So, if I did make a custom dictionary for a language, I'd have to pretend that it was named something else. So our Hmong dictionary was actually coded as "English - Scottish." Hmong translator thought that was funny.

Anyhow. We make you a custom dictionary file. Each "word" in the dictionary is a whole line of your Quran file. So we load it in as a custom dictionary, and make sure that it's not using the actual Hunspell dictionary. Then, if a single word is misspelled, you get the red wavy underline indicating a misspelling. Here's how it could work:

1) Take your Master Quran File, and make it into a raw text file.

a) Remove all of the line numbers from your master. This is actually the only place we'll be using GREP.

I don't know if you are using the decorative brackets or the actual end-of-ayah glyph because they've all dropped, but you can just copy them out of your text and paste them into the "Find what" field. I have two separate clauses to find either one-digit or two-digit verse numbers:

﴿\d﴾|﴿\d\d﴾

You might need to play with this query to get it to capture all of your verse numbers...

b) Use the text tool to select the entire Quran, whack Control-A

c) File -> Export, choose "Text Only" as the file format. Give it a unique name (like QuranCustomDict.txt)

2) Add this custom dictionary to InDesign

a) Edit -> Preferences -> Dictionary -> Arabic

b) Add the dictionary

c) Make sure that Spelling is set to User Dictionary Only

3) Turn on Dynamic Spelling (Edit -> Preferences -> Spelling -> check the Dynamic Spelling box)

4) Go to your layout file that you want to check and ensure that all text is set to Arabic language, if it's not already:

a) Open a Find/Change dialog, go to the GREP tab

b) your "Find what" query is

.+

which means "find everything"

c) leave "Change to" blank

d) in the Change Formatting area, go to Advanced Character Formats and specify Arabic

e) whack that Change All button

Now, if that is all set up correctly, then any divergence from the lines as they appear in your custom dictionary get marked as spelling errors. I have a fatha on the clipboard and am simply pasting it into random words, here:

Report · Feb 17, 2023

@Joel Cherney Very nice! 🙂

Report · Feb 17, 2023

@Joel Cherney Wow...Thanks a lot for so much of effort Joel.

I am really appreciate the work you have done.

Actually, this part, I have already done. I have already prepared a Quran Hunspell Dictionary and written a batch script which installs it automatically.

What matters now is the sequence of words. Thats the main reason I wanted to compare it line by line from the database.

Any suggestions on that ?

Thanks once again.

Regards

Shahid

Report · Feb 18, 2023

@Bedazzled532, if I understand Joel's technique correctly, the idea is to enter the entire line into the dictionary so it doesn't check words, but entire lines. Is that what you have already done? Did it work?

Report · Feb 18, 2023

It was supposed to work that way. I am actually halfway between shamefaced and flabbergasted, over here, because my trick actually doesn't work in InDesign. If you swap the order of two words, the spellcheck doesn't catch it. When I export the word list, the dictionary is segmenting on spaces, not on lines. So it's not treating the whole first line, the bismillah, as one "word". I can go and edit the raw text file where InDesign stores the custom word list and it still treats each word on each line as a separate dictionary term.

The flabbergasted half of my brain is agog because I've used this trick before, this trick of defining whole phrases as single words. I am guessing that maybe I did it in Framemaker instead of InDesign? Maybe I did it in Hunspell, or maybe I'm thinking of a Trados termbase? I'm going to dig through my archives, see if I can't find the notes I assume I must have kept on it.

Report · Feb 18, 2023

Oh well @Joel Cherney, it was a terrific idea to try anyway.

Report · Feb 15, 2023

Hi @Bedazzled532, like Joel I am wondering what you are exactly trying to achieve. If you are trying to catch errors typed quotes from the Quran, would it be better to have a script just enter the text from the master data file. The user choose the chapter and verse and the script would look it up and insert it at the insertion point. That should be feasible. Otherwise you are talking about a quite sophisticated system that will require considerable development I think.

- Mark

Report · Feb 15, 2023

Thanks m1b

I understand that what I want is complicated but I am just a beginner in scripting so I wanted to do just comparision line by line. I have to struggle even writing this simple script. Just in a learning phase.

Thanks

Report · Feb 18, 2023

How about exporting Story(ies) as plain text or RTF, then sorting and comparing in WORD?

Not fully automated but if you won't have too many errors - it should be quick?

Report · Feb 18, 2023

I suggested that a ways back, and for a one-shot, labor-intensive effort it still seems to be a viable approach. But I get the idea the OP needs this on a more continuing basis, something a little more integrated into a writing and publishing workflow.

And I believe the goal, at pretty much any cost, is *zero* errors.

Report · Feb 26, 2023

I am trying this following script but for some reason it is not working.

The logic is to read one line from the original txt file (database), then read one line from the text frame,

compare the lines, if match then well and good, if does not match apply a char style of color red to that line.

I am not very good at writing scripts but this is what i have come up with. Any help would be appreciated.

Thanks.

//Read from file
file = File("d:/readid.txt");
file.open("r");
var content = file.read().split("\n");

for (var i = 0; i < content.length ; i++)
{
var orig = content[i];

//alert(content.length);
//Read from text frame
app.findGrepPreferences=app.changeGrepPreferences=null;
app.findGrepPreferences.findWhat=".+";
p = app.activeDocument.findGrep();
//alert(p.length);
for (var i = p.length-1; i >= 0; i--)
{
var newln = p[i].lines[0].contents;

if(orig === newln){
alert("same");
}
else{
alert("not same");
break;
}

}

}

Report · Feb 27, 2023

@Bedazzled532, you've made a terrific start! Next you will need to get the story you want to check, so in my example below I've just got the story that the you have selected. I compare it paragraph-by-paragraph with the masterContentFile, and apply a characterStyle "Bad" if it doesn't match.

function main() {

    var masterContentFile = File("d:/readid.txt");
    var badCharacterStyleName = 'Bad';

    if (!masterContentFile.exists) {
        alert('Could not find master content file "' + masterContentFile + '".');
        return;
    }

    var doc = app.activeDocument,
        badCharacterStyle = doc.characterStyles.itemByName(badCharacterStyleName);

    if (!badCharacterStyle.isValid) {
        alert('Could not find character style "' + badCharacterStyleName + '".');
        return;
    }

    if (
        doc.selection[0] == undefined
        || !doc.selection[0].hasOwnProperty('parentStory')
    ) {
        alert('Please put cursor in the story you want to check and try again.');
        return;
    }

    masterContentFile.open('r')

    var masterContent = masterContentFile.read().split("\n"),
        userParagraphs = doc.selection[0].parentStory.paragraphs,
        userContent = userParagraphs.everyItem().contents,
        leadingTrailingSpace = /(^\s|\s$)/g,
        contentCount = Math.min(userContent.length, masterContent.length),
        differenceCount = 0;

    for (var i = 0; i < contentCount; i++) {

        var m = masterContent[i].replace(leadingTrailingSpace, ''),
            u = userContent[i].replace(leadingTrailingSpace, '');

        if (u != m) {
            // $.writeln('  m = ' + m);
            // $.writeln('  u = ' + u);
            userParagraphs[i].applyCharacterStyle(badCharacterStyle);
            differenceCount++
        }

    }

    alert('Compared with master content, ' + differenceCount + ' different paragraphs were found.');

};

app.doScript(main, ScriptLanguage.JAVASCRIPT, undefined, UndoModes.ENTIRE_SCRIPT, 'Check Story Against Master Content');

If you have trouble, try uncommenting the two writeln statements in the loop and look at the script output in the console.

I have wrapped the whole thing in a function main and called it via app.doScript. This means that there is only one neat Undo if you want to go back to before the script changed the styles.

Also, for your info, you can use the Code button in this forum to paste in script code. It looks much better and doesn't get scrambled. The code button looks like < / >

- Mark

Report · Feb 27, 2023

@m1b Wow. Thank a ton for your efforts m1b

I created a 'Bad' char style. Script is running without errors but unfortunately it is nnot able to compare, I guess.

In my master database I entered two lines:

This is line 1

This is line 2

In my text frame in InDesign, I copy pasted the same line, for testing.

When I run the script, it is giving the message "Compasred with master content, 2 different paragraphs were found".

I dont know why this is happening.

What needs to be done now ? Is it the new line char or something else ?

Regards

Report · Feb 27, 2023

@m1b Thanks so much m1b, it works. It was my mistake.

In the database, matter was in lower case and in text frame, it was in Upper and lower case.

I changed the case and it worked. Thanks a lot.

I will try to implement this login in Quran comparing.

Compare text using script or grep

2 Correct answers