Copy link to clipboard
Copied
I have a document with lots and lots of Polytonic Greek. A lot of the Greek text is incorrect (reflecting its origin as an OCR-traced scan), with diacritics being particular troublesome. Three different types appear in the document:
Itās easy enough to convert separate representations into decomposed characters in a few GREP queries. But as is well-known, InDesign does not handle Unicode normalisation very well, meaning that any further find/replace or GREP styles targeted at precomposed characters will not work on decomposed characters.
Meanwhile, JavaScript string normalisation didnāt arrive until ECMAScript 6 and thus doesnāt work in ExtendScript (and there are still some deal-breakers in UXP scripting that means I canāt use that), so the obvious, built-in choice wonāt work.
Is there some way ā through a script or otherwise ā to normalise all decomposed characters to precomposed according to NFKC?
(If scripting, ExtendScript would be preferable, since Iām hovering somewhere slightly south of useless in both VBScript and, particularly, AppleScript.)
So, after some going back and forth, I decided to bite the bullet and use UXP scripting after all, and simply lump it that there would be side effects. As it turned out, the only side effect was that UXP doesnāt yet have an equivalent to ExtendScriptās app.doScript() that allows you to make the whole script undoable in a single move. Problematic for scripts in general, but for the nonce, something I can live with (just had to make sure I performed the script on a clean copy so if something went
...Copy link to clipboard
Copied
Hi @Janus Bahs Jacquet, would it be possible to post a one-page sample document showing those three problem cases and also another sample document with them fixed? A before and after, so to speak. That might help with answers. Or are the actual problem cases so numerous and varied that only a good understanding of the NFKC algo will do?
- Mark
Copy link to clipboard
Copied
There are of course a lot of different variations to account for (thatās why Iām trying to find a way to script it rather than just doing it by hand), since Greek diacritics can stack: for all seven capital vowel letters, there are about ten different combinations of diacritics. But I donāt think any deeper understanding of the NFKC algorithm is necessary (I certainly donāt have more than a passing understanding of it) ā thatās just the normalisation form that suits me best: everything converted to precomposed forms.
Shown here with an alpha, the ten possible combinations are:
į¼ į¼ į¼ į¼ į¼ į¼ į¼ į¼ į¾ į¾
(The last two may look like double letters here, but in fact they have a iota subscript.)
Iāve attached here a sample file with the same line of text written in all three different ways: with separated diacritics, decomposed characters and with precomposed characters. The latter two look the same, but if you move the cursor back and forth through the words in InDesign, youāll notice that the initial capitals with diacritics ācount asā two letters in the decomposed line, but only one character in the precomposed line. Moreover, if you copy a word like į¼ĻĪµĻĻ in the decomposed line, and then paste it into the āFind whatā box in the Find/Change dialog, youāll see that InDesign only matches the word in the precomposed line ā even though you just pasted it from the decomposed line!
So the goal is to take any combination of diacritic + vowel written either with separated diacritics (Type 1) or decomposed characters (Type 2) and convert them all to precomposed characters (Type 3).
The brute-force way to do that would be to manually loop over all 140-or-so combinations of diacritics and vowel letters (70 separated, 70 decomposed) and replace them with their precomposed counterparts. But that would take hours and be very error-prone.
Slightly smarter would be to convert the separated forms to decomposed forms first ā this can be done with 10 GREP queries, one for each combination of diacritics, of this type:
ā with \x{0314} being the Unicode value for COMBINING REVERSED COMMA ABOVE, the second part of the decomposed form of Greek vowels with rough breathing marks. Brute-forcing decomposed to precomposed forms would then āonlyā require about 80 GREP queries in total; still tedious and error-prone.
The best way I can think of would be to do the ten GREPs to get rid of all the separated forms, and then, in the script, simply finding the decomposed forms and replacing them with match.normalize('NFKC') (where match is the variable holding the matched string).
Unfortunately, string.normalize() is not available in ExtendScript, so that last bit, which was supposed to be the easiest, becomes challenging. š
Copy link to clipboard
Copied
Thanks @Janus Bahs Jacquet, the sample document was helpful. But I haven't been able to achieve anything (not surprising, as I'm no expertāthis is the first I've heard of String.prototype.normalize).
I noticed (as you mention) that pasting decomposed text into the FindText dialog converts it to precomposed text, so I wondered if we could use that to our advantage. But when I tried this in a script I found that Indesign doesn't do this when going via the scripting API, so that didn't work. However, perhaps if you script the UI (not something I have any skill at) you might be able to get it to work for you.
I also tried to find polyfills for String.prototype.normalize but couldn't find any. Nor could I find any simple code we could port to ExtendScript. I'm sure you've done the same searching.
At this point I'd be looking at scripting those brute force findGreps you mentioned via a text configuration file that lists all the possible combos etc. Sorry I couldn't be more help. Don't lose hope, thoughāthere are some *very* knowledgeable folk around here that might have an answer.
- Mark
Copy link to clipboard
Copied
So, after some going back and forth, I decided to bite the bullet and use UXP scripting after all, and simply lump it that there would be side effects. As it turned out, the only side effect was that UXP doesnāt yet have an equivalent to ExtendScriptās app.doScript() that allows you to make the whole script undoable in a single move. Problematic for scripts in general, but for the nonce, something I can live with (just had to make sure I performed the script on a clean copy so if something went awry, I could just close the file and discard any changes in order to effectively undo the script).
With UXP scripting, the following worked:
function main() {
var charmap = {
'į¾½': '\\x{0313}',
'įæ': '\\x{0313}\\x{0301}',
'įæ': '\\x{0313}\\x{0300}',
'įæ': '\\x{0313}\\x{0342}',
'įæ¾': '\\x{0314}',
'įæ': '\\x{0314}\\x{0301}',
'įæ': '\\x{0314}\\x{0300}',
'įæ': '\\x{0314}\\x{0342}'
}
RegExp.escape = function(text) {
return text.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
};
var r = new RegExp ("[" + RegExp.escape(Object.keys(charmap).join("")) + "]", "g");
var sel = (app.selection.length > 0 && app.selection[0].constructorName == 'Text') ? app.selection[0] : app.activeDocument;
for (const [find, replace] of Object.entries(charmap)) {
app.findGrepPreferences = null;
app.findGrepPreferences.findWhat = find + '([ĪĪĪĪĪ„ĪĪ©])';
app.changeGrepPreferences.changeTo = '$1' + replace;
sel.changeGrep();
}
app.findGrepPreferences = null;
app.findGrepPreferences.findWhat = '[ĪĪĪĪĪ„ĪĪ©][\\x{0300}-\\x{036F}]+';
var f = sel.findGrep();
for (i = 0; i < f.length; i++) {
f[i].contents = f[i].contents.normalize('NFKC');
}
}
main();
It starts off using simple GREP replacements, based on a charmap, to replace the separated diacritics with their combined equivalents, in order to have decomposed characters.
Then it uses a GREP search to find all sequences of capital Greek letters followed by one or more combining diacritics (= decomposed representations of capitals with diacritics), pass each match through normalize('NFKC') to get the compatible, normalised (= precomposed) form, and replace the contents of the match with this normalised form.
(As it happened, there werenāt any capitals with iota subscripts and diacritics in the file, so I could simplify the charmap a bit.)
Copy link to clipboard
Copied
Hey @Janus Bahs Jacquet, thanks for posting your solution!
- Mark
Copy link to clipboard
Copied
Just a suggestion. Since you perform a compatibility decomp + canonical comp there's good chance that string length is not preserved from f[i].contents to f[i].contents.normalize('NFKC'). It is then much safer to loop backwards when processing fās items.
Best,
Marc
Copy link to clipboard
Copied
Ah, thatās a very good point, @Marc Autret ā Iāve been bitten by that before when adding text to frames. Iāll have to check with a previous version to make sure that hasnāt happened here!