Normalise string to NFKC form

Report · Feb 28, 2023

I have a document with lots and lots of Polytonic Greek. A lot of the Greek text is incorrect (reflecting its origin as an OCR-traced scan), with diacritics being particular troublesome. Three different types appear in the document:

separate (full-width diacritic followed by letter): ῞Ι (U+1FDE + U+0391)
decomposed (letter followed by combining diacritics): Ἵ (U+0391 + U+0314 + U+0301)
precomposed (single glyph): Ἵ (U+1F05)

It’s easy enough to convert separate representations into decomposed characters in a few GREP queries. But as is well-known, InDesign does not handle Unicode normalisation very well, meaning that any further find/replace or GREP styles targeted at precomposed characters will not work on decomposed characters.

Meanwhile, JavaScript string normalisation didn’t arrive until ECMAScript 6 and thus doesn’t work in ExtendScript (and there are still some deal-breakers in UXP scripting that means I can’t use that), so the obvious, built-in choice won’t work.

Is there some way – through a script or otherwise – to normalise all decomposed characters to precomposed according to NFKC?

(If scripting, ExtendScript would be preferable, since I’m hovering somewhere slightly south of useless in both VBScript and, particularly, AppleScript.)

Report · Feb 28, 2023

Hi @Janus Bahs Jacquet, would it be possible to post a one-page sample document showing those three problem cases and also another sample document with them fixed? A before and after, so to speak. That might help with answers. Or are the actual problem cases so numerous and varied that only a good understanding of the NFKC algo will do?

- Mark

Report · Feb 28, 2023

There are of course a lot of different variations to account for (that’s why I’m trying to find a way to script it rather than just doing it by hand), since Greek diacritics can stack: for all seven capital vowel letters, there are about ten different combinations of diacritics. But I don’t think any deeper understanding of the NFKC algorithm is necessary (I certainly don’t have more than a passing understanding of it) – that’s just the normalisation form that suits me best: everything converted to precomposed forms.

Shown here with an alpha, the ten possible combinations are:

Ἀ Ἁ Ἄ Ἅ Ἂ Ἃ Ἆ Ἇ ᾎ ᾏ

(The last two may look like double letters here, but in fact they have a iota subscript.)

I’ve attached here a sample file with the same line of text written in all three different ways: with separated diacritics, decomposed characters and with precomposed characters. The latter two look the same, but if you move the cursor back and forth through the words in InDesign, you’ll notice that the initial capitals with diacritics ‘count as’ two letters in the decomposed line, but only one character in the precomposed line. Moreover, if you copy a word like Ἄρεως in the decomposed line, and then paste it into the ‘Find what’ box in the Find/Change dialog, you’ll see that InDesign only matches the word in the precomposed line – even though you just pasted it from the decomposed line!

So the goal is to take any combination of diacritic + vowel written either with separated diacritics (Type 1) or decomposed characters (Type 2) and convert them all to precomposed characters (Type 3).

The brute-force way to do that would be to manually loop over all 140-or-so combinations of diacritics and vowel letters (70 separated, 70 decomposed) and replace them with their precomposed counterparts. But that would take hours and be very error-prone.

Slightly smarter would be to convert the separated forms to decomposed forms first – this can be done with 10 GREP queries, one for each combination of diacritics, of this type:

[find] ῾([ΑΕΙΟΥΗΩ])
[replace with] $1\x{0314}

– with \x{0314} being the Unicode value for COMBINING REVERSED COMMA ABOVE, the second part of the decomposed form of Greek vowels with rough breathing marks. Brute-forcing decomposed to precomposed forms would then ‘only’ require about 80 GREP queries in total; still tedious and error-prone.

The best way I can think of would be to do the ten GREPs to get rid of all the separated forms, and then, in the script, simply finding the decomposed forms and replacing them with match.normalize('NFKC') (where match is the variable holding the matched string).

Unfortunately, string.normalize() is not available in ExtendScript, so that last bit, which was supposed to be the easiest, becomes challenging. 😕

Report · Feb 28, 2023

Thanks @Janus Bahs Jacquet, the sample document was helpful. But I haven't been able to achieve anything (not surprising, as I'm no expert—this is the first I've heard of String.prototype.normalize).

I noticed (as you mention) that pasting decomposed text into the FindText dialog converts it to precomposed text, so I wondered if we could use that to our advantage. But when I tried this in a script I found that Indesign doesn't do this when going via the scripting API, so that didn't work. However, perhaps if you script the UI (not something I have any skill at) you might be able to get it to work for you.

I also tried to find polyfills for String.prototype.normalize but couldn't find any. Nor could I find any simple code we could port to ExtendScript. I'm sure you've done the same searching.

At this point I'd be looking at scripting those brute force findGreps you mentioned via a text configuration file that lists all the possible combos etc. Sorry I couldn't be more help. Don't lose hope, though—there are some *very* knowledgeable folk around here that might have an answer.

- Mark

Report · Mar 02, 2023

So, after some going back and forth, I decided to bite the bullet and use UXP scripting after all, and simply lump it that there would be side effects. As it turned out, the only side effect was that UXP doesn’t yet have an equivalent to ExtendScript’s app.doScript() that allows you to make the whole script undoable in a single move. Problematic for scripts in general, but for the nonce, something I can live with (just had to make sure I performed the script on a clean copy so if something went awry, I could just close the file and discard any changes in order to effectively undo the script).

With UXP scripting, the following worked:

function main() {
	var charmap = {
		'᾽': '\\x{0313}',
		'῎': '\\x{0313}\\x{0301}',
		'῍': '\\x{0313}\\x{0300}',
		'῏': '\\x{0313}\\x{0342}',
		'῾': '\\x{0314}',
		'῞': '\\x{0314}\\x{0301}',
		'῝': '\\x{0314}\\x{0300}',
		'῟': '\\x{0314}\\x{0342}'
	}
	
	RegExp.escape = function(text) {
	  return text.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
	};
	
	var r = new RegExp ("[" + RegExp.escape(Object.keys(charmap).join("")) + "]", "g");
	var sel = (app.selection.length > 0 && app.selection[0].constructorName == 'Text') ? app.selection[0] : app.activeDocument;

	for (const [find, replace] of Object.entries(charmap)) {
		app.findGrepPreferences = null;
		app.findGrepPreferences.findWhat = find + '([ΑΕΙΟΥΗΩ])';
		app.changeGrepPreferences.changeTo = '$1' + replace;
		sel.changeGrep();
	}

	app.findGrepPreferences = null;
	app.findGrepPreferences.findWhat = '[ΑΕΙΟΥΗΩ][\\x{0300}-\\x{036F}]+';
	var f = sel.findGrep();

	
	for (i = 0; i < f.length; i++) {
		f[i].contents = f[i].contents.normalize('NFKC');
	}
}

main();

It starts off using simple GREP replacements, based on a charmap, to replace the separated diacritics with their combined equivalents, in order to have decomposed characters.

Then it uses a GREP search to find all sequences of capital Greek letters followed by one or more combining diacritics (= decomposed representations of capitals with diacritics), pass each match through normalize('NFKC') to get the compatible, normalised (= precomposed) form, and replace the contents of the match with this normalised form.

(As it happened, there weren’t any capitals with iota subscripts and diacritics in the file, so I could simplify the charmap a bit.)

Report · Mar 02, 2023

Hey @Janus Bahs Jacquet, thanks for posting your solution!

- Mark

Report · Mar 02, 2023

Hi @Janus Bahs Jacquet

Just a suggestion. Since you perform a compatibility decomp + canonical comp there's good chance that string length is not preserved from f[i].contents to f[i].contents.normalize('NFKC'). It is then much safer to loop backwards when processing f’s items.

Best,

Marc

Report · Mar 05, 2023

Ah, that’s a very good point, @Marc Autret – I’ve been bitten by that before when adding text to frames. I’ll have to check with a previous version to make sure that hasn’t happened here!

Adobe Community

Normalise string to NFKC form

1 Correct answer