Normalise string to NFKC form

Question

I have a document with lots and lots of Polytonic Greek. A lot of the Greek text is incorrect (reflecting its origin as an OCR-traced scan), with diacritics being particular troublesome. Three different types appear in the document:

separate (full-width diacritic followed by letter): ῞Ι (U+1FDE + U+0391)
decomposed (letter followed by combining diacritics): Ἵ (U+0391 + U+0314 + U+0301)
precomposed (single glyph): Ἵ (U+1F05)

It’s easy enough to convert separate representations into decomposed characters in a few GREP queries. But as is well-known, InDesign does not handle Unicode normalisation very well, meaning that any further find/replace or GREP styles targeted at precomposed characters will not work on decomposed characters.

Meanwhile, JavaScript string normalisation didn’t arrive until ECMAScript 6 and thus doesn’t work in ExtendScript (and there are still some deal-breakers in UXP scripting that means I can’t use that), so the obvious, built-in choice won’t work.

Is there some way – through a script or otherwise – to normalise all decomposed characters to precomposed according to NFKC?

(If scripting, ExtendScript would be preferable, since I’m hovering somewhere slightly south of useless in both VBScript and, particularly, AppleScript.)

Janus Bahs Jacquet · Accepted Answer

So, after some going back and forth, I decided to bite the bullet and use UXP scripting after all, and simply lump it that there would be side effects. As it turned out, the only side effect was that UXP doesn’t yet have an equivalent to ExtendScript’s app.doScript() that allows you to make the whole script undoable in a single move. Problematic for scripts in general, but for the nonce, something I can live with (just had to make sure I performed the script on a clean copy so if something went awry, I could just close the file and discard any changes in order to effectively undo the script).

With UXP scripting, the following worked:

function main() {
	var charmap = {
		'᾽': '\\x{0313}',
		'῎': '\\x{0313}\\x{0301}',
		'῍': '\\x{0313}\\x{0300}',
		'῏': '\\x{0313}\\x{0342}',
		'῾': '\\x{0314}',
		'῞': '\\x{0314}\\x{0301}',
		'῝': '\\x{0314}\\x{0300}',
		'῟': '\\x{0314}\\x{0342}'
	}
	
	RegExp.escape = function(text) {
	  return text.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
	};
	
	var r = new RegExp ("[" + RegExp.escape(Object.keys(charmap).join("")) + "]", "g");
	var sel = (app.selection.length > 0 && app.selection[0].constructorName == 'Text') ? app.selection[0] : app.activeDocument;

	for (const [find, replace] of Object.entries(charmap)) {
		app.findGrepPreferences = null;
		app.findGrepPreferences.findWhat = find + '([ΑΕΙΟΥΗΩ])';
		app.changeGrepPreferences.changeTo = '$1' + replace;
		sel.changeGrep();
	}

	app.findGrepPreferences = null;
	app.findGrepPreferences.findWhat = '[ΑΕΙΟΥΗΩ][\\x{0300}-\\x{036F}]+';
	var f = sel.findGrep();

	
	for (i = 0; i < f.length; i++) {
		f[i].contents = f[i].contents.normalize('NFKC');
	}
}

main();

It starts off using simple GREP replacements, based on a charmap, to replace the separated diacritics with their combined equivalents, in order to have decomposed characters.

Then it uses a GREP search to find all sequences of capital Greek letters followed by one or more combining diacritics (= decomposed representations of capitals with diacritics), pass each match through normalize('NFKC') to get the compatible, normalised (= precomposed) form, and replace the contents of the match with this normalised form.

(As it happened, there weren’t any capitals with iota subscripts and diacritics in the file, so I could simplify the charmap a bit.)

m1b · Answer

Hi @Janus Bahs Jacquet, would it be possible to post a one-page sample document showing those three problem cases and also another sample document with them fixed? A before and after, so to speak. That might help with answers. Or are the actual problem cases so numerous and varied that only a good understanding of the NFKC algo will do?

- Mark

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded