• Global community
    • Language:
      • Deutsch
      • English
      • EspaƱol
      • FranƧais
      • PortuguĆŖs
  • ę—„ęœ¬čŖžć‚³ćƒŸćƒ„ćƒ‹ćƒ†ć‚£
    Dedicated community for Japanese speakers
  • ķ•œźµ­ ģ»¤ė®¤ė‹ˆķ‹°
    Dedicated community for Korean speakers

Normalise string to NFKC form

Contributor ,
Feb 28, 2023 Feb 28, 2023

Copy link to clipboard


I have a document with lots and lots of Polytonic Greek. A lot of the Greek text is incorrect (reflecting its origin as an OCR-traced scan), with diacritics being particular troublesome. Three different types appear in the document:


  • separate (full-width diacritic followed by letter): įæžĪ™ (U+1FDE + U+0391)
  • decomposed (letter followed by combining diacritics): į¼½ (U+0391 +ā€…U+0314ā€…+ā€…U+0301)
  • precomposed (single glyph): į¼½ (U+1F05)


Itā€™s easy enough to convert separate representations into decomposed characters in a few GREP queries. But as is well-known, InDesign does not handle Unicode normalisation very well, meaning that any further find/replace or GREP styles targeted at precomposed characters will not work on decomposed characters.


Meanwhile, JavaScript string normalisation didnā€™t arrive until ECMAScript 6 and thus doesnā€™t work in ExtendScript (and there are still some deal-breakers in UXP scripting that means I canā€™t use that), so the obvious, built-in choice wonā€™t work.


Is there some way ā€“ through a script or otherwise ā€“ to normalise all decomposed characters to precomposed according to NFKC?


(If scripting, ExtendScript would be preferable, since Iā€™m hovering somewhere slightly south of useless in both VBScript and, particularly, AppleScript.)

Scripting , Type






Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

Contributor , Mar 02, 2023 Mar 02, 2023

So, after some going back and forth, I decided to bite the bullet and use UXP scripting after all, and simply lump it that there would be side effects. As it turned out, the only side effect was that UXP doesnā€™t yet have an equivalent to ExtendScriptā€™s app.doScript() that allows you to make the whole script undoable in a single move. Problematic for scripts in general, but for the nonce, something I can live with (just had to make sure I performed the script on a clean copy so if something went




Community Expert ,
Feb 28, 2023 Feb 28, 2023

Copy link to clipboard


Hi @Janus Bahs Jacquet, would it be possible to post a one-page sample document showing those three problem cases and also another sample document with them fixed? A before and after, so to speak. That might help with answers. Or are the actual problem cases so numerous and varied that only a good understanding of the NFKC algo will do?

- Mark





Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Contributor ,
Feb 28, 2023 Feb 28, 2023

Copy link to clipboard


There are of course a lot of different variations to account for (thatā€™s why Iā€™m trying to find a way to script it rather than just doing it by hand), since Greek diacritics can stack: for all seven capital vowel letters, there are about ten different combinations of diacritics. But I donā€™t think any deeper understanding of the NFKC algorithm is necessary (I certainly donā€™t have more than a passing understanding of it) ā€“ thatā€™s just the normalisation form that suits me best: everything converted to precomposed forms.


Shown here with an alpha, the ten possible combinations are:


į¼ˆ į¼‰ į¼Œ į¼ į¼Š į¼‹ į¼Ž į¼ į¾Ž į¾


(The last two may look like double letters here, but in fact they have a iota subscript.)


Iā€™ve attached here a sample file with the same line of text written in all three different ways: with separated diacritics, decomposed characters and with precomposed characters. The latter two look the same, but if you move the cursor back and forth through the words in InDesign, youā€™ll notice that the initial capitals with diacritics ā€˜count asā€™ two letters in the decomposed line, but only one character in the precomposed line. Moreover, if you copy a word like į¼ŒĻĪµĻ‰Ļ‚ in the decomposed line, and then paste it into the ā€˜Find whatā€™ box in the Find/Change dialog, youā€™ll see that InDesign only matches the word in the precomposed line ā€“ even though you just pasted it from the decomposed line!


So the goal is to take any combination of diacritic + vowel written either with separated diacritics (Type 1) or decomposed characters (Type 2) and convert them all to precomposed characters (Type 3).


The brute-force way to do that would be to manually loop over all 140-or-so combinations of diacritics and vowel letters (70 separated, 70 decomposed) and replace them with their precomposed counterparts. But that would take hours and be very error-prone.


Slightly smarter would be to convert the separated forms to decomposed forms first ā€“ this can be done with 10 GREP queries, one for each combination of diacritics, of this type:


  • [find] įæ¾([Ī‘Ī•Ī™ĪŸĪ„Ī—Ī©])
  • [replace with] $1\x{0314}


ā€“ with \x{0314} being the Unicode value for COMBINING REVERSED COMMA ABOVE, the second part of the decomposed form of Greek vowels with rough breathing marks. Brute-forcing decomposed to precomposed forms would then ā€˜onlyā€™ require about 80 GREP queries in total; still tedious and error-prone.


The best way I can think of would be to do the ten GREPs to get rid of all the separated forms, and then, in the script, simply finding the decomposed forms and replacing them with match.normalize('NFKC') (where match is the variable holding the matched string).


Unfortunately, string.normalize() is not available in ExtendScript, so that last bit, which was supposed to be the easiest, becomes challenging. šŸ˜•





Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 28, 2023 Feb 28, 2023

Copy link to clipboard


Thanks @Janus Bahs Jacquet, the sample document was helpful. But I haven't been able to achieve anything (not surprising, as I'm no expertā€”this is the first I've heard of String.prototype.normalize).


I noticed (as you mention) that pasting decomposed text into the FindText dialog converts it to precomposed text, so I wondered if we could use that to our advantage. But when I tried this in a script I found that Indesign doesn't do this when going via the scripting API, so that didn't work. However, perhaps if you script the UI (not something I have any skill at) you might be able to get it to work for you.


I also tried to find polyfills for String.prototype.normalize but couldn't find any. Nor could I find any simple code we could port to ExtendScript. I'm sure you've done the same searching.


At this point I'd be looking at scripting those brute force findGreps you mentioned via a text configuration file that lists all the possible combos etc. Sorry I couldn't be more help. Don't lose hope, thoughā€”there are some *very* knowledgeable folk around here that might have an answer.

- Mark





Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Contributor ,
Mar 02, 2023 Mar 02, 2023

Copy link to clipboard


So, after some going back and forth, I decided to bite the bullet and use UXP scripting after all, and simply lump it that there would be side effects. As it turned out, the only side effect was that UXP doesnā€™t yet have an equivalent to ExtendScriptā€™s app.doScript() that allows you to make the whole script undoable in a single move. Problematic for scripts in general, but for the nonce, something I can live with (just had to make sure I performed the script on a clean copy so if something went awry, I could just close the file and discard any changes in order to effectively undo the script).


With UXP scripting, the following worked:




function main() {
	var charmap = {
		'į¾½': '\\x{0313}',
		'įæŽ': '\\x{0313}\\x{0301}',
		'įæ': '\\x{0313}\\x{0300}',
		'įæ': '\\x{0313}\\x{0342}',
		'įæ¾': '\\x{0314}',
		'įæž': '\\x{0314}\\x{0301}',
		'įæ': '\\x{0314}\\x{0300}',
		'įæŸ': '\\x{0314}\\x{0342}'
	RegExp.escape = function(text) {
	  return text.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");
	var r = new RegExp ("[" + RegExp.escape(Object.keys(charmap).join("")) + "]", "g");
	var sel = (app.selection.length > 0 && app.selection[0].constructorName == 'Text') ? app.selection[0] : app.activeDocument;

	for (const [find, replace] of Object.entries(charmap)) {
		app.findGrepPreferences = null;
		app.findGrepPreferences.findWhat = find + '([Ī‘Ī•Ī™ĪŸĪ„Ī—Ī©])';
		app.changeGrepPreferences.changeTo = '$1' + replace;

	app.findGrepPreferences = null;
	app.findGrepPreferences.findWhat = '[Ī‘Ī•Ī™ĪŸĪ„Ī—Ī©][\\x{0300}-\\x{036F}]+';
	var f = sel.findGrep();

	for (i = 0; i < f.length; i++) {
		f[i].contents = f[i].contents.normalize('NFKC');





It starts off using simple GREP replacements, based on a charmap, to replace the separated diacritics with their combined equivalents, in order to have decomposed characters.


Then it uses a GREP search to find all sequences of capital Greek letters followed by one or more combining diacritics (= decomposed representations of capitals with diacritics), pass each match through normalize('NFKC') to get the compatible, normalised (= precomposed) form, and replace the contents of the match with this normalised form.


(As it happened, there werenā€™t any capitals with iota subscripts and diacritics in the file, so I could simplify the charmap a bit.)





Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Mar 02, 2023 Mar 02, 2023

Copy link to clipboard


Hey @Janus Bahs Jacquet, thanks for posting your solution!

- Mark





Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Guide ,
Mar 02, 2023 Mar 02, 2023

Copy link to clipboard


Hi @Janus Bahs Jacquet 


Just a suggestion. Since you perform a compatibility decomp + canonical comp there's good chance that string length is not preserved from f[i].contents to f[i].contents.normalize('NFKC'). It is then much safer to loop backwards when processing fā€™s items.








Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Contributor ,
Mar 05, 2023 Mar 05, 2023

Copy link to clipboard



Ah, thatā€™s a very good point, @Marc Autret ā€“ Iā€™ve been bitten by that before when adding text to frames. Iā€™ll have to check with a previous version to make sure that hasnā€™t happened here!





Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines