Exit
  • Global community
    • Language:
      • Deutsch
      • English
      • Español
      • Français
      • Português
  • 日本語コミュニティ
    Dedicated community for Japanese speakers
  • 한국 커뮤니티
    Dedicated community for Korean speakers
0

Unicode character classes in GREP

Community Beginner ,
Feb 14, 2025 Feb 14, 2025

Copy link to clipboard

Copied

I would like to remove the spaces in front of the closing German quotation marks in a longer text. GREP, however, does not find the closing quotation marks with the expression \p{Pf}. What's wrong?

TOPICS
Scripting

Views

335
Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines

correct answers 1 Correct answer

Community Expert , Feb 15, 2025 Feb 15, 2025

Do you think it would be worthwhile to notify the Unicode Consortium about these classification errors? Is there a reasonable chance that they might be corrected?

 

I started writing a "Yes" answer in reply to this question. In the course of reading up on the issue, I've decided to flip to a "No." The reason is I went and found commentary on the issue directly from the source:

 

Many characters have multiple uses, and not all such cases can be captured entirely by the General_Category value. 

...

Votes

Translate
Community Expert ,
Feb 14, 2025 Feb 14, 2025

Copy link to clipboard

Copied

You need to escape { and } - \{ and \} 

 

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 14, 2025 Feb 14, 2025

Copy link to clipboard

Copied

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Feb 16, 2025 Feb 16, 2025

Copy link to clipboard

Copied

LATEST

Thank you, this Reference-Sheet is helpful!

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Feb 14, 2025 Feb 14, 2025

Copy link to clipboard

Copied

I use the GREP printout in the text file FindChangeList.txt in connection with the Java script FindChangeByList. There, the expressions "/" excaped, but not "{" and "{". To my knowledge, no escape of "/" is required in the Find/Replace of InDesign. There "\p{Pi}" is found as the opening quotation (initial punctuation). However, not "\p{Pf}" as a final punctuation. I suspect this is an error in InDesign.

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 14, 2025 Feb 14, 2025

Copy link to clipboard

Copied

Can you share your example INDD file? 

 

And FindChangeList.txt

 

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 14, 2025 Feb 14, 2025

Copy link to clipboard

Copied

And in what role are you using {}? 

 

Because {} without escaping - are used for "how many times preceding expression should be repeated". 

 

If there is only one number - {10} - then it's "up to 10 times" - if two - {2,10} - then it's "between 2 and 10". 

 

 

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 14, 2025 Feb 14, 2025

Copy link to clipboard

Copied

\p{something} is used to match a Unicode character category.

 

I honestly didn't know this existed at all in InDesign's regex implementation until recently, and in testing I found it flaky enough to not want to trust it, so I've continued to use a syntax like 

[\x{####}-\x{####]+

to specify Unicode ranges. 

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Feb 14, 2025 Feb 14, 2025

Copy link to clipboard

Copied

FindChangeList.txt is about lines 11 and 12, in which the opening and closing quotation marks are processed.

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 14, 2025 Feb 14, 2025

Copy link to clipboard

Copied

Seems to work for me - I have nothing in German to hand, but \s\p{Pf} found all instances of space-before-close-quote in French and Russian and Spanish, just out of what I have open in InDesign right now. 

 

Can you give us more details? How is your closing quotation encoded? Can you find those quotes with any other regex?

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Feb 14, 2025 Feb 14, 2025

Copy link to clipboard

Copied

The closing quotation in my text is coded as U+201D (E2 80 9D).

By the way, I use InDesign 20.1 on a Mac Studio M1.

When I try to Find/Replace opening quotation marks with regex \p{Pi} it finds the closing quotation.

When I try to Find/Replace closing quotation marks with regex \p{Pf} it finds nothing.

Very strange ...

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 14, 2025 Feb 14, 2025

Copy link to clipboard

Copied

@Joel Cherney 

 

In the attached sample doc - it doesn't find anything for me for "\s\p{Pf}" - even in the UI - :

 

RobertatIDTasker_0-1739582853172.png

 

But works for "\s\p{Pi}":

 

RobertatIDTasker_2-1739583009772.png

 

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 14, 2025 Feb 14, 2025

Copy link to clipboard

Copied

I had to cut myself off; what I wrote feels a little bit like the notes for the writing of an academic article, the kind  I've not written in a few decades. You want to find all instances of space followed by German close quote? I suggest this query:

 

\s\x{201C}

 

 

Here's the longer version:

 

My disclaimer, once again, is that "I found this \p{Something} feature flaky, so I don't rely on it." It's the kind of thing I've avoided my entire career, to be honest; I remember encountering POSIX Unicode classes in Perl, and finding that I couldn't rely on e.g. [:alpha:] to behave the same way in different environments. I guess it's down to how regular expressions are implemented in any given app or language or environment, right? And they're not all the same.

 

As far as the Unicode spec goes, I find the "Double Low-9 Quotation Mark" and the "Double Reversed Low-9 Quotation Mark" in the category \p{Po} which is "Open Punctuation." That contains many variety of open parentheses or brackets, and those two quote marks. Why? That makes no sense at all.  The "Close Punctuation" category has... no quote marks of any kind. In what circumstance could you possibly match a double low-9 as initial punctuation, but some non-quote character as final? The "Initial Punctuation" category for open quotes includes the "Left Double Quotation Mark" encoded at 201C, but it doesn't have the "Double Low-9 Quotation Mark." The "Final Punctuation" category for close quotes has... well, it has nothing useful for German, in any case. 

 

I can find evidence of this stuff in the Unicode spec dating back to 2008 or so. This is Very Weird. Even back in the 1990s, I can't imagine that you could assemble a room full of academics at a Unicode Consortium conference where nobody knew what kind of quotation marks were used in typesetting in Germany. In 2008? They must have hammered this stuff out over email, and wedged it into the spec when no one was looking. Thinking about the kinds of people I knew at the Unicode Consortium conferences (mostly fearsomely multilingual professors of linguistics), I just can't see them letting stuff this obvious just slide by. 

 

As a localization wonk, I've been making sure that translations had The Right Kind Of Quote Marks going back to the late 1990s. I managed to get through decades of writing regular expressions in multiple environments before I encountered these "Categories," and I can easily imagine why none of my elder localization engineers (or DTP wonks, or Perl nerds, or Javascript pals, or translation wizards) ever mentioned them to me; because they're just not very useful. The categories seem to be grouped by some one with a heavily monolingual-English attitude. It looks half-baked. 

 

So @BMeyendriesch how did you learn about these Categories? Do they work as you'd expect them to in some other environment? It looks to me like a problem with the Unicode spec.  Your German Anführungszeichen don't map on to the very English-y assumptions built into these Categories. That being said, there's something wrong with InDesign as well. I like to search for Unicode values with \x{####} but I've been experimenting with other methods for searching, and my results are somewhat inconsistent. But language settings matter, sometimes; if I take some text that is marked as German, and I use the Text section of the Find/Change dialog to search for ^{ I find German-style open quotes, and when searching ^} I find German-style close quotes. If I take the same German text and mark it as English, then ^{ finds the close quote, and ^} finds nothing at all. That makes sense, right? Because your German close quote is encoded at the same point as my North American English open quote. But in the GREP section, ~{ which is ostensibly a Double Open Quote, finds a German close quote, whether it's marked as German or English or Romanian. 

 

I'm sure that we could track down what's implemented correctly and what's implemented incorrectly, given time and dedication. But I think I'm personally just going to continue specifying exactly which Unicode value I'm looking for.

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Feb 15, 2025 Feb 15, 2025

Copy link to clipboard

Copied

Thank you, Joel Cherney, for your detailed explanation of the issue! I had no idea that Unicode character classes were this unreliable. As a scientist, I tend to formulate things as generally as possible, which is why I used character classes in my GREP expressions. However, I will follow your advice from now on and use specific Unicode characters in my regular expressions instead of relying on these classes.

 

Do you think it would be worthwhile to notify the Unicode Consortium about these classification errors? Is there a reasonable chance that they might be corrected?

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Expert ,
Feb 15, 2025 Feb 15, 2025

Copy link to clipboard

Copied

Do you think it would be worthwhile to notify the Unicode Consortium about these classification errors? Is there a reasonable chance that they might be corrected?

 

I started writing a "Yes" answer in reply to this question. In the course of reading up on the issue, I've decided to flip to a "No." The reason is I went and found commentary on the issue directly from the source:

 

Many characters have multiple uses, and not all such cases can be captured entirely by the General_Category value.  

[...]

The distinctions between some General_Category values are somewhat arbitrary for edge cases, particularly those involving symbols and punctuation. 

[...]

Characters with the quotation-related General_Category values Pi or Pf may behave like opening punctuation (gc=Ps) or closing punctuation (gc=Pe), depending on usage and quotation conventions.

 

I think that, if we wanted to somehow communicate this error in categorization, the best thing to do at the receiving end would be to revise that last bit to say "depending on locale, usage, and quotation conventions." Plenty of additional descriptions of ambiguity in categorization of punctuation can be found in the spec. It seems unlikely that we're the first people to stumble across this particular question, as well; I'm not likely to try to use the Contact Us form at unicode.org about this. I do intend to try to dig up some old mailing list archives, however, so I can try to trace exactly how these decsions came about. It seems like it should be doable to develop a more thorough historical understanding of what took place than, er, the supposition that they "wedged it into the spec when no one was looking." (Sorry, I'm an American, conspiracy-theory nonsense comes with the territory at the moment.)

 

 

 

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines
Community Beginner ,
Feb 16, 2025 Feb 16, 2025

Copy link to clipboard

Copied

Thanks for the detailed explanation of Unicode character categorization! I completely understand that, given the nature of human languages, it’s impossible to have a perfectly clear and consistent classification of special characters. Programming languages definitely have the upper hand in that regard!

 

My suggestion would be to forgo character classes entirely when a clear-cut classification isn’t possible. As it stands now, they only cause confusion.

 

That said, I’ve solved my specific issue using custom GREP classes […] of single Unicode characters.

 

Thanks again for your dedication, Joel Cherney!

Votes

Translate

Report

Report
Community guidelines
Be kind and respectful, give credit to the original source of content, and search for duplicates before posting. Learn more
community guidelines