Unicode character classes in GREP

Forum|Forum|1 year ago
February 14, 2025
5 replies
1113 views

I would like to remove the spaces in front of the closing German quotation marks in a longer text. GREP, however, does not find the closing quotation marks with the expression \p{Pf}. What's wrong?

Scripting

Correct answer Joel Cherney

Thank you, Joel Cherney, for your detailed explanation of the issue! I had no idea that Unicode character classes were this unreliable. As a scientist, I tend to formulate things as generally as possible, which is why I used character classes in my GREP expressions. However, I will follow your advice from now on and use specific Unicode characters in my regular expressions instead of relying on these classes.

Do you think it would be worthwhile to notify the Unicode Consortium about these classification errors? Is there a reasonable chance that they might be corrected?

Do you think it would be worthwhile to notify the Unicode Consortium about these classification errors? Is there a reasonable chance that they might be corrected?

I started writing a "Yes" answer in reply to this question. In the course of reading up on the issue, I've decided to flip to a "No." The reason is I went and found commentary on the issue directly from the source:

Many characters have multiple uses, and not all such cases can be captured entirely by the General_Category value.

[...]

The distinctions between some General_Category values are somewhat arbitrary for edge cases, particularly those involving symbols and punctuation.

[...]

Characters with the quotation-related General_Category values Pi or Pf may behave like opening punctuation (gc=Ps) or closing punctuation (gc=Pe), depending on usage and quotation conventions.

I think that, if we wanted to somehow communicate this error in categorization, the best thing to do at the receiving end would be to revise that last bit to say "depending on locale, usage, and quotation conventions." Plenty of additional descriptions of ambiguity in categorization of punctuation can be found in the spec. It seems unlikely that we're the first people to stumble across this particular question, as well; I'm not likely to try to use the Contact Us form at unicode.org about this. I do intend to try to dig up some old mailing list archives, however, so I can try to trace exactly how these decsions came about. It seems like it should be doable to develop a more thorough historical understanding of what took place than, er, the supposition that they "wedged it into the spec when no one was looking." (Sorry, I'm an American, conspiracy-theory nonsense comes with the territory at the moment.)

Joel Cherney

Community Expert

Seems to work for me - I have nothing in German to hand, but \s\p{Pf} found all instances of space-before-close-quote in French and Russian and Spanish, just out of what I have open in InDesign right now.

Can you give us more details? How is your closing quotation encoded? Can you find those quotes with any other regex?

Robert at ID-Tasker

Legend

@Joel Cherney

In the attached sample doc - it doesn't find anything for me for "\s\p{Pf}" - even in the UI - :

But works for "\s\p{Pi}":

Joel Cherney

Correct answer

Community Expert

Thank you, Joel Cherney, for your detailed explanation of the issue! I had no idea that Unicode character classes were this unreliable. As a scientist, I tend to formulate things as generally as possible, which is why I used character classes in my GREP expressions. However, I will follow your advice from now on and use specific Unicode characters in my regular expressions instead of relying on these classes.

Do you think it would be worthwhile to notify the Unicode Consortium about these classification errors? Is there a reasonable chance that they might be corrected?

Do you think it would be worthwhile to notify the Unicode Consortium about these classification errors? Is there a reasonable chance that they might be corrected?

I started writing a "Yes" answer in reply to this question. In the course of reading up on the issue, I've decided to flip to a "No." The reason is I went and found commentary on the issue directly from the source:

Many characters have multiple uses, and not all such cases can be captured entirely by the General_Category value.

[...]

The distinctions between some General_Category values are somewhat arbitrary for edge cases, particularly those involving symbols and punctuation.

[...]

Characters with the quotation-related General_Category values Pi or Pf may behave like opening punctuation (gc=Ps) or closing punctuation (gc=Pe), depending on usage and quotation conventions.

I think that, if we wanted to somehow communicate this error in categorization, the best thing to do at the receiving end would be to revise that last bit to say "depending on locale, usage, and quotation conventions." Plenty of additional descriptions of ambiguity in categorization of punctuation can be found in the spec. It seems unlikely that we're the first people to stumble across this particular question, as well; I'm not likely to try to use the Contact Us form at unicode.org about this. I do intend to try to dig up some old mailing list archives, however, so I can try to trace exactly how these decsions came about. It seems like it should be doable to develop a more thorough historical understanding of what took place than, er, the supposition that they "wedged it into the spec when no one was looking." (Sorry, I'm an American, conspiracy-theory nonsense comes with the territory at the moment.)

B

BMeyendrieschAuthor

Participating Frequently

FindChangeList.txt is about lines 11 and 12, in which the opening and closing quotation marks are processed.

FindChangeList.txt

Doppelpunkt.zip

B

BMeyendrieschAuthor

Participating Frequently

I use the GREP printout in the text file FindChangeList.txt in connection with the Java script FindChangeByList. There, the expressions "/" excaped, but not "{" and "{". To my knowledge, no escape of "/" is required in the Find/Replace of InDesign. There "\p{Pi}" is found as the opening quotation (initial punctuation). However, not "\p{Pf}" as a final punctuation. I suspect this is an error in InDesign.