Unicode character classes in GREP
I would like to remove the spaces in front of the closing German quotation marks in a longer text. GREP, however, does not find the closing quotation marks with the expression \p{Pf}. What's wrong?
I would like to remove the spaces in front of the closing German quotation marks in a longer text. GREP, however, does not find the closing quotation marks with the expression \p{Pf}. What's wrong?
Thank you, Joel Cherney, for your detailed explanation of the issue! I had no idea that Unicode character classes were this unreliable. As a scientist, I tend to formulate things as generally as possible, which is why I used character classes in my GREP expressions. However, I will follow your advice from now on and use specific Unicode characters in my regular expressions instead of relying on these classes.
Do you think it would be worthwhile to notify the Unicode Consortium about these classification errors? Is there a reasonable chance that they might be corrected?
Do you think it would be worthwhile to notify the Unicode Consortium about these classification errors? Is there a reasonable chance that they might be corrected?
I started writing a "Yes" answer in reply to this question. In the course of reading up on the issue, I've decided to flip to a "No." The reason is I went and found commentary on the issue directly from the source:
Many characters have multiple uses, and not all such cases can be captured entirely by the General_Category value.
[...]
The distinctions between some General_Category values are somewhat arbitrary for edge cases, particularly those involving symbols and punctuation.
[...]
Characters with the quotation-related General_Category values Pi or Pf may behave like opening punctuation (gc=Ps) or closing punctuation (gc=Pe), depending on usage and quotation conventions.
I think that, if we wanted to somehow communicate this error in categorization, the best thing to do at the receiving end would be to revise that last bit to say "depending on locale, usage, and quotation conventions." Plenty of additional descriptions of ambiguity in categorization of punctuation can be found in the spec. It seems unlikely that we're the first people to stumble across this particular question, as well; I'm not likely to try to use the Contact Us form at unicode.org about this. I do intend to try to dig up some old mailing list archives, however, so I can try to trace exactly how these decsions came about. It seems like it should be doable to develop a more thorough historical understanding of what took place than, er, the supposition that they "wedged it into the spec when no one was looking." (Sorry, I'm an American, conspiracy-theory nonsense comes with the territory at the moment.)
Already have an account? Login
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.