Copy link to clipboard
Copied
I've recently uploaded a GREP list/cheat sheet/reference card to a couple of topics. It's one I created a few years ago and have been updating for my own uses, since I can't seem to find one that meets all my wishes — complete, accurate, organized, reasonably tidy-looking and without the points of murky example or understanding that so many of the references out there seem to have. (Okay, I'm fussy.)
After my upload yesterday I noticed an error, and some omissions, and one thing led to another, and so — very polished version of what I think a GREP reference should be is attached. I hope both newcomers and grizzled GREPpers find it useful.
PDF and IDML also available on my digital publishing reference site.
All due credit, both in general and as a resource for this update/polishing round, to @Peter Kahrel 's authoritative GREP in InDesign.
[Older version removed; updated file in later post.]
Copy link to clipboard
Copied
Hi @Peter Kahrel I am conscious that you have given this your last shot, but (sorry!) I do seem to want to comment on a couple of things. 😬
> ... and some people do, not only because it's more flexible, but also because it's less typing and you don't have to wonder whether the < comes before the = or the other way around.
Totally valid point, and gives a good reason to use \K over a lookbehind. It is not a valid reason to *call* it a lookbehind, however.
> As to your comparison chart, you can phrase thing any way to suit a purpose. ...
Yes! And that's my point: I prefer to use phrases—where pedagogically appropriate—to correspond to how it actually is. I mean, your hypothetical phrasing isn't remotely correct—on the right, the entire matched contents is *never* discarded and the "apple" in the lookbehind is never captured. On these two points the \K performs the exact opposite to a positive lookbehind!
Anyway, while I might enjoy talking about this kind of minutiae when it takes my fancy, I don't expect others to be similarly moved—except maybe Robert 😉 I see you there!—so I fully understand if you want to leave this where it is.
- Mark
Copy link to clipboard
Copied
I've come out of retirement for lesser things!
> on the right, the entire matched contents is *never* discarded and the "apple" in the lookbehind is never captured. On these two points the \K performs the exact opposite to a positive lookbehind!
I don't think that 'discarded' is a useful term here because neither lookbehind captures what it's looking behind for, so to speak. When a search pattern is placed in parentheses the results are captured (and can be referred to using \1, \2, etc or $1, $2, etc). The lookbehind part of the expression (\D+ and apple) are matched but never captured.
Only now, by the way, because you mentioned the opposite behaviour of the two constructs, do I begin to understand how you understand the difference: the \K lookbehind (in your example) looks for \D+ and when found, checks whether it's followed by \d+, and the classic lookbehind looks for \d+ and checks whether it's preceded by 'apple'. Are you sure that that's how it works?
Coming back to your dialog, the teacher should have pointed out to the student that they could use
((?<=apple)|(?<=banana))\d+
as a kind-of-variable-length lookbehind in their particular example. It's just that you can't use any of the repeat operators.
Copy link to clipboard
Copied
> I've come out of retirement for lesser things!
 
@Peter Kahrel Haha, love it! I sometimes enjoy a bit of nitpicking when the mood takes me—and I am grateful for the indulgence here—but I promise to not make it a regular habit on other threads!
tldr;
I've realised that we simply don't agree on a point you made earlier, Peter, that
"Lookbehind is a functional notion, not a formal one."
I guess I would feel awkward referring to \K as a lookbehind in the same way a car mechanic might feel awkward in a discussion wherein "turbocharger" and "supercharger" where interchanged. These mechanisms might give very similar, high-level results, but are otherwise fundamentally different. And, importantly, those very slight high-level differences are 100% explained by the fundamental differences between them (eg, a turbocharger will have a small delay because it is driven by exhaust gases, while the supercharger is driven by the crankshaft and has no delay). The words "supercharger" and "turbocharger" are engineering terms, not day-to-day terms and in many contexts they could be used imprecisely with no problems. However, if I was teaching anybody in a related field—say, auto-electrician?—I would be more careful to use correct terminology.
This is why I, personally, would hesitate to describe \K as a lookbehind. But I can see why it is a perfectly reasonable thing to do in some circumstances. It would rarely be a big deal in the real world, and certainly isn't here on this thread. 🙂
And with that I will sign off, and thanks for indulging me. I had no idea my initial comment was going to provoke so much examination. It was fun.
- Mark
___________________________
Further nitpicking and details, from your last reply. This is for the masochists only!
> because neither lookbehind captures what it's looking behind for
This claim is wrong. During the \K procedure, "apple" IS captured (or "matched", or "consumed"—let me know if the wording is the problem here) and when the \K activates, the captured contents is discarded—the bucket is emptied. During the positive lookbehind procedure "apple" is never captured at all. The grep engine is first looking to match the "1" of the regular expression. As it marches on it sees a "1" and then evaluates the "apple", probably backwards. Arrows show the difference. If you draw arrows showing the location of the "current character" under scrutiny by the state machine, in the \K example it marches left to right, with no backtracking, but in any lookbehind example it will see a match (but not yet capture) eg. the "1", then cast it's eye back to the left "e", "l", "p", "p", "a", evaluating the symbols in the lookbehind (?<=) and then actually capture the "1". It is a totally different process and here you can see why the latter is a lookbehind and the former is not.
> I don't think that 'discarded' is a useful term here
It was useful to me, writing the previous paragraph.
> When a search pattern is placed in parentheses the results are captured (and can be referred to using \1, \2, etc or $1, $2, etc).
You are forgetting the root capture group $0, which DID capture "apple" in the \K version (before it was discarded), but DIDN'T in the positive lookbehind version. But capture groups, per se, aren't relevent to my point.
> now do I begin to understand how you understand the difference: the \K lookbehind (in your example) looks for \D+ and when found, checks whether it's followed by \d+,
That is a strange way of describing the normal greap engine character-by-character matching behaviour which, yes, goes left-to-right matching as it goes. The \K just resets in the middle of this process. There is no lookbehind happening in the \K process—it is a simple forward-only process and will be very fast code. I'm sure that's why they implemented it.
A student in one school might ask: "I love this \K version of the positive lookbehind—it's so simple and fast. What is the negative lookbehind version?" At the same time a student in another school might never ponder what the opposite of "discard the captured content" is.
> Are you sure that that's how it works?
At the level of this discussion, yep!
> Coming back to your dialog, the teacher should have pointed out to the student that they could use ((?<=apple)|(?<=banana))\d+
Yes, that would be a great thing for the teacher to explain—no matter what anyone thinks about our nitpicking!
Copy link to clipboard
Copied
Nice analogy, Mark, the compressors. So you can have two types of charger. Maybe we can return the analogy to GREP and say that there are two types of behind (no pun intended): lookbehind and keepbehind.
P.
Copy link to clipboard
Copied
> Lookbehind and keepbehind.
Not too bad! 🙂
Copy link to clipboard
Copied
Which, really, I think covers 99% of what InDesign GREP users would want to know in order to choose one over the other. Mastering the regular lookbehind is a fairly basic — okay, fairly midlevel — skill. Adding the flexibility of \K to the toolkit gives some options to get around the variable-length issue and is worth knowing.
But unless the user is also a coding/systems/'nix maven who routinely writes 100-character GREP strings, I can't see that all the subtleties matter much to them. As with those obsessed with the minutiae of EPUB code, for example, there's always-always another layer to approaches and fixes, but the practical, everyday usefulness is found at a much simpler level.
So —
Copy link to clipboard
Copied
Absolutely, James. Never mind us nitpicking.
Thank you for the updated chart, it's a wonderful thing.
One comment though (hadn't spotted it earlier): in a note you indicate that non-marking sub-expressions are obsolete. They aren't, they may not be used much, but they're certainly not obsolete.
Copy link to clipboard
Copied
Peter wrote:
> ... in a note you indicate that non-marking sub-expressions are obsolete. They aren't, they may not be used much, but they're certainly not obsolete.
Yes, definitely! Non-marking (AKA non-capturing) groups are very useful when needed.
James wrote:
> Which, really, I think covers 99% of what InDesign GREP users would want to know in order to choose one over the other.
True, but it misses my point, which I can summarise in one sentence:
Teaching \K as "Reset match" or "Discard capture" or "Clear capture" (or, for the mnemonic, "Klear capture"? or even—sigh—"Keep out" [Edit: or Peter's improvement: "keepbehind"?]) is no harder than teaching it as an "(Inclusive) Lookbehind*" and has the benefit that this description is real, not notional, and stands a chance to implant an accurate mental model in the student, that will withstand contact with the wider world.
Now I am imagining—for the sake of the exercise—a cohort of students leaving the academy and going out into the world to say things like "Oh, you'll need an Inclusive Lookbehind in that case." If that doesn't make the hair on your neck stand up, then, good for you—you are a normal healthly person!   🙂
- Mark
Edit: had to reformat because forum software was putting my reply into 5 columns! It's has clearly had enough of nitpicking!
Copy link to clipboard
Copied
When I've asked you for clarification - you've told me to get lost.
Yeah, your lengthy example to clarify things looks like no big deal at all ... almost...
Copy link to clipboard
Copied
Oh boy, Robert, perhaps I am just not correcty parsing your conversational style here. It seems to me that you have asked several closed-ended questions to which I answered immediately. I don't remember telling you to get lost.
Perhaps it is the difference between your post
> Or it all works as intended - and I just misunderstood it?
and Peter's post
> I don't see why you wouldn't want to call \K a lookbehind. Like the classic lookbehind, it finds things if they're preceded by a certain pattern. Lookbehind is a functional notion, not a formal one.
Can you see the difference? I didn't even know how to answer your question, whereas Peter's reply—which wasn't even a question—not only clarified his thoughts on the topic somewhat, but also gave me something concrete to respond-to, and I responded at considerable length, for no other reasons than (a) I wanted Peter and other readers to understand where I was coming from, (b) I wanted to explore a subtle philosophical concept, and (c) I wanted to practice writing and presenting a topic in a helpful way (and yes, yes, I know I did a poor job—but I've done worse before so I am not totally unhappy). A long, detailed response does not mean the topic is "a big deal"—it might be just interesting, or whatever.
Also, Peter is not being weirdly combative, which I appreciate.
- Mark
Copy link to clipboard
Copied
Changing narrative again... OK, you win.
Copy link to clipboard
Copied
James -- Just one small comment on your excellent chart: in the Character-class box the label for \x{■} says "Hex Code (2 or 4 digit)", but that can be 1- or 3-digit too. You can omit leading zeros: \x{9} finds tab characters, \x{14b}, the eng. It's no big deal, but maybe say '1-4 digits'.
P.
Copy link to clipboard
Copied
Noted, thanks. I will probably do an update once all the dust settles here... but (re the \K thing) I did mean this to be a quickref chart and not a full annotated distillation of your book. 🙂
Copy link to clipboard
Copied
James, I applaud your neat and orderly layout presentation. It looks nice.
This is the way I have been describing it (\K) in my GREP code list:
Lookbehind (keep text found out of match) ............. \K
Do you think that should be rewritten for better clarity?
Copy link to clipboard
Copied
Okay.
(1) I changed the note for non-marking expression from * "Mostly Obsolete" to "* Obsolete" to make room for a few more words about our pal \K — there not being another millimeter for text on this sheet. I settled on Mostly Obsolete after the last disucssion in which it was explained to me that 'non marking' means it doesn's use up system memory over execution, which can be, or was, important in systems with small working memory.. Since pretty much all systems have vast-to-unlimited memory these days, I suppose some massive GREP string might benefit from this efficiency twist, but my thought is anyone writing such huge search strings already knows that. So I'm changing that to "*Rarely needed in place of regular Subexpression."
(2) I find it amusing/vexing/frustrating that there is hardly ANY authoritative info on our pal \K... searches turn up little but brief notes and/or endless discussion/argument, most of which are written in pure gibberish. It isn't even included on some majority of references/charts. Were I not a compleatist, I'd just delete the whole line and be done with it; this sheet is meant for beginners to occasional GREPpers, not those who think in terms of combined lookbehind/lookahead structure. So to make the chart compleat, avoid looking like a compleat idiot to GREP-fu Masters, and make use of the limited space, I'm naming it Reset Match and making the note "†Complicated geekahol."
The IDML version is all yours for further, personal, deeply meaningful edits. 🙂
Copy link to clipboard
Copied
If it was just about memory then I'd agree with you that there's not much need for non-marking expressions. But it's not only about memory, it's about speed as well. For the occasional GREP expression it's not going to make much difference. But if you stuff your document with GREP styles and use a lot of grouping -- expressions enclosed in parentheses -- it will certainly make a difference.
Copy link to clipboard
Copied
Edit 2025-02-20: Don't bother reading this example. Sorry to all, but something was nagging me and I realised that example I dredged up is a useless one. I mis-remembered the reason I had for using the non-capturing group: it was just for code-readability—I probably just liked the neatness of matching 1, 2, 3. I could have just used a normal capture group and accessed indexes 1 , 3, and 4 and it would have been fine I think. So please ignore this example—it is useless. Sorry!
Funnily enough I had never considered that memory or speed were a reason to use non-marking groups (AKA non-capture groups), although both of those are good reasons.
I use them when I need to control the indexing of the captured text. My example is in a scripting context, and off the top of my head I don't remember ever using this in the Indesign UI but I don't see why it wouldn't apply when using group references in the changeTo field. [EDIT: yes it does apply perfectly—see my answer to Robert below for the same example converted into normal Indesign find/change grep.]
Here is a simple example:
function getJobNumber(doc) {
    const matchJobNumber = /(^(?:([A-Z]{2})-?)?([-\d]+))_/;
    var match = doc.name.match(matchJobNumber);
    if (match && match.length > 1)
        return {
            fullCode: match[1],
            countryCode: match[2],
            numberCode: match[3],
        };
};
The brief was that some document names started with job numbers in "UK-12345_" format, some had no hyphen "UK12345_" and some were just "12345_". I needed to collect (1) the full job number, (2) the countryCode by itself, and (3) the number.
(?:([A-Z]{2})-?)?
This part collects the country code. I needed to group it because this whole part is optional (the question mark at the end). But I never want to use its captured string because I only want the two letter country code without the possibly trailing hyphen.
Using a normal capture group (([A-Z]{2})-?)? would have misaligned the indices of the groups between the case where a country code existed and when it didn't. So using (?: ) keeps that optional grouping out of my results. The inner capture group ([A-Z]{2}) happily is still allocated its rightful index even if the outer non-capture group is not found (in which case the country code match[2] is an empty string).
Having said all that, I would judge that non-marking groups were well beyond the needs of the 99% users and if you needed space for something I would be tempted to leave it off your chart altogether.
- Mark
Copy link to clipboard
Copied
But this:
    var match = doc.name.match(matchJobNumber);has nothing to do with InDesign's GREP implementation?
Copy link to clipboard
Copied
Exactly @Robert at ID-Tasker! That line would be a poor choice to highlight.
- Mark
Copy link to clipboard
Copied
I was under the impression that you can read between the lines...
You're using JavaScript's RegEx on a text variable/string - not InDesign's GREP, that works on text objects.
Copy link to clipboard
Copied
Hi Robert, I try not to read between the lines, due to my often poor success rate.
> You're using JavaScript's RegEx on a text variable/string - not InDesign's GREP, that works on text objects.
Again, you are exactly correct!
But for your sake I will go into reading-between-the-lines territory and guess that you are concerned that because my example that uses a different flavour of grep (ExtendScript's RegExp vs Indesign's PCRE) my example will not be applicable. This is a legitimate concern, but rest assured, I knew what I was doing: the grep pattern I used in my example is 100% compatible with both flavours.
So why didn't I just use an actual Indesign grep example? I was lazy. Sorry.
But I will rectify my laziness now—
Here is the exact same example, converted into normal Indesign Find Change Grep context:
EDIT 2025-02-20: Do not bother reading this. Yes, this is a faithful conversion of my previous example, but it is a useless example—I could have used a normal capture group along with $1, $3 and $4, and it would have been fine. See my note on the earlier post. My apologies for wasting your time!
1. Using the non-capture group gives me what I want.
2. In this case using a normal capture group messes up the capture group indices and I get a mess.
Hope that helps.
- Mark
Edit: a couple of typos.
Copy link to clipboard
Copied
Last comments/suggestions on your latest idml, @James Gifford—NitroPress: (apologies for wrong fonts in screen shots).
Paragraph Return and Carriage Return: I wondered it might be more readable to combine these, consistent with "End of Story".
1. If you did decide to remove the non-marking group, I would nominate the "negated character set" for its spot. They are really useful—let me know if you want examples.
2. the term "subexpression" in my opinion focuses on the wrong thing (I mean there are subexpressions elsewhere, eg. lookbehinds have subexpressions). I would suggest "Capture Group" which is a standard term.
3. Regarding "Reset Match": I would suggest the note "Discards any text already captured". Also it mustn't show a magenta box because there are no parameters for \K.
Your chart is a fantastic addition and I sincerely hope you aren't regretting sharing it here! 🙂
- Mark
Copy link to clipboard
Copied
I am largely following existing terms, established formats and accepted presentation. While there are areas where I am confident in staking out new territories, terms and interpretations (let's chat about EPUB some time), this is not one of them.
Copy link to clipboard
Copied
@James Gifford—NitroPress Thanks for the chart. And although this thread has become nearly impossible to follow in any navigational way, all that back and forth discussion about \K has finally, I think, given me an understanding of the functional use for it, so thanks to all the folks who chimed in about that, too.
Copy link to clipboard
Copied
Just wanted to say I'm enjoying this discussion in the deep end of the swimming pool. I'm enjoying the learning, and it gives me insights when teaching GREP to others. Thanks James, Mark, Peter, Robert, and all you code-oriented folks on this extraordinary forum!
 
					
				
				
			
		
 
					
				
				
			
		
Find more inspiration, events, and resources on the new Adobe Community
Explore Now