A Quick Behind the Scenes of Greedy and Lazy Matching
Regular expressions can be a bit of a nightmare — if you encounter them for the first time.
That was what I used to think as well, before I eventually started to like them. You could sense the power of regular expressions, and they seemed to be ubiquitous — they show up in all kinds of places, and you know that there is no escaping from a regular expression (or, a regex). And, one of the things you would keep hearing on your encounters is the concept of a greedy versus lazy matching. Just when you think you finally grasp them somehow, they might seem to be those mysterious concepts once again a week later. So, let's see what is happening here.
Let's say we have a string: 'One said: "The flag is moving." The other said: "The wind is moving." They argued but could not agree.'
. (If you're not familiar with the koan — not the wind, not the flag; mind is moving.)
What we want to do is to match each of the quoted sentences: "The flag is moving."
and "The wind is moving."
. Using JavaScript, the first thing comes to mind is to perhaps use something like this regex to get a match:
let regex = /".+"/g;
let regex = /".+"/g;
However, if you try this, it gets a match for everything after the first quote:
let s = 'One said: "The flag is moving." The other said: "The wind is moving." They argued but could not agree.';
let regex = /".+"/g;
s.match(regex);
// -> [ '"The flag is moving." The other said: "The wind is moving."' ]
let s = 'One said: "The flag is moving." The other said: "The wind is moving." They argued but could not agree.';
let regex = /".+"/g;
s.match(regex);
// -> [ '"The flag is moving." The other said: "The wind is moving."' ]
This is not what we want. But, why that happens is because of the greedy algorithm that our regex uses. What it does is basically going to the next position if there is no match for a given position. Let's see what that means.
The first character of the pattern ".+"
to match is "
.
The regex engine starts with the 0th index of the string, which is O
, therefore gets no match. If there is no match, remember, it goes to the next position. It is n
, still no match. It continues advancing until it finds "
at index 10:
One said: "The flag is moving." The other said: "The wind is moving." They argued but could not agree.
Now that the regex engine has found it, it tries to match the next part of the pattern, which is the dot (.
) character. The dot is a special quantifier that will match "any character except a newline." The character after the first quote on our string was T
, so, it matches. Because of the +
quantifier after the dot, it will repeat as long as a character matches the dot — which will be all the rest of the characters in the string in this case:
One said: "The flag is moving." The other said: "The wind is moving." They argued but could not agree.
Next, the rest of the pattern ("
at the end) needs to match as well. However, we are at the end of the string already. So, what happens now is, that the engine backtracks. It goes back one character, and will try to match from that character onwards:
One said: "The flag is moving." The other said: "The wind is moving." They argued but could not agree.
Since the character after the match is not a quote ("
), it does not match. Therefore, it backtracks again:
One said: "The flag is moving." The other said: "The wind is moving." They argued but could not agree.
Now, e
is not "
as well. The engine backtracks again and again until it finds the "
:
One said: "The flag is moving." The other said: "The wind is moving." They argued but could not agree.
Our regex engine has found it, and is happy now.
Though, we have a g
flag for global match, and the search would continue searching for the pattern again after the first match. But, the rest of the string (They argued but could not agree.
) has no other quotes ("
) — therefore, we don't get any more results.
Now that we know that was not our intention to get this result, let's take a look at lazy matching.
We can do the lazy match with the quantifier ?
after the +
. So, our new regex would be:
let regex = /".+?"/g;
let regex = /".+?"/g;
It means that we want to repeat as little as possible. Now it works as intended:
let s = 'One said: "The flag is moving." The other said: "The wind is moving." They argued but could not agree.';
let regex = /".+?"/g;
s.match(regex);
// -> [ '"The flag is moving."', '"The wind is moving."' ]
let s = 'One said: "The flag is moving." The other said: "The wind is moving." They argued but could not agree.';
let regex = /".+?"/g;
s.match(regex);
// -> [ '"The flag is moving."', '"The wind is moving."' ]
The first thing it does is to look for the quote ("
) to match the first character in the pattern ".+?"
. Similar to before, it tries until it finds it at index 10:
One said: "The flag is moving." The other said: "The wind is moving." They argued but could not agree.
Next, it matches the dot:
One said: "The flag is moving." The other said: "The wind is moving." They argued but could not agree.
Now, things are different from greedy search. Instead of matching the dot continually, the engine tries matching the rest of the pattern now. In this case, instead of trying to match h
for the dot (.
) first, it checks if the h
character matches the quote ("
). Since it does not, the engine matches it with the dot and continue:
One said: "The flag is moving." The other said: "The wind is moving." They argued but could not agree.
Now, it checks e
for the quote ("
). No match again. It continues until finds the quote:
One said: "The flag is moving." The other said: "The wind is moving." They argued but could not agree.
We've found the first match! Because of the global flag, it will start again from the end of the first match, and find another result for the pattern again eventually:
One said: "The flag is moving." The other said: "The wind is moving." They argued but could not agree.
Finally, we've got the result we wanted and are now happy — thanks to understanding the difference between greedy and lazy matching.
Note: If you're not satisfied with the koan example and still confused, here is a shorter and simpler explanation: https://stackoverflow.com/a/2301298.