Enum regex_automata::util::search::MatchKind
source · #[non_exhaustive]pub enum MatchKind {
All,
LeftmostFirst,
}
Expand description
The kind of match semantics to use for a regex pattern.
The default match kind is LeftmostFirst
, and this corresponds to the
match semantics used by most backtracking engines, such as Perl.
§Leftmost first or “preference order” match semantics
Leftmost-first semantics determine which match to report when there are
multiple paths through a regex that match at the same position. The tie is
essentially broken by how a backtracker would behave. For example, consider
running the regex foofoofoo|foofoo|foo
on the haystack foofoo
. In this
case, both the foofoo
and foo
branches match at position 0
. So should
the end of the match be 3
or 6
?
A backtracker will conceptually work by trying foofoofoo
and failing.
Then it will try foofoo
, find the match and stop there. Thus, the
leftmost-first match position is 6
. This is called “leftmost-first” or
“preference order” because the order of the branches as written in the
regex pattern is what determines how to break the tie.
(Note that leftmost-longest match semantics, which break ties by always taking the longest matching string, are not currently supported by this crate. These match semantics tend to be found in POSIX regex engines.)
This example shows how leftmost-first semantics work, and how it even applies to multi-pattern regexes:
use regex_automata::{
nfa::thompson::pikevm::PikeVM,
Match,
};
let re = PikeVM::new_many(&[
r"foofoofoo",
r"foofoo",
r"foo",
])?;
let mut cache = re.create_cache();
let got: Vec<Match> = re.find_iter(&mut cache, "foofoo").collect();
let expected = vec![Match::must(1, 0..6)];
assert_eq!(expected, got);
§All matches
The All
match semantics report any and all matches, and generally will
attempt to match as much as possible. It doesn’t respect any sort of match
priority at all, so things like non-greedy matching don’t work in this
mode.
The fact that non-greedy matching doesn’t work generally makes most forms
of unanchored non-overlapping searches have unintuitive behavior. Namely,
unanchored searches behave as if there is a (?s-u:.)*?
prefix at the
beginning of the pattern, which is specifically non-greedy. Since it will
be treated as greedy in All
match semantics, this generally means that
it will first attempt to consume all of the haystack and is likely to wind
up skipping matches.
Generally speaking, All
should only be used in two circumstances:
- When running an anchored search and there is a desire to match as much as possible. For example, when building a reverse regex matcher to find the start of a match after finding the end. In this case, the reverse search is anchored to the end of the match found by the forward search.
- When running overlapping searches. Since
All
encodes all possible matches, this is generally what you want for an overlapping search. If you try to use leftmost-first in an overlapping search, it is likely to produce counter-intuitive results since leftmost-first specifically excludes some matches from its underlying finite state machine.
This example demonstrates the counter-intuitive behavior of All
semantics
when using a standard leftmost unanchored search:
use regex_automata::{
nfa::thompson::pikevm::PikeVM,
Match, MatchKind,
};
let re = PikeVM::builder()
.configure(PikeVM::config().match_kind(MatchKind::All))
.build("foo")?;
let hay = "first foo second foo wat";
let mut cache = re.create_cache();
let got: Vec<Match> = re.find_iter(&mut cache, hay).collect();
// Notice that it completely skips the first 'foo'!
let expected = vec![Match::must(0, 17..20)];
assert_eq!(expected, got);
This second example shows how All
semantics are useful for an overlapping
search. Note that we use lower level lazy DFA APIs here since the NFA
engines only currently support a very limited form of overlapping search.
use regex_automata::{
hybrid::dfa::{DFA, OverlappingState},
HalfMatch, Input, MatchKind,
};
let re = DFA::builder()
// If we didn't set 'All' semantics here, then the regex would only
// match 'foo' at offset 3 and nothing else. Why? Because the state
// machine implements preference order and knows that the 'foofoo' and
// 'foofoofoo' branches can never match since 'foo' will always match
// when they match and take priority.
.configure(DFA::config().match_kind(MatchKind::All))
.build(r"foo|foofoo|foofoofoo")?;
let mut cache = re.create_cache();
let mut state = OverlappingState::start();
let input = Input::new("foofoofoo");
let mut got = vec![];
loop {
re.try_search_overlapping_fwd(&mut cache, &input, &mut state)?;
let m = match state.get_match() {
None => break,
Some(m) => m,
};
got.push(m);
}
let expected = vec![
HalfMatch::must(0, 3),
HalfMatch::must(0, 6),
HalfMatch::must(0, 9),
];
assert_eq!(expected, got);
Variants (Non-exhaustive)§
This enum is marked as non-exhaustive
All
Report all possible matches.
LeftmostFirst
Report only the leftmost matches. When multiple leftmost matches exist, report the match corresponding to the part of the regex that appears first in the syntax.