Struct regex_automata::util::iter::Searcher
source · pub struct Searcher<'h> {
input: Input<'h>,
last_match_end: Option<usize>,
}
Expand description
A searcher for creating iterators and performing lower level iteration.
This searcher encapsulates the logic required for finding all successive non-overlapping matches in a haystack. In theory, iteration would look something like this:
- Setting the start position to
0
. - Execute a regex search. If no match, end iteration.
- Report the match and set the start position to the end of the match.
- Go back to (2).
And if this were indeed the case, it’s likely that Searcher
wouldn’t
exist. Unfortunately, because a regex may match the empty string, the above
logic won’t work for all possible regexes. Namely, if an empty match is
found, then step (3) would set the start position of the search to the
position it was at. Thus, iteration would never end.
Instead, a Searcher
knows how to detect these cases and forcefully
advance iteration in the case of an empty match that overlaps with a
previous match.
If you know that your regex cannot match any empty string, then the simple algorithm described above will work correctly.
When possible, prefer the iterators defined on the regex engine you’re using. This tries to abstract over the regex engine and is thus a bit more unwieldy to use.
In particular, a Searcher
is not itself an iterator. Instead, it provides
advance
routines that permit moving the search along explicitly. It also
provides various routines, like Searcher::into_matches_iter
, that
accept a closure (representing how a regex engine executes a search) and
returns a conventional iterator.
The lifetime parameters come from the Input
type passed to
Searcher::new
:
'h
is the lifetime of the underlying haystack.
§Searcher vs Iterator
Why does a search type with “advance” APIs exist at all when we also have iterators? Unfortunately, the reasoning behind this split is a complex combination of the following things:
- While many of the regex engines expose their own iterators, it is also
nice to expose this lower level iteration helper because it permits callers
to provide their own
Input
configuration. Moreover, aSearcher
can work with any regex engine instead of only the ones defined in this crate. This way, everyone benefits from a shared iteration implementation. - There are many different regex engines that, while they have the same match semantics, they have slightly different APIs. Iteration is just complex enough to want to share code, and so we need a way of abstracting over those different regex engines. While we could define a new trait that describes any regex engine search API, it would wind up looking very close to a closure. While there may still be reasons for the more generic trait to exist, for now and for the purposes of iteration, we use a closure. Closures also provide a lot of easy flexibility at the call site, in that they permit the caller to borrow any kind of state they want for use during each search call.
- As a result of using closures, and because closures are anonymous types
that cannot be named, it is difficult to encapsulate them without both
costs to speed and added complexity to the public API. For example, in
defining an iterator type like
dfa::regex::FindMatches
, if we use a closure internally, it’s not possible to name this type in the return type of the iterator constructor. Thus, the only way around it is to erase the type by boxing it and turning it into aBox<dyn FnMut ...>
. This boxed closure is unlikely to be inlined and it infects the public API in subtle ways. Namely, unless you declare the closure as implementingSend
andSync
, then the resulting iterator type won’t implement it either. But there are practical issues with requiring the closure to implementSend
andSync
that result in other API complexities that are beyond the scope of this already long exposition. - Some regex engines expose more complex match information than just
“which pattern matched” and “at what offsets.” For example, the PikeVM
exposes match spans for each capturing group that participated in the
match. In such cases, it can be quite beneficial to reuse the capturing
group allocation on subsequent searches. A proper iterator doesn’t permit
this API due to its interface, so it’s useful to have something a bit lower
level that permits callers to amortize allocations while also reusing a
shared implementation of iteration. (See the documentation for
Searcher::advance
for an example of using the “advance” API with the PikeVM.)
What this boils down to is that there are “advance” APIs which require handing a closure to it for every call, and there are also APIs to create iterators from a closure. The former are useful for implementing iterators or when you need more flexibility, while the latter are useful for conveniently writing custom iterators on-the-fly.
§Example: iterating with captures
Several regex engines in this crate over convenient iterator APIs over
Captures
values. To do so, this requires allocating a new Captures
value for each iteration step. This can perhaps be more costly than you
might want. Instead of implementing your own iterator to avoid that
cost (which can be a little subtle if you want to handle empty matches
correctly), you can use this Searcher
to do it for you:
use regex_automata::{
nfa::thompson::pikevm::PikeVM,
util::iter::Searcher,
Input, Span,
};
let re = PikeVM::new("foo(?P<numbers>[0-9]+)")?;
let haystack = "foo1 foo12 foo123";
let mut caps = re.create_captures();
let mut cache = re.create_cache();
let mut matches = vec![];
let mut searcher = Searcher::new(Input::new(haystack));
while let Some(_) = searcher.advance(|input| {
re.search(&mut cache, input, &mut caps);
Ok(caps.get_match())
}) {
// The unwrap is OK since 'numbers' matches if the pattern matches.
matches.push(caps.get_group_by_name("numbers").unwrap());
}
assert_eq!(matches, vec![
Span::from(3..4),
Span::from(8..10),
Span::from(14..17),
]);
Fields§
§input: Input<'h>
The input parameters to give to each regex engine call.
The start position of the search is mutated during iteration.
last_match_end: Option<usize>
Records the end offset of the most recent match. This is necessary to handle a corner case for preventing empty matches from overlapping with the ending bounds of a prior match.
Implementations§
source§impl<'h> Searcher<'h>
impl<'h> Searcher<'h>
sourcepub fn new(input: Input<'h>) -> Searcher<'h>
pub fn new(input: Input<'h>) -> Searcher<'h>
Create a new fallible non-overlapping matches iterator.
The given input
provides the parameters (including the haystack),
while the finder
represents a closure that calls the underlying regex
engine. The closure may borrow any additional state that is needed,
such as a prefilter scanner.
sourcepub fn input<'s>(&'s self) -> &'s Input<'h>
pub fn input<'s>(&'s self) -> &'s Input<'h>
Returns the current Input
used by this searcher.
The Input
returned is generally equivalent to the one given to
Searcher::new
, but its start position may be different to reflect
the start of the next search to be executed.
sourcepub fn advance_half<F>(&mut self, finder: F) -> Option<HalfMatch>
pub fn advance_half<F>(&mut self, finder: F) -> Option<HalfMatch>
Return the next half match for an infallible search if one exists, and advance to the next position.
This is like try_advance_half
, except errors are converted into
panics.
§Panics
If the given closure returns an error, then this panics. This is useful when you know your underlying regex engine has been configured to not return an error.
§Example
This example shows how to use a Searcher
to iterate over all matches
when using a DFA, which only provides “half” matches.
use regex_automata::{
hybrid::dfa::DFA,
util::iter::Searcher,
HalfMatch, Input,
};
let re = DFA::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}")?;
let mut cache = re.create_cache();
let input = Input::new("2010-03-14 2016-10-08 2020-10-22");
let mut it = Searcher::new(input);
let expected = Some(HalfMatch::must(0, 10));
let got = it.advance_half(|input| re.try_search_fwd(&mut cache, input));
assert_eq!(expected, got);
let expected = Some(HalfMatch::must(0, 21));
let got = it.advance_half(|input| re.try_search_fwd(&mut cache, input));
assert_eq!(expected, got);
let expected = Some(HalfMatch::must(0, 32));
let got = it.advance_half(|input| re.try_search_fwd(&mut cache, input));
assert_eq!(expected, got);
let expected = None;
let got = it.advance_half(|input| re.try_search_fwd(&mut cache, input));
assert_eq!(expected, got);
This correctly moves iteration forward even when an empty match occurs:
use regex_automata::{
hybrid::dfa::DFA,
util::iter::Searcher,
HalfMatch, Input,
};
let re = DFA::new(r"a|")?;
let mut cache = re.create_cache();
let input = Input::new("abba");
let mut it = Searcher::new(input);
let expected = Some(HalfMatch::must(0, 1));
let got = it.advance_half(|input| re.try_search_fwd(&mut cache, input));
assert_eq!(expected, got);
let expected = Some(HalfMatch::must(0, 2));
let got = it.advance_half(|input| re.try_search_fwd(&mut cache, input));
assert_eq!(expected, got);
let expected = Some(HalfMatch::must(0, 4));
let got = it.advance_half(|input| re.try_search_fwd(&mut cache, input));
assert_eq!(expected, got);
let expected = None;
let got = it.advance_half(|input| re.try_search_fwd(&mut cache, input));
assert_eq!(expected, got);
sourcepub fn advance<F>(&mut self, finder: F) -> Option<Match>
pub fn advance<F>(&mut self, finder: F) -> Option<Match>
Return the next match for an infallible search if one exists, and advance to the next position.
The search is advanced even in the presence of empty matches by forbidding empty matches from overlapping with any other match.
This is like try_advance
, except errors are converted into panics.
§Panics
If the given closure returns an error, then this panics. This is useful when you know your underlying regex engine has been configured to not return an error.
§Example
This example shows how to use a Searcher
to iterate over all matches
when using a regex based on lazy DFAs:
use regex_automata::{
hybrid::regex::Regex,
util::iter::Searcher,
Match, Input,
};
let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}")?;
let mut cache = re.create_cache();
let input = Input::new("2010-03-14 2016-10-08 2020-10-22");
let mut it = Searcher::new(input);
let expected = Some(Match::must(0, 0..10));
let got = it.advance(|input| re.try_search(&mut cache, input));
assert_eq!(expected, got);
let expected = Some(Match::must(0, 11..21));
let got = it.advance(|input| re.try_search(&mut cache, input));
assert_eq!(expected, got);
let expected = Some(Match::must(0, 22..32));
let got = it.advance(|input| re.try_search(&mut cache, input));
assert_eq!(expected, got);
let expected = None;
let got = it.advance(|input| re.try_search(&mut cache, input));
assert_eq!(expected, got);
This example shows the same as above, but with the PikeVM. This example
is useful because it shows how to use this API even when the regex
engine doesn’t directly return a Match
.
use regex_automata::{
nfa::thompson::pikevm::PikeVM,
util::iter::Searcher,
Match, Input,
};
let re = PikeVM::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let input = Input::new("2010-03-14 2016-10-08 2020-10-22");
let mut it = Searcher::new(input);
let expected = Some(Match::must(0, 0..10));
let got = it.advance(|input| {
re.search(&mut cache, input, &mut caps);
Ok(caps.get_match())
});
// Note that if we wanted to extract capturing group spans, we could
// do that here with 'caps'.
assert_eq!(expected, got);
let expected = Some(Match::must(0, 11..21));
let got = it.advance(|input| {
re.search(&mut cache, input, &mut caps);
Ok(caps.get_match())
});
assert_eq!(expected, got);
let expected = Some(Match::must(0, 22..32));
let got = it.advance(|input| {
re.search(&mut cache, input, &mut caps);
Ok(caps.get_match())
});
assert_eq!(expected, got);
let expected = None;
let got = it.advance(|input| {
re.search(&mut cache, input, &mut caps);
Ok(caps.get_match())
});
assert_eq!(expected, got);
sourcepub fn try_advance_half<F>(
&mut self,
finder: F,
) -> Result<Option<HalfMatch>, MatchError>
pub fn try_advance_half<F>( &mut self, finder: F, ) -> Result<Option<HalfMatch>, MatchError>
Return the next half match for a fallible search if one exists, and advance to the next position.
This is like advance_half
, except it permits callers to handle errors
during iteration.
sourcepub fn try_advance<F>(&mut self, finder: F) -> Result<Option<Match>, MatchError>
pub fn try_advance<F>(&mut self, finder: F) -> Result<Option<Match>, MatchError>
Return the next match for a fallible search if one exists, and advance to the next position.
This is like advance
, except it permits callers to handle errors
during iteration.
sourcepub fn into_half_matches_iter<F>(self, finder: F) -> TryHalfMatchesIter<'h, F> ⓘ
pub fn into_half_matches_iter<F>(self, finder: F) -> TryHalfMatchesIter<'h, F> ⓘ
Given a closure that executes a single search, return an iterator over all successive non-overlapping half matches.
The iterator returned yields result values. If the underlying regex
engine is configured to never return an error, consider calling
TryHalfMatchesIter::infallible
to convert errors into panics.
§Example
This example shows how to use a Searcher
to create a proper
iterator over half matches.
use regex_automata::{
hybrid::dfa::DFA,
util::iter::Searcher,
HalfMatch, Input,
};
let re = DFA::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}")?;
let mut cache = re.create_cache();
let input = Input::new("2010-03-14 2016-10-08 2020-10-22");
let mut it = Searcher::new(input).into_half_matches_iter(|input| {
re.try_search_fwd(&mut cache, input)
});
let expected = Some(Ok(HalfMatch::must(0, 10)));
assert_eq!(expected, it.next());
let expected = Some(Ok(HalfMatch::must(0, 21)));
assert_eq!(expected, it.next());
let expected = Some(Ok(HalfMatch::must(0, 32)));
assert_eq!(expected, it.next());
let expected = None;
assert_eq!(expected, it.next());
sourcepub fn into_matches_iter<F>(self, finder: F) -> TryMatchesIter<'h, F> ⓘ
pub fn into_matches_iter<F>(self, finder: F) -> TryMatchesIter<'h, F> ⓘ
Given a closure that executes a single search, return an iterator over all successive non-overlapping matches.
The iterator returned yields result values. If the underlying regex
engine is configured to never return an error, consider calling
TryMatchesIter::infallible
to convert errors into panics.
§Example
This example shows how to use a Searcher
to create a proper
iterator over matches.
use regex_automata::{
hybrid::regex::Regex,
util::iter::Searcher,
Match, Input,
};
let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}")?;
let mut cache = re.create_cache();
let input = Input::new("2010-03-14 2016-10-08 2020-10-22");
let mut it = Searcher::new(input).into_matches_iter(|input| {
re.try_search(&mut cache, input)
});
let expected = Some(Ok(Match::must(0, 0..10)));
assert_eq!(expected, it.next());
let expected = Some(Ok(Match::must(0, 11..21)));
assert_eq!(expected, it.next());
let expected = Some(Ok(Match::must(0, 22..32)));
assert_eq!(expected, it.next());
let expected = None;
assert_eq!(expected, it.next());
sourcepub fn into_captures_iter<F>(
self,
caps: Captures,
finder: F,
) -> TryCapturesIter<'h, F> ⓘ
pub fn into_captures_iter<F>( self, caps: Captures, finder: F, ) -> TryCapturesIter<'h, F> ⓘ
Given a closure that executes a single search, return an iterator over
all successive non-overlapping Captures
values.
The iterator returned yields result values. If the underlying regex
engine is configured to never return an error, consider calling
TryCapturesIter::infallible
to convert errors into panics.
Unlike the other iterator constructors, this accepts an initial
Captures
value. This Captures
value is reused for each search, and
the iterator implementation clones it before returning it. The caller
must provide this value because the iterator is purposely ignorant
of the underlying regex engine and thus doesn’t know how to create
one itself. More to the point, a Captures
value itself has a few
different constructors, which change which kind of information is
available to query in exchange for search performance.
§Example
This example shows how to use a Searcher
to create a proper iterator
over Captures
values, which provides access to all capturing group
spans for each match.
use regex_automata::{
nfa::thompson::pikevm::PikeVM,
util::iter::Searcher,
Input,
};
let re = PikeVM::new(
r"(?P<y>[0-9]{4})-(?P<m>[0-9]{2})-(?P<d>[0-9]{2})",
)?;
let (mut cache, caps) = (re.create_cache(), re.create_captures());
let haystack = "2010-03-14 2016-10-08 2020-10-22";
let input = Input::new(haystack);
let mut it = Searcher::new(input)
.into_captures_iter(caps, |input, caps| {
re.search(&mut cache, input, caps);
Ok(())
});
let got = it.next().expect("first date")?;
let year = got.get_group_by_name("y").expect("must match");
assert_eq!("2010", &haystack[year]);
let got = it.next().expect("second date")?;
let month = got.get_group_by_name("m").expect("must match");
assert_eq!("10", &haystack[month]);
let got = it.next().expect("third date")?;
let day = got.get_group_by_name("d").expect("must match");
assert_eq!("22", &haystack[day]);
assert!(it.next().is_none());
sourcefn handle_overlapping_empty_half_match<F>(
&mut self,
_: HalfMatch,
finder: F,
) -> Result<Option<HalfMatch>, MatchError>
fn handle_overlapping_empty_half_match<F>( &mut self, _: HalfMatch, finder: F, ) -> Result<Option<HalfMatch>, MatchError>
Handles the special case of a match that begins where the previous match ended. Without this special handling, it’d be possible to get stuck where an empty match never results in forward progress. This also makes it more consistent with how presiding general purpose regex engines work.
sourcefn handle_overlapping_empty_match<F>(
&mut self,
m: Match,
finder: F,
) -> Result<Option<Match>, MatchError>
fn handle_overlapping_empty_match<F>( &mut self, m: Match, finder: F, ) -> Result<Option<Match>, MatchError>
Handles the special case of an empty match by ensuring that 1) the iterator always advances and 2) empty matches never overlap with other matches.
(1) is necessary because we principally make progress by setting the starting location of the next search to the ending location of the last match. But if a match is empty, then this results in a search that does not advance and thus does not terminate.
(2) is not strictly necessary, but makes intuitive sense and matches the presiding behavior of most general purpose regex engines. The “intuitive sense” here is that we want to report NON-overlapping matches. So for example, given the regex ‘a|(?:)’ against the haystack ‘a’, without the special handling, you’d get the matches [0, 1) and [1, 1), where the latter overlaps with the end bounds of the former.
Note that we mark this cold and forcefully prevent inlining because handling empty matches like this is extremely rare and does require quite a bit of code, comparatively. Keeping this code out of the main iterator function keeps it smaller and more amenable to inlining itself.