pub struct PikeVM {
config: Config,
nfa: NFA,
}
Expand description
A virtual machine for executing regex searches with capturing groups.
§Infallible APIs
Unlike most other regex engines in this crate, a PikeVM
never returns an
error at search time. It supports all Anchored
configurations, never
quits and works on haystacks of arbitrary length.
There are two caveats to mention though:
- If an invalid pattern ID is given to a search via
Anchored::Pattern
, then the PikeVM will report “no match.” This is consistent with all other regex engines in this crate. - When using
PikeVM::which_overlapping_matches
with aPatternSet
that has insufficient capacity to store all valid pattern IDs, then if a match occurs for aPatternID
that cannot be inserted, it is silently dropped as if it did not match.
§Advice
The PikeVM
is generally the most “powerful” regex engine in this crate.
“Powerful” in this context means that it can handle any regular expression
that is parseable by regex-syntax
and any size haystack. Regretably,
the PikeVM
is also simultaneously often the slowest regex engine in
practice. This results in an annoying situation where one generally tries
to pick any other regex engine (or perhaps none at all) before being
forced to fall back to a PikeVM
.
For example, a common strategy for dealing with capturing groups is to
actually look for the overall match of the regex using a faster regex
engine, like a lazy DFA. Once the overall
match is found, one can then run the PikeVM
on just the match span to
find the spans of the capturing groups. In this way, the faster regex
engine does the majority of the work, while the PikeVM
only lends its
power in a more limited role.
Unfortunately, this isn’t always possible because the faster regex engines
don’t support all of the regex features in regex-syntax
. This notably
includes (and is currently limited to) Unicode word boundaries. So if
your pattern has Unicode word boundaries, you typically can’t use a
DFA-based regex engine at all (unless you enable heuristic support for
it). (The one-pass
DFA can handle Unicode word boundaries for
anchored searches only, but in a cruel sort of joke, many Unicode features
tend to result in making the regex not one-pass.)
§Example
This example shows that the PikeVM
implements Unicode word boundaries
correctly by default.
use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};
let re = PikeVM::new(r"\b\w+\b")?;
let mut cache = re.create_cache();
let mut it = re.find_iter(&mut cache, "Шерлок Холмс");
assert_eq!(Some(Match::must(0, 0..12)), it.next());
assert_eq!(Some(Match::must(0, 13..23)), it.next());
assert_eq!(None, it.next());
Fields§
§config: Config
§nfa: NFA
Implementations§
source§impl PikeVM
impl PikeVM
sourcepub fn new(pattern: &str) -> Result<PikeVM, BuildError>
pub fn new(pattern: &str) -> Result<PikeVM, BuildError>
Parse the given regular expression using the default configuration and
return the corresponding PikeVM
.
If you want a non-default configuration, then use the Builder
to
set your own configuration.
§Example
use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};
let re = PikeVM::new("foo[0-9]+bar")?;
let mut cache = re.create_cache();
assert_eq!(
Some(Match::must(0, 3..14)),
re.find_iter(&mut cache, "zzzfoo12345barzzz").next(),
);
sourcepub fn new_many<P: AsRef<str>>(patterns: &[P]) -> Result<PikeVM, BuildError>
pub fn new_many<P: AsRef<str>>(patterns: &[P]) -> Result<PikeVM, BuildError>
Like new
, but parses multiple patterns into a single “multi regex.”
This similarly uses the default regex configuration.
§Example
use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};
let re = PikeVM::new_many(&["[a-z]+", "[0-9]+"])?;
let mut cache = re.create_cache();
let mut it = re.find_iter(&mut cache, "abc 1 foo 4567 0 quux");
assert_eq!(Some(Match::must(0, 0..3)), it.next());
assert_eq!(Some(Match::must(1, 4..5)), it.next());
assert_eq!(Some(Match::must(0, 6..9)), it.next());
assert_eq!(Some(Match::must(1, 10..14)), it.next());
assert_eq!(Some(Match::must(1, 15..16)), it.next());
assert_eq!(Some(Match::must(0, 17..21)), it.next());
assert_eq!(None, it.next());
sourcepub fn new_from_nfa(nfa: NFA) -> Result<PikeVM, BuildError>
pub fn new_from_nfa(nfa: NFA) -> Result<PikeVM, BuildError>
Like new
, but builds a PikeVM directly from an NFA. This is useful
if you already have an NFA, or even if you hand-assembled the NFA.
§Example
This shows how to hand assemble a regular expression via its HIR, compile an NFA from it and build a PikeVM from the NFA.
use regex_automata::{nfa::thompson::{NFA, pikevm::PikeVM}, Match};
use regex_syntax::hir::{Hir, Class, ClassBytes, ClassBytesRange};
let hir = Hir::class(Class::Bytes(ClassBytes::new(vec![
ClassBytesRange::new(b'0', b'9'),
ClassBytesRange::new(b'A', b'Z'),
ClassBytesRange::new(b'_', b'_'),
ClassBytesRange::new(b'a', b'z'),
])));
let config = NFA::config().nfa_size_limit(Some(1_000));
let nfa = NFA::compiler().configure(config).build_from_hir(&hir)?;
let re = PikeVM::new_from_nfa(nfa)?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let expected = Some(Match::must(0, 3..4));
re.captures(&mut cache, "!@#A#@!", &mut caps);
assert_eq!(expected, caps.get_match());
sourcepub fn always_match() -> Result<PikeVM, BuildError>
pub fn always_match() -> Result<PikeVM, BuildError>
Create a new PikeVM
that matches every input.
§Example
use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};
let re = PikeVM::always_match()?;
let mut cache = re.create_cache();
let expected = Match::must(0, 0..0);
assert_eq!(Some(expected), re.find_iter(&mut cache, "").next());
assert_eq!(Some(expected), re.find_iter(&mut cache, "foo").next());
sourcepub fn never_match() -> Result<PikeVM, BuildError>
pub fn never_match() -> Result<PikeVM, BuildError>
Create a new PikeVM
that never matches any input.
§Example
use regex_automata::nfa::thompson::pikevm::PikeVM;
let re = PikeVM::never_match()?;
let mut cache = re.create_cache();
assert_eq!(None, re.find_iter(&mut cache, "").next());
assert_eq!(None, re.find_iter(&mut cache, "foo").next());
sourcepub fn config() -> Config
pub fn config() -> Config
Return a default configuration for a PikeVM
.
This is a convenience routine to avoid needing to import the Config
type when customizing the construction of a PikeVM
.
§Example
This example shows how to disable UTF-8 mode. When UTF-8 mode is disabled, zero-width matches that split a codepoint are allowed. Otherwise they are never reported.
In the code below, notice that ""
is permitted to match positions
that split the encoding of a codepoint.
use regex_automata::{nfa::thompson::{self, pikevm::PikeVM}, Match};
let re = PikeVM::builder()
.thompson(thompson::Config::new().utf8(false))
.build(r"")?;
let mut cache = re.create_cache();
let haystack = "a☃z";
let mut it = re.find_iter(&mut cache, haystack);
assert_eq!(Some(Match::must(0, 0..0)), it.next());
assert_eq!(Some(Match::must(0, 1..1)), it.next());
assert_eq!(Some(Match::must(0, 2..2)), it.next());
assert_eq!(Some(Match::must(0, 3..3)), it.next());
assert_eq!(Some(Match::must(0, 4..4)), it.next());
assert_eq!(Some(Match::must(0, 5..5)), it.next());
assert_eq!(None, it.next());
sourcepub fn builder() -> Builder
pub fn builder() -> Builder
Return a builder for configuring the construction of a PikeVM
.
This is a convenience routine to avoid needing to import the
Builder
type in common cases.
§Example
This example shows how to use the builder to disable UTF-8 mode everywhere.
use regex_automata::{
nfa::thompson::{self, pikevm::PikeVM},
util::syntax,
Match,
};
let re = PikeVM::builder()
.syntax(syntax::Config::new().utf8(false))
.thompson(thompson::Config::new().utf8(false))
.build(r"foo(?-u:[^b])ar.*")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let haystack = b"\xFEfoo\xFFarzz\xE2\x98\xFF\n";
let expected = Some(Match::must(0, 1..9));
re.captures(&mut cache, haystack, &mut caps);
assert_eq!(expected, caps.get_match());
sourcepub fn create_captures(&self) -> Captures
pub fn create_captures(&self) -> Captures
Create a new empty set of capturing groups that is guaranteed to be
valid for the search APIs on this PikeVM
.
A Captures
value created for a specific PikeVM
cannot be used with
any other PikeVM
.
This is a convenience function for Captures::all
. See the
Captures
documentation for an explanation of its alternative
constructors that permit the PikeVM
to do less work during a search,
and thus might make it faster.
sourcepub fn create_cache(&self) -> Cache
pub fn create_cache(&self) -> Cache
Create a new cache for this PikeVM
.
The cache returned should only be used for searches for this
PikeVM
. If you want to reuse the cache for another PikeVM
, then
you must call Cache::reset
with that PikeVM
(or, equivalently,
PikeVM::reset_cache
).
sourcepub fn reset_cache(&self, cache: &mut Cache)
pub fn reset_cache(&self, cache: &mut Cache)
Reset the given cache such that it can be used for searching with the
this PikeVM
(and only this PikeVM
).
A cache reset permits reusing memory already allocated in this cache
with a different PikeVM
.
§Example
This shows how to re-purpose a cache for use with a different PikeVM
.
use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};
let re1 = PikeVM::new(r"\w")?;
let re2 = PikeVM::new(r"\W")?;
let mut cache = re1.create_cache();
assert_eq!(
Some(Match::must(0, 0..2)),
re1.find_iter(&mut cache, "Δ").next(),
);
// Using 'cache' with re2 is not allowed. It may result in panics or
// incorrect results. In order to re-purpose the cache, we must reset
// it with the PikeVM we'd like to use it with.
//
// Similarly, after this reset, using the cache with 're1' is also not
// allowed.
re2.reset_cache(&mut cache);
assert_eq!(
Some(Match::must(0, 0..3)),
re2.find_iter(&mut cache, "☃").next(),
);
sourcepub fn pattern_len(&self) -> usize
pub fn pattern_len(&self) -> usize
Returns the total number of patterns compiled into this PikeVM
.
In the case of a PikeVM
that contains no patterns, this returns 0
.
§Example
This example shows the pattern length for a PikeVM
that never
matches:
use regex_automata::nfa::thompson::pikevm::PikeVM;
let re = PikeVM::never_match()?;
assert_eq!(re.pattern_len(), 0);
And another example for a PikeVM
that matches at every position:
use regex_automata::nfa::thompson::pikevm::PikeVM;
let re = PikeVM::always_match()?;
assert_eq!(re.pattern_len(), 1);
And finally, a PikeVM
that was constructed from multiple patterns:
use regex_automata::nfa::thompson::pikevm::PikeVM;
let re = PikeVM::new_many(&["[0-9]+", "[a-z]+", "[A-Z]+"])?;
assert_eq!(re.pattern_len(), 3);
sourcepub fn get_config(&self) -> &Config
pub fn get_config(&self) -> &Config
Return the config for this PikeVM
.
source§impl PikeVM
impl PikeVM
sourcepub fn is_match<'h, I: Into<Input<'h>>>(
&self,
cache: &mut Cache,
input: I,
) -> bool
pub fn is_match<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I, ) -> bool
Returns true if and only if this PikeVM
matches the given haystack.
This routine may short circuit if it knows that scanning future
input will never lead to a different result. In particular, if the
underlying NFA enters a match state, then this routine will return
true
immediately without inspecting any future input. (Consider how
this might make a difference given the regex a+
on the haystack
aaaaaaaaaaaaaaa
. This routine can stop after it sees the first a
,
but routines like find
need to continue searching because +
is
greedy by default.)
§Example
This shows basic usage:
use regex_automata::nfa::thompson::pikevm::PikeVM;
let re = PikeVM::new("foo[0-9]+bar")?;
let mut cache = re.create_cache();
assert!(re.is_match(&mut cache, "foo12345bar"));
assert!(!re.is_match(&mut cache, "foobar"));
§Example: consistency with search APIs
is_match
is guaranteed to return true
whenever find
returns a
match. This includes searches that are executed entirely within a
codepoint:
use regex_automata::{nfa::thompson::pikevm::PikeVM, Input};
let re = PikeVM::new("a*")?;
let mut cache = re.create_cache();
assert!(!re.is_match(&mut cache, Input::new("☃").span(1..2)));
Notice that when UTF-8 mode is disabled, then the above reports a match because the restriction against zero-width matches that split a codepoint has been lifted:
use regex_automata::{nfa::thompson::{pikevm::PikeVM, NFA}, Input};
let re = PikeVM::builder()
.thompson(NFA::config().utf8(false))
.build("a*")?;
let mut cache = re.create_cache();
assert!(re.is_match(&mut cache, Input::new("☃").span(1..2)));
sourcepub fn find<'h, I: Into<Input<'h>>>(
&self,
cache: &mut Cache,
input: I,
) -> Option<Match>
pub fn find<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I, ) -> Option<Match>
Executes a leftmost forward search and returns a Match
if one exists.
This routine only includes the overall match span. To get access to the
individual spans of each capturing group, use PikeVM::captures
.
§Example
Leftmost first match semantics corresponds to the match with the
smallest starting offset, but where the end offset is determined by
preferring earlier branches in the original regular expression. For
example, Sam|Samwise
will match Sam
in Samwise
, but Samwise|Sam
will match Samwise
in Samwise
.
Generally speaking, the “leftmost first” match is how most backtracking
regular expressions tend to work. This is in contrast to POSIX-style
regular expressions that yield “leftmost longest” matches. Namely,
both Sam|Samwise
and Samwise|Sam
match Samwise
when using
leftmost longest semantics. (This crate does not currently support
leftmost longest semantics.)
use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};
let re = PikeVM::new("foo[0-9]+")?;
let mut cache = re.create_cache();
let expected = Match::must(0, 0..8);
assert_eq!(Some(expected), re.find(&mut cache, "foo12345"));
// Even though a match is found after reading the first byte (`a`),
// the leftmost first match semantics demand that we find the earliest
// match that prefers earlier parts of the pattern over later parts.
let re = PikeVM::new("abc|a")?;
let mut cache = re.create_cache();
let expected = Match::must(0, 0..3);
assert_eq!(Some(expected), re.find(&mut cache, "abc"));
sourcepub fn captures<'h, I: Into<Input<'h>>>(
&self,
cache: &mut Cache,
input: I,
caps: &mut Captures,
)
pub fn captures<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I, caps: &mut Captures, )
Executes a leftmost forward search and writes the spans of capturing
groups that participated in a match into the provided Captures
value. If no match was found, then Captures::is_match
is guaranteed
to return false
.
§Example
use regex_automata::{nfa::thompson::pikevm::PikeVM, Span};
let re = PikeVM::new(r"^([0-9]{4})-([0-9]{2})-([0-9]{2})$")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
re.captures(&mut cache, "2010-03-14", &mut caps);
assert!(caps.is_match());
assert_eq!(Some(Span::from(0..4)), caps.get_group(1));
assert_eq!(Some(Span::from(5..7)), caps.get_group(2));
assert_eq!(Some(Span::from(8..10)), caps.get_group(3));
sourcepub fn find_iter<'r, 'c, 'h, I: Into<Input<'h>>>(
&'r self,
cache: &'c mut Cache,
input: I,
) -> FindMatches<'r, 'c, 'h> ⓘ
pub fn find_iter<'r, 'c, 'h, I: Into<Input<'h>>>( &'r self, cache: &'c mut Cache, input: I, ) -> FindMatches<'r, 'c, 'h> ⓘ
Returns an iterator over all non-overlapping leftmost matches in the given bytes. If no match exists, then the iterator yields no elements.
§Example
use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};
let re = PikeVM::new("foo[0-9]+")?;
let mut cache = re.create_cache();
let text = "foo1 foo12 foo123";
let matches: Vec<Match> = re.find_iter(&mut cache, text).collect();
assert_eq!(matches, vec![
Match::must(0, 0..4),
Match::must(0, 5..10),
Match::must(0, 11..17),
]);
sourcepub fn captures_iter<'r, 'c, 'h, I: Into<Input<'h>>>(
&'r self,
cache: &'c mut Cache,
input: I,
) -> CapturesMatches<'r, 'c, 'h> ⓘ
pub fn captures_iter<'r, 'c, 'h, I: Into<Input<'h>>>( &'r self, cache: &'c mut Cache, input: I, ) -> CapturesMatches<'r, 'c, 'h> ⓘ
Returns an iterator over all non-overlapping Captures
values. If no
match exists, then the iterator yields no elements.
This yields the same matches as PikeVM::find_iter
, but it includes
the spans of all capturing groups that participate in each match.
Tip: See util::iter::Searcher
for
how to correctly iterate over all matches in a haystack while avoiding
the creation of a new Captures
value for every match. (Which you are
forced to do with an Iterator
.)
§Example
use regex_automata::{nfa::thompson::pikevm::PikeVM, Span};
let re = PikeVM::new("foo(?P<numbers>[0-9]+)")?;
let mut cache = re.create_cache();
let text = "foo1 foo12 foo123";
let matches: Vec<Span> = re
.captures_iter(&mut cache, text)
// The unwrap is OK since 'numbers' matches if the pattern matches.
.map(|caps| caps.get_group_by_name("numbers").unwrap())
.collect();
assert_eq!(matches, vec![
Span::from(3..4),
Span::from(8..10),
Span::from(14..17),
]);
source§impl PikeVM
impl PikeVM
sourcepub fn search(&self, cache: &mut Cache, input: &Input<'_>, caps: &mut Captures)
pub fn search(&self, cache: &mut Cache, input: &Input<'_>, caps: &mut Captures)
Executes a leftmost forward search and writes the spans of capturing
groups that participated in a match into the provided Captures
value. If no match was found, then Captures::is_match
is guaranteed
to return false
.
This is like PikeVM::captures
, but it accepts a concrete &Input
instead of an Into<Input>
.
§Example: specific pattern search
This example shows how to build a multi-PikeVM that permits searching for specific patterns.
use regex_automata::{
nfa::thompson::pikevm::PikeVM,
Anchored, Match, PatternID, Input,
};
let re = PikeVM::new_many(&["[a-z0-9]{6}", "[a-z][a-z0-9]{5}"])?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let haystack = "foo123";
// Since we are using the default leftmost-first match and both
// patterns match at the same starting position, only the first pattern
// will be returned in this case when doing a search for any of the
// patterns.
let expected = Some(Match::must(0, 0..6));
re.search(&mut cache, &Input::new(haystack), &mut caps);
assert_eq!(expected, caps.get_match());
// But if we want to check whether some other pattern matches, then we
// can provide its pattern ID.
let expected = Some(Match::must(1, 0..6));
let input = Input::new(haystack)
.anchored(Anchored::Pattern(PatternID::must(1)));
re.search(&mut cache, &input, &mut caps);
assert_eq!(expected, caps.get_match());
§Example: specifying the bounds of a search
This example shows how providing the bounds of a search can produce different results than simply sub-slicing the haystack.
use regex_automata::{nfa::thompson::pikevm::PikeVM, Match, Input};
let re = PikeVM::new(r"\b[0-9]{3}\b")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let haystack = "foo123bar";
// Since we sub-slice the haystack, the search doesn't know about
// the larger context and assumes that `123` is surrounded by word
// boundaries. And of course, the match position is reported relative
// to the sub-slice as well, which means we get `0..3` instead of
// `3..6`.
let expected = Some(Match::must(0, 0..3));
re.search(&mut cache, &Input::new(&haystack[3..6]), &mut caps);
assert_eq!(expected, caps.get_match());
// But if we provide the bounds of the search within the context of the
// entire haystack, then the search can take the surrounding context
// into account. (And if we did find a match, it would be reported
// as a valid offset into `haystack` instead of its sub-slice.)
let expected = None;
let input = Input::new(haystack).range(3..6);
re.search(&mut cache, &input, &mut caps);
assert_eq!(expected, caps.get_match());
sourcepub fn search_slots(
&self,
cache: &mut Cache,
input: &Input<'_>,
slots: &mut [Option<NonMaxUsize>],
) -> Option<PatternID>
pub fn search_slots( &self, cache: &mut Cache, input: &Input<'_>, slots: &mut [Option<NonMaxUsize>], ) -> Option<PatternID>
Executes a leftmost forward search and writes the spans of capturing
groups that participated in a match into the provided slots
, and
returns the matching pattern ID. The contents of the slots for patterns
other than the matching pattern are unspecified. If no match was found,
then None
is returned and the contents of slots
is unspecified.
This is like PikeVM::search
, but it accepts a raw slots slice
instead of a Captures
value. This is useful in contexts where you
don’t want or need to allocate a Captures
.
It is legal to pass any number of slots to this routine. If the regex engine would otherwise write a slot offset that doesn’t fit in the provided slice, then it is simply skipped. In general though, there are usually three slice lengths you might want to use:
- An empty slice, if you only care about which pattern matched.
- A slice with
pattern_len() * 2
slots, if you only care about the overall match spans for each matching pattern. - A slice with
slot_len()
slots, which permits recording match offsets for every capturing group in every pattern.
§Example
This example shows how to find the overall match offsets in a
multi-pattern search without allocating a Captures
value. Indeed, we
can put our slots right on the stack.
use regex_automata::{nfa::thompson::pikevm::PikeVM, PatternID, Input};
let re = PikeVM::new_many(&[
r"\pL+",
r"\d+",
])?;
let mut cache = re.create_cache();
let input = Input::new("!@#123");
// We only care about the overall match offsets here, so we just
// allocate two slots for each pattern. Each slot records the start
// and end of the match.
let mut slots = [None; 4];
let pid = re.search_slots(&mut cache, &input, &mut slots);
assert_eq!(Some(PatternID::must(1)), pid);
// The overall match offsets are always at 'pid * 2' and 'pid * 2 + 1'.
// See 'GroupInfo' for more details on the mapping between groups and
// slot indices.
let slot_start = pid.unwrap().as_usize() * 2;
let slot_end = slot_start + 1;
assert_eq!(Some(3), slots[slot_start].map(|s| s.get()));
assert_eq!(Some(6), slots[slot_end].map(|s| s.get()));
sourcefn search_slots_imp(
&self,
cache: &mut Cache,
input: &Input<'_>,
slots: &mut [Option<NonMaxUsize>],
) -> Option<HalfMatch>
fn search_slots_imp( &self, cache: &mut Cache, input: &Input<'_>, slots: &mut [Option<NonMaxUsize>], ) -> Option<HalfMatch>
This is the actual implementation of search_slots_imp
that
doesn’t account for the special case when 1) the NFA has UTF-8 mode
enabled, 2) the NFA can match the empty string and 3) the caller has
provided an insufficient number of slots to record match offsets.
sourcepub fn which_overlapping_matches(
&self,
cache: &mut Cache,
input: &Input<'_>,
patset: &mut PatternSet,
)
pub fn which_overlapping_matches( &self, cache: &mut Cache, input: &Input<'_>, patset: &mut PatternSet, )
Writes the set of patterns that match anywhere in the given search
configuration to patset
. If multiple patterns match at the same
position and this PikeVM
was configured with MatchKind::All
semantics, then all matching patterns are written to the given set.
Unless all of the patterns in this PikeVM
are anchored, then
generally speaking, this will visit every byte in the haystack.
This search routine does not clear the pattern set. This gives some flexibility to the caller (e.g., running multiple searches with the same pattern set), but does make the API bug-prone if you’re reusing the same pattern set for multiple searches but intended them to be independent.
If a pattern ID matched but the given PatternSet
does not have
sufficient capacity to store it, then it is not inserted and silently
dropped.
§Example
This example shows how to find all matching patterns in a haystack, even when some patterns match at the same position as other patterns.
use regex_automata::{
nfa::thompson::pikevm::PikeVM,
Input, MatchKind, PatternSet,
};
let patterns = &[
r"\w+", r"\d+", r"\pL+", r"foo", r"bar", r"barfoo", r"foobar",
];
let re = PikeVM::builder()
.configure(PikeVM::config().match_kind(MatchKind::All))
.build_many(patterns)?;
let mut cache = re.create_cache();
let input = Input::new("foobar");
let mut patset = PatternSet::new(re.pattern_len());
re.which_overlapping_matches(&mut cache, &input, &mut patset);
let expected = vec![0, 2, 3, 4, 6];
let got: Vec<usize> = patset.iter().map(|p| p.as_usize()).collect();
assert_eq!(expected, got);
source§impl PikeVM
impl PikeVM
sourcefn search_imp(
&self,
cache: &mut Cache,
input: &Input<'_>,
slots: &mut [Option<NonMaxUsize>],
) -> Option<HalfMatch>
fn search_imp( &self, cache: &mut Cache, input: &Input<'_>, slots: &mut [Option<NonMaxUsize>], ) -> Option<HalfMatch>
The implementation of standard leftmost search.
Capturing group spans are written to slots
, but only if requested.
slots
can be any length. Any slot in the NFA that is activated but
which is out of bounds for the given slots
is ignored.
sourcefn which_overlapping_imp(
&self,
cache: &mut Cache,
input: &Input<'_>,
patset: &mut PatternSet,
)
fn which_overlapping_imp( &self, cache: &mut Cache, input: &Input<'_>, patset: &mut PatternSet, )
The implementation for the ‘which_overlapping_matches’ API. Basically, we do a single scan through the entire haystack (unless our regex or search is anchored) and record every pattern that matched. In particular, when MatchKind::All is used, this supports overlapping matches. So if we have the regexes ‘sam’ and ‘samwise’, they will both be reported in the pattern set when searching the haystack ‘samwise’.
sourcefn nexts(
&self,
stack: &mut Vec<FollowEpsilon>,
curr: &mut ActiveStates,
next: &mut ActiveStates,
input: &Input<'_>,
at: usize,
slots: &mut [Option<NonMaxUsize>],
) -> Option<PatternID>
fn nexts( &self, stack: &mut Vec<FollowEpsilon>, curr: &mut ActiveStates, next: &mut ActiveStates, input: &Input<'_>, at: usize, slots: &mut [Option<NonMaxUsize>], ) -> Option<PatternID>
Process the active states in ‘curr’ to find the states (written to ‘next’) we should process for the next byte in the haystack.
‘stack’ is used to perform a depth first traversal of the NFA when computing an epsilon closure.
When a match is found, the slots for that match state (in ‘curr’) are copied to ‘caps’. Moreover, once a match is seen, processing for ‘curr’ stops (unless the PikeVM was configured with MatchKind::All semantics).
sourcefn nexts_overlapping(
&self,
stack: &mut Vec<FollowEpsilon>,
curr: &mut ActiveStates,
next: &mut ActiveStates,
input: &Input<'_>,
at: usize,
patset: &mut PatternSet,
)
fn nexts_overlapping( &self, stack: &mut Vec<FollowEpsilon>, curr: &mut ActiveStates, next: &mut ActiveStates, input: &Input<'_>, at: usize, patset: &mut PatternSet, )
Like ‘nexts’, but for the overlapping case. This doesn’t write any slots, and instead just writes which pattern matched in ‘patset’.
sourcefn next(
&self,
stack: &mut Vec<FollowEpsilon>,
curr_slot_table: &mut SlotTable,
next: &mut ActiveStates,
input: &Input<'_>,
at: usize,
sid: StateID,
) -> Option<PatternID>
fn next( &self, stack: &mut Vec<FollowEpsilon>, curr_slot_table: &mut SlotTable, next: &mut ActiveStates, input: &Input<'_>, at: usize, sid: StateID, ) -> Option<PatternID>
Starting from ‘sid’, if the position ‘at’ in the ‘input’ haystack has a transition defined out of ‘sid’, then add the state transitioned to and its epsilon closure to the ‘next’ set of states to explore.
‘stack’ is used by the epsilon closure computation to perform a depth first traversal of the NFA.
‘curr_slot_table’ should be the table of slots for the current set of states being explored. If there is a transition out of ‘sid’, then sid’s row in the slot table is used to perform the epsilon closure.
sourcefn epsilon_closure(
&self,
stack: &mut Vec<FollowEpsilon>,
curr_slots: &mut [Option<NonMaxUsize>],
next: &mut ActiveStates,
input: &Input<'_>,
at: usize,
sid: StateID,
)
fn epsilon_closure( &self, stack: &mut Vec<FollowEpsilon>, curr_slots: &mut [Option<NonMaxUsize>], next: &mut ActiveStates, input: &Input<'_>, at: usize, sid: StateID, )
Compute the epsilon closure of ‘sid’, writing the closure into ‘next’ while copying slot values from ‘curr_slots’ into corresponding states in ‘next’. ‘curr_slots’ should be the slot values corresponding to ‘sid’.
The given ‘stack’ is used to perform a depth first traversal of the NFA by recursively following all epsilon transitions out of ‘sid’. Conditional epsilon transitions are followed if and only if they are satisfied for the position ‘at’ in the ‘input’ haystack.
While this routine may write to ‘curr_slots’, once it returns, any writes are undone and the original values (even if absent) are restored.
sourcefn epsilon_closure_explore(
&self,
stack: &mut Vec<FollowEpsilon>,
curr_slots: &mut [Option<NonMaxUsize>],
next: &mut ActiveStates,
input: &Input<'_>,
at: usize,
sid: StateID,
)
fn epsilon_closure_explore( &self, stack: &mut Vec<FollowEpsilon>, curr_slots: &mut [Option<NonMaxUsize>], next: &mut ActiveStates, input: &Input<'_>, at: usize, sid: StateID, )
Explore all of the epsilon transitions out of ‘sid’. This is mostly split out from ‘epsilon_closure’ in order to clearly delineate the actual work of computing an epsilon closure from the stack book-keeping.
This will push any additional explorations needed on to ‘stack’.
‘curr_slots’ should refer to the slots for the currently active NFA state. That is, the current state we are stepping through. These slots are mutated in place as new ‘Captures’ states are traversed during epsilon closure, but the slots are restored to their original values once the full epsilon closure is completed. The ultimate use of ‘curr_slots’ is to copy them to the corresponding ‘next_slots’, so that the capturing group spans are forwarded from the currently active state to the next.
‘next’ refers to the next set of active states. Computing an epsilon closure may increase the next set of active states.
‘input’ refers to the caller’s input configuration and ‘at’ refers to the current position in the haystack. These are used to check whether conditional epsilon transitions (like look-around) are satisfied at the current position. If they aren’t, then the epsilon closure won’t include them.
sourcefn start_config(&self, input: &Input<'_>) -> Option<(bool, StateID)>
fn start_config(&self, input: &Input<'_>) -> Option<(bool, StateID)>
Return the starting configuration of a PikeVM search.
The “start config” is basically whether the search should be anchored or not and the NFA state ID at which to begin the search. The state ID returned always corresponds to an anchored starting state even when the search is unanchored. This is because the PikeVM search loop deals with unanchored searches with an explicit epsilon closure out of the start state.
This routine accounts for both the caller’s Input
configuration
and the pattern itself. For example, even if the caller asks for an
unanchored search, if the pattern itself is anchored, then this will
always return ‘true’ because implementing an unanchored search in that
case would be incorrect.
Similarly, if the caller requests an anchored search for a particular pattern, then the starting state ID returned will reflect that.
If a pattern ID is given in the input configuration that is not in
this regex, then None
is returned.