Struct regex_automata::dfa::onepass::DFA
source · pub struct DFA {
config: Config,
nfa: NFA,
table: Vec<Transition>,
starts: Vec<StateID>,
min_match_id: StateID,
classes: ByteClasses,
alphabet_len: usize,
stride2: usize,
pateps_offset: usize,
explicit_slot_start: usize,
}
Expand description
A one-pass DFA for executing a subset of anchored regex searches while resolving capturing groups.
A one-pass DFA can be built from an NFA that is one-pass. An NFA is
one-pass when there is never any ambiguity about how to continue a search.
For example, a*a
is not one-pass becuase during a search, it’s not
possible to know whether to continue matching the a*
or to move on to
the single a
. However, a*b
is one-pass, because for every byte in the
input, it’s always clear when to move on from a*
to b
.
§Only anchored searches are supported
In this crate, especially for DFAs, unanchored searches are implemented by
treating the pattern as if it had a (?s-u:.)*?
prefix. While the prefix
is one-pass on its own, adding anything after it, e.g., (?s-u:.)*?a
will
make the overall pattern not one-pass. Why? Because the (?s-u:.)
matches
any byte, and there is therefore ambiguity as to when the prefix should
stop matching and something else should start matching.
Therefore, one-pass DFAs do not support unanchored searches. In addition
to many regexes simply not being one-pass, it implies that one-pass DFAs
have limited utility. With that said, when a one-pass DFA can be used, it
can potentially provide a dramatic speed up over alternatives like the
BoundedBacktracker
and the PikeVM
. In particular,
a one-pass DFA is the only DFA capable of reporting the spans of matching
capturing groups.
To clarify, when we say that unanchored searches are not supported, what that actually means is:
- The high level routines,
DFA::is_match
andDFA::captures
, always do anchored searches. - Since iterators are most useful in the context of unanchored searches,
there is no
DFA::captures_iter
method. - For lower level routines like
DFA::try_search
, an error will be returned if the givenInput
is configured to do an unanchored search or search for an invalid pattern ID. (Note that anInput
is configured to do an unanchored search by default, so just giving aInput::new
is guaranteed to return an error.)
§Other limitations
In addition to the configurable heap limit and the requirement that a regex pattern be one-pass, there are some other limitations:
- There is an internal limit on the total number of explicit capturing groups that appear across all patterns. It is somewhat small and there is no way to configure it. If your pattern(s) exceed this limit, then building a one-pass DFA will fail.
- If the number of patterns exceeds an internal unconfigurable limit, then building a one-pass DFA will fail. This limit is quite large and you’re unlikely to hit it.
- If the total number of states exceeds an internal unconfigurable limit, then building a one-pass DFA will fail. This limit is quite large and you’re unlikely to hit it.
§Other examples of regexes that aren’t one-pass
One particularly unfortunate example is that enabling Unicode can cause
regexes that were one-pass to no longer be one-pass. Consider the regex
(?-u)\w*\s
for example. It is one-pass because there is exactly no
overlap between the ASCII definitions of \w
and \s
. But \w*\s
(i.e., with Unicode enabled) is not one-pass because \w
and \s
get
translated to UTF-8 automatons. And while the codepoints in \w
and \s
do not overlap, the underlying UTF-8 encodings do. Indeed, because of the
overlap between UTF-8 automata, the use of Unicode character classes will
tend to vastly increase the likelihood of a regex not being one-pass.
§How does one know if a regex is one-pass or not?
At the time of writing, the only way to know is to try and build a one-pass DFA. The one-pass property is checked while constructing the DFA.
This does mean that you might potentially waste some CPU cycles and memory by optimistically trying to build a one-pass DFA. But this is currently the only way. In the future, building a one-pass DFA might be able to use some heuristics to detect common violations of the one-pass property and bail more quickly.
§Resource usage
Unlike a general DFA, a one-pass DFA has stricter bounds on its resource
usage. Namely, construction of a one-pass DFA has a time and space
complexity of O(n)
, where n ~ nfa.states().len()
. (A general DFA’s time
and space complexity is O(2^n)
.) This smaller time bound is achieved
because there is at most one DFA state created for each NFA state. If
additional DFA states would be required, then the pattern is not one-pass
and construction will fail.
Note though that currently, this DFA uses a fully dense representation. This means that while its space complexity is no worse than an NFA, it may in practice use more memory because of higher constant factors. The reason for this trade off is two-fold. Firstly, a dense representation makes the search faster. Secondly, the bigger an NFA, the more unlikely it is to be one-pass. Therefore, most one-pass DFAs are usually pretty small.
§Example
This example shows that the one-pass DFA implements Unicode word boundaries correctly while simultaneously reporting spans for capturing groups that participate in a match. (This is the only DFA that implements full support for Unicode word boundaries.)
use regex_automata::{dfa::onepass::DFA, Match, Span};
let re = DFA::new(r"\b(?P<first>\w+)[[:space:]]+(?P<last>\w+)\b")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
re.captures(&mut cache, "Шерлок Холмс", &mut caps);
assert_eq!(Some(Match::must(0, 0..23)), caps.get_match());
assert_eq!(Some(Span::from(0..12)), caps.get_group_by_name("first"));
assert_eq!(Some(Span::from(13..23)), caps.get_group_by_name("last"));
§Example: iteration
Unlike other regex engines in this crate, this one does not provide iterator search functions. This is because a one-pass DFA only supports anchored searches, and so iterator functions are generally not applicable.
However, if you know that all of your matches are
directly adjacent, then an iterator can be used. The
util::iter::Searcher
type can be used for
this purpose:
use regex_automata::{
dfa::onepass::DFA,
util::iter::Searcher,
Anchored, Input, Span,
};
let re = DFA::new(r"\w(\d)\w")?;
let (mut cache, caps) = (re.create_cache(), re.create_captures());
let input = Input::new("a1zb2yc3x").anchored(Anchored::Yes);
let mut it = Searcher::new(input).into_captures_iter(caps, |input, caps| {
Ok(re.try_search(&mut cache, input, caps)?)
}).infallible();
let caps0 = it.next().unwrap();
assert_eq!(Some(Span::from(1..2)), caps0.get_group(1));
Fields§
§config: Config
The configuration provided by the caller.
nfa: NFA
The NFA used to build this DFA.
NOTE: We probably don’t need to store the NFA here, but we use enough bits from it that it’s convenient to do so. And there really isn’t much cost to doing so either, since an NFA is reference counted internally.
table: Vec<Transition>
The transition table. Given a state ID ‘s’ and a byte of haystack ‘b’,
the next state is table[sid + classes[byte]]
.
The stride of this table (i.e., the number of columns) is always a power of 2, even if the alphabet length is smaller. This makes converting between state IDs and state indices very cheap.
Note that the stride always includes room for one extra “transition” that isn’t actually a transition. It is a ‘PatternEpsilons’ that is used for match states only. Because of this, the maximum number of active columns in the transition table is 257, which means the maximum stride is 512 (the next power of 2 greater than or equal to 257).
starts: Vec<StateID>
The DFA state IDs of the starting states.
starts[0]
is always present and corresponds to the starting state
when searching for matches of any pattern in the DFA.
starts[i]
where i>0 corresponds to the starting state for the pattern
ID ‘i-1’. These starting states are optional.
min_match_id: StateID
Every state ID >= this value corresponds to a match state.
This is what a search uses to detect whether a state is a match state or not. It requires only a simple comparison instead of bit-unpacking the PatternEpsilons from every state.
classes: ByteClasses
The alphabet of this DFA, split into equivalence classes. Bytes in the same equivalence class can never discriminate between a match and a non-match.
alphabet_len: usize
The number of elements in each state in the transition table. This may be less than the stride, since the stride is always a power of 2 and the alphabet length can be anything up to and including 256.
stride2: usize
The number of columns in the transition table, expressed as a power of 2.
pateps_offset: usize
The offset at which the PatternEpsilons for a match state is stored in the transition table.
PERF: One wonders whether it would be better to put this in a separate allocation, since only match states have a non-empty PatternEpsilons and the number of match states tends be dwarfed by the number of non-match states. So this would save ‘8*len(non_match_states)’ for each DFA. The question is whether moving this to a different allocation will lead to a perf hit during searches. You might think dealing with match states is rare, but some regexes spend a lot of time in match states gobbling up input. But… match state handling is already somewhat expensive, so maybe this wouldn’t do much? Either way, it’s worth experimenting.
explicit_slot_start: usize
The first explicit slot index. This refers to the first slot appearing immediately after the last implicit slot. It is always ’patterns.len()
- 2’.
We record this because we only store the explicit slots in our DFA transition table that need to be saved. Implicit slots are handled automatically as part of the search.
Implementations§
source§impl DFA
impl DFA
sourcepub fn new(pattern: &str) -> Result<DFA, BuildError>
pub fn new(pattern: &str) -> Result<DFA, BuildError>
Parse the given regular expression using the default configuration and return the corresponding one-pass DFA.
If you want a non-default configuration, then use the Builder
to
set your own configuration.
§Example
use regex_automata::{dfa::onepass::DFA, Match};
let re = DFA::new("foo[0-9]+bar")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
re.captures(&mut cache, "foo12345barzzz", &mut caps);
assert_eq!(Some(Match::must(0, 0..11)), caps.get_match());
sourcepub fn new_many<P: AsRef<str>>(patterns: &[P]) -> Result<DFA, BuildError>
pub fn new_many<P: AsRef<str>>(patterns: &[P]) -> Result<DFA, BuildError>
Like new
, but parses multiple patterns into a single “multi regex.”
This similarly uses the default regex configuration.
§Example
use regex_automata::{dfa::onepass::DFA, Match};
let re = DFA::new_many(&["[a-z]+", "[0-9]+"])?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
re.captures(&mut cache, "abc123", &mut caps);
assert_eq!(Some(Match::must(0, 0..3)), caps.get_match());
re.captures(&mut cache, "123abc", &mut caps);
assert_eq!(Some(Match::must(1, 0..3)), caps.get_match());
sourcepub fn new_from_nfa(nfa: NFA) -> Result<DFA, BuildError>
pub fn new_from_nfa(nfa: NFA) -> Result<DFA, BuildError>
Like new
, but builds a one-pass DFA directly from an NFA. This is
useful if you already have an NFA, or even if you hand-assembled the
NFA.
§Example
This shows how to hand assemble a regular expression via its HIR, compile an NFA from it and build a one-pass DFA from the NFA.
use regex_automata::{
dfa::onepass::DFA,
nfa::thompson::NFA,
Match,
};
use regex_syntax::hir::{Hir, Class, ClassBytes, ClassBytesRange};
let hir = Hir::class(Class::Bytes(ClassBytes::new(vec![
ClassBytesRange::new(b'0', b'9'),
ClassBytesRange::new(b'A', b'Z'),
ClassBytesRange::new(b'_', b'_'),
ClassBytesRange::new(b'a', b'z'),
])));
let config = NFA::config().nfa_size_limit(Some(1_000));
let nfa = NFA::compiler().configure(config).build_from_hir(&hir)?;
let re = DFA::new_from_nfa(nfa)?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let expected = Some(Match::must(0, 0..1));
re.captures(&mut cache, "A", &mut caps);
assert_eq!(expected, caps.get_match());
sourcepub fn always_match() -> Result<DFA, BuildError>
pub fn always_match() -> Result<DFA, BuildError>
Create a new one-pass DFA that matches every input.
§Example
use regex_automata::{dfa::onepass::DFA, Match};
let dfa = DFA::always_match()?;
let mut cache = dfa.create_cache();
let mut caps = dfa.create_captures();
let expected = Match::must(0, 0..0);
dfa.captures(&mut cache, "", &mut caps);
assert_eq!(Some(expected), caps.get_match());
dfa.captures(&mut cache, "foo", &mut caps);
assert_eq!(Some(expected), caps.get_match());
sourcepub fn never_match() -> Result<DFA, BuildError>
pub fn never_match() -> Result<DFA, BuildError>
Create a new one-pass DFA that never matches any input.
§Example
use regex_automata::dfa::onepass::DFA;
let dfa = DFA::never_match()?;
let mut cache = dfa.create_cache();
let mut caps = dfa.create_captures();
dfa.captures(&mut cache, "", &mut caps);
assert_eq!(None, caps.get_match());
dfa.captures(&mut cache, "foo", &mut caps);
assert_eq!(None, caps.get_match());
sourcepub fn config() -> Config
pub fn config() -> Config
Return a default configuration for a DFA.
This is a convenience routine to avoid needing to import the Config
type when customizing the construction of a DFA.
§Example
This example shows how to change the match semantics of this DFA from its default “leftmost first” to “all.” When using “all,” non-greediness doesn’t apply and neither does preference order matching. Instead, the longest match possible is always returned. (Although, by construction, it’s impossible for a one-pass DFA to have a different answer for “preference order” vs “longest match.”)
use regex_automata::{dfa::onepass::DFA, Match, MatchKind};
let re = DFA::builder()
.configure(DFA::config().match_kind(MatchKind::All))
.build(r"(abc)+?")?;
let mut cache = re.create_cache();
let mut caps = re.create_captures();
re.captures(&mut cache, "abcabc", &mut caps);
// Normally, the non-greedy repetition would give us a 0..3 match.
assert_eq!(Some(Match::must(0, 0..6)), caps.get_match());
sourcepub fn builder() -> Builder
pub fn builder() -> Builder
Return a builder for configuring the construction of a DFA.
This is a convenience routine to avoid needing to import the
Builder
type in common cases.
§Example
This example shows how to use the builder to disable UTF-8 mode.
use regex_automata::{
dfa::onepass::DFA,
nfa::thompson,
util::syntax,
Match,
};
let re = DFA::builder()
.syntax(syntax::Config::new().utf8(false))
.thompson(thompson::Config::new().utf8(false))
.build(r"foo(?-u:[^b])ar.*")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let haystack = b"foo\xFFarzz\xE2\x98\xFF\n";
let expected = Some(Match::must(0, 0..8));
re.captures(&mut cache, haystack, &mut caps);
assert_eq!(expected, caps.get_match());
sourcepub fn create_captures(&self) -> Captures
pub fn create_captures(&self) -> Captures
Create a new empty set of capturing groups that is guaranteed to be valid for the search APIs on this DFA.
A Captures
value created for a specific DFA cannot be used with any
other DFA.
This is a convenience function for Captures::all
. See the
Captures
documentation for an explanation of its alternative
constructors that permit the DFA to do less work during a search, and
thus might make it faster.
sourcepub fn create_cache(&self) -> Cache
pub fn create_cache(&self) -> Cache
Create a new cache for this DFA.
The cache returned should only be used for searches for this
DFA. If you want to reuse the cache for another DFA, then you
must call Cache::reset
with that DFA (or, equivalently,
DFA::reset_cache
).
sourcepub fn reset_cache(&self, cache: &mut Cache)
pub fn reset_cache(&self, cache: &mut Cache)
Reset the given cache such that it can be used for searching with the this DFA (and only this DFA).
A cache reset permits reusing memory already allocated in this cache with a different DFA.
§Example
This shows how to re-purpose a cache for use with a different DFA.
use regex_automata::{dfa::onepass::DFA, Match};
let re1 = DFA::new(r"\w")?;
let re2 = DFA::new(r"\W")?;
let mut caps1 = re1.create_captures();
let mut caps2 = re2.create_captures();
let mut cache = re1.create_cache();
assert_eq!(
Some(Match::must(0, 0..2)),
{ re1.captures(&mut cache, "Δ", &mut caps1); caps1.get_match() },
);
// Using 'cache' with re2 is not allowed. It may result in panics or
// incorrect results. In order to re-purpose the cache, we must reset
// it with the one-pass DFA we'd like to use it with.
//
// Similarly, after this reset, using the cache with 're1' is also not
// allowed.
re2.reset_cache(&mut cache);
assert_eq!(
Some(Match::must(0, 0..3)),
{ re2.captures(&mut cache, "☃", &mut caps2); caps2.get_match() },
);
sourcepub fn get_config(&self) -> &Config
pub fn get_config(&self) -> &Config
Return the config for this one-pass DFA.
sourcepub fn pattern_len(&self) -> usize
pub fn pattern_len(&self) -> usize
Returns the total number of patterns compiled into this DFA.
In the case of a DFA that contains no patterns, this returns 0
.
sourcepub fn state_len(&self) -> usize
pub fn state_len(&self) -> usize
Returns the total number of states in this one-pass DFA.
Note that unlike dense or sparse DFAs, a one-pass DFA does not expose a low level DFA API. Therefore, this routine has little use other than being informational.
sourcepub fn alphabet_len(&self) -> usize
pub fn alphabet_len(&self) -> usize
Returns the total number of elements in the alphabet for this DFA.
That is, this returns the total number of transitions that each state in this DFA must have. The maximum alphabet size is 256, which corresponds to each possible byte value.
The alphabet size may be less than 256 though, and unless
Config::byte_classes
is disabled, it is typically must less than
256. Namely, bytes are grouped into equivalence classes such that no
two bytes in the same class can distinguish a match from a non-match.
For example, in the regex ^[a-z]+$
, the ASCII bytes a-z
could
all be in the same equivalence class. This leads to a massive space
savings.
Note though that the alphabet length does not necessarily equal the
total stride space taken up by a single DFA state in the transition
table. Namely, for performance reasons, the stride is always the
smallest power of two that is greater than or equal to the alphabet
length. For this reason, DFA::stride
or DFA::stride2
are
often more useful. The alphabet length is typically useful only for
informational purposes.
Note also that unlike dense or sparse DFAs, a one-pass DFA does
not have a special end-of-input (EOI) transition. This is because
a one-pass DFA handles look-around assertions explicitly (like the
PikeVM
) and does not build
them into the transitions of the DFA.
sourcepub fn stride2(&self) -> usize
pub fn stride2(&self) -> usize
Returns the total stride for every state in this DFA, expressed as the exponent of a power of 2. The stride is the amount of space each state takes up in the transition table, expressed as a number of transitions. (Unused transitions map to dead states.)
The stride of a DFA is always equivalent to the smallest power of 2 that is greater than or equal to the DFA’s alphabet length. This definition uses extra space, but possibly permits faster translation between state identifiers and their corresponding offsets in this DFA’s transition table.
For example, if the DFA’s stride is 16 transitions, then its stride2
is 4
since 2^4 = 16
.
The minimum stride2
value is 1
(corresponding to a stride of 2
)
while the maximum stride2
value is 9
(corresponding to a stride
of 512
). The maximum in theory should be 8
, but because of some
implementation quirks that may be relaxed in the future, it is one more
than 8
. (Do note that a maximal stride is incredibly rare, as it
would imply that there is almost no redundant in the regex pattern.)
Note that unlike dense or sparse DFAs, a one-pass DFA does not expose a low level DFA API. Therefore, this routine has little use other than being informational.
sourcepub fn stride(&self) -> usize
pub fn stride(&self) -> usize
Returns the total stride for every state in this DFA. This corresponds to the total number of transitions used by each state in this DFA’s transition table.
Please see DFA::stride2
for more information. In particular, this
returns the stride as the number of transitions, where as stride2
returns it as the exponent of a power of 2.
Note that unlike dense or sparse DFAs, a one-pass DFA does not expose a low level DFA API. Therefore, this routine has little use other than being informational.
sourcepub fn memory_usage(&self) -> usize
pub fn memory_usage(&self) -> usize
Returns the memory usage, in bytes, of this DFA.
The memory usage is computed based on the number of bytes used to represent this DFA.
This does not include the stack size used up by this DFA. To
compute that, use std::mem::size_of::<onepass::DFA>()
.
source§impl DFA
impl DFA
sourcepub fn is_match<'h, I: Into<Input<'h>>>(
&self,
cache: &mut Cache,
input: I,
) -> bool
pub fn is_match<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I, ) -> bool
Executes an anchored leftmost forward search, and returns true if and only if this one-pass DFA matches the given haystack.
This routine may short circuit if it knows that scanning future
input will never lead to a different result. In particular, if the
underlying DFA enters a match state, then this routine will return
true
immediately without inspecting any future input. (Consider how
this might make a difference given the regex a+
on the haystack
aaaaaaaaaaaaaaa
. This routine can stop after it sees the first a
,
but routines like find
need to continue searching because +
is
greedy by default.)
The given Input
is forcefully set to use Anchored::Yes
if the
given configuration was Anchored::No
(which is the default).
§Panics
This routine panics if the search could not complete. This can occur in the following circumstances:
- When the provided
Input
configuration is not supported. For example, by providing an unsupported anchor mode. Concretely, this occurs when usingAnchored::Pattern
without enablingConfig::starts_for_each_pattern
.
When a search panics, callers cannot know whether a match exists or not.
Use DFA::try_search
if you want to handle these panics as error
values instead.
§Example
This shows basic usage:
use regex_automata::dfa::onepass::DFA;
let re = DFA::new("foo[0-9]+bar")?;
let mut cache = re.create_cache();
assert!(re.is_match(&mut cache, "foo12345bar"));
assert!(!re.is_match(&mut cache, "foobar"));
§Example: consistency with search APIs
is_match
is guaranteed to return true
whenever captures
returns
a match. This includes searches that are executed entirely within a
codepoint:
use regex_automata::{dfa::onepass::DFA, Input};
let re = DFA::new("a*")?;
let mut cache = re.create_cache();
assert!(!re.is_match(&mut cache, Input::new("☃").span(1..2)));
Notice that when UTF-8 mode is disabled, then the above reports a match because the restriction against zero-width matches that split a codepoint has been lifted:
use regex_automata::{dfa::onepass::DFA, nfa::thompson::NFA, Input};
let re = DFA::builder()
.thompson(NFA::config().utf8(false))
.build("a*")?;
let mut cache = re.create_cache();
assert!(re.is_match(&mut cache, Input::new("☃").span(1..2)));
sourcepub fn find<'h, I: Into<Input<'h>>>(
&self,
cache: &mut Cache,
input: I,
) -> Option<Match>
pub fn find<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I, ) -> Option<Match>
Executes an anchored leftmost forward search, and returns a Match
if
and only if this one-pass DFA matches the given haystack.
This routine only includes the overall match span. To get access to the
individual spans of each capturing group, use DFA::captures
.
The given Input
is forcefully set to use Anchored::Yes
if the
given configuration was Anchored::No
(which is the default).
§Panics
This routine panics if the search could not complete. This can occur in the following circumstances:
- When the provided
Input
configuration is not supported. For example, by providing an unsupported anchor mode. Concretely, this occurs when usingAnchored::Pattern
without enablingConfig::starts_for_each_pattern
.
When a search panics, callers cannot know whether a match exists or not.
Use DFA::try_search
if you want to handle these panics as error
values instead.
§Example
Leftmost first match semantics corresponds to the match with the
smallest starting offset, but where the end offset is determined by
preferring earlier branches in the original regular expression. For
example, Sam|Samwise
will match Sam
in Samwise
, but Samwise|Sam
will match Samwise
in Samwise
.
Generally speaking, the “leftmost first” match is how most backtracking
regular expressions tend to work. This is in contrast to POSIX-style
regular expressions that yield “leftmost longest” matches. Namely,
both Sam|Samwise
and Samwise|Sam
match Samwise
when using
leftmost longest semantics. (This crate does not currently support
leftmost longest semantics.)
use regex_automata::{dfa::onepass::DFA, Match};
let re = DFA::new("foo[0-9]+")?;
let mut cache = re.create_cache();
let expected = Match::must(0, 0..8);
assert_eq!(Some(expected), re.find(&mut cache, "foo12345"));
// Even though a match is found after reading the first byte (`a`),
// the leftmost first match semantics demand that we find the earliest
// match that prefers earlier parts of the pattern over later parts.
let re = DFA::new("abc|a")?;
let mut cache = re.create_cache();
let expected = Match::must(0, 0..3);
assert_eq!(Some(expected), re.find(&mut cache, "abc"));
sourcepub fn captures<'h, I: Into<Input<'h>>>(
&self,
cache: &mut Cache,
input: I,
caps: &mut Captures,
)
pub fn captures<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I, caps: &mut Captures, )
Executes an anchored leftmost forward search and writes the spans
of capturing groups that participated in a match into the provided
Captures
value. If no match was found, then Captures::is_match
is guaranteed to return false
.
The given Input
is forcefully set to use Anchored::Yes
if the
given configuration was Anchored::No
(which is the default).
§Panics
This routine panics if the search could not complete. This can occur in the following circumstances:
- When the provided
Input
configuration is not supported. For example, by providing an unsupported anchor mode. Concretely, this occurs when usingAnchored::Pattern
without enablingConfig::starts_for_each_pattern
.
When a search panics, callers cannot know whether a match exists or not.
Use DFA::try_search
if you want to handle these panics as error
values instead.
§Example
This shows a simple example of a one-pass regex that extracts capturing group spans.
use regex_automata::{dfa::onepass::DFA, Match, Span};
let re = DFA::new(
// Notice that we use ASCII here. The corresponding Unicode regex
// is sadly not one-pass.
"(?P<first>[[:alpha:]]+)[[:space:]]+(?P<last>[[:alpha:]]+)",
)?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
re.captures(&mut cache, "Bruce Springsteen", &mut caps);
assert_eq!(Some(Match::must(0, 0..17)), caps.get_match());
assert_eq!(Some(Span::from(0..5)), caps.get_group(1));
assert_eq!(Some(Span::from(6..17)), caps.get_group_by_name("last"));
sourcepub fn try_search(
&self,
cache: &mut Cache,
input: &Input<'_>,
caps: &mut Captures,
) -> Result<(), MatchError>
pub fn try_search( &self, cache: &mut Cache, input: &Input<'_>, caps: &mut Captures, ) -> Result<(), MatchError>
Executes an anchored leftmost forward search and writes the spans
of capturing groups that participated in a match into the provided
Captures
value. If no match was found, then Captures::is_match
is guaranteed to return false
.
The differences with DFA::captures
are:
- This returns an error instead of panicking if the search fails.
- Accepts an
&Input
instead of aInto<Input>
. This permits reusing the same input for multiple searches, which may be important for latency. - This does not automatically change the
Anchored
mode fromNo
toYes
. Instead, ifInput::anchored
isAnchored::No
, then an error is returned.
§Errors
This routine errors if the search could not complete. This can occur in the following circumstances:
- When the provided
Input
configuration is not supported. For example, by providing an unsupported anchor mode. Concretely, this occurs when usingAnchored::Pattern
without enablingConfig::starts_for_each_pattern
.
When a search returns an error, callers cannot know whether a match exists or not.
§Example: specific pattern search
This example shows how to build a multi-regex that permits searching for specific patterns. Note that this is somewhat less useful than in other regex engines, since a one-pass DFA by definition has no ambiguity about which pattern can match at a position. That is, if it were possible for two different patterns to match at the same starting position, then the multi-regex would not be one-pass and construction would have failed.
Nevertheless, this can still be useful if you only care about matches for a specific pattern, and want the DFA to report “no match” even if some other pattern would have matched.
Note that in order to make use of this functionality,
Config::starts_for_each_pattern
must be enabled. It is disabled
by default since it may result in higher memory usage.
use regex_automata::{
dfa::onepass::DFA, Anchored, Input, Match, PatternID,
};
let re = DFA::builder()
.configure(DFA::config().starts_for_each_pattern(true))
.build_many(&["[a-z]+", "[0-9]+"])?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let haystack = "123abc";
let input = Input::new(haystack).anchored(Anchored::Yes);
// A normal multi-pattern search will show pattern 1 matches.
re.try_search(&mut cache, &input, &mut caps)?;
assert_eq!(Some(Match::must(1, 0..3)), caps.get_match());
// If we only want to report pattern 0 matches, then we'll get no
// match here.
let input = input.anchored(Anchored::Pattern(PatternID::must(0)));
re.try_search(&mut cache, &input, &mut caps)?;
assert_eq!(None, caps.get_match());
§Example: specifying the bounds of a search
This example shows how providing the bounds of a search can produce different results than simply sub-slicing the haystack.
use regex_automata::{dfa::onepass::DFA, Anchored, Input, Match};
// one-pass DFAs fully support Unicode word boundaries!
// A sad joke is that a Unicode aware regex like \w+\s is not one-pass.
// :-(
let re = DFA::new(r"\b[0-9]{3}\b")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let haystack = "foo123bar";
// Since we sub-slice the haystack, the search doesn't know about
// the larger context and assumes that `123` is surrounded by word
// boundaries. And of course, the match position is reported relative
// to the sub-slice as well, which means we get `0..3` instead of
// `3..6`.
let expected = Some(Match::must(0, 0..3));
let input = Input::new(&haystack[3..6]).anchored(Anchored::Yes);
re.try_search(&mut cache, &input, &mut caps)?;
assert_eq!(expected, caps.get_match());
// But if we provide the bounds of the search within the context of the
// entire haystack, then the search can take the surrounding context
// into account. (And if we did find a match, it would be reported
// as a valid offset into `haystack` instead of its sub-slice.)
let expected = None;
let input = Input::new(haystack).range(3..6).anchored(Anchored::Yes);
re.try_search(&mut cache, &input, &mut caps)?;
assert_eq!(expected, caps.get_match());
sourcepub fn try_search_slots(
&self,
cache: &mut Cache,
input: &Input<'_>,
slots: &mut [Option<NonMaxUsize>],
) -> Result<Option<PatternID>, MatchError>
pub fn try_search_slots( &self, cache: &mut Cache, input: &Input<'_>, slots: &mut [Option<NonMaxUsize>], ) -> Result<Option<PatternID>, MatchError>
Executes an anchored leftmost forward search and writes the spans
of capturing groups that participated in a match into the provided
slots
, and returns the matching pattern ID. The contents of the
slots for patterns other than the matching pattern are unspecified. If
no match was found, then None
is returned and the contents of all
slots
is unspecified.
This is like DFA::try_search
, but it accepts a raw slots slice
instead of a Captures
value. This is useful in contexts where you
don’t want or need to allocate a Captures
.
It is legal to pass any number of slots to this routine. If the regex engine would otherwise write a slot offset that doesn’t fit in the provided slice, then it is simply skipped. In general though, there are usually three slice lengths you might want to use:
- An empty slice, if you only care about which pattern matched.
- A slice with
pattern_len() * 2
slots, if you only care about the overall match spans for each matching pattern. - A slice with
slot_len()
slots, which permits recording match offsets for every capturing group in every pattern.
§Errors
This routine errors if the search could not complete. This can occur in the following circumstances:
- When the provided
Input
configuration is not supported. For example, by providing an unsupported anchor mode. Concretely, this occurs when usingAnchored::Pattern
without enablingConfig::starts_for_each_pattern
.
When a search returns an error, callers cannot know whether a match exists or not.
§Example
This example shows how to find the overall match offsets in a
multi-pattern search without allocating a Captures
value. Indeed, we
can put our slots right on the stack.
use regex_automata::{dfa::onepass::DFA, Anchored, Input, PatternID};
let re = DFA::new_many(&[
r"[a-zA-Z]+",
r"[0-9]+",
])?;
let mut cache = re.create_cache();
let input = Input::new("123").anchored(Anchored::Yes);
// We only care about the overall match offsets here, so we just
// allocate two slots for each pattern. Each slot records the start
// and end of the match.
let mut slots = [None; 4];
let pid = re.try_search_slots(&mut cache, &input, &mut slots)?;
assert_eq!(Some(PatternID::must(1)), pid);
// The overall match offsets are always at 'pid * 2' and 'pid * 2 + 1'.
// See 'GroupInfo' for more details on the mapping between groups and
// slot indices.
let slot_start = pid.unwrap().as_usize() * 2;
let slot_end = slot_start + 1;
assert_eq!(Some(0), slots[slot_start].map(|s| s.get()));
assert_eq!(Some(3), slots[slot_end].map(|s| s.get()));
fn try_search_slots_imp( &self, cache: &mut Cache, input: &Input<'_>, slots: &mut [Option<NonMaxUsize>], ) -> Result<Option<PatternID>, MatchError>
source§impl DFA
impl DFA
fn search_imp( &self, cache: &mut Cache, input: &Input<'_>, slots: &mut [Option<NonMaxUsize>], ) -> Result<Option<PatternID>, MatchError>
sourcefn find_match(
&self,
cache: &mut Cache,
input: &Input<'_>,
at: usize,
sid: StateID,
slots: &mut [Option<NonMaxUsize>],
matched_pid: &mut Option<PatternID>,
) -> bool
fn find_match( &self, cache: &mut Cache, input: &Input<'_>, at: usize, sid: StateID, slots: &mut [Option<NonMaxUsize>], matched_pid: &mut Option<PatternID>, ) -> bool
Assumes ‘sid’ is a match state and looks for whether a match can be reported. If so, appropriate offsets are written to ‘slots’ and ‘matched_pid’ is set to the matching pattern ID.
Even when ‘sid’ is a match state, it’s possible that a match won’t be reported. For example, when the conditional epsilon transitions leading to the match state aren’t satisfied at the given position in the haystack.
source§impl DFA
impl DFA
sourcefn start(&self) -> StateID
fn start(&self) -> StateID
Returns the anchored start state for matching any pattern in this DFA.
sourcefn start_pattern(&self, pid: PatternID) -> Result<StateID, MatchError>
fn start_pattern(&self, pid: PatternID) -> Result<StateID, MatchError>
Returns the anchored start state for matching the given pattern. If
‘starts_for_each_pattern’
was not enabled, then this returns an error. If the given pattern is
not in this DFA, then Ok(None)
is returned.
sourcefn transition(&self, sid: StateID, byte: u8) -> Transition
fn transition(&self, sid: StateID, byte: u8) -> Transition
Returns the transition from the given state ID and byte of input. The transition includes the next state ID, the slots that should be saved and any conditional epsilon transitions that must be satisfied in order to take this transition.
sourcefn set_transition(&mut self, sid: StateID, byte: u8, to: Transition)
fn set_transition(&mut self, sid: StateID, byte: u8, to: Transition)
Set the transition from the given state ID and byte of input to the transition given.
sourcefn sparse_transitions(&self, sid: StateID) -> SparseTransitionIter<'_> ⓘ
fn sparse_transitions(&self, sid: StateID) -> SparseTransitionIter<'_> ⓘ
Return an iterator of “sparse” transitions for the given state ID. “sparse” in this context means that consecutive transitions that are equivalent are returned as one group, and transitions to the DEAD state are ignored.
This winds up being useful for debug printing, since it’s much terser to display runs of equivalent transitions than the transition for every possible byte value. Indeed, in practice, it’s very common for runs of equivalent transitions to appear.
sourcefn pattern_epsilons(&self, sid: StateID) -> PatternEpsilons
fn pattern_epsilons(&self, sid: StateID) -> PatternEpsilons
Return the pattern epsilons for the given state ID.
If the given state ID does not correspond to a match state ID, then the pattern epsilons returned is empty.
sourcefn set_pattern_epsilons(&mut self, sid: StateID, pateps: PatternEpsilons)
fn set_pattern_epsilons(&mut self, sid: StateID, pateps: PatternEpsilons)
Set the pattern epsilons for the given state ID.
sourcefn prev_state_id(&self, id: StateID) -> Option<StateID>
fn prev_state_id(&self, id: StateID) -> Option<StateID>
Returns the state ID prior to the one given. This returns None if the given ID is the first DFA state.
sourcefn last_state_id(&self) -> StateID
fn last_state_id(&self) -> StateID
Returns the state ID of the last state in this DFA’s transition table. “last” in this context means the last state to appear in memory, i.e., the one with the greatest ID.
sourcepub(super) fn swap_states(&mut self, id1: StateID, id2: StateID)
pub(super) fn swap_states(&mut self, id1: StateID, id2: StateID)
Move the transitions from ‘id1’ to ‘id2’ and vice versa.
WARNING: This does not update the rest of the transition table to have transitions to ‘id1’ changed to ‘id2’ and vice versa. This merely moves the states in memory.
Trait Implementations§
source§impl Remappable for DFA
impl Remappable for DFA
source§fn stride2(&self) -> usize
fn stride2(&self) -> usize
source§fn swap_states(&mut self, id1: StateID, id2: StateID)
fn swap_states(&mut self, id1: StateID, id2: StateID)
id1
are now in the memory region where the transitions for id2
were, and all of the transitions in id2
are now in the memory region
where the transitions for id1
were. Read more