Struct regex_automata::dfa::onepass::DFA

source ·
pub struct DFA {
    config: Config,
    nfa: NFA,
    table: Vec<Transition>,
    starts: Vec<StateID>,
    min_match_id: StateID,
    classes: ByteClasses,
    alphabet_len: usize,
    stride2: usize,
    pateps_offset: usize,
    explicit_slot_start: usize,
}
Expand description

A one-pass DFA for executing a subset of anchored regex searches while resolving capturing groups.

A one-pass DFA can be built from an NFA that is one-pass. An NFA is one-pass when there is never any ambiguity about how to continue a search. For example, a*a is not one-pass becuase during a search, it’s not possible to know whether to continue matching the a* or to move on to the single a. However, a*b is one-pass, because for every byte in the input, it’s always clear when to move on from a* to b.

§Only anchored searches are supported

In this crate, especially for DFAs, unanchored searches are implemented by treating the pattern as if it had a (?s-u:.)*? prefix. While the prefix is one-pass on its own, adding anything after it, e.g., (?s-u:.)*?a will make the overall pattern not one-pass. Why? Because the (?s-u:.) matches any byte, and there is therefore ambiguity as to when the prefix should stop matching and something else should start matching.

Therefore, one-pass DFAs do not support unanchored searches. In addition to many regexes simply not being one-pass, it implies that one-pass DFAs have limited utility. With that said, when a one-pass DFA can be used, it can potentially provide a dramatic speed up over alternatives like the BoundedBacktracker and the PikeVM. In particular, a one-pass DFA is the only DFA capable of reporting the spans of matching capturing groups.

To clarify, when we say that unanchored searches are not supported, what that actually means is:

  • The high level routines, DFA::is_match and DFA::captures, always do anchored searches.
  • Since iterators are most useful in the context of unanchored searches, there is no DFA::captures_iter method.
  • For lower level routines like DFA::try_search, an error will be returned if the given Input is configured to do an unanchored search or search for an invalid pattern ID. (Note that an Input is configured to do an unanchored search by default, so just giving a Input::new is guaranteed to return an error.)

§Other limitations

In addition to the configurable heap limit and the requirement that a regex pattern be one-pass, there are some other limitations:

  • There is an internal limit on the total number of explicit capturing groups that appear across all patterns. It is somewhat small and there is no way to configure it. If your pattern(s) exceed this limit, then building a one-pass DFA will fail.
  • If the number of patterns exceeds an internal unconfigurable limit, then building a one-pass DFA will fail. This limit is quite large and you’re unlikely to hit it.
  • If the total number of states exceeds an internal unconfigurable limit, then building a one-pass DFA will fail. This limit is quite large and you’re unlikely to hit it.

§Other examples of regexes that aren’t one-pass

One particularly unfortunate example is that enabling Unicode can cause regexes that were one-pass to no longer be one-pass. Consider the regex (?-u)\w*\s for example. It is one-pass because there is exactly no overlap between the ASCII definitions of \w and \s. But \w*\s (i.e., with Unicode enabled) is not one-pass because \w and \s get translated to UTF-8 automatons. And while the codepoints in \w and \s do not overlap, the underlying UTF-8 encodings do. Indeed, because of the overlap between UTF-8 automata, the use of Unicode character classes will tend to vastly increase the likelihood of a regex not being one-pass.

§How does one know if a regex is one-pass or not?

At the time of writing, the only way to know is to try and build a one-pass DFA. The one-pass property is checked while constructing the DFA.

This does mean that you might potentially waste some CPU cycles and memory by optimistically trying to build a one-pass DFA. But this is currently the only way. In the future, building a one-pass DFA might be able to use some heuristics to detect common violations of the one-pass property and bail more quickly.

§Resource usage

Unlike a general DFA, a one-pass DFA has stricter bounds on its resource usage. Namely, construction of a one-pass DFA has a time and space complexity of O(n), where n ~ nfa.states().len(). (A general DFA’s time and space complexity is O(2^n).) This smaller time bound is achieved because there is at most one DFA state created for each NFA state. If additional DFA states would be required, then the pattern is not one-pass and construction will fail.

Note though that currently, this DFA uses a fully dense representation. This means that while its space complexity is no worse than an NFA, it may in practice use more memory because of higher constant factors. The reason for this trade off is two-fold. Firstly, a dense representation makes the search faster. Secondly, the bigger an NFA, the more unlikely it is to be one-pass. Therefore, most one-pass DFAs are usually pretty small.

§Example

This example shows that the one-pass DFA implements Unicode word boundaries correctly while simultaneously reporting spans for capturing groups that participate in a match. (This is the only DFA that implements full support for Unicode word boundaries.)

use regex_automata::{dfa::onepass::DFA, Match, Span};

let re = DFA::new(r"\b(?P<first>\w+)[[:space:]]+(?P<last>\w+)\b")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

re.captures(&mut cache, "Шерлок Холмс", &mut caps);
assert_eq!(Some(Match::must(0, 0..23)), caps.get_match());
assert_eq!(Some(Span::from(0..12)), caps.get_group_by_name("first"));
assert_eq!(Some(Span::from(13..23)), caps.get_group_by_name("last"));

§Example: iteration

Unlike other regex engines in this crate, this one does not provide iterator search functions. This is because a one-pass DFA only supports anchored searches, and so iterator functions are generally not applicable.

However, if you know that all of your matches are directly adjacent, then an iterator can be used. The util::iter::Searcher type can be used for this purpose:

use regex_automata::{
    dfa::onepass::DFA,
    util::iter::Searcher,
    Anchored, Input, Span,
};

let re = DFA::new(r"\w(\d)\w")?;
let (mut cache, caps) = (re.create_cache(), re.create_captures());
let input = Input::new("a1zb2yc3x").anchored(Anchored::Yes);

let mut it = Searcher::new(input).into_captures_iter(caps, |input, caps| {
    Ok(re.try_search(&mut cache, input, caps)?)
}).infallible();
let caps0 = it.next().unwrap();
assert_eq!(Some(Span::from(1..2)), caps0.get_group(1));

Fields§

§config: Config

The configuration provided by the caller.

§nfa: NFA

The NFA used to build this DFA.

NOTE: We probably don’t need to store the NFA here, but we use enough bits from it that it’s convenient to do so. And there really isn’t much cost to doing so either, since an NFA is reference counted internally.

§table: Vec<Transition>

The transition table. Given a state ID ‘s’ and a byte of haystack ‘b’, the next state is table[sid + classes[byte]].

The stride of this table (i.e., the number of columns) is always a power of 2, even if the alphabet length is smaller. This makes converting between state IDs and state indices very cheap.

Note that the stride always includes room for one extra “transition” that isn’t actually a transition. It is a ‘PatternEpsilons’ that is used for match states only. Because of this, the maximum number of active columns in the transition table is 257, which means the maximum stride is 512 (the next power of 2 greater than or equal to 257).

§starts: Vec<StateID>

The DFA state IDs of the starting states.

starts[0] is always present and corresponds to the starting state when searching for matches of any pattern in the DFA.

starts[i] where i>0 corresponds to the starting state for the pattern ID ‘i-1’. These starting states are optional.

§min_match_id: StateID

Every state ID >= this value corresponds to a match state.

This is what a search uses to detect whether a state is a match state or not. It requires only a simple comparison instead of bit-unpacking the PatternEpsilons from every state.

§classes: ByteClasses

The alphabet of this DFA, split into equivalence classes. Bytes in the same equivalence class can never discriminate between a match and a non-match.

§alphabet_len: usize

The number of elements in each state in the transition table. This may be less than the stride, since the stride is always a power of 2 and the alphabet length can be anything up to and including 256.

§stride2: usize

The number of columns in the transition table, expressed as a power of 2.

§pateps_offset: usize

The offset at which the PatternEpsilons for a match state is stored in the transition table.

PERF: One wonders whether it would be better to put this in a separate allocation, since only match states have a non-empty PatternEpsilons and the number of match states tends be dwarfed by the number of non-match states. So this would save ‘8*len(non_match_states)’ for each DFA. The question is whether moving this to a different allocation will lead to a perf hit during searches. You might think dealing with match states is rare, but some regexes spend a lot of time in match states gobbling up input. But… match state handling is already somewhat expensive, so maybe this wouldn’t do much? Either way, it’s worth experimenting.

§explicit_slot_start: usize

The first explicit slot index. This refers to the first slot appearing immediately after the last implicit slot. It is always ’patterns.len()

  • 2’.

We record this because we only store the explicit slots in our DFA transition table that need to be saved. Implicit slots are handled automatically as part of the search.

Implementations§

source§

impl DFA

source

pub fn new(pattern: &str) -> Result<DFA, BuildError>

Parse the given regular expression using the default configuration and return the corresponding one-pass DFA.

If you want a non-default configuration, then use the Builder to set your own configuration.

§Example
use regex_automata::{dfa::onepass::DFA, Match};

let re = DFA::new("foo[0-9]+bar")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

re.captures(&mut cache, "foo12345barzzz", &mut caps);
assert_eq!(Some(Match::must(0, 0..11)), caps.get_match());
source

pub fn new_many<P: AsRef<str>>(patterns: &[P]) -> Result<DFA, BuildError>

Like new, but parses multiple patterns into a single “multi regex.” This similarly uses the default regex configuration.

§Example
use regex_automata::{dfa::onepass::DFA, Match};

let re = DFA::new_many(&["[a-z]+", "[0-9]+"])?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

re.captures(&mut cache, "abc123", &mut caps);
assert_eq!(Some(Match::must(0, 0..3)), caps.get_match());

re.captures(&mut cache, "123abc", &mut caps);
assert_eq!(Some(Match::must(1, 0..3)), caps.get_match());
source

pub fn new_from_nfa(nfa: NFA) -> Result<DFA, BuildError>

Like new, but builds a one-pass DFA directly from an NFA. This is useful if you already have an NFA, or even if you hand-assembled the NFA.

§Example

This shows how to hand assemble a regular expression via its HIR, compile an NFA from it and build a one-pass DFA from the NFA.

use regex_automata::{
    dfa::onepass::DFA,
    nfa::thompson::NFA,
    Match,
};
use regex_syntax::hir::{Hir, Class, ClassBytes, ClassBytesRange};

let hir = Hir::class(Class::Bytes(ClassBytes::new(vec![
    ClassBytesRange::new(b'0', b'9'),
    ClassBytesRange::new(b'A', b'Z'),
    ClassBytesRange::new(b'_', b'_'),
    ClassBytesRange::new(b'a', b'z'),
])));

let config = NFA::config().nfa_size_limit(Some(1_000));
let nfa = NFA::compiler().configure(config).build_from_hir(&hir)?;

let re = DFA::new_from_nfa(nfa)?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let expected = Some(Match::must(0, 0..1));
re.captures(&mut cache, "A", &mut caps);
assert_eq!(expected, caps.get_match());
source

pub fn always_match() -> Result<DFA, BuildError>

Create a new one-pass DFA that matches every input.

§Example
use regex_automata::{dfa::onepass::DFA, Match};

let dfa = DFA::always_match()?;
let mut cache = dfa.create_cache();
let mut caps = dfa.create_captures();

let expected = Match::must(0, 0..0);
dfa.captures(&mut cache, "", &mut caps);
assert_eq!(Some(expected), caps.get_match());
dfa.captures(&mut cache, "foo", &mut caps);
assert_eq!(Some(expected), caps.get_match());
source

pub fn never_match() -> Result<DFA, BuildError>

Create a new one-pass DFA that never matches any input.

§Example
use regex_automata::dfa::onepass::DFA;

let dfa = DFA::never_match()?;
let mut cache = dfa.create_cache();
let mut caps = dfa.create_captures();

dfa.captures(&mut cache, "", &mut caps);
assert_eq!(None, caps.get_match());
dfa.captures(&mut cache, "foo", &mut caps);
assert_eq!(None, caps.get_match());
source

pub fn config() -> Config

Return a default configuration for a DFA.

This is a convenience routine to avoid needing to import the Config type when customizing the construction of a DFA.

§Example

This example shows how to change the match semantics of this DFA from its default “leftmost first” to “all.” When using “all,” non-greediness doesn’t apply and neither does preference order matching. Instead, the longest match possible is always returned. (Although, by construction, it’s impossible for a one-pass DFA to have a different answer for “preference order” vs “longest match.”)

use regex_automata::{dfa::onepass::DFA, Match, MatchKind};

let re = DFA::builder()
    .configure(DFA::config().match_kind(MatchKind::All))
    .build(r"(abc)+?")?;
let mut cache = re.create_cache();
let mut caps = re.create_captures();

re.captures(&mut cache, "abcabc", &mut caps);
// Normally, the non-greedy repetition would give us a 0..3 match.
assert_eq!(Some(Match::must(0, 0..6)), caps.get_match());
source

pub fn builder() -> Builder

Return a builder for configuring the construction of a DFA.

This is a convenience routine to avoid needing to import the Builder type in common cases.

§Example

This example shows how to use the builder to disable UTF-8 mode.

use regex_automata::{
    dfa::onepass::DFA,
    nfa::thompson,
    util::syntax,
    Match,
};

let re = DFA::builder()
    .syntax(syntax::Config::new().utf8(false))
    .thompson(thompson::Config::new().utf8(false))
    .build(r"foo(?-u:[^b])ar.*")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

let haystack = b"foo\xFFarzz\xE2\x98\xFF\n";
let expected = Some(Match::must(0, 0..8));
re.captures(&mut cache, haystack, &mut caps);
assert_eq!(expected, caps.get_match());
source

pub fn create_captures(&self) -> Captures

Create a new empty set of capturing groups that is guaranteed to be valid for the search APIs on this DFA.

A Captures value created for a specific DFA cannot be used with any other DFA.

This is a convenience function for Captures::all. See the Captures documentation for an explanation of its alternative constructors that permit the DFA to do less work during a search, and thus might make it faster.

source

pub fn create_cache(&self) -> Cache

Create a new cache for this DFA.

The cache returned should only be used for searches for this DFA. If you want to reuse the cache for another DFA, then you must call Cache::reset with that DFA (or, equivalently, DFA::reset_cache).

source

pub fn reset_cache(&self, cache: &mut Cache)

Reset the given cache such that it can be used for searching with the this DFA (and only this DFA).

A cache reset permits reusing memory already allocated in this cache with a different DFA.

§Example

This shows how to re-purpose a cache for use with a different DFA.

use regex_automata::{dfa::onepass::DFA, Match};

let re1 = DFA::new(r"\w")?;
let re2 = DFA::new(r"\W")?;
let mut caps1 = re1.create_captures();
let mut caps2 = re2.create_captures();

let mut cache = re1.create_cache();
assert_eq!(
    Some(Match::must(0, 0..2)),
    { re1.captures(&mut cache, "Δ", &mut caps1); caps1.get_match() },
);

// Using 'cache' with re2 is not allowed. It may result in panics or
// incorrect results. In order to re-purpose the cache, we must reset
// it with the one-pass DFA we'd like to use it with.
//
// Similarly, after this reset, using the cache with 're1' is also not
// allowed.
re2.reset_cache(&mut cache);
assert_eq!(
    Some(Match::must(0, 0..3)),
    { re2.captures(&mut cache, "☃", &mut caps2); caps2.get_match() },
);
source

pub fn get_config(&self) -> &Config

Return the config for this one-pass DFA.

source

pub fn get_nfa(&self) -> &NFA

Returns a reference to the underlying NFA.

source

pub fn pattern_len(&self) -> usize

Returns the total number of patterns compiled into this DFA.

In the case of a DFA that contains no patterns, this returns 0.

source

pub fn state_len(&self) -> usize

Returns the total number of states in this one-pass DFA.

Note that unlike dense or sparse DFAs, a one-pass DFA does not expose a low level DFA API. Therefore, this routine has little use other than being informational.

source

pub fn alphabet_len(&self) -> usize

Returns the total number of elements in the alphabet for this DFA.

That is, this returns the total number of transitions that each state in this DFA must have. The maximum alphabet size is 256, which corresponds to each possible byte value.

The alphabet size may be less than 256 though, and unless Config::byte_classes is disabled, it is typically must less than 256. Namely, bytes are grouped into equivalence classes such that no two bytes in the same class can distinguish a match from a non-match. For example, in the regex ^[a-z]+$, the ASCII bytes a-z could all be in the same equivalence class. This leads to a massive space savings.

Note though that the alphabet length does not necessarily equal the total stride space taken up by a single DFA state in the transition table. Namely, for performance reasons, the stride is always the smallest power of two that is greater than or equal to the alphabet length. For this reason, DFA::stride or DFA::stride2 are often more useful. The alphabet length is typically useful only for informational purposes.

Note also that unlike dense or sparse DFAs, a one-pass DFA does not have a special end-of-input (EOI) transition. This is because a one-pass DFA handles look-around assertions explicitly (like the PikeVM) and does not build them into the transitions of the DFA.

source

pub fn stride2(&self) -> usize

Returns the total stride for every state in this DFA, expressed as the exponent of a power of 2. The stride is the amount of space each state takes up in the transition table, expressed as a number of transitions. (Unused transitions map to dead states.)

The stride of a DFA is always equivalent to the smallest power of 2 that is greater than or equal to the DFA’s alphabet length. This definition uses extra space, but possibly permits faster translation between state identifiers and their corresponding offsets in this DFA’s transition table.

For example, if the DFA’s stride is 16 transitions, then its stride2 is 4 since 2^4 = 16.

The minimum stride2 value is 1 (corresponding to a stride of 2) while the maximum stride2 value is 9 (corresponding to a stride of 512). The maximum in theory should be 8, but because of some implementation quirks that may be relaxed in the future, it is one more than 8. (Do note that a maximal stride is incredibly rare, as it would imply that there is almost no redundant in the regex pattern.)

Note that unlike dense or sparse DFAs, a one-pass DFA does not expose a low level DFA API. Therefore, this routine has little use other than being informational.

source

pub fn stride(&self) -> usize

Returns the total stride for every state in this DFA. This corresponds to the total number of transitions used by each state in this DFA’s transition table.

Please see DFA::stride2 for more information. In particular, this returns the stride as the number of transitions, where as stride2 returns it as the exponent of a power of 2.

Note that unlike dense or sparse DFAs, a one-pass DFA does not expose a low level DFA API. Therefore, this routine has little use other than being informational.

source

pub fn memory_usage(&self) -> usize

Returns the memory usage, in bytes, of this DFA.

The memory usage is computed based on the number of bytes used to represent this DFA.

This does not include the stack size used up by this DFA. To compute that, use std::mem::size_of::<onepass::DFA>().

source§

impl DFA

source

pub fn is_match<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I, ) -> bool

Executes an anchored leftmost forward search, and returns true if and only if this one-pass DFA matches the given haystack.

This routine may short circuit if it knows that scanning future input will never lead to a different result. In particular, if the underlying DFA enters a match state, then this routine will return true immediately without inspecting any future input. (Consider how this might make a difference given the regex a+ on the haystack aaaaaaaaaaaaaaa. This routine can stop after it sees the first a, but routines like find need to continue searching because + is greedy by default.)

The given Input is forcefully set to use Anchored::Yes if the given configuration was Anchored::No (which is the default).

§Panics

This routine panics if the search could not complete. This can occur in the following circumstances:

When a search panics, callers cannot know whether a match exists or not.

Use DFA::try_search if you want to handle these panics as error values instead.

§Example

This shows basic usage:

use regex_automata::dfa::onepass::DFA;

let re = DFA::new("foo[0-9]+bar")?;
let mut cache = re.create_cache();

assert!(re.is_match(&mut cache, "foo12345bar"));
assert!(!re.is_match(&mut cache, "foobar"));
§Example: consistency with search APIs

is_match is guaranteed to return true whenever captures returns a match. This includes searches that are executed entirely within a codepoint:

use regex_automata::{dfa::onepass::DFA, Input};

let re = DFA::new("a*")?;
let mut cache = re.create_cache();

assert!(!re.is_match(&mut cache, Input::new("☃").span(1..2)));

Notice that when UTF-8 mode is disabled, then the above reports a match because the restriction against zero-width matches that split a codepoint has been lifted:

use regex_automata::{dfa::onepass::DFA, nfa::thompson::NFA, Input};

let re = DFA::builder()
    .thompson(NFA::config().utf8(false))
    .build("a*")?;
let mut cache = re.create_cache();

assert!(re.is_match(&mut cache, Input::new("☃").span(1..2)));
source

pub fn find<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I, ) -> Option<Match>

Executes an anchored leftmost forward search, and returns a Match if and only if this one-pass DFA matches the given haystack.

This routine only includes the overall match span. To get access to the individual spans of each capturing group, use DFA::captures.

The given Input is forcefully set to use Anchored::Yes if the given configuration was Anchored::No (which is the default).

§Panics

This routine panics if the search could not complete. This can occur in the following circumstances:

When a search panics, callers cannot know whether a match exists or not.

Use DFA::try_search if you want to handle these panics as error values instead.

§Example

Leftmost first match semantics corresponds to the match with the smallest starting offset, but where the end offset is determined by preferring earlier branches in the original regular expression. For example, Sam|Samwise will match Sam in Samwise, but Samwise|Sam will match Samwise in Samwise.

Generally speaking, the “leftmost first” match is how most backtracking regular expressions tend to work. This is in contrast to POSIX-style regular expressions that yield “leftmost longest” matches. Namely, both Sam|Samwise and Samwise|Sam match Samwise when using leftmost longest semantics. (This crate does not currently support leftmost longest semantics.)

use regex_automata::{dfa::onepass::DFA, Match};

let re = DFA::new("foo[0-9]+")?;
let mut cache = re.create_cache();
let expected = Match::must(0, 0..8);
assert_eq!(Some(expected), re.find(&mut cache, "foo12345"));

// Even though a match is found after reading the first byte (`a`),
// the leftmost first match semantics demand that we find the earliest
// match that prefers earlier parts of the pattern over later parts.
let re = DFA::new("abc|a")?;
let mut cache = re.create_cache();
let expected = Match::must(0, 0..3);
assert_eq!(Some(expected), re.find(&mut cache, "abc"));
source

pub fn captures<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I, caps: &mut Captures, )

Executes an anchored leftmost forward search and writes the spans of capturing groups that participated in a match into the provided Captures value. If no match was found, then Captures::is_match is guaranteed to return false.

The given Input is forcefully set to use Anchored::Yes if the given configuration was Anchored::No (which is the default).

§Panics

This routine panics if the search could not complete. This can occur in the following circumstances:

When a search panics, callers cannot know whether a match exists or not.

Use DFA::try_search if you want to handle these panics as error values instead.

§Example

This shows a simple example of a one-pass regex that extracts capturing group spans.

use regex_automata::{dfa::onepass::DFA, Match, Span};

let re = DFA::new(
    // Notice that we use ASCII here. The corresponding Unicode regex
    // is sadly not one-pass.
    "(?P<first>[[:alpha:]]+)[[:space:]]+(?P<last>[[:alpha:]]+)",
)?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

re.captures(&mut cache, "Bruce Springsteen", &mut caps);
assert_eq!(Some(Match::must(0, 0..17)), caps.get_match());
assert_eq!(Some(Span::from(0..5)), caps.get_group(1));
assert_eq!(Some(Span::from(6..17)), caps.get_group_by_name("last"));

Executes an anchored leftmost forward search and writes the spans of capturing groups that participated in a match into the provided Captures value. If no match was found, then Captures::is_match is guaranteed to return false.

The differences with DFA::captures are:

  1. This returns an error instead of panicking if the search fails.
  2. Accepts an &Input instead of a Into<Input>. This permits reusing the same input for multiple searches, which may be important for latency.
  3. This does not automatically change the Anchored mode from No to Yes. Instead, if Input::anchored is Anchored::No, then an error is returned.
§Errors

This routine errors if the search could not complete. This can occur in the following circumstances:

When a search returns an error, callers cannot know whether a match exists or not.

This example shows how to build a multi-regex that permits searching for specific patterns. Note that this is somewhat less useful than in other regex engines, since a one-pass DFA by definition has no ambiguity about which pattern can match at a position. That is, if it were possible for two different patterns to match at the same starting position, then the multi-regex would not be one-pass and construction would have failed.

Nevertheless, this can still be useful if you only care about matches for a specific pattern, and want the DFA to report “no match” even if some other pattern would have matched.

Note that in order to make use of this functionality, Config::starts_for_each_pattern must be enabled. It is disabled by default since it may result in higher memory usage.

use regex_automata::{
    dfa::onepass::DFA, Anchored, Input, Match, PatternID,
};

let re = DFA::builder()
    .configure(DFA::config().starts_for_each_pattern(true))
    .build_many(&["[a-z]+", "[0-9]+"])?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let haystack = "123abc";
let input = Input::new(haystack).anchored(Anchored::Yes);

// A normal multi-pattern search will show pattern 1 matches.
re.try_search(&mut cache, &input, &mut caps)?;
assert_eq!(Some(Match::must(1, 0..3)), caps.get_match());

// If we only want to report pattern 0 matches, then we'll get no
// match here.
let input = input.anchored(Anchored::Pattern(PatternID::must(0)));
re.try_search(&mut cache, &input, &mut caps)?;
assert_eq!(None, caps.get_match());

This example shows how providing the bounds of a search can produce different results than simply sub-slicing the haystack.

use regex_automata::{dfa::onepass::DFA, Anchored, Input, Match};

// one-pass DFAs fully support Unicode word boundaries!
// A sad joke is that a Unicode aware regex like \w+\s is not one-pass.
// :-(
let re = DFA::new(r"\b[0-9]{3}\b")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let haystack = "foo123bar";

// Since we sub-slice the haystack, the search doesn't know about
// the larger context and assumes that `123` is surrounded by word
// boundaries. And of course, the match position is reported relative
// to the sub-slice as well, which means we get `0..3` instead of
// `3..6`.
let expected = Some(Match::must(0, 0..3));
let input = Input::new(&haystack[3..6]).anchored(Anchored::Yes);
re.try_search(&mut cache, &input, &mut caps)?;
assert_eq!(expected, caps.get_match());

// But if we provide the bounds of the search within the context of the
// entire haystack, then the search can take the surrounding context
// into account. (And if we did find a match, it would be reported
// as a valid offset into `haystack` instead of its sub-slice.)
let expected = None;
let input = Input::new(haystack).range(3..6).anchored(Anchored::Yes);
re.try_search(&mut cache, &input, &mut caps)?;
assert_eq!(expected, caps.get_match());
source

pub fn try_search_slots( &self, cache: &mut Cache, input: &Input<'_>, slots: &mut [Option<NonMaxUsize>], ) -> Result<Option<PatternID>, MatchError>

Executes an anchored leftmost forward search and writes the spans of capturing groups that participated in a match into the provided slots, and returns the matching pattern ID. The contents of the slots for patterns other than the matching pattern are unspecified. If no match was found, then None is returned and the contents of all slots is unspecified.

This is like DFA::try_search, but it accepts a raw slots slice instead of a Captures value. This is useful in contexts where you don’t want or need to allocate a Captures.

It is legal to pass any number of slots to this routine. If the regex engine would otherwise write a slot offset that doesn’t fit in the provided slice, then it is simply skipped. In general though, there are usually three slice lengths you might want to use:

  • An empty slice, if you only care about which pattern matched.
  • A slice with pattern_len() * 2 slots, if you only care about the overall match spans for each matching pattern.
  • A slice with slot_len() slots, which permits recording match offsets for every capturing group in every pattern.
§Errors

This routine errors if the search could not complete. This can occur in the following circumstances:

When a search returns an error, callers cannot know whether a match exists or not.

§Example

This example shows how to find the overall match offsets in a multi-pattern search without allocating a Captures value. Indeed, we can put our slots right on the stack.

use regex_automata::{dfa::onepass::DFA, Anchored, Input, PatternID};

let re = DFA::new_many(&[
    r"[a-zA-Z]+",
    r"[0-9]+",
])?;
let mut cache = re.create_cache();
let input = Input::new("123").anchored(Anchored::Yes);

// We only care about the overall match offsets here, so we just
// allocate two slots for each pattern. Each slot records the start
// and end of the match.
let mut slots = [None; 4];
let pid = re.try_search_slots(&mut cache, &input, &mut slots)?;
assert_eq!(Some(PatternID::must(1)), pid);

// The overall match offsets are always at 'pid * 2' and 'pid * 2 + 1'.
// See 'GroupInfo' for more details on the mapping between groups and
// slot indices.
let slot_start = pid.unwrap().as_usize() * 2;
let slot_end = slot_start + 1;
assert_eq!(Some(0), slots[slot_start].map(|s| s.get()));
assert_eq!(Some(3), slots[slot_end].map(|s| s.get()));
source

fn try_search_slots_imp( &self, cache: &mut Cache, input: &Input<'_>, slots: &mut [Option<NonMaxUsize>], ) -> Result<Option<PatternID>, MatchError>

source§

impl DFA

source

fn search_imp( &self, cache: &mut Cache, input: &Input<'_>, slots: &mut [Option<NonMaxUsize>], ) -> Result<Option<PatternID>, MatchError>

source

fn find_match( &self, cache: &mut Cache, input: &Input<'_>, at: usize, sid: StateID, slots: &mut [Option<NonMaxUsize>], matched_pid: &mut Option<PatternID>, ) -> bool

Assumes ‘sid’ is a match state and looks for whether a match can be reported. If so, appropriate offsets are written to ‘slots’ and ‘matched_pid’ is set to the matching pattern ID.

Even when ‘sid’ is a match state, it’s possible that a match won’t be reported. For example, when the conditional epsilon transitions leading to the match state aren’t satisfied at the given position in the haystack.

source§

impl DFA

source

fn start(&self) -> StateID

Returns the anchored start state for matching any pattern in this DFA.

source

fn start_pattern(&self, pid: PatternID) -> Result<StateID, MatchError>

Returns the anchored start state for matching the given pattern. If ‘starts_for_each_pattern’ was not enabled, then this returns an error. If the given pattern is not in this DFA, then Ok(None) is returned.

source

fn transition(&self, sid: StateID, byte: u8) -> Transition

Returns the transition from the given state ID and byte of input. The transition includes the next state ID, the slots that should be saved and any conditional epsilon transitions that must be satisfied in order to take this transition.

source

fn set_transition(&mut self, sid: StateID, byte: u8, to: Transition)

Set the transition from the given state ID and byte of input to the transition given.

source

fn sparse_transitions(&self, sid: StateID) -> SparseTransitionIter<'_>

Return an iterator of “sparse” transitions for the given state ID. “sparse” in this context means that consecutive transitions that are equivalent are returned as one group, and transitions to the DEAD state are ignored.

This winds up being useful for debug printing, since it’s much terser to display runs of equivalent transitions than the transition for every possible byte value. Indeed, in practice, it’s very common for runs of equivalent transitions to appear.

source

fn pattern_epsilons(&self, sid: StateID) -> PatternEpsilons

Return the pattern epsilons for the given state ID.

If the given state ID does not correspond to a match state ID, then the pattern epsilons returned is empty.

source

fn set_pattern_epsilons(&mut self, sid: StateID, pateps: PatternEpsilons)

Set the pattern epsilons for the given state ID.

source

fn prev_state_id(&self, id: StateID) -> Option<StateID>

Returns the state ID prior to the one given. This returns None if the given ID is the first DFA state.

source

fn last_state_id(&self) -> StateID

Returns the state ID of the last state in this DFA’s transition table. “last” in this context means the last state to appear in memory, i.e., the one with the greatest ID.

source

pub(super) fn swap_states(&mut self, id1: StateID, id2: StateID)

Move the transitions from ‘id1’ to ‘id2’ and vice versa.

WARNING: This does not update the rest of the transition table to have transitions to ‘id1’ changed to ‘id2’ and vice versa. This merely moves the states in memory.

source

pub(super) fn remap(&mut self, map: impl Fn(StateID) -> StateID)

Map all state IDs in this DFA (transition table + start states) according to the closure given.

Trait Implementations§

source§

impl Clone for DFA

source§

fn clone(&self) -> DFA

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for DFA

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl Remappable for DFA

source§

fn state_len(&self) -> usize

Return the total number of states.
source§

fn stride2(&self) -> usize

Return the power-of-2 exponent that yields the stride. The pertinent laws here are, where N=stride2: 2^N=stride and len(alphabet) <= stride.
source§

fn swap_states(&mut self, id1: StateID, id2: StateID)

Swap the states pointed to by the given IDs. The underlying finite state machine should be mutated such that all of the transitions in id1 are now in the memory region where the transitions for id2 were, and all of the transitions in id2 are now in the memory region where the transitions for id1 were. Read more
source§

fn remap(&mut self, map: impl Fn(StateID) -> StateID)

This must remap every single state ID in the underlying value according to the function given. For example, in a DFA, this should remap every transition and every starting state ID.

Auto Trait Implementations§

§

impl Freeze for DFA

§

impl RefUnwindSafe for DFA

§

impl Send for DFA

§

impl Sync for DFA

§

impl Unpin for DFA

§

impl UnwindSafe for DFA

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for T
where T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.