Struct regex_automata::hybrid::dfa::Config

source ·
pub struct Config {
    match_kind: Option<MatchKind>,
    pre: Option<Option<Prefilter>>,
    starts_for_each_pattern: Option<bool>,
    byte_classes: Option<bool>,
    unicode_word_boundary: Option<bool>,
    quitset: Option<ByteSet>,
    specialize_start_states: Option<bool>,
    cache_capacity: Option<usize>,
    skip_cache_capacity_check: Option<bool>,
    minimum_cache_clear_count: Option<Option<usize>>,
    minimum_bytes_per_state: Option<Option<usize>>,
}
Expand description

The configuration used for building a lazy DFA.

As a convenience, DFA::config is an alias for Config::new. The advantage of the former is that it often lets you avoid importing the Config type directly.

A lazy DFA configuration is a simple data object that is typically used with Builder::configure.

The default configuration guarantees that a search will never return a “gave up” or “quit” error, although it is possible for a search to fail if Config::starts_for_each_pattern wasn’t enabled (which it is not by default) and an Anchored::Pattern mode is requested via Input.

Fields§

§match_kind: Option<MatchKind>§pre: Option<Option<Prefilter>>§starts_for_each_pattern: Option<bool>§byte_classes: Option<bool>§unicode_word_boundary: Option<bool>§quitset: Option<ByteSet>§specialize_start_states: Option<bool>§cache_capacity: Option<usize>§skip_cache_capacity_check: Option<bool>§minimum_cache_clear_count: Option<Option<usize>>§minimum_bytes_per_state: Option<Option<usize>>

Implementations§

source§

impl Config

source

pub fn new() -> Config

Return a new default lazy DFA builder configuration.

source

pub fn match_kind(self, kind: MatchKind) -> Config

Set the desired match semantics.

The default is MatchKind::LeftmostFirst, which corresponds to the match semantics of Perl-like regex engines. That is, when multiple patterns would match at the same leftmost position, the pattern that appears first in the concrete syntax is chosen.

Currently, the only other kind of match semantics supported is MatchKind::All. This corresponds to classical DFA construction where all possible matches are added to the lazy DFA.

Typically, All is used when one wants to execute an overlapping search and LeftmostFirst otherwise. In particular, it rarely makes sense to use All with the various “leftmost” find routines, since the leftmost routines depend on the LeftmostFirst automata construction strategy. Specifically, LeftmostFirst adds dead states to the lazy DFA as a way to terminate the search and report a match. LeftmostFirst also supports non-greedy matches using this strategy where as All does not.

This example shows the typical use of MatchKind::All, which is to report overlapping matches.

use regex_automata::{
    hybrid::dfa::{DFA, OverlappingState},
    HalfMatch, Input, MatchKind,
};

let dfa = DFA::builder()
    .configure(DFA::config().match_kind(MatchKind::All))
    .build_many(&[r"\w+$", r"\S+$"])?;
let mut cache = dfa.create_cache();
let haystack = "@foo";
let mut state = OverlappingState::start();

let expected = Some(HalfMatch::must(1, 4));
dfa.try_search_overlapping_fwd(
    &mut cache, &Input::new(haystack), &mut state,
)?;
assert_eq!(expected, state.get_match());

// The first pattern also matches at the same position, so re-running
// the search will yield another match. Notice also that the first
// pattern is returned after the second. This is because the second
// pattern begins its match before the first, is therefore an earlier
// match and is thus reported first.
let expected = Some(HalfMatch::must(0, 4));
dfa.try_search_overlapping_fwd(
    &mut cache, &Input::new(haystack), &mut state,
)?;
assert_eq!(expected, state.get_match());
§Example: reverse automaton to find start of match

Another example for using MatchKind::All is for constructing a reverse automaton to find the start of a match. All semantics are used for this in order to find the longest possible match, which corresponds to the leftmost starting position.

Note that if you need the starting position then hybrid::regex::Regex will handle this for you, so it’s usually not necessary to do this yourself.

use regex_automata::{
    hybrid::dfa::DFA,
    nfa::thompson::NFA,
    Anchored, HalfMatch, Input, MatchKind,
};

let input = Input::new("123foobar456");
let pattern = r"[a-z]+r";

let dfa_fwd = DFA::new(pattern)?;
let dfa_rev = DFA::builder()
    .thompson(NFA::config().reverse(true))
    .configure(DFA::config().match_kind(MatchKind::All))
    .build(pattern)?;
let mut cache_fwd = dfa_fwd.create_cache();
let mut cache_rev = dfa_rev.create_cache();

let expected_fwd = HalfMatch::must(0, 9);
let expected_rev = HalfMatch::must(0, 3);
let got_fwd = dfa_fwd.try_search_fwd(&mut cache_fwd, &input)?.unwrap();
// Here we don't specify the pattern to search for since there's only
// one pattern and we're doing a leftmost search. But if this were an
// overlapping search, you'd need to specify the pattern that matched
// in the forward direction. (Otherwise, you might wind up finding the
// starting position of a match of some other pattern.) That in turn
// requires building the reverse automaton with starts_for_each_pattern
// enabled.
let input = input
    .clone()
    .range(..got_fwd.offset())
    .anchored(Anchored::Yes);
let got_rev = dfa_rev.try_search_rev(&mut cache_rev, &input)?.unwrap();
assert_eq!(expected_fwd, got_fwd);
assert_eq!(expected_rev, got_rev);
source

pub fn prefilter(self, pre: Option<Prefilter>) -> Config

Set a prefilter to be used whenever a start state is entered.

A Prefilter in this context is meant to accelerate searches by looking for literal prefixes that every match for the corresponding pattern (or patterns) must start with. Once a prefilter produces a match, the underlying search routine continues on to try and confirm the match.

Be warned that setting a prefilter does not guarantee that the search will be faster. While it’s usually a good bet, if the prefilter produces a lot of false positive candidates (i.e., positions matched by the prefilter but not by the regex), then the overall result can be slower than if you had just executed the regex engine without any prefilters.

Note that unless Config::specialize_start_states has been explicitly set, then setting this will also enable (when pre is Some) or disable (when pre is None) start state specialization. This occurs because without start state specialization, a prefilter is likely to be less effective. And without a prefilter, start state specialization is usually pointless.

By default no prefilter is set.

§Example
use regex_automata::{
    hybrid::dfa::DFA,
    util::prefilter::Prefilter,
    Input, HalfMatch, MatchKind,
};

let pre = Prefilter::new(MatchKind::LeftmostFirst, &["foo", "bar"]);
let re = DFA::builder()
    .configure(DFA::config().prefilter(pre))
    .build(r"(foo|bar)[a-z]+")?;
let mut cache = re.create_cache();
let input = Input::new("foo1 barfox bar");
assert_eq!(
    Some(HalfMatch::must(0, 11)),
    re.try_search_fwd(&mut cache, &input)?,
);

Be warned though that an incorrect prefilter can lead to incorrect results!

use regex_automata::{
    hybrid::dfa::DFA,
    util::prefilter::Prefilter,
    Input, HalfMatch, MatchKind,
};

let pre = Prefilter::new(MatchKind::LeftmostFirst, &["foo", "car"]);
let re = DFA::builder()
    .configure(DFA::config().prefilter(pre))
    .build(r"(foo|bar)[a-z]+")?;
let mut cache = re.create_cache();
let input = Input::new("foo1 barfox bar");
assert_eq!(
    // No match reported even though there clearly is one!
    None,
    re.try_search_fwd(&mut cache, &input)?,
);
source

pub fn starts_for_each_pattern(self, yes: bool) -> Config

Whether to compile a separate start state for each pattern in the lazy DFA.

When enabled, a separate anchored start state is added for each pattern in the lazy DFA. When this start state is used, then the DFA will only search for matches for the pattern specified, even if there are other patterns in the DFA.

The main downside of this option is that it can potentially increase the size of the DFA and/or increase the time it takes to build the DFA at search time. However, since this is configuration for a lazy DFA, these states aren’t actually built unless they’re used. Enabling this isn’t necessarily free, however, as it may result in higher cache usage.

There are a few reasons one might want to enable this (it’s disabled by default):

  1. When looking for the start of an overlapping match (using a reverse DFA), doing it correctly requires starting the reverse search using the starting state of the pattern that matched in the forward direction. Indeed, when building a Regex, it will automatically enable this option when building the reverse DFA internally.
  2. When you want to use a DFA with multiple patterns to both search for matches of any pattern or to search for anchored matches of one particular pattern while using the same DFA. (Otherwise, you would need to compile a new DFA for each pattern.)

By default this is disabled.

§Example

This example shows how to use this option to permit the same lazy DFA to run both general searches for any pattern and anchored searches for a specific pattern.

use regex_automata::{
    hybrid::dfa::DFA,
    Anchored, HalfMatch, Input, PatternID,
};

let dfa = DFA::builder()
    .configure(DFA::config().starts_for_each_pattern(true))
    .build_many(&[r"[a-z0-9]{6}", r"[a-z][a-z0-9]{5}"])?;
let mut cache = dfa.create_cache();
let haystack = "bar foo123";

// Here's a normal unanchored search that looks for any pattern.
let expected = HalfMatch::must(0, 10);
let input = Input::new(haystack);
assert_eq!(Some(expected), dfa.try_search_fwd(&mut cache, &input)?);
// We can also do a normal anchored search for any pattern. Since it's
// an anchored search, we position the start of the search where we
// know the match will begin.
let expected = HalfMatch::must(0, 10);
let input = Input::new(haystack).range(4..);
assert_eq!(Some(expected), dfa.try_search_fwd(&mut cache, &input)?);
// Since we compiled anchored start states for each pattern, we can
// also look for matches of other patterns explicitly, even if a
// different pattern would have normally matched.
let expected = HalfMatch::must(1, 10);
let input = Input::new(haystack)
    .range(4..)
    .anchored(Anchored::Pattern(PatternID::must(1)));
assert_eq!(Some(expected), dfa.try_search_fwd(&mut cache, &input)?);
source

pub fn byte_classes(self, yes: bool) -> Config

Whether to attempt to shrink the size of the lazy DFA’s alphabet or not.

This option is enabled by default and should never be disabled unless one is debugging the lazy DFA.

When enabled, the lazy DFA will use a map from all possible bytes to their corresponding equivalence class. Each equivalence class represents a set of bytes that does not discriminate between a match and a non-match in the DFA. For example, the pattern [ab]+ has at least two equivalence classes: a set containing a and b and a set containing every byte except for a and b. a and b are in the same equivalence classes because they never discriminate between a match and a non-match.

The advantage of this map is that the size of the transition table can be reduced drastically from #states * 256 * sizeof(LazyStateID) to #states * k * sizeof(LazyStateID) where k is the number of equivalence classes (rounded up to the nearest power of 2). As a result, total space usage can decrease substantially. Moreover, since a smaller alphabet is used, DFA compilation during search becomes faster as well since it will potentially be able to reuse a single transition for multiple bytes.

WARNING: This is only useful for debugging lazy DFAs. Disabling this does not yield any speed advantages. Namely, even when this is disabled, a byte class map is still used while searching. The only difference is that every byte will be forced into its own distinct equivalence class. This is useful for debugging the actual generated transitions because it lets one see the transitions defined on actual bytes instead of the equivalence classes.

source

pub fn unicode_word_boundary(self, yes: bool) -> Config

Heuristically enable Unicode word boundaries.

When set, this will attempt to implement Unicode word boundaries as if they were ASCII word boundaries. This only works when the search input is ASCII only. If a non-ASCII byte is observed while searching, then a MatchError::quit error is returned.

A possible alternative to enabling this option is to simply use an ASCII word boundary, e.g., via (?-u:\b). The main reason to use this option is if you absolutely need Unicode support. This option lets one use a fast search implementation (a DFA) for some potentially very common cases, while providing the option to fall back to some other regex engine to handle the general case when an error is returned.

If the pattern provided has no Unicode word boundary in it, then this option has no effect. (That is, quitting on a non-ASCII byte only occurs when this option is enabled and a Unicode word boundary is present in the pattern.)

This is almost equivalent to setting all non-ASCII bytes to be quit bytes. The only difference is that this will cause non-ASCII bytes to be quit bytes only when a Unicode word boundary is present in the pattern.

When enabling this option, callers must be prepared to handle a MatchError error during search. When using a Regex, this corresponds to using the try_ suite of methods. Alternatively, if callers can guarantee that their input is ASCII only, then a MatchError::quit error will never be returned while searching.

This is disabled by default.

§Example

This example shows how to heuristically enable Unicode word boundaries in a pattern. It also shows what happens when a search comes across a non-ASCII byte.

use regex_automata::{
    hybrid::dfa::DFA,
    HalfMatch, Input, MatchError,
};

let dfa = DFA::builder()
    .configure(DFA::config().unicode_word_boundary(true))
    .build(r"\b[0-9]+\b")?;
let mut cache = dfa.create_cache();

// The match occurs before the search ever observes the snowman
// character, so no error occurs.
let haystack = "foo 123  ☃";
let expected = Some(HalfMatch::must(0, 7));
let got = dfa.try_search_fwd(&mut cache, &Input::new(haystack))?;
assert_eq!(expected, got);

// Notice that this search fails, even though the snowman character
// occurs after the ending match offset. This is because search
// routines read one byte past the end of the search to account for
// look-around, and indeed, this is required here to determine whether
// the trailing \b matches.
let haystack = "foo 123 ☃";
let expected = MatchError::quit(0xE2, 8);
let got = dfa.try_search_fwd(&mut cache, &Input::new(haystack));
assert_eq!(Err(expected), got);

// Another example is executing a search where the span of the haystack
// we specify is all ASCII, but there is non-ASCII just before it. This
// correctly also reports an error.
let input = Input::new("β123").range(2..);
let expected = MatchError::quit(0xB2, 1);
let got = dfa.try_search_fwd(&mut cache, &input);
assert_eq!(Err(expected), got);

// And similarly for the trailing word boundary.
let input = Input::new("123β").range(..3);
let expected = MatchError::quit(0xCE, 3);
let got = dfa.try_search_fwd(&mut cache, &input);
assert_eq!(Err(expected), got);
source

pub fn quit(self, byte: u8, yes: bool) -> Config

Add a “quit” byte to the lazy DFA.

When a quit byte is seen during search time, then search will return a MatchError::quit error indicating the offset at which the search stopped.

A quit byte will always overrule any other aspects of a regex. For example, if the x byte is added as a quit byte and the regex \w is used, then observing x will cause the search to quit immediately despite the fact that x is in the \w class.

This mechanism is primarily useful for heuristically enabling certain features like Unicode word boundaries in a DFA. Namely, if the input to search is ASCII, then a Unicode word boundary can be implemented via an ASCII word boundary with no change in semantics. Thus, a DFA can attempt to match a Unicode word boundary but give up as soon as it observes a non-ASCII byte. Indeed, if callers set all non-ASCII bytes to be quit bytes, then Unicode word boundaries will be permitted when building lazy DFAs. Of course, callers should enable Config::unicode_word_boundary if they want this behavior instead. (The advantage being that non-ASCII quit bytes will only be added if a Unicode word boundary is in the pattern.)

When enabling this option, callers must be prepared to handle a MatchError error during search. When using a Regex, this corresponds to using the try_ suite of methods.

By default, there are no quit bytes set.

§Panics

This panics if heuristic Unicode word boundaries are enabled and any non-ASCII byte is removed from the set of quit bytes. Namely, enabling Unicode word boundaries requires setting every non-ASCII byte to a quit byte. So if the caller attempts to undo any of that, then this will panic.

§Example

This example shows how to cause a search to terminate if it sees a \n byte. This could be useful if, for example, you wanted to prevent a user supplied pattern from matching across a line boundary.

use regex_automata::{hybrid::dfa::DFA, MatchError, Input};

let dfa = DFA::builder()
    .configure(DFA::config().quit(b'\n', true))
    .build(r"foo\p{any}+bar")?;
let mut cache = dfa.create_cache();

let haystack = "foo\nbar";
// Normally this would produce a match, since \p{any} contains '\n'.
// But since we instructed the automaton to enter a quit state if a
// '\n' is observed, this produces a match error instead.
let expected = MatchError::quit(b'\n', 3);
let got = dfa.try_search_fwd(
    &mut cache,
    &Input::new(haystack),
).unwrap_err();
assert_eq!(expected, got);
source

pub fn specialize_start_states(self, yes: bool) -> Config

Enable specializing start states in the lazy DFA.

When start states are specialized, an implementor of a search routine using a lazy DFA can tell when the search has entered a starting state. When start states aren’t specialized, then it is impossible to know whether the search has entered a start state.

Ideally, this option wouldn’t need to exist and we could always specialize start states. The problem is that start states can be quite active. This in turn means that an efficient search routine is likely to ping-pong between a heavily optimized hot loop that handles most states and to a less optimized specialized handling of start states. This causes branches to get heavily mispredicted and overall can materially decrease throughput. Therefore, specializing start states should only be enabled when it is needed.

Knowing whether a search is in a start state is typically useful when a prefilter is active for the search. A prefilter is typically only run when in a start state and a prefilter can greatly accelerate a search. Therefore, the possible cost of specializing start states is worth it in this case. Otherwise, if you have no prefilter, there is likely no reason to specialize start states.

This is disabled by default, but note that it is automatically enabled (or disabled) if Config::prefilter is set. Namely, unless specialize_start_states has already been set, Config::prefilter will automatically enable or disable it based on whether a prefilter is present or not, respectively. This is done because a prefilter’s effectiveness is rooted in being executed whenever the DFA is in a start state, and that’s only possible to do when they are specialized.

Note that it is plausibly reasonable to disable this option explicitly while enabling a prefilter. In that case, a prefilter will still be run at the beginning of a search, but never again. This in theory could strike a good balance if you’re in a situation where a prefilter is likely to produce many false positive candidates.

§Example

This example shows how to enable start state specialization and then shows how to check whether a state is a start state or not.

use regex_automata::{hybrid::dfa::DFA, MatchError, Input};

let dfa = DFA::builder()
    .configure(DFA::config().specialize_start_states(true))
    .build(r"[a-z]+")?;
let mut cache = dfa.create_cache();

let haystack = "123 foobar 4567".as_bytes();
let sid = dfa.start_state_forward(&mut cache, &Input::new(haystack))?;
// The ID returned by 'start_state_forward' will always be tagged as
// a start state when start state specialization is enabled.
assert!(sid.is_tagged());
assert!(sid.is_start());

Compare the above with the default lazy DFA configuration where start states are not specialized. In this case, the start state is not tagged and sid.is_start() returns false.

use regex_automata::{hybrid::dfa::DFA, MatchError, Input};

let dfa = DFA::new(r"[a-z]+")?;
let mut cache = dfa.create_cache();

let haystack = "123 foobar 4567".as_bytes();
let sid = dfa.start_state_forward(&mut cache, &Input::new(haystack))?;
// Start states are not tagged in the default configuration!
assert!(!sid.is_tagged());
assert!(!sid.is_start());
source

pub fn cache_capacity(self, bytes: usize) -> Config

Sets the maximum amount of heap memory, in bytes, to allocate to the cache for use during a lazy DFA search. If the lazy DFA would otherwise use more heap memory, then, depending on other configuration knobs, either stop the search and return an error or clear the cache and continue the search.

The default cache capacity is some “reasonable” number that will accommodate most regular expressions. You may find that if you need to build a large DFA then it may be necessary to increase the cache capacity.

Note that while building a lazy DFA will do a “minimum” check to ensure the capacity is big enough, this is more or less about correctness. If the cache is bigger than the minimum but still “too small,” then the lazy DFA could wind up spending a lot of time clearing the cache and recomputing transitions, thus negating the performance benefits of a lazy DFA. Thus, setting the cache capacity is mostly an experimental endeavor. For most common patterns, however, the default should be sufficient.

For more details on how the lazy DFA’s cache is used, see the documentation for Cache.

§Example

This example shows what happens if the configured cache capacity is too small. In such cases, one can override the cache capacity to make it bigger. Alternatively, one might want to use less memory by setting a smaller cache capacity.

use regex_automata::{hybrid::dfa::DFA, HalfMatch, Input};

let pattern = r"\p{L}{1000}";

// The default cache capacity is likely too small to deal with regexes
// that are very large. Large repetitions of large Unicode character
// classes are a common way to make very large regexes.
let _ = DFA::new(pattern).unwrap_err();
// Bump up the capacity to something bigger.
let dfa = DFA::builder()
    .configure(DFA::config().cache_capacity(100 * (1<<20))) // 100 MB
    .build(pattern)?;
let mut cache = dfa.create_cache();

let haystack = "ͰͲͶͿΆΈΉΊΌΎΏΑΒΓΔΕΖΗΘΙ".repeat(50);
let expected = Some(HalfMatch::must(0, 2000));
let got = dfa.try_search_fwd(&mut cache, &Input::new(&haystack))?;
assert_eq!(expected, got);
source

pub fn skip_cache_capacity_check(self, yes: bool) -> Config

Configures construction of a lazy DFA to use the minimum cache capacity if the configured capacity is otherwise too small for the provided NFA.

This is useful if you never want lazy DFA construction to fail because of a capacity that is too small.

In general, this option is typically not a good idea. In particular, while a minimum cache capacity does permit the lazy DFA to function where it otherwise couldn’t, it’s plausible that it may not function well if it’s constantly running out of room. In that case, the speed advantages of the lazy DFA may be negated. On the other hand, the “minimum” cache capacity computed may not be completely accurate and could actually be bigger than what is really necessary. Therefore, it is plausible that using the minimum cache capacity could still result in very good performance.

This is disabled by default.

§Example

This example shows what happens if the configured cache capacity is too small. In such cases, one could override the capacity explicitly. An alternative, demonstrated here, let’s us force construction to use the minimum cache capacity if the configured capacity is otherwise too small.

use regex_automata::{hybrid::dfa::DFA, HalfMatch, Input};

let pattern = r"\p{L}{1000}";

// The default cache capacity is likely too small to deal with regexes
// that are very large. Large repetitions of large Unicode character
// classes are a common way to make very large regexes.
let _ = DFA::new(pattern).unwrap_err();
// Configure construction such it automatically selects the minimum
// cache capacity if it would otherwise be too small.
let dfa = DFA::builder()
    .configure(DFA::config().skip_cache_capacity_check(true))
    .build(pattern)?;
let mut cache = dfa.create_cache();

let haystack = "ͰͲͶͿΆΈΉΊΌΎΏΑΒΓΔΕΖΗΘΙ".repeat(50);
let expected = Some(HalfMatch::must(0, 2000));
let got = dfa.try_search_fwd(&mut cache, &Input::new(&haystack))?;
assert_eq!(expected, got);
source

pub fn minimum_cache_clear_count(self, min: Option<usize>) -> Config

Configure a lazy DFA search to quit after a certain number of cache clearings.

When a minimum is set, then a lazy DFA search will possibly “give up” after the minimum number of cache clearings has occurred. This is typically useful in scenarios where callers want to detect whether the lazy DFA search is “efficient” or not. If the cache is cleared too many times, this is a good indicator that it is not efficient, and thus, the caller may wish to use some other regex engine.

Note that the number of times a cache is cleared is a property of the cache itself. Thus, if a cache is used in a subsequent search with a similarly configured lazy DFA, then it could cause the search to “give up” if the cache needed to be cleared, depending on its internal count and configured minimum. The cache clear count can only be reset to 0 via DFA::reset_cache (or Regex::reset_cache if you’re using the Regex API).

By default, no minimum is configured. Thus, a lazy DFA search will never give up due to cache clearings. If you do set this option, you might consider also setting Config::minimum_bytes_per_state in order for the lazy DFA to take efficiency into account before giving up.

§Example

This example uses a somewhat pathological configuration to demonstrate the possible behavior of cache clearing and how it might result in a search that returns an error.

It is important to note that the precise mechanics of how and when a cache gets cleared is an implementation detail.

use regex_automata::{hybrid::dfa::DFA, Input, MatchError, MatchErrorKind};

// This is a carefully chosen regex. The idea is to pick one
// that requires some decent number of states (hence the bounded
// repetition). But we specifically choose to create a class with an
// ASCII letter and a non-ASCII letter so that we can check that no new
// states are created once the cache is full. Namely, if we fill up the
// cache on a haystack of 'a's, then in order to match one 'β', a new
// state will need to be created since a 'β' is encoded with multiple
// bytes. Since there's no room for this state, the search should quit
// at the very first position.
let pattern = r"[aβ]{100}";
let dfa = DFA::builder()
    .configure(
        // Configure it so that we have the minimum cache capacity
        // possible. And that if any clearings occur, the search quits.
        DFA::config()
            .skip_cache_capacity_check(true)
            .cache_capacity(0)
            .minimum_cache_clear_count(Some(0)),
    )
    .build(pattern)?;
let mut cache = dfa.create_cache();

// Our search will give up before reaching the end!
let haystack = "a".repeat(101).into_bytes();
let result = dfa.try_search_fwd(&mut cache, &Input::new(&haystack));
assert!(matches!(
    *result.unwrap_err().kind(),
    MatchErrorKind::GaveUp { .. },
));

// Now that we know the cache is full, if we search a haystack that we
// know will require creating at least one new state, it should not
// be able to make much progress.
let haystack = "β".repeat(101).into_bytes();
let result = dfa.try_search_fwd(&mut cache, &Input::new(&haystack));
assert!(matches!(
    *result.unwrap_err().kind(),
    MatchErrorKind::GaveUp { .. },
));

// If we reset the cache, then we should be able to create more states
// and make more progress with searching for betas.
cache.reset(&dfa);
let haystack = "β".repeat(101).into_bytes();
let result = dfa.try_search_fwd(&mut cache, &Input::new(&haystack));
assert!(matches!(
    *result.unwrap_err().kind(),
    MatchErrorKind::GaveUp { .. },
));

// ... switching back to ASCII still makes progress since it just needs
// to set transitions on existing states!
let haystack = "a".repeat(101).into_bytes();
let result = dfa.try_search_fwd(&mut cache, &Input::new(&haystack));
assert!(matches!(
    *result.unwrap_err().kind(),
    MatchErrorKind::GaveUp { .. },
));
source

pub fn minimum_bytes_per_state(self, min: Option<usize>) -> Config

Configure a lazy DFA search to quit only when its efficiency drops below the given minimum.

The efficiency of the cache is determined by the number of DFA states compiled per byte of haystack searched. For example, if the efficiency is 2, then it means the lazy DFA is creating a new DFA state after searching approximately 2 bytes in a haystack. Generally speaking, 2 is quite bad and it’s likely that even a slower regex engine like the PikeVM would be faster.

This has no effect if Config::minimum_cache_clear_count is not set. Namely, this option only kicks in when the cache has been cleared more than the minimum number. If no minimum is set, then the cache is simply cleared whenever it fills up and it is impossible for the lazy DFA to quit due to ineffective use of the cache.

In general, if one is setting Config::minimum_cache_clear_count, then one should probably also set this knob as well. The reason is that the absolute number of times the cache is cleared is generally not a great predictor of efficiency. For example, if a new DFA state is created for every 1,000 bytes searched, then it wouldn’t be hard for the cache to get cleared more than N times and then cause the lazy DFA to quit. But a new DFA state every 1,000 bytes is likely quite good from a performance perspective, and it’s likely that the lazy DFA should continue searching, even if it requires clearing the cache occasionally.

Finally, note that if you’re implementing your own lazy DFA search routine and also want this efficiency check to work correctly, then you’ll need to use the following routines to record search progress:

  • Call Cache::search_start at the beginning of every search.
  • Call Cache::search_update whenever DFA::next_state is called.
  • Call Cache::search_finish before completing a search. (It is not strictly necessary to call this when an error is returned, as Cache::search_start will automatically finish the previous search for you. But calling it where possible before returning helps improve the accuracy of how many bytes have actually been searched.)
source

pub fn get_match_kind(&self) -> MatchKind

Returns the match semantics set in this configuration.

source

pub fn get_prefilter(&self) -> Option<&Prefilter>

Returns the prefilter set in this configuration, if one at all.

source

pub fn get_starts_for_each_pattern(&self) -> bool

Returns whether this configuration has enabled anchored starting states for every pattern in the DFA.

source

pub fn get_byte_classes(&self) -> bool

Returns whether this configuration has enabled byte classes or not. This is typically a debugging oriented option, as disabling it confers no speed benefit.

source

pub fn get_unicode_word_boundary(&self) -> bool

Returns whether this configuration has enabled heuristic Unicode word boundary support. When enabled, it is possible for a search to return an error.

source

pub fn get_quit(&self, byte: u8) -> bool

Returns whether this configuration will instruct the lazy DFA to enter a quit state whenever the given byte is seen during a search. When at least one byte has this enabled, it is possible for a search to return an error.

source

pub fn get_specialize_start_states(&self) -> bool

Returns whether this configuration will instruct the lazy DFA to “specialize” start states. When enabled, the lazy DFA will tag start states so that search routines using the lazy DFA can detect when it’s in a start state and do some kind of optimization (like run a prefilter).

source

pub fn get_cache_capacity(&self) -> usize

Returns the cache capacity set on this configuration.

source

pub fn get_skip_cache_capacity_check(&self) -> bool

Returns whether the cache capacity check should be skipped.

source

pub fn get_minimum_cache_clear_count(&self) -> Option<usize>

Returns, if set, the minimum number of times the cache must be cleared before a lazy DFA search can give up. When no minimum is set, then a search will never quit and will always clear the cache whenever it fills up.

source

pub fn get_minimum_bytes_per_state(&self) -> Option<usize>

Returns, if set, the minimum number of bytes per state that need to be processed in order for the lazy DFA to keep going. If the minimum falls below this number (and the cache has been cleared a minimum number of times), then the lazy DFA will return a “gave up” error.

source

pub fn get_minimum_cache_capacity(&self, nfa: &NFA) -> Result<usize, BuildError>

Returns the minimum lazy DFA cache capacity required for the given NFA.

The cache capacity required for a particular NFA may change without notice. Callers should not rely on it being stable.

This is useful for informational purposes, but can also be useful for other reasons. For example, if one wants to check the minimum cache capacity themselves or if one wants to set the capacity based on the minimum.

This may return an error if this configuration does not support all of the instructions used in the given NFA. For example, if the NFA has a Unicode word boundary but this configuration does not enable heuristic support for Unicode word boundaries.

source

fn byte_classes_from_nfa(&self, nfa: &NFA, quit: &ByteSet) -> ByteClasses

Returns the byte class map used during search from the given NFA.

If byte classes are disabled on this configuration, then a map is returned that puts each byte in its own equivalent class.

source

fn quit_set_from_nfa(&self, nfa: &NFA) -> Result<ByteSet, BuildError>

Return the quit set for this configuration and the given NFA.

This may return an error if the NFA is incompatible with this configuration’s quit set. For example, if the NFA has a Unicode word boundary and the quit set doesn’t include non-ASCII bytes.

source

fn overwrite(&self, o: Config) -> Config

Overwrite the default configuration such that the options in o are always used. If an option in o is not set, then the corresponding option in self is used. If it’s not set in self either, then it remains not set.

Trait Implementations§

source§

impl Clone for Config

source§

fn clone(&self) -> Config

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for Config

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl Default for Config

source§

fn default() -> Config

Returns the “default value” for a type. Read more

Auto Trait Implementations§

§

impl Freeze for Config

§

impl RefUnwindSafe for Config

§

impl Send for Config

§

impl Sync for Config

§

impl Unpin for Config

§

impl UnwindSafe for Config

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for T
where T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.