Struct AhoCorasickBuilder

Help

pub struct AhoCorasickBuilder {
    nfa_noncontiguous: Builder,
    nfa_contiguous: Builder,
    dfa: Builder,
    kind: Option<AhoCorasickKind>,
    start_kind: StartKind,
}

Expand description

A builder for configuring an Aho-Corasick automaton.

§Quick advice

Use AhoCorasickBuilder::match_kind to configure your searcher with MatchKind::LeftmostFirst if you want to match how backtracking regex engines execute searches for pat1|pat2|..|patN. Use MatchKind::LeftmostLongest if you want to match how POSIX regex engines do it.
If you need an anchored search, use AhoCorasickBuilder::start_kind to set the StartKind::Anchored mode since StartKind::Unanchored is the default. Or just use StartKind::Both to support both types of searches.
You might want to use AhoCorasickBuilder::kind to set your searcher to always use a AhoCorasickKind::DFA if search speed is critical and memory usage isn’t a concern. Otherwise, not setting a kind will probably make the right choice for you. Beware that if you use StartKind::Both to build a searcher that supports both unanchored and anchored searches and you set AhoCorasickKind::DFA, then the DFA will essentially be duplicated to support both simultaneously. This results in very high memory usage.
For all other options, their defaults are almost certainly what you want.

Fields§

§nfa_noncontiguous: Builder§nfa_contiguous: Builder§dfa: Builder§kind: Option<AhoCorasickKind>§start_kind: StartKind

Implementations§

Source §

impl AhoCorasickBuilder

Source

pub fn new() -> AhoCorasickBuilder

Create a new builder for configuring an Aho-Corasick automaton.

The builder provides a way to configure a number of things, including ASCII case insensitivity and what kind of match semantics are used.

Source

pub fn build<I, P>(&self, patterns: I) -> Result<AhoCorasick, BuildError>
where I: IntoIterator<Item = P>, P: AsRef<[u8]>,

Build an Aho-Corasick automaton using the configuration set on this builder.

A builder may be reused to create more automatons.

§Examples

Basic usage:

use aho_corasick::{AhoCorasickBuilder, PatternID};

let patterns = &["foo", "bar", "baz"];
let ac = AhoCorasickBuilder::new().build(patterns).unwrap();
assert_eq!(
    Some(PatternID::must(1)),
    ac.find("xxx bar xxx").map(|m| m.pattern()),
);

Source

fn build_auto(&self, nfa: NFA) -> (Arc<dyn AcAutomaton>, AhoCorasickKind)

Implements the automatic selection logic for the Aho-Corasick implementation to use. Since all Aho-Corasick automatons are built from a non-contiguous NFA, the caller is responsible for building that first.

Source

pub fn match_kind(&mut self, kind: MatchKind) -> &mut AhoCorasickBuilder

Set the desired match semantics.

The default is MatchKind::Standard, which corresponds to the match semantics supported by the standard textbook description of the Aho-Corasick algorithm. Namely, matches are reported as soon as they are found. Moreover, this is the only way to get overlapping matches or do stream searching.

The other kinds of match semantics that are supported are MatchKind::LeftmostFirst and MatchKind::LeftmostLongest. The former corresponds to the match you would get if you were to try to match each pattern at each position in the haystack in the same order that you give to the automaton. That is, it returns the leftmost match corresponding to the earliest pattern given to the automaton. The latter corresponds to finding the longest possible match among all leftmost matches.

For more details on match semantics, see the documentation for MatchKind.

Note that setting this to MatchKind::LeftmostFirst or MatchKind::LeftmostLongest will cause some search routines on AhoCorasick to return an error (or panic if you’re using the infallible API). Notably, this includes stream and overlapping searches.

§Examples

In these examples, we demonstrate the differences between match semantics for a particular set of patterns in a specific order: b, abc, abcd.

Standard semantics:

use aho_corasick::{AhoCorasick, MatchKind};

let patterns = &["b", "abc", "abcd"];
let haystack = "abcd";

let ac = AhoCorasick::builder()
    .match_kind(MatchKind::Standard) // default, not necessary
    .build(patterns)
    .unwrap();
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("b", &haystack[mat.start()..mat.end()]);

Leftmost-first semantics:

use aho_corasick::{AhoCorasick, MatchKind};

let patterns = &["b", "abc", "abcd"];
let haystack = "abcd";

let ac = AhoCorasick::builder()
    .match_kind(MatchKind::LeftmostFirst)
    .build(patterns)
    .unwrap();
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("abc", &haystack[mat.start()..mat.end()]);

Leftmost-longest semantics:

use aho_corasick::{AhoCorasick, MatchKind};

let patterns = &["b", "abc", "abcd"];
let haystack = "abcd";

let ac = AhoCorasick::builder()
    .match_kind(MatchKind::LeftmostLongest)
    .build(patterns)
    .unwrap();
let mat = ac.find(haystack).expect("should have a match");
assert_eq!("abcd", &haystack[mat.start()..mat.end()]);

Source

pub fn start_kind(&mut self, kind: StartKind) -> &mut AhoCorasickBuilder

Sets the starting state configuration for the automaton.

Every Aho-Corasick automaton is capable of having two start states: one that is used for unanchored searches and one that is used for anchored searches. Some automatons, like the NFAs, support this with almost zero additional cost. Other automatons, like the DFA, require two copies of the underlying transition table to support both simultaneously.

Because there may be an added non-trivial cost to supporting both, it is possible to configure which starting state configuration is needed.

Indeed, since anchored searches tend to be somewhat more rare, only unanchored searches are supported by default. Thus, StartKind::Unanchored is the default.

Note that when this is set to StartKind::Unanchored, then running an anchored search will result in an error (or a panic if using the infallible APIs). Similarly, when this is set to StartKind::Anchored, then running an unanchored search will result in an error (or a panic if using the infallible APIs). When StartKind::Both is used, then both unanchored and anchored searches are always supported.

Also note that even if an AhoCorasick searcher is using an NFA internally (which always supports both unanchored and anchored searches), an error will still be reported for a search that isn’t supported by the configuration set via this method. This means, for example, that an error is never dependent on which internal implementation of Aho-Corasick is used.

§Example: anchored search

This shows how to build a searcher that only supports anchored searches:

use aho_corasick::{
    AhoCorasick, Anchored, Input, Match, MatchKind, StartKind,
};

let ac = AhoCorasick::builder()
    .match_kind(MatchKind::LeftmostFirst)
    .start_kind(StartKind::Anchored)
    .build(&["b", "abc", "abcd"])
    .unwrap();

// An unanchored search is not supported! An error here is guaranteed
// given the configuration above regardless of which kind of
// Aho-Corasick implementation ends up being used internally.
let input = Input::new("foo abcd").anchored(Anchored::No);
assert!(ac.try_find(input).is_err());

let input = Input::new("foo abcd").anchored(Anchored::Yes);
assert_eq!(None, ac.try_find(input)?);

let input = Input::new("abcd").anchored(Anchored::Yes);
assert_eq!(Some(Match::must(1, 0..3)), ac.try_find(input)?);

§Example: unanchored and anchored searches

This shows how to build a searcher that supports both unanchored and anchored searches:

use aho_corasick::{
    AhoCorasick, Anchored, Input, Match, MatchKind, StartKind,
};

let ac = AhoCorasick::builder()
    .match_kind(MatchKind::LeftmostFirst)
    .start_kind(StartKind::Both)
    .build(&["b", "abc", "abcd"])
    .unwrap();

let input = Input::new("foo abcd").anchored(Anchored::No);
assert_eq!(Some(Match::must(1, 4..7)), ac.try_find(input)?);

let input = Input::new("foo abcd").anchored(Anchored::Yes);
assert_eq!(None, ac.try_find(input)?);

let input = Input::new("abcd").anchored(Anchored::Yes);
assert_eq!(Some(Match::must(1, 0..3)), ac.try_find(input)?);

Source

pub fn ascii_case_insensitive(&mut self, yes: bool) -> &mut AhoCorasickBuilder

Enable ASCII-aware case insensitive matching.

When this option is enabled, searching will be performed without respect to case for ASCII letters (a-z and A-Z) only.

Enabling this option does not change the search algorithm, but it may increase the size of the automaton.

NOTE: It is unlikely that support for Unicode case folding will be added in the future. The ASCII case works via a simple hack to the underlying automaton, but full Unicode handling requires a fair bit of sophistication. If you do need Unicode handling, you might consider using the regex crate or the lower level regex-automata crate.

§Examples

Basic usage:

use aho_corasick::AhoCorasick;

let patterns = &["FOO", "bAr", "BaZ"];
let haystack = "foo bar baz";

let ac = AhoCorasick::builder()
    .ascii_case_insensitive(true)
    .build(patterns)
    .unwrap();
assert_eq!(3, ac.find_iter(haystack).count());

Source

pub fn kind(&mut self, kind: Option<AhoCorasickKind>) -> &mut AhoCorasickBuilder

Choose the type of underlying automaton to use.

Currently, there are four choices:

AhoCorasickKind::NoncontiguousNFA instructs the searcher to use a noncontiguous::NFA. A noncontiguous NFA is the fastest to be built, has moderate memory usage and is typically the slowest to execute a search.
AhoCorasickKind::ContiguousNFA instructs the searcher to use a contiguous::NFA. A contiguous NFA is a little slower to build than a noncontiguous NFA, has excellent memory usage and is typically a little slower than a DFA for a search.
AhoCorasickKind::DFA instructs the searcher to use a dfa::DFA. A DFA is very slow to build, uses exorbitant amounts of memory, but will typically execute searches the fastest.
None (the default) instructs the searcher to choose the “best” Aho-Corasick implementation. This choice is typically based primarily on the number of patterns.

Setting this configuration does not change the time complexity for constructing the Aho-Corasick automaton (which is O(p) where p is the total number of patterns being compiled). Setting this to AhoCorasickKind::DFA does however reduce the time complexity of non-overlapping searches from O(n + p) to O(n), where n is the length of the haystack.

In general, you should probably stick to the default unless you have some kind of reason to use a specific Aho-Corasick implementation. For example, you might choose AhoCorasickKind::DFA if you don’t care about memory usage and want the fastest possible search times.

Setting this guarantees that the searcher returned uses the chosen implementation. If that implementation could not be constructed, then an error will be returned. In contrast, when None is used, it is possible for it to attempt to construct, for example, a contiguous NFA and have it fail. In which case, it will fall back to using a noncontiguous NFA.

If None is given, then one may use AhoCorasick::kind to determine which Aho-Corasick implementation was chosen.

Note that the heuristics used for choosing which AhoCorasickKind may be changed in a semver compatible release.

Source

pub fn prefilter(&mut self, yes: bool) -> &mut AhoCorasickBuilder

Enable heuristic prefilter optimizations.

When enabled, searching will attempt to quickly skip to match candidates using specialized literal search routines. A prefilter cannot always be used, and is generally treated as a heuristic. It can be useful to disable this if the prefilter is observed to be sub-optimal for a particular workload.

Currently, prefilters are typically only active when building searchers with a small (less than 100) number of patterns.

This is enabled by default.

Source

pub fn dense_depth(&mut self, depth: usize) -> &mut AhoCorasickBuilder

Set the limit on how many states use a dense representation for their transitions. Other states will generally use a sparse representation.

A dense representation uses more memory but is generally faster, since the next transition in a dense representation can be computed in a constant number of instructions. A sparse representation uses less memory but is generally slower, since the next transition in a sparse representation requires executing a variable number of instructions.

This setting is only used when an Aho-Corasick implementation is used that supports the dense versus sparse representation trade off. Not all do.

This limit is expressed in terms of the depth of a state, i.e., the number of transitions from the starting state of the automaton. The idea is that most of the time searching will be spent near the starting state of the automaton, so states near the start state should use a dense representation. States further away from the start state would then use a sparse representation.

By default, this is set to a low but non-zero number. Setting this to 0 is almost never what you want, since it is likely to make searches very slow due to the start state itself being forced to use a sparse representation. However, it is unlikely that increasing this number will help things much, since the most active states have a small depth. More to the point, the memory usage increases superlinearly as this number increases.

Source

pub fn byte_classes(&mut self, yes: bool) -> &mut AhoCorasickBuilder

A debug settting for whether to attempt to shrink the size of the automaton’s alphabet or not.

This option is enabled by default and should never be disabled unless one is debugging the underlying automaton.

When enabled, some (but not all) Aho-Corasick automatons will use a map from all possible bytes to their corresponding equivalence class. Each equivalence class represents a set of bytes that does not discriminate between a match and a non-match in the automaton.

The advantage of this map is that the size of the transition table can be reduced drastically from #states * 256 * sizeof(u32) to #states * k * sizeof(u32) where k is the number of equivalence classes (rounded up to the nearest power of 2). As a result, total space usage can decrease substantially. Moreover, since a smaller alphabet is used, automaton compilation becomes faster as well.

WARNING: This is only useful for debugging automatons. Disabling this does not yield any speed advantages. Namely, even when this is disabled, a byte class map is still used while searching. The only difference is that every byte will be forced into its own distinct equivalence class. This is useful for debugging the actual generated transitions because it lets one see the transitions defined on actual bytes instead of the equivalence classes.