Struct aho_corasick::nfa::noncontiguous::NFA

source ·
pub struct NFA {
    match_kind: MatchKind,
    states: Vec<State>,
    sparse: Vec<Transition>,
    dense: Vec<StateID>,
    matches: Vec<Match>,
    pattern_lens: Vec<SmallIndex>,
    prefilter: Option<Prefilter>,
    byte_classes: ByteClasses,
    min_pattern_len: usize,
    max_pattern_len: usize,
    special: Special,
}
Expand description

A noncontiguous NFA implementation of Aho-Corasick.

When possible, prefer using AhoCorasick instead of this type directly. Using an NFA directly is typically only necessary when one needs access to the Automaton trait implementation.

This NFA represents the “core” implementation of Aho-Corasick in this crate. Namely, constructing this NFA involving building a trie and then filling in the failure transitions between states, similar to what is described in any standard textbook description of Aho-Corasick.

In order to minimize heap usage and to avoid additional construction costs, this implementation represents the transitions of all states as distinct sparse memory allocations. This is where it gets its name from. That is, this NFA has no contiguous memory allocation for its transition table. Each state gets its own allocation.

While the sparse representation keeps memory usage to somewhat reasonable levels, it is still quite large and also results in somewhat mediocre search performance. For this reason, it is almost always a good idea to use a contiguous::NFA instead. It is marginally slower to build, but has higher throughput and can sometimes use an order of magnitude less memory. The main reason to use a noncontiguous NFA is when you need the fastest possible construction time, or when a contiguous NFA does not have the desired capacity. (The total number of NFA states it can have is fewer than a noncontiguous NFA.)

§Example

This example shows how to build an NFA directly and use it to execute Automaton::try_find:

use aho_corasick::{
    automaton::Automaton,
    nfa::noncontiguous::NFA,
    Input, Match,
};

let patterns = &["b", "abc", "abcd"];
let haystack = "abcd";

let nfa = NFA::new(patterns).unwrap();
assert_eq!(
    Some(Match::must(0, 1..2)),
    nfa.try_find(&Input::new(haystack))?,
);

It is also possible to implement your own version of try_find. See the Automaton documentation for an example.

Fields§

§match_kind: MatchKind

The match semantics built into this NFA.

§states: Vec<State>

A set of states. Each state defines its own transitions, a fail transition and a set of indices corresponding to matches.

The first state is always the fail state, which is used only as a sentinel. Namely, in the final NFA, no transition into the fail state exists. (Well, they do, but they aren’t followed. Instead, the state’s failure transition is followed.)

The second state (index 1) is always the dead state. Dead states are in every automaton, but only used when leftmost-{first,longest} match semantics are enabled. Specifically, they instruct search to stop at specific points in order to report the correct match location. In the standard Aho-Corasick construction, there are no transitions to the dead state.

The third state (index 2) is generally intended to be the starting or “root” state.

§sparse: Vec<Transition>

Transitions stored in a sparse representation via a linked list.

Each transition contains three pieces of information: the byte it is defined for, the state it transitions to and a link to the next transition in the same state (or StateID::ZERO if it is the last transition).

The first transition for each state is determined by State::sparse.

Note that this contains a complete set of all transitions in this NFA, including states that have a dense representation for transitions. (Adding dense transitions for a state doesn’t remove its sparse transitions, since deleting transitions from this particular sparse representation would be fairly expensive.)

§dense: Vec<StateID>

Transitions stored in a dense representation.

A state has a row in this table if and only if State::dense is not equal to StateID::ZERO. When not zero, there are precisely NFA::byte_classes::alphabet_len() entries beginning at State::dense in this table.

Generally a very small minority of states have a dense representation since it uses so much memory.

§matches: Vec<Match>

Matches stored in linked list for each state.

Like sparse transitions, each match has a link to the next match in the state.

The first match for each state is determined by State::matches.

§pattern_lens: Vec<SmallIndex>

The length, in bytes, of each pattern in this NFA. This slice is indexed by PatternID.

The number of entries in this vector corresponds to the total number of patterns in this automaton.

§prefilter: Option<Prefilter>

A prefilter for quickly skipping to candidate matches, if pertinent.

§byte_classes: ByteClasses

A set of equivalence classes in terms of bytes. We compute this while building the NFA, but don’t use it in the NFA’s states. Instead, we use this for building the DFA. We store it on the NFA since it’s easy to compute while visiting the patterns.

§min_pattern_len: usize

The length, in bytes, of the shortest pattern in this automaton. This information is useful for detecting whether an automaton matches the empty string or not.

§max_pattern_len: usize

The length, in bytes, of the longest pattern in this automaton. This information is useful for keeping correct buffer sizes when searching on streams.

§special: Special

The information required to deduce which states are “special” in this NFA.

Since the DEAD and FAIL states are always the first two states and there are only ever two start states (which follow all of the match states), it follows that we can determine whether a state is a fail, dead, match or start with just a few comparisons on the ID itself:

is_dead(sid): sid == NFA::DEAD is_fail(sid): sid == NFA::FAIL is_match(sid): NFA::FAIL < sid && sid <= max_match_id is_start(sid): sid == start_unanchored_id || sid == start_anchored_id

Note that this only applies to the NFA after it has been constructed. During construction, the start states are the first ones added and the match states are inter-leaved with non-match states. Once all of the states have been added, the states are shuffled such that the above predicates hold.

Implementations§

source§

impl NFA

source

pub fn new<I, P>(patterns: I) -> Result<NFA, BuildError>
where I: IntoIterator<Item = P>, P: AsRef<[u8]>,

Create a new Aho-Corasick noncontiguous NFA using the default configuration.

Use a Builder if you want to change the configuration.

source

pub fn builder() -> Builder

A convenience method for returning a new Aho-Corasick noncontiguous NFA builder.

This usually permits one to just import the NFA type.

source§

impl NFA

source

pub(crate) const DEAD: StateID = _

The DEAD state is a sentinel state like the FAIL state. The DEAD state instructs any search to stop and return any currently recorded match, or no match otherwise. Generally speaking, it is impossible for an unanchored standard search to enter a DEAD state. But an anchored search can, and so to can a leftmost search.

We put DEAD before FAIL so that DEAD is always 0. We repeat this decision across the other Aho-Corasicm automata, so that DEAD states there are always 0 too. It’s not that we need all of the implementations to agree, but rather, the contiguous NFA and the DFA use a sort of “premultiplied” state identifier where the only state whose ID is always known and constant is the first state. Subsequent state IDs depend on how much space has already been used in the transition table.

source

pub(crate) const FAIL: StateID = _

The FAIL state mostly just corresponds to the ID of any transition on a state that isn’t explicitly defined. When one transitions into the FAIL state, one must follow the previous state’s failure transition before doing the next state lookup. In this way, FAIL is more of a sentinel than a state that one actually transitions into. In particular, it is never exposed in the Automaton interface.

source

pub(crate) fn byte_classes(&self) -> &ByteClasses

Returns the equivalence classes of bytes found while constructing this NFA.

Note that the NFA doesn’t actually make use of these equivalence classes. Instead, these are useful for building the DFA when desired.

source

pub(crate) fn pattern_lens_raw(&self) -> &[SmallIndex]

Returns a slice containing the length of each pattern in this searcher. It is indexed by PatternID and has length NFA::patterns_len.

This is exposed for convenience when building a contiguous NFA. But it can be reconstructed from the Automaton API if necessary.

source

pub(crate) fn states(&self) -> &[State]

Returns a slice of all states in this non-contiguous NFA.

source

pub(crate) fn special(&self) -> &Special

Returns the underlying “special” state information for this NFA.

source

pub(crate) fn swap_states(&mut self, id1: StateID, id2: StateID)

Swaps the states at id1 and id2.

This does not update the transitions of any state to account for the state swap.

source

pub(crate) fn remap(&mut self, map: impl Fn(StateID) -> StateID)

Re-maps all state IDs in this NFA according to the map function given.

source

pub(crate) fn iter_trans( &self, sid: StateID, ) -> impl Iterator<Item = Transition> + '_

Iterate over all of the transitions for the given state ID.

source

pub(crate) fn iter_matches( &self, sid: StateID, ) -> impl Iterator<Item = PatternID> + '_

Iterate over all of the matches for the given state ID.

Return the link following the one given. If the one given is the last link for the given state, then return None.

If no previous link is given, then this returns the first link in the state, if one exists.

This is useful for manually iterating over the transitions in a single state without borrowing the NFA. This permits mutating other parts of the NFA during iteration. Namely, one can access the transition pointed to by the link via self.sparse[link].

source

fn follow_transition(&self, sid: StateID, byte: u8) -> StateID

Follow the transition for the given byte in the given state. If no such transition exists, then the FAIL state ID is returned.

source

fn follow_transition_sparse(&self, sid: StateID, byte: u8) -> StateID

Like follow_transition, but always uses the sparse representation.

source

fn add_transition( &mut self, prev: StateID, byte: u8, next: StateID, ) -> Result<(), BuildError>

Set the transition for the given byte to the state ID given.

Note that one should not set transitions to the FAIL state. It is not technically incorrect, but it wastes space. If a transition is not defined, then it is automatically assumed to lead to the FAIL state.

source

fn init_full_state( &mut self, prev: StateID, next: StateID, ) -> Result<(), BuildError>

This sets every possible transition (all 255 of them) for the given state to the name next value.

This is useful for efficiently initializing start/dead states.

§Panics

This requires that the state has no transitions added to it already. If it has any transitions, then this panics. It will also panic if the state has been densified prior to calling this.

source

fn add_match(&mut self, sid: StateID, pid: PatternID) -> Result<(), BuildError>

Add a match for the given pattern ID to the state for the given ID.

source

fn copy_matches(&mut self, src: StateID, dst: StateID) -> Result<(), BuildError>

Copy matches from the src state to the dst state. This is useful when a match state can be reached via a failure transition. In which case, you’ll want to copy the matches (if any) from the state reached by the failure transition to the original state you were at.

source

fn alloc_transition(&mut self) -> Result<StateID, BuildError>

Create a new entry in NFA::trans, if there’s room, and return that entry’s ID. If there’s no room, then an error is returned.

source

fn alloc_match(&mut self) -> Result<StateID, BuildError>

Create a new entry in NFA::matches, if there’s room, and return that entry’s ID. If there’s no room, then an error is returned.

source

fn alloc_dense_state(&mut self) -> Result<StateID, BuildError>

Create a new set of N transitions in this NFA’s dense transition table. The ID return corresponds to the index at which the N transitions begin. So id+0 is the first transition and id+(N-1) is the last.

N is determined via NFA::byte_classes::alphabet_len.

source

fn alloc_state(&mut self, depth: usize) -> Result<StateID, BuildError>

Allocate and add a fresh state to the underlying NFA and return its ID (guaranteed to be one more than the ID of the previously allocated state). If the ID would overflow StateID, then this returns an error.

Trait Implementations§

source§

impl Automaton for NFA

source§

fn start_state(&self, anchored: Anchored) -> Result<StateID, MatchError>

Returns the starting state for the given anchor mode. Read more
source§

fn next_state(&self, anchored: Anchored, sid: StateID, byte: u8) -> StateID

Performs a state transition from sid for byte and returns the next state. Read more
source§

fn is_special(&self, sid: StateID) -> bool

Returns true if the given ID represents a “special” state. A special state is a dead, match or start state. Read more
source§

fn is_dead(&self, sid: StateID) -> bool

Returns true if the given ID represents a dead state. Read more
source§

fn is_match(&self, sid: StateID) -> bool

Returns true if the given ID represents a match state. Read more
source§

fn is_start(&self, sid: StateID) -> bool

Returns true if the given ID represents a start state. Read more
source§

fn match_kind(&self) -> MatchKind

Returns the match semantics that this automaton was built with.
source§

fn patterns_len(&self) -> usize

Returns the total number of patterns compiled into this automaton.
source§

fn pattern_len(&self, pid: PatternID) -> usize

Returns the length of the pattern for the given ID. Read more
source§

fn min_pattern_len(&self) -> usize

Returns the length, in bytes, of the shortest pattern in this automaton.
source§

fn max_pattern_len(&self) -> usize

Returns the length, in bytes, of the longest pattern in this automaton.
source§

fn match_len(&self, sid: StateID) -> usize

Returns the total number of matches for the given state ID. Read more
source§

fn match_pattern(&self, sid: StateID, index: usize) -> PatternID

Returns the pattern ID for the match state given by sid at the index given. Read more
source§

fn memory_usage(&self) -> usize

Returns the heap memory usage, in bytes, used by this automaton.
source§

fn prefilter(&self) -> Option<&Prefilter>

Returns a prefilter, if available, that can be used to accelerate searches for this automaton. Read more
source§

fn try_find(&self, input: &Input<'_>) -> Result<Option<Match>, MatchError>

Executes a non-overlapping search with this automaton using the given configuration. Read more
source§

fn try_find_overlapping( &self, input: &Input<'_>, state: &mut OverlappingState, ) -> Result<(), MatchError>

Executes a overlapping search with this automaton using the given configuration. Read more
source§

fn try_find_iter<'a, 'h>( &'a self, input: Input<'h>, ) -> Result<FindIter<'a, 'h, Self>, MatchError>
where Self: Sized,

Returns an iterator of non-overlapping matches with this automaton using the given configuration. Read more
source§

fn try_find_overlapping_iter<'a, 'h>( &'a self, input: Input<'h>, ) -> Result<FindOverlappingIter<'a, 'h, Self>, MatchError>
where Self: Sized,

Returns an iterator of overlapping matches with this automaton using the given configuration. Read more
source§

fn try_replace_all<B>( &self, haystack: &str, replace_with: &[B], ) -> Result<String, MatchError>
where Self: Sized, B: AsRef<str>,

Replaces all non-overlapping matches in haystack with strings from replace_with depending on the pattern that matched. The replace_with slice must have length equal to Automaton::patterns_len. Read more
source§

fn try_replace_all_bytes<B>( &self, haystack: &[u8], replace_with: &[B], ) -> Result<Vec<u8>, MatchError>
where Self: Sized, B: AsRef<[u8]>,

Replaces all non-overlapping matches in haystack with strings from replace_with depending on the pattern that matched. The replace_with slice must have length equal to Automaton::patterns_len. Read more
source§

fn try_replace_all_with<F>( &self, haystack: &str, dst: &mut String, replace_with: F, ) -> Result<(), MatchError>
where Self: Sized, F: FnMut(&Match, &str, &mut String) -> bool,

Replaces all non-overlapping matches in haystack by calling the replace_with closure given. Read more
source§

fn try_replace_all_with_bytes<F>( &self, haystack: &[u8], dst: &mut Vec<u8>, replace_with: F, ) -> Result<(), MatchError>
where Self: Sized, F: FnMut(&Match, &[u8], &mut Vec<u8>) -> bool,

Replaces all non-overlapping matches in haystack by calling the replace_with closure given. Read more
source§

fn try_stream_find_iter<'a, R: Read>( &'a self, rdr: R, ) -> Result<StreamFindIter<'a, Self, R>, MatchError>
where Self: Sized,

Returns an iterator of non-overlapping matches with this automaton from the stream given. Read more
source§

fn try_stream_replace_all<R, W, B>( &self, rdr: R, wtr: W, replace_with: &[B], ) -> Result<()>
where Self: Sized, R: Read, W: Write, B: AsRef<[u8]>,

Replaces all non-overlapping matches in rdr with strings from replace_with depending on the pattern that matched, and writes the result to wtr. The replace_with slice must have length equal to Automaton::patterns_len. Read more
source§

fn try_stream_replace_all_with<R, W, F>( &self, rdr: R, wtr: W, replace_with: F, ) -> Result<()>
where Self: Sized, R: Read, W: Write, F: FnMut(&Match, &[u8], &mut W) -> Result<()>,

Replaces all non-overlapping matches in rdr by calling the replace_with closure given and writing the result to wtr. Read more
source§

impl Clone for NFA

source§

fn clone(&self) -> NFA

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for NFA

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl Remappable for NFA

source§

fn state_len(&self) -> usize

Return the total number of states.
source§

fn swap_states(&mut self, id1: StateID, id2: StateID)

Swap the states pointed to by the given IDs. The underlying finite state machine should be mutated such that all of the transitions in id1 are now in the memory region where the transitions for id2 were, and all of the transitions in id2 are now in the memory region where the transitions for id1 were. Read more
source§

fn remap(&mut self, map: impl Fn(StateID) -> StateID)

This must remap every single state ID in the underlying value according to the function given. For example, in a DFA, this should remap every transition and every starting state ID.
source§

impl Sealed for NFA

Auto Trait Implementations§

§

impl Freeze for NFA

§

impl RefUnwindSafe for NFA

§

impl Send for NFA

§

impl Sync for NFA

§

impl Unpin for NFA

§

impl UnwindSafe for NFA

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for T
where T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.