Struct aho_corasick::nfa::noncontiguous::NFA
source · pub struct NFA {
match_kind: MatchKind,
states: Vec<State>,
sparse: Vec<Transition>,
dense: Vec<StateID>,
matches: Vec<Match>,
pattern_lens: Vec<SmallIndex>,
prefilter: Option<Prefilter>,
byte_classes: ByteClasses,
min_pattern_len: usize,
max_pattern_len: usize,
special: Special,
}
Expand description
A noncontiguous NFA implementation of Aho-Corasick.
When possible, prefer using AhoCorasick
instead of
this type directly. Using an NFA
directly is typically only necessary
when one needs access to the Automaton
trait implementation.
This NFA represents the “core” implementation of Aho-Corasick in this crate. Namely, constructing this NFA involving building a trie and then filling in the failure transitions between states, similar to what is described in any standard textbook description of Aho-Corasick.
In order to minimize heap usage and to avoid additional construction costs, this implementation represents the transitions of all states as distinct sparse memory allocations. This is where it gets its name from. That is, this NFA has no contiguous memory allocation for its transition table. Each state gets its own allocation.
While the sparse representation keeps memory usage to somewhat reasonable
levels, it is still quite large and also results in somewhat mediocre
search performance. For this reason, it is almost always a good idea to
use a contiguous::NFA
instead. It is
marginally slower to build, but has higher throughput and can sometimes use
an order of magnitude less memory. The main reason to use a noncontiguous
NFA is when you need the fastest possible construction time, or when a
contiguous NFA does not have the desired capacity. (The total number of NFA
states it can have is fewer than a noncontiguous NFA.)
§Example
This example shows how to build an NFA
directly and use it to execute
Automaton::try_find
:
use aho_corasick::{
automaton::Automaton,
nfa::noncontiguous::NFA,
Input, Match,
};
let patterns = &["b", "abc", "abcd"];
let haystack = "abcd";
let nfa = NFA::new(patterns).unwrap();
assert_eq!(
Some(Match::must(0, 1..2)),
nfa.try_find(&Input::new(haystack))?,
);
It is also possible to implement your own version of try_find
. See the
Automaton
documentation for an example.
Fields§
§match_kind: MatchKind
The match semantics built into this NFA.
states: Vec<State>
A set of states. Each state defines its own transitions, a fail transition and a set of indices corresponding to matches.
The first state is always the fail state, which is used only as a sentinel. Namely, in the final NFA, no transition into the fail state exists. (Well, they do, but they aren’t followed. Instead, the state’s failure transition is followed.)
The second state (index 1) is always the dead state. Dead states are in every automaton, but only used when leftmost-{first,longest} match semantics are enabled. Specifically, they instruct search to stop at specific points in order to report the correct match location. In the standard Aho-Corasick construction, there are no transitions to the dead state.
The third state (index 2) is generally intended to be the starting or “root” state.
sparse: Vec<Transition>
Transitions stored in a sparse representation via a linked list.
Each transition contains three pieces of information: the byte it
is defined for, the state it transitions to and a link to the next
transition in the same state (or StateID::ZERO
if it is the last
transition).
The first transition for each state is determined by State::sparse
.
Note that this contains a complete set of all transitions in this NFA, including states that have a dense representation for transitions. (Adding dense transitions for a state doesn’t remove its sparse transitions, since deleting transitions from this particular sparse representation would be fairly expensive.)
dense: Vec<StateID>
Transitions stored in a dense representation.
A state has a row in this table if and only if State::dense
is
not equal to StateID::ZERO
. When not zero, there are precisely
NFA::byte_classes::alphabet_len()
entries beginning at State::dense
in this table.
Generally a very small minority of states have a dense representation since it uses so much memory.
matches: Vec<Match>
Matches stored in linked list for each state.
Like sparse transitions, each match has a link to the next match in the state.
The first match for each state is determined by State::matches
.
pattern_lens: Vec<SmallIndex>
The length, in bytes, of each pattern in this NFA. This slice is
indexed by PatternID
.
The number of entries in this vector corresponds to the total number of patterns in this automaton.
prefilter: Option<Prefilter>
A prefilter for quickly skipping to candidate matches, if pertinent.
byte_classes: ByteClasses
A set of equivalence classes in terms of bytes. We compute this while building the NFA, but don’t use it in the NFA’s states. Instead, we use this for building the DFA. We store it on the NFA since it’s easy to compute while visiting the patterns.
min_pattern_len: usize
The length, in bytes, of the shortest pattern in this automaton. This information is useful for detecting whether an automaton matches the empty string or not.
max_pattern_len: usize
The length, in bytes, of the longest pattern in this automaton. This information is useful for keeping correct buffer sizes when searching on streams.
special: Special
The information required to deduce which states are “special” in this NFA.
Since the DEAD and FAIL states are always the first two states and there are only ever two start states (which follow all of the match states), it follows that we can determine whether a state is a fail, dead, match or start with just a few comparisons on the ID itself:
is_dead(sid): sid == NFA::DEAD is_fail(sid): sid == NFA::FAIL is_match(sid): NFA::FAIL < sid && sid <= max_match_id is_start(sid): sid == start_unanchored_id || sid == start_anchored_id
Note that this only applies to the NFA after it has been constructed. During construction, the start states are the first ones added and the match states are inter-leaved with non-match states. Once all of the states have been added, the states are shuffled such that the above predicates hold.
Implementations§
source§impl NFA
impl NFA
source§impl NFA
impl NFA
sourcepub(crate) const DEAD: StateID = _
pub(crate) const DEAD: StateID = _
The DEAD state is a sentinel state like the FAIL state. The DEAD state instructs any search to stop and return any currently recorded match, or no match otherwise. Generally speaking, it is impossible for an unanchored standard search to enter a DEAD state. But an anchored search can, and so to can a leftmost search.
We put DEAD before FAIL so that DEAD is always 0. We repeat this decision across the other Aho-Corasicm automata, so that DEAD states there are always 0 too. It’s not that we need all of the implementations to agree, but rather, the contiguous NFA and the DFA use a sort of “premultiplied” state identifier where the only state whose ID is always known and constant is the first state. Subsequent state IDs depend on how much space has already been used in the transition table.
sourcepub(crate) const FAIL: StateID = _
pub(crate) const FAIL: StateID = _
The FAIL state mostly just corresponds to the ID of any transition on a
state that isn’t explicitly defined. When one transitions into the FAIL
state, one must follow the previous state’s failure transition before
doing the next state lookup. In this way, FAIL is more of a sentinel
than a state that one actually transitions into. In particular, it is
never exposed in the Automaton
interface.
sourcepub(crate) fn byte_classes(&self) -> &ByteClasses
pub(crate) fn byte_classes(&self) -> &ByteClasses
Returns the equivalence classes of bytes found while constructing this NFA.
Note that the NFA doesn’t actually make use of these equivalence classes. Instead, these are useful for building the DFA when desired.
sourcepub(crate) fn pattern_lens_raw(&self) -> &[SmallIndex]
pub(crate) fn pattern_lens_raw(&self) -> &[SmallIndex]
Returns a slice containing the length of each pattern in this searcher.
It is indexed by PatternID
and has length NFA::patterns_len
.
This is exposed for convenience when building a contiguous NFA. But it
can be reconstructed from the Automaton
API if necessary.
sourcepub(crate) fn states(&self) -> &[State]
pub(crate) fn states(&self) -> &[State]
Returns a slice of all states in this non-contiguous NFA.
sourcepub(crate) fn special(&self) -> &Special
pub(crate) fn special(&self) -> &Special
Returns the underlying “special” state information for this NFA.
sourcepub(crate) fn swap_states(&mut self, id1: StateID, id2: StateID)
pub(crate) fn swap_states(&mut self, id1: StateID, id2: StateID)
Swaps the states at id1
and id2
.
This does not update the transitions of any state to account for the state swap.
sourcepub(crate) fn remap(&mut self, map: impl Fn(StateID) -> StateID)
pub(crate) fn remap(&mut self, map: impl Fn(StateID) -> StateID)
Re-maps all state IDs in this NFA according to the map
function
given.
sourcepub(crate) fn iter_trans(
&self,
sid: StateID,
) -> impl Iterator<Item = Transition> + '_
pub(crate) fn iter_trans( &self, sid: StateID, ) -> impl Iterator<Item = Transition> + '_
Iterate over all of the transitions for the given state ID.
sourcepub(crate) fn iter_matches(
&self,
sid: StateID,
) -> impl Iterator<Item = PatternID> + '_
pub(crate) fn iter_matches( &self, sid: StateID, ) -> impl Iterator<Item = PatternID> + '_
Iterate over all of the matches for the given state ID.
sourcefn next_link(&self, sid: StateID, prev: Option<StateID>) -> Option<StateID>
fn next_link(&self, sid: StateID, prev: Option<StateID>) -> Option<StateID>
Return the link following the one given. If the one given is the last
link for the given state, then return None
.
If no previous link is given, then this returns the first link in the state, if one exists.
This is useful for manually iterating over the transitions in a single
state without borrowing the NFA. This permits mutating other parts of
the NFA during iteration. Namely, one can access the transition pointed
to by the link via self.sparse[link]
.
sourcefn follow_transition(&self, sid: StateID, byte: u8) -> StateID
fn follow_transition(&self, sid: StateID, byte: u8) -> StateID
Follow the transition for the given byte in the given state. If no such transition exists, then the FAIL state ID is returned.
sourcefn follow_transition_sparse(&self, sid: StateID, byte: u8) -> StateID
fn follow_transition_sparse(&self, sid: StateID, byte: u8) -> StateID
Like follow_transition
, but always uses the sparse representation.
sourcefn add_transition(
&mut self,
prev: StateID,
byte: u8,
next: StateID,
) -> Result<(), BuildError>
fn add_transition( &mut self, prev: StateID, byte: u8, next: StateID, ) -> Result<(), BuildError>
Set the transition for the given byte to the state ID given.
Note that one should not set transitions to the FAIL state. It is not technically incorrect, but it wastes space. If a transition is not defined, then it is automatically assumed to lead to the FAIL state.
sourcefn init_full_state(
&mut self,
prev: StateID,
next: StateID,
) -> Result<(), BuildError>
fn init_full_state( &mut self, prev: StateID, next: StateID, ) -> Result<(), BuildError>
This sets every possible transition (all 255 of them) for the given
state to the name next
value.
This is useful for efficiently initializing start/dead states.
§Panics
This requires that the state has no transitions added to it already. If it has any transitions, then this panics. It will also panic if the state has been densified prior to calling this.
sourcefn add_match(&mut self, sid: StateID, pid: PatternID) -> Result<(), BuildError>
fn add_match(&mut self, sid: StateID, pid: PatternID) -> Result<(), BuildError>
Add a match for the given pattern ID to the state for the given ID.
sourcefn copy_matches(&mut self, src: StateID, dst: StateID) -> Result<(), BuildError>
fn copy_matches(&mut self, src: StateID, dst: StateID) -> Result<(), BuildError>
Copy matches from the src
state to the dst
state. This is useful
when a match state can be reached via a failure transition. In which
case, you’ll want to copy the matches (if any) from the state reached
by the failure transition to the original state you were at.
sourcefn alloc_transition(&mut self) -> Result<StateID, BuildError>
fn alloc_transition(&mut self) -> Result<StateID, BuildError>
Create a new entry in NFA::trans
, if there’s room, and return that
entry’s ID. If there’s no room, then an error is returned.
sourcefn alloc_match(&mut self) -> Result<StateID, BuildError>
fn alloc_match(&mut self) -> Result<StateID, BuildError>
Create a new entry in NFA::matches
, if there’s room, and return that
entry’s ID. If there’s no room, then an error is returned.
sourcefn alloc_dense_state(&mut self) -> Result<StateID, BuildError>
fn alloc_dense_state(&mut self) -> Result<StateID, BuildError>
Create a new set of N
transitions in this NFA’s dense transition
table. The ID return corresponds to the index at which the N
transitions begin. So id+0
is the first transition and id+(N-1)
is
the last.
N
is determined via NFA::byte_classes::alphabet_len
.
sourcefn alloc_state(&mut self, depth: usize) -> Result<StateID, BuildError>
fn alloc_state(&mut self, depth: usize) -> Result<StateID, BuildError>
Allocate and add a fresh state to the underlying NFA and return its
ID (guaranteed to be one more than the ID of the previously allocated
state). If the ID would overflow StateID
, then this returns an error.
Trait Implementations§
source§impl Automaton for NFA
impl Automaton for NFA
source§fn start_state(&self, anchored: Anchored) -> Result<StateID, MatchError>
fn start_state(&self, anchored: Anchored) -> Result<StateID, MatchError>
source§fn is_special(&self, sid: StateID) -> bool
fn is_special(&self, sid: StateID) -> bool
source§fn is_dead(&self, sid: StateID) -> bool
fn is_dead(&self, sid: StateID) -> bool
source§fn is_match(&self, sid: StateID) -> bool
fn is_match(&self, sid: StateID) -> bool
source§fn is_start(&self, sid: StateID) -> bool
fn is_start(&self, sid: StateID) -> bool
source§fn match_kind(&self) -> MatchKind
fn match_kind(&self) -> MatchKind
source§fn patterns_len(&self) -> usize
fn patterns_len(&self) -> usize
source§fn pattern_len(&self, pid: PatternID) -> usize
fn pattern_len(&self, pid: PatternID) -> usize
source§fn min_pattern_len(&self) -> usize
fn min_pattern_len(&self) -> usize
source§fn max_pattern_len(&self) -> usize
fn max_pattern_len(&self) -> usize
source§fn match_len(&self, sid: StateID) -> usize
fn match_len(&self, sid: StateID) -> usize
source§fn memory_usage(&self) -> usize
fn memory_usage(&self) -> usize
source§fn prefilter(&self) -> Option<&Prefilter>
fn prefilter(&self) -> Option<&Prefilter>
source§fn try_find(&self, input: &Input<'_>) -> Result<Option<Match>, MatchError>
fn try_find(&self, input: &Input<'_>) -> Result<Option<Match>, MatchError>
source§fn try_find_overlapping(
&self,
input: &Input<'_>,
state: &mut OverlappingState,
) -> Result<(), MatchError>
fn try_find_overlapping( &self, input: &Input<'_>, state: &mut OverlappingState, ) -> Result<(), MatchError>
source§fn try_find_iter<'a, 'h>(
&'a self,
input: Input<'h>,
) -> Result<FindIter<'a, 'h, Self>, MatchError>where
Self: Sized,
fn try_find_iter<'a, 'h>(
&'a self,
input: Input<'h>,
) -> Result<FindIter<'a, 'h, Self>, MatchError>where
Self: Sized,
source§fn try_find_overlapping_iter<'a, 'h>(
&'a self,
input: Input<'h>,
) -> Result<FindOverlappingIter<'a, 'h, Self>, MatchError>where
Self: Sized,
fn try_find_overlapping_iter<'a, 'h>(
&'a self,
input: Input<'h>,
) -> Result<FindOverlappingIter<'a, 'h, Self>, MatchError>where
Self: Sized,
source§fn try_replace_all<B>(
&self,
haystack: &str,
replace_with: &[B],
) -> Result<String, MatchError>
fn try_replace_all<B>( &self, haystack: &str, replace_with: &[B], ) -> Result<String, MatchError>
haystack
with
strings from replace_with
depending on the pattern that
matched. The replace_with
slice must have length equal to
Automaton::patterns_len
. Read moresource§fn try_replace_all_bytes<B>(
&self,
haystack: &[u8],
replace_with: &[B],
) -> Result<Vec<u8>, MatchError>
fn try_replace_all_bytes<B>( &self, haystack: &[u8], replace_with: &[B], ) -> Result<Vec<u8>, MatchError>
haystack
with
strings from replace_with
depending on the pattern that
matched. The replace_with
slice must have length equal to
Automaton::patterns_len
. Read moresource§fn try_replace_all_with<F>(
&self,
haystack: &str,
dst: &mut String,
replace_with: F,
) -> Result<(), MatchError>
fn try_replace_all_with<F>( &self, haystack: &str, dst: &mut String, replace_with: F, ) -> Result<(), MatchError>
haystack
by calling the
replace_with
closure given. Read moresource§fn try_replace_all_with_bytes<F>(
&self,
haystack: &[u8],
dst: &mut Vec<u8>,
replace_with: F,
) -> Result<(), MatchError>
fn try_replace_all_with_bytes<F>( &self, haystack: &[u8], dst: &mut Vec<u8>, replace_with: F, ) -> Result<(), MatchError>
haystack
by calling the
replace_with
closure given. Read moresource§fn try_stream_find_iter<'a, R: Read>(
&'a self,
rdr: R,
) -> Result<StreamFindIter<'a, Self, R>, MatchError>where
Self: Sized,
fn try_stream_find_iter<'a, R: Read>(
&'a self,
rdr: R,
) -> Result<StreamFindIter<'a, Self, R>, MatchError>where
Self: Sized,
source§fn try_stream_replace_all<R, W, B>(
&self,
rdr: R,
wtr: W,
replace_with: &[B],
) -> Result<()>
fn try_stream_replace_all<R, W, B>( &self, rdr: R, wtr: W, replace_with: &[B], ) -> Result<()>
rdr
with strings from
replace_with
depending on the pattern that matched, and writes the
result to wtr
. The replace_with
slice must have length equal to
Automaton::patterns_len
. Read moresource§impl Remappable for NFA
impl Remappable for NFA
source§fn swap_states(&mut self, id1: StateID, id2: StateID)
fn swap_states(&mut self, id1: StateID, id2: StateID)
id1
are now in the memory region where the transitions for id2
were, and all of the transitions in id2
are now in the memory region
where the transitions for id1
were. Read more