Struct aho_corasick::nfa::contiguous::NFA
source · pub struct NFA {
repr: Vec<u32>,
pattern_lens: Vec<SmallIndex>,
state_len: usize,
prefilter: Option<Prefilter>,
match_kind: MatchKind,
alphabet_len: usize,
byte_classes: ByteClasses,
min_pattern_len: usize,
max_pattern_len: usize,
special: Special,
}
Expand description
A contiguous NFA implementation of Aho-Corasick.
When possible, prefer using AhoCorasick
instead of
this type directly. Using an NFA
directly is typically only necessary
when one needs access to the Automaton
trait implementation.
This NFA can only be built by first constructing a noncontiguous::NFA
.
Both NFA::new
and Builder::build
do this for you automatically, but
Builder::build_from_noncontiguous
permits doing it explicitly.
The main difference between a noncontiguous NFA and a contiguous NFA is that the latter represents all of its states and transitions in a single allocation, where as the former uses a separate allocation for each state. Doing this at construction time while keeping a low memory footprint isn’t feasible, which is primarily why there are two different NFA types: one that does the least amount of work possible to build itself, and another that does a little extra work to compact itself and make state transitions faster by making some states use a dense representation.
Because a contiguous NFA uses a single allocation, there is a lot more opportunity for compression tricks to reduce the heap memory used. Indeed, it is not uncommon for a contiguous NFA to use an order of magnitude less heap memory than a noncontiguous NFA. Since building a contiguous NFA usually only takes a fraction of the time it takes to build a noncontiguous NFA, the overall build time is not much slower. Thus, in most cases, a contiguous NFA is the best choice.
Since a contiguous NFA uses various tricks for compression and to achieve
faster state transitions, currently, its limit on the number of states
is somewhat smaller than what a noncontiguous NFA can achieve. Generally
speaking, you shouldn’t expect to run into this limit if the number of
patterns is under 1 million. It is plausible that this limit will be
increased in the future. If the limit is reached, building a contiguous NFA
will return an error. Often, since building a contiguous NFA is relatively
cheap, it can make sense to always try it even if you aren’t sure if it
will fail or not. If it does, you can always fall back to a noncontiguous
NFA. (Indeed, the main AhoCorasick
type employs a
strategy similar to this at construction time.)
§Example
This example shows how to build an NFA
directly and use it to execute
Automaton::try_find
:
use aho_corasick::{
automaton::Automaton,
nfa::contiguous::NFA,
Input, Match,
};
let patterns = &["b", "abc", "abcd"];
let haystack = "abcd";
let nfa = NFA::new(patterns).unwrap();
assert_eq!(
Some(Match::must(0, 1..2)),
nfa.try_find(&Input::new(haystack))?,
);
It is also possible to implement your own version of try_find
. See the
Automaton
documentation for an example.
Fields§
§repr: Vec<u32>
The raw NFA representation. Each state is packed with a header (containing the format of the state, the failure transition and, for a sparse state, the number of transitions), its transitions and any matching pattern IDs for match states.
pattern_lens: Vec<SmallIndex>
The length of each pattern. This is used to compute the start offset of a match.
state_len: usize
The total number of states in this NFA.
prefilter: Option<Prefilter>
A prefilter for accelerating searches, if one exists.
match_kind: MatchKind
The match semantics built into this NFA.
alphabet_len: usize
The alphabet size, or total number of equivalence classes, for this NFA. Dense states always have this many transitions.
byte_classes: ByteClasses
The equivalence classes for this NFA. All transitions, dense and sparse, are defined on equivalence classes and not on the 256 distinct byte values.
min_pattern_len: usize
The length of the shortest pattern in this automaton.
max_pattern_len: usize
The length of the longest pattern in this automaton.
special: Special
The information required to deduce which states are “special” in this NFA.
Implementations§
source§impl NFA
impl NFA
source§impl NFA
impl NFA
sourceconst DEAD: StateID = _
const DEAD: StateID = _
A sentinel state ID indicating that a search should stop once it has entered this state. When a search stops, it returns a match if one has been found, otherwise no match. A contiguous NFA always has an actual dead state at this ID.
sourceconst FAIL: StateID = _
const FAIL: StateID = _
Another sentinel state ID indicating that a search should move through current state’s failure transition.
Note that unlike DEAD, this does not actually point to a valid state in a contiguous NFA. (noncontiguous::NFA::FAIL does point to a valid state.) Instead, this points to the position that is guaranteed to never be a valid state ID (by making sure it points to a place in the middle of the encoding of the DEAD state). Since we never need to actually look at the FAIL state itself, this works out.
By why do it this way? So that FAIL is a constant. I don’t have any concrete evidence that this materially helps matters, but it’s easy to do. The alternative would be making the FAIL ID point to the second state, which could be made a constant but is a little trickier to do. The easiest path is to just make the FAIL state a runtime value, but since comparisons with FAIL occur in perf critical parts of the search, we want it to be as tight as possible and not waste any registers.
Very hand wavy… But the code complexity that results from this is very mild.
Trait Implementations§
source§impl Automaton for NFA
impl Automaton for NFA
source§fn start_state(&self, anchored: Anchored) -> Result<StateID, MatchError>
fn start_state(&self, anchored: Anchored) -> Result<StateID, MatchError>
source§fn is_special(&self, sid: StateID) -> bool
fn is_special(&self, sid: StateID) -> bool
source§fn is_dead(&self, sid: StateID) -> bool
fn is_dead(&self, sid: StateID) -> bool
source§fn is_match(&self, sid: StateID) -> bool
fn is_match(&self, sid: StateID) -> bool
source§fn is_start(&self, sid: StateID) -> bool
fn is_start(&self, sid: StateID) -> bool
source§fn match_kind(&self) -> MatchKind
fn match_kind(&self) -> MatchKind
source§fn patterns_len(&self) -> usize
fn patterns_len(&self) -> usize
source§fn pattern_len(&self, pid: PatternID) -> usize
fn pattern_len(&self, pid: PatternID) -> usize
source§fn min_pattern_len(&self) -> usize
fn min_pattern_len(&self) -> usize
source§fn max_pattern_len(&self) -> usize
fn max_pattern_len(&self) -> usize
source§fn match_len(&self, sid: StateID) -> usize
fn match_len(&self, sid: StateID) -> usize
source§fn memory_usage(&self) -> usize
fn memory_usage(&self) -> usize
source§fn prefilter(&self) -> Option<&Prefilter>
fn prefilter(&self) -> Option<&Prefilter>
source§fn try_find(&self, input: &Input<'_>) -> Result<Option<Match>, MatchError>
fn try_find(&self, input: &Input<'_>) -> Result<Option<Match>, MatchError>
source§fn try_find_overlapping(
&self,
input: &Input<'_>,
state: &mut OverlappingState,
) -> Result<(), MatchError>
fn try_find_overlapping( &self, input: &Input<'_>, state: &mut OverlappingState, ) -> Result<(), MatchError>
source§fn try_find_iter<'a, 'h>(
&'a self,
input: Input<'h>,
) -> Result<FindIter<'a, 'h, Self>, MatchError>where
Self: Sized,
fn try_find_iter<'a, 'h>(
&'a self,
input: Input<'h>,
) -> Result<FindIter<'a, 'h, Self>, MatchError>where
Self: Sized,
source§fn try_find_overlapping_iter<'a, 'h>(
&'a self,
input: Input<'h>,
) -> Result<FindOverlappingIter<'a, 'h, Self>, MatchError>where
Self: Sized,
fn try_find_overlapping_iter<'a, 'h>(
&'a self,
input: Input<'h>,
) -> Result<FindOverlappingIter<'a, 'h, Self>, MatchError>where
Self: Sized,
source§fn try_replace_all<B>(
&self,
haystack: &str,
replace_with: &[B],
) -> Result<String, MatchError>
fn try_replace_all<B>( &self, haystack: &str, replace_with: &[B], ) -> Result<String, MatchError>
haystack
with
strings from replace_with
depending on the pattern that
matched. The replace_with
slice must have length equal to
Automaton::patterns_len
. Read moresource§fn try_replace_all_bytes<B>(
&self,
haystack: &[u8],
replace_with: &[B],
) -> Result<Vec<u8>, MatchError>
fn try_replace_all_bytes<B>( &self, haystack: &[u8], replace_with: &[B], ) -> Result<Vec<u8>, MatchError>
haystack
with
strings from replace_with
depending on the pattern that
matched. The replace_with
slice must have length equal to
Automaton::patterns_len
. Read moresource§fn try_replace_all_with<F>(
&self,
haystack: &str,
dst: &mut String,
replace_with: F,
) -> Result<(), MatchError>
fn try_replace_all_with<F>( &self, haystack: &str, dst: &mut String, replace_with: F, ) -> Result<(), MatchError>
haystack
by calling the
replace_with
closure given. Read moresource§fn try_replace_all_with_bytes<F>(
&self,
haystack: &[u8],
dst: &mut Vec<u8>,
replace_with: F,
) -> Result<(), MatchError>
fn try_replace_all_with_bytes<F>( &self, haystack: &[u8], dst: &mut Vec<u8>, replace_with: F, ) -> Result<(), MatchError>
haystack
by calling the
replace_with
closure given. Read moresource§fn try_stream_find_iter<'a, R: Read>(
&'a self,
rdr: R,
) -> Result<StreamFindIter<'a, Self, R>, MatchError>where
Self: Sized,
fn try_stream_find_iter<'a, R: Read>(
&'a self,
rdr: R,
) -> Result<StreamFindIter<'a, Self, R>, MatchError>where
Self: Sized,
source§fn try_stream_replace_all<R, W, B>(
&self,
rdr: R,
wtr: W,
replace_with: &[B],
) -> Result<()>
fn try_stream_replace_all<R, W, B>( &self, rdr: R, wtr: W, replace_with: &[B], ) -> Result<()>
rdr
with strings from
replace_with
depending on the pattern that matched, and writes the
result to wtr
. The replace_with
slice must have length equal to
Automaton::patterns_len
. Read more