Struct aho_corasick::nfa::noncontiguous::Compiler
source · struct Compiler<'a> {
builder: &'a Builder,
prefilter: Builder,
nfa: NFA,
byteset: ByteClassSet,
}
Expand description
A compiler uses a builder configuration and builds up the NFA formulation of an Aho-Corasick automaton. This roughly corresponds to the standard formulation described in textbooks, with some tweaks to support leftmost searching.
Fields§
§builder: &'a Builder
§prefilter: Builder
§nfa: NFA
§byteset: ByteClassSet
Implementations§
source§impl<'a> Compiler<'a>
impl<'a> Compiler<'a>
fn new(builder: &'a Builder) -> Result<Compiler<'a>, BuildError>
fn compile<I, P>(self, patterns: I) -> Result<NFA, BuildError>
sourcefn build_trie<I, P>(&mut self, patterns: I) -> Result<(), BuildError>
fn build_trie<I, P>(&mut self, patterns: I) -> Result<(), BuildError>
This sets up the initial prefix trie that makes up the Aho-Corasick automaton. Effectively, it creates the basic structure of the automaton, where every pattern given has a path from the start state to the end of the pattern.
sourcefn fill_failure_transitions(&mut self) -> Result<(), BuildError>
fn fill_failure_transitions(&mut self) -> Result<(), BuildError>
This routine creates failure transitions according to the standard textbook formulation of the Aho-Corasick algorithm, with a couple small tweaks to support “leftmost” semantics.
Building failure transitions is the most interesting part of building the Aho-Corasick automaton, because they are what allow searches to be performed in linear time. Specifically, a failure transition is a single transition associated with each state that points back to the longest proper suffix of the pattern being searched. The failure transition is followed whenever there exists no transition on the current state for the current input byte. If there is no other proper suffix, then the failure transition points back to the starting state.
For example, let’s say we built an Aho-Corasick automaton with the following patterns: ‘abcd’ and ‘cef’. The trie looks like this:
a - S1 - b - S2 - c - S3 - d - S4*
/
S0 - c - S5 - e - S6 - f - S7*
At this point, it should be fairly straight-forward to see how this trie can be used in a simplistic way. At any given position in the text we’re searching (called the “subject” string), all we need to do is follow the transitions in the trie by consuming one transition for each byte in the subject string. If we reach a match state, then we can report that location as a match.
The trick comes when searching a subject string like ‘abcef’. We’ll
initially follow the transition from S0 to S1 and wind up in S3 after
observng the ‘c’ byte. At this point, the next byte is ‘e’ but state
S3 has no transition for ‘e’, so the search fails. We then would need
to restart the search at the next position in ‘abcef’, which
corresponds to ‘b’. The match would fail, but the next search starting
at ‘c’ would finally succeed. The problem with this approach is that
we wind up searching the subject string potentially many times. In
effect, this makes the algorithm have worst case O(n * m)
complexity,
where n ~ len(subject)
and m ~ len(all patterns)
. We would instead
like to achieve a O(n + m)
worst case complexity.
This is where failure transitions come in. Instead of dying at S3 in the first search, the automaton can instruct the search to move to another part of the automaton that corresponds to a suffix of what we’ve seen so far. Recall that we’ve seen ‘abc’ in the subject string, and the automaton does indeed have a non-empty suffix, ‘c’, that could potentially lead to another match. Thus, the actual Aho-Corasick automaton for our patterns in this case looks like this:
a - S1 - b - S2 - c - S3 - d - S4*
/ /
/ ----------------
/ /
S0 - c - S5 - e - S6 - f - S7*
That is, we have a failure transition from S3 to S5, which is followed exactly in cases when we are in state S3 but see any byte other than ‘d’ (that is, we’ve “failed” to find a match in this portion of our trie). We know we can transition back to S5 because we’ve already seen a ‘c’ byte, so we don’t need to re-scan it. We can then pick back up with the search starting at S5 and complete our match.
Adding failure transitions to a trie is fairly simple, but subtle. The key issue is that you might have multiple failure transition that you need to follow. For example, look at the trie for the patterns ‘abcd’, ‘b’, ‘bcd’ and ‘cd’:
- a - S1 - b - S2* - c - S3 - d - S4*
/ / /
/ ------- -------
/ / /
S0 --- b - S5* - c - S6 - d - S7*
\ /
\ --------
\ /
- c - S8 - d - S9*
The failure transitions for this trie are defined from S2 to S5, S3 to S6 and S6 to S8. Moreover, state S2 needs to track that it corresponds to a match, since its failure transition to S5 is itself a match state.
Perhaps simplest way to think about adding these failure transitions is recursively. That is, if you know the failure transitions for every possible previous state that could be visited (e.g., when computing the failure transition for S3, you already know the failure transitions for S0, S1 and S2), then you can simply follow the failure transition of the previous state and check whether the incoming transition is defined after following the failure transition.
For example, when determining the failure state for S3, by our assumptions, we already know that there is a failure transition from S2 (the previous state) to S5. So we follow that transition and check whether the transition connecting S2 to S3 is defined. Indeed, it is, as there is a transition from S5 to S6 for the byte ‘c’. If no such transition existed, we could keep following the failure transitions until we reach the start state, which is the failure transition for every state that has no corresponding proper suffix.
We don’t actually use recursion to implement this, but instead, use a breadth first search of the automaton. Our base case is the start state, whose failure transition is just a transition to itself.
When building a leftmost automaton, we proceed as above, but only include a subset of failure transitions. Namely, we omit any failure transitions that appear after a match state in the trie. This is because failure transitions always point back to a proper suffix of what has been seen so far. Thus, following a failure transition after a match implies looking for a match that starts after the one that has already been seen, which is of course therefore not the leftmost match.
N.B. I came up with this algorithm on my own, and after scouring all of the other AC implementations I know of (Perl, Snort, many on GitHub). I couldn’t find any that implement leftmost semantics like this. Perl of course needs leftmost-first semantics, but they implement it with a seeming hack at search time instead of encoding it into the automaton. There are also a couple Java libraries that support leftmost longest semantics, but they do it by building a queue of matches at search time, which is even worse than what Perl is doing. —AG
sourcefn shuffle(&mut self)
fn shuffle(&mut self)
Shuffle the states so that they appear in this sequence:
DEAD, FAIL, MATCH…, START, START, NON-MATCH…
The idea here is that if we know how special states are laid out in our transition table, then we can determine what “kind” of state we’re in just by comparing our current state ID with a particular value. In this way, we avoid doing extra memory lookups.
Before shuffling begins, our states look something like this:
DEAD, FAIL, START, START, (MATCH | NON-MATCH)…
So all we need to do is move all of the MATCH states so that they all appear before any NON-MATCH state, like so:
DEAD, FAIL, START, START, MATCH… NON-MATCH…
Then it’s just a simple matter of swapping the two START states with the last two MATCH states.
(This is the same technique used for fully compiled DFAs in regex-automata.)
sourcefn densify(&mut self) -> Result<(), BuildError>
fn densify(&mut self) -> Result<(), BuildError>
Attempts to convert the transition representation of a subset of states in this NFA from sparse to dense. This can greatly improve search performance since states with a higher number of transitions tend to correlate with very active states.
We generally only densify states that are close to the start state. These tend to be the most active states and thus benefit from a dense representation more than other states.
This tends to best balance between memory usage and performance. In particular, the vast majority of all states in a typical Aho-Corasick automaton have only 1 transition and are usually farther from the start state and thus don’t get densified.
Note that this doesn’t remove the sparse representation of transitions
for states that are densified. It could be done, but actually removing
entries from NFA::sparse
is likely more expensive than it’s worth.
sourcefn queued_set(&self) -> QueuedSet
fn queued_set(&self) -> QueuedSet
Returns a set that tracked queued states.
This is only necessary when ASCII case insensitivity is enabled, since
it is the only way to visit the same state twice. Otherwise, this
returns an inert set that nevers adds anything and always reports
false
for every member test.
sourcefn init_unanchored_start_state(&mut self) -> Result<(), BuildError>
fn init_unanchored_start_state(&mut self) -> Result<(), BuildError>
Initializes the unanchored start state by making it dense. This is achieved by explicitly setting every transition to the FAIL state. This isn’t necessary for correctness, since any missing transition is automatically assumed to be mapped to the FAIL state. We do this to make the unanchored starting state dense, and thus in turn make transition lookups on it faster. (Which is worth doing because it’s the most active state.)
sourcefn set_anchored_start_state(&mut self) -> Result<(), BuildError>
fn set_anchored_start_state(&mut self) -> Result<(), BuildError>
Setup the anchored start state by copying all of the transitions and matches from the unanchored starting state with one change: the failure transition is changed to the DEAD state, so that for any undefined transitions, the search will stop.
sourcefn add_unanchored_start_state_loop(&mut self)
fn add_unanchored_start_state_loop(&mut self)
Set the failure transitions on the start state to loop back to the start state. This effectively permits the Aho-Corasick automaton to match at any position. This is also required for finding the next state to terminate, namely, finding the next state should never return a fail_id.
This must be done after building the initial trie, since trie
construction depends on transitions to fail_id
to determine whether a
state already exists or not.
sourcefn close_start_state_loop_for_leftmost(&mut self)
fn close_start_state_loop_for_leftmost(&mut self)
Remove the start state loop by rewriting any transitions on the start state back to the start state with transitions to the dead state.
The loop is only closed when two conditions are met: the start state is a match state and the match kind is leftmost-first or leftmost-longest.
The reason for this is that under leftmost semantics, a start state that is also a match implies that we should never restart the search process. We allow normal transitions out of the start state, but if none exist, we transition to the dead state, which signals that searching should stop.
sourcefn add_dead_state_loop(&mut self) -> Result<(), BuildError>
fn add_dead_state_loop(&mut self) -> Result<(), BuildError>
Sets all transitions on the dead state to point back to the dead state. Normally, missing transitions map back to the failure state, but the point of the dead state is to act as a sink that can never be escaped.