Struct aho_corasick::packed::rabinkarp::RabinKarp
source · pub(crate) struct RabinKarp {
patterns: Arc<Patterns>,
buckets: Vec<Vec<(usize, PatternID)>>,
hash_len: usize,
hash_2pow: usize,
}
Expand description
An implementation of the Rabin-Karp algorithm. The main idea of this algorithm is to maintain a rolling hash as it moves through the input, and then check whether that hash corresponds to the same hash for any of the patterns we’re looking for.
A draw back of naively scaling Rabin-Karp to multiple patterns is that it requires all of the patterns to be the same length, which in turn corresponds to the number of bytes to hash. We adapt this to work for multiple patterns of varying size by fixing the number of bytes to hash to be the length of the smallest pattern. We also split the patterns into several buckets to hopefully make the confirmation step faster.
Wikipedia has a decent explanation, if a bit heavy on the theory: https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
But ESMAJ provides something a bit more concrete: https://www-igm.univ-mlv.fr/~lecroq/string/node5.html
Fields§
§patterns: Arc<Patterns>
The patterns we’re searching for.
buckets: Vec<Vec<(usize, PatternID)>>
The order of patterns in each bucket is significant. Namely, they are arranged such that the first one to match is the correct match. This may not necessarily correspond to the order provided by the caller. For example, if leftmost-longest semantics are used, then the patterns are sorted by their length in descending order. If leftmost-first semantics are used, then the patterns are sorted by their pattern ID in ascending order (which corresponds to the caller’s order).
hash_len: usize
The length of the hashing window. Generally, this corresponds to the length of the smallest pattern.
hash_2pow: usize
The factor to subtract out of a hash before updating it with a new byte.
Implementations§
source§impl RabinKarp
impl RabinKarp
sourcepub(crate) fn new(patterns: &Arc<Patterns>) -> RabinKarp
pub(crate) fn new(patterns: &Arc<Patterns>) -> RabinKarp
Compile a new Rabin-Karp matcher from the patterns given.
This panics if any of the patterns in the collection are empty, or if the collection is itself empty.
sourcepub(crate) fn find_at(&self, haystack: &[u8], at: usize) -> Option<Match>
pub(crate) fn find_at(&self, haystack: &[u8], at: usize) -> Option<Match>
Return the first matching pattern in the given haystack, begining the
search at at
.
sourcepub(crate) fn memory_usage(&self) -> usize
pub(crate) fn memory_usage(&self) -> usize
Returns the approximate total amount of heap used by this searcher, in units of bytes.
sourcefn verify(&self, id: PatternID, haystack: &[u8], at: usize) -> Option<Match>
fn verify(&self, id: PatternID, haystack: &[u8], at: usize) -> Option<Match>
Verify whether the pattern with the given id matches at
haystack[at..]
.
We tag this function as cold
because it helps improve codegen.
Intuitively, it would seem like inlining it would be better. However,
the only time this is called and a match is not found is when there
there is a hash collision, or when a prefix of a pattern matches but
the entire pattern doesn’t match. This is hopefully fairly rare, and
if it does occur a lot, it’s going to be slow no matter what we do.