Struct regex_automata::nfa::thompson::map::Utf8BoundedMap
source · pub struct Utf8BoundedMap {
version: u16,
capacity: usize,
map: Vec<Utf8BoundedEntry>,
}
Expand description
A bounded hash map where the key is a sequence of NFA transitions and the value is a pre-existing NFA state ID.
std’s hashmap can be used for this, however, this map has two important advantages. Firstly, it has lower overhead. Secondly, it permits us to control our memory usage by limited the number of slots. In general, the cost here is that this map acts as a cache. That is, inserting a new entry may remove an old entry. We are okay with this, since it does not impact correctness in the cases where it is used. The only effect that dropping states from the cache has is that the resulting NFA generated may be bigger than it otherwise would be.
This improves benchmarks that compile large Unicode character classes, since it makes the generation of (almost) minimal UTF-8 automaton faster. Specifically, one could observe the difference with std’s hashmap via something like the following benchmark:
hyperfine “regex-cli debug thompson -qr –captures none ‘\w{90} ecurB’”
But to observe that difference, you’d have to modify the code to use std’s hashmap.
It is quite possible that there is a better way to approach this problem. For example, if there happens to be a very common state that collides with a lot of less frequent states, then we could wind up with very poor caching behavior. Alas, the effectiveness of this cache has not been measured. Instead, ad hoc experiments suggest that it is “good enough.” Additional smarts (such as an LRU eviction policy) have to be weighed against the amount of extra time they cost.
Fields§
§version: u16
The current version of this map. Only entries with matching versions are considered during lookups. If an entry is found with a mismatched version, then the map behaves as if the entry does not exist.
This makes it possible to clear the map by simply incrementing the version number instead of actually deallocating any storage.
capacity: usize
The total number of entries this map can store.
map: Vec<Utf8BoundedEntry>
The actual entries, keyed by hash. Collisions between different states result in the old state being dropped.
Implementations§
source§impl Utf8BoundedMap
impl Utf8BoundedMap
sourcepub fn new(capacity: usize) -> Utf8BoundedMap
pub fn new(capacity: usize) -> Utf8BoundedMap
Create a new bounded map with the given capacity. The map will never grow beyond the given size.
Note that this does not allocate. Instead, callers must call clear
before using this map. clear
will allocate space if necessary.
This avoids the need to pay for the allocation of this map when compiling regexes that lack large Unicode character classes.
sourcepub fn clear(&mut self)
pub fn clear(&mut self)
Clear this map of all entries, but permit the reuse of allocation if possible.
This must be called before the map can be used.
sourcepub fn hash(&self, key: &[Transition]) -> usize
pub fn hash(&self, key: &[Transition]) -> usize
Return a hash of the given transitions.
sourcepub fn get(&mut self, key: &[Transition], hash: usize) -> Option<StateID>
pub fn get(&mut self, key: &[Transition], hash: usize) -> Option<StateID>
Retrieve the cached state ID corresponding to the given key. The hash
given must have been computed with hash
using the same key value.
If there is no cached state with the given transitions, then None is returned.
sourcepub fn set(&mut self, key: Vec<Transition>, hash: usize, state_id: StateID)
pub fn set(&mut self, key: Vec<Transition>, hash: usize, state_id: StateID)
Add a cached state to this map with the given key. Callers should
ensure that state_id
points to a state that contains precisely the
NFA transitions given.
hash
must have been computed using the hash
method with the same
key.
Trait Implementations§
source§impl Clone for Utf8BoundedMap
impl Clone for Utf8BoundedMap
source§fn clone(&self) -> Utf8BoundedMap
fn clone(&self) -> Utf8BoundedMap
1.0.0 · source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source
. Read more