Struct regex_automata::hybrid::dfa::Cache
source · pub struct Cache {
trans: Vec<LazyStateID>,
starts: Vec<LazyStateID>,
states: Vec<State>,
states_to_id: HashMap<State, LazyStateID>,
sparses: SparseSets,
stack: Vec<StateID>,
scratch_state_builder: StateBuilderEmpty,
state_saver: StateSaver,
memory_usage_state: usize,
clear_count: usize,
bytes_searched: usize,
progress: Option<SearchProgress>,
}
Expand description
A cache represents a partially computed DFA.
A cache is the key component that differentiates a classical DFA and a
hybrid NFA/DFA (also called a “lazy DFA”). Where a classical DFA builds a
complete transition table that can handle all possible inputs, a hybrid
NFA/DFA starts with an empty transition table and builds only the parts
required during search. The parts that are built are stored in a cache. For
this reason, a cache is a required parameter for nearly every operation on
a DFA
.
Caches can be created from their corresponding DFA via
DFA::create_cache
. A cache can only be used with either the DFA that
created it, or the DFA that was most recently used to reset it with
Cache::reset
. Using a cache with any other DFA may result in panics
or incorrect results.
Fields§
§trans: Vec<LazyStateID>
The transition table.
Given a current
LazyStateID and an input
byte, the next state can
be computed via trans[untagged(current) + equiv_class(input)]
. Notice
that no multiplication is used. That’s because state identifiers are
“premultiplied.”
Note that the next state may be the “unknown” state. In this case, the
next state is not known and determinization for current
on input
must be performed.
starts: Vec<LazyStateID>
The starting states for this DFA.
These are computed lazily. Initially, these are all set to “unknown” lazy state IDs.
When ‘starts_for_each_pattern’ is disabled (the default), then the size of this is constrained to the possible starting configurations based on the search parameters. (At time of writing, that’s 4.) However, when starting states for each pattern is enabled, then there are N additional groups of starting states, where each group reflects the different possible configurations and N is the number of patterns.
states: Vec<State>
A sequence of NFA/DFA powerset states that have been computed for this lazy DFA. This sequence is indexable by untagged LazyStateIDs. (Every tagged LazyStateID can be used to index this sequence by converting it to its untagged form.)
states_to_id: HashMap<State, LazyStateID>
A map from states to their corresponding IDs. This map may be accessed
via the raw byte representation of a state, which means that a State
does not need to be allocated to determine whether it already exists
in this map. Indeed, the existence of such a state is what determines
whether we allocate a new State
or not.
The higher level idea here is that we do just enough determinization for a state to check whether we’ve already computed it. If we have, then we can save a little (albeit not much) work. The real savings is in memory usage. If we never checked for trivially duplicate states, then our memory usage would explode to unreasonable levels.
sparses: SparseSets
Sparse sets used to track which NFA states have been visited during various traversals.
stack: Vec<StateID>
Scratch space for traversing the NFA graph. (We use space on the heap instead of the call stack.)
scratch_state_builder: StateBuilderEmpty
Scratch space for building a NFA/DFA powerset state. This is used to
help amortize allocation since not every powerset state generated is
added to the cache. In particular, if it already exists in the cache,
then there is no need to allocate a new State
for it.
state_saver: StateSaver
A simple abstraction for handling the saving of at most a single state across a cache clearing. This is required for correctness. Namely, if adding a new state after clearing the cache fails, then the caller must retain the ability to continue using the state ID given. The state corresponding to the state ID is what we preserve across cache clearings.
memory_usage_state: usize
The memory usage, in bytes, used by ‘states’ and ‘states_to_id’. We track this as new states are added since states use a variable amount of heap. Tracking this as we add states makes it possible to compute the total amount of memory used by the determinizer in constant time.
clear_count: usize
The number of times the cache has been cleared. When a minimum cache clear count is set, then the cache will return an error instead of clearing the cache if the count has been exceeded.
bytes_searched: usize
The total number of bytes searched since the last time this cache was cleared, not including the current search.
This can be added to the length of the current search to get the true total number of bytes searched.
This is generally only non-zero when the
Cache::search_{start,update,finish}
APIs are used to track search
progress.
progress: Option<SearchProgress>
The progress of the current search.
This is only non-None
when callers utlize the Cache::search_start
,
Cache::search_update
and Cache::search_finish
APIs.
The purpose of recording search progress is to be able to make a determination about the efficiency of the cache. Namely, by keeping track of the
Implementations§
source§impl Cache
impl Cache
sourcepub fn new(dfa: &DFA) -> Cache
pub fn new(dfa: &DFA) -> Cache
Create a new cache for the given lazy DFA.
The cache returned should only be used for searches for the given DFA.
If you want to reuse the cache for another DFA, then you must call
Cache::reset
with that DFA.
sourcepub fn reset(&mut self, dfa: &DFA)
pub fn reset(&mut self, dfa: &DFA)
Reset this cache such that it can be used for searching with the given lazy DFA (and only that DFA).
A cache reset permits reusing memory already allocated in this cache with a different lazy DFA.
Resetting a cache sets its “clear count” to 0. This is relevant if the lazy DFA has been configured to “give up” after it has cleared the cache a certain number of times.
Any lazy state ID generated by the cache prior to resetting it is invalid after the reset.
§Example
This shows how to re-purpose a cache for use with a different DFA.
use regex_automata::{hybrid::dfa::DFA, HalfMatch, Input};
let dfa1 = DFA::new(r"\w")?;
let dfa2 = DFA::new(r"\W")?;
let mut cache = dfa1.create_cache();
assert_eq!(
Some(HalfMatch::must(0, 2)),
dfa1.try_search_fwd(&mut cache, &Input::new("Δ"))?,
);
// Using 'cache' with dfa2 is not allowed. It may result in panics or
// incorrect results. In order to re-purpose the cache, we must reset
// it with the DFA we'd like to use it with.
//
// Similarly, after this reset, using the cache with 'dfa1' is also not
// allowed.
cache.reset(&dfa2);
assert_eq!(
Some(HalfMatch::must(0, 3)),
dfa2.try_search_fwd(&mut cache, &Input::new("☃"))?,
);
sourcepub fn search_start(&mut self, at: usize)
pub fn search_start(&mut self, at: usize)
Initializes a new search starting at the given position.
If a previous search was unfinished, then it is finished automatically and a new search is begun.
Note that keeping track of search progress is not necessary
for correct implementations of search using a lazy DFA. Keeping
track of search progress is only necessary if you want the
Config::minimum_bytes_per_state
configuration knob to work.
sourcepub fn search_update(&mut self, at: usize)
pub fn search_update(&mut self, at: usize)
Updates the current search to indicate that it has search to the current position.
No special care needs to be taken for reverse searches. Namely, the position given may be less than the starting position of the search.
§Panics
This panics if no search has been started by Cache::search_start
.
sourcepub fn search_finish(&mut self, at: usize)
pub fn search_finish(&mut self, at: usize)
Indicates that a search has finished at the given position.
§Panics
This panics if no search has been started by Cache::search_start
.
sourcepub fn search_total_len(&self) -> usize
pub fn search_total_len(&self) -> usize
Returns the total number of bytes that have been searched since this cache was last cleared.
This is useful for determining the efficiency of the cache. For
example, the lazy DFA uses this value in conjunction with the
Config::minimum_bytes_per_state
knob to help determine whether it
should quit searching.
This always returns 0
if search progress isn’t being tracked. Note
that the lazy DFA search routines in this crate always track search
progress.
sourcepub fn clear_count(&self) -> usize
pub fn clear_count(&self) -> usize
Returns the total number of times this cache has been cleared since it was either created or last reset.
This is useful for informational purposes or if you want to change search strategies based on the number of times the cache has been cleared.
sourcepub fn memory_usage(&self) -> usize
pub fn memory_usage(&self) -> usize
Returns the heap memory usage, in bytes, of this cache.
This does not include the stack size used up by this cache. To
compute that, use std::mem::size_of::<Cache>()
.