Function regex_automata::hybrid::dfa::minimum_cache_capacity

source ·
fn minimum_cache_capacity(
    nfa: &NFA,
    classes: &ByteClasses,
    starts_for_each_pattern: bool,
) -> usize
Expand description

Based on the minimum number of states required for a useful lazy DFA cache, this returns a heuristic minimum number of bytes of heap space required.

This is a “heuristic” because the minimum it returns is likely bigger than the true minimum. Namely, it assumes that each powerset NFA/DFA state uses the maximum number of NFA states (all of them). This is likely bigger than what is required in practice. Computing the true minimum effectively requires determinization, which is probably too much work to do for a simple check like this.

One of the issues with this approach IMO is that it requires that this be in sync with the calculation above for computing how much heap memory the DFA cache uses. If we get it wrong, it’s possible for example for the minimum to be smaller than the computed heap memory, and thus, it may be the case that we can’t add the required minimum number of states. That in turn will make lazy DFA panic because we assume that we can add at least a minimum number of states.

Another approach would be to always allow the minimum number of states to be added to the lazy DFA cache, even if it exceeds the configured cache limit. This does mean that the limit isn’t really a limit in all cases, which is unfortunate. But it does at least guarantee that the lazy DFA can always make progress, even if it is slow. (This approach is very similar to enabling the ‘skip_cache_capacity_check’ config knob, except it wouldn’t rely on cache size calculation. Instead, it would just always permit a minimum number of states to be added.)