Expand description

This module contains types and routines for implementing determinization.

In this crate, there are at least two places where we implement determinization: fully ahead-of-time compiled DFAs in the dfa module and lazily compiled DFAs in the hybrid module. The stuff in this module corresponds to the things that are in common between these implementations.

There are three broad things that our implementations of determinization have in common, as defined by this module:

  • The classification of start states. That is, whether we’re dealing with word boundaries, line boundaries, etc., is all the same. This also includes the look-behind assertions that are satisfied by each starting state classification.
  • The representation of DFA states as sets of NFA states, including convenience types for building these DFA states that are amenable to reusing allocations.
  • Routines for the “classical” parts of determinization: computing the epsilon closure, tracking match states (with corresponding pattern IDs, since we support multi-pattern finite automata) and, of course, computing the transition function between states for units of input.

I did consider a couple of alternatives to this particular form of code reuse:

  1. Don’t do any code reuse. The problem here is that we really want both forms of determinization to do exactly identical things when it comes to their handling of NFA states. While our tests generally ensure this, the code is tricky and large enough where not reusing code is a pretty big bummer.

  2. Implement all of determinization once and make it generic over fully compiled DFAs and lazily compiled DFAs. While I didn’t actually try this approach, my instinct is that it would be more complex than is needed here. And the interface required would be pretty hairy. Instead, I think splitting it into logical sub-components works better.

Modules

  • state 🔒
    This module defines a DFA state representation and builders for constructing DFA states.

Functions

  • Add the NFA state IDs in the given set to the given DFA builder state. The order in which states are added corresponds to the order in which they were added to set.
  • Compute the epsilon closure for the given NFA state. The epsilon closure consists of all NFA state IDs, including start_nfa_id, that can be reached from start_nfa_id without consuming any input. These state IDs are written to set in the order they are visited, but only if they are not already in set. start_nfa_id must be a valid state ID for the NFA given.
  • next 🔒
    Compute the set of all reachable NFA states, including the full epsilon closure, from a DFA state for a single unit of input. The set of reachable states is returned as a StateBuilderNFA. The StateBuilderNFA returned also includes any look-behind assertions satisfied by unit, in addition to whether it is a match state. For multi-pattern DFAs, the builder will also include the pattern IDs that match (in the order seen).
  • Sets the appropriate look-behind assertions on the given state based on this starting configuration.