Module regex_automata::util::determinize::state

source ·
Expand description

This module defines a DFA state representation and builders for constructing DFA states.

This representation is specifically for use in implementations of NFA-to-DFA conversion via powerset construction. (Also called “determinization” in this crate.)

The term “DFA state” is somewhat overloaded in this crate. In some cases, it refers to the set of transitions over an alphabet for a particular state. In other cases, it refers to a set of NFA states. The former is really about the final representation of a state in a DFA’s transition table, where as the latter—what this module is focused on—is closer to an intermediate form that is used to help eventually build the transition table.

This module exports four types. All four types represent the same idea: an ordered set of NFA states. This ordered set represents the epsilon closure of a particular NFA state, where the “epsilon closure” is the set of NFA states that can be transitioned to without consuming any input. i.e., Follow all of the NFA state’s epsilon transitions. In addition, this implementation of DFA states cares about two other things: the ordered set of pattern IDs corresponding to the patterns that match if the state is a match state, and the set of look-behind assertions that were true when the state was created.

The first, State, is a frozen representation of a state that cannot be modified. It may be cheaply cloned without copying the state itself and can be accessed safely from multiple threads simultaneously. This type is useful for when one knows that the DFA state being constructed is distinct from any other previously constructed states. Namely, powerset construction, in practice, requires one to keep a cache of previously created DFA states. Otherwise, the number of DFA states created in memory balloons to an impractically large number. For this reason, equivalent states should endeavor to have an equivalent byte-level representation. (In general, “equivalency” here means, “equivalent assertions, pattern IDs and NFA state IDs.” We do not require that full DFA minimization be implemented here. This form of equivalency is only surface deep and is more-or-less a practical necessity.)

The other three types represent different phases in the construction of a DFA state. Internally, these three types (and State) all use the same byte-oriented representation. That means one can use any of the builder types to check whether the state it represents already exists or not. If it does, then there is no need to freeze it into a State (which requires an alloc and a copy). Here are the three types described succinctly:

  • StateBuilderEmpty represents a state with no pattern IDs, no assertions and no NFA states. Creating a StateBuilderEmpty performs no allocs. A StateBuilderEmpty can only be used to query its underlying memory capacity, or to convert into a builder for recording pattern IDs and/or assertions.

  • StateBuilderMatches represents a state with zero or more pattern IDs, zero or more satisfied assertions and zero NFA state IDs. A StateBuilderMatches can only be used for adding pattern IDs and recording assertions.

  • StateBuilderNFA represents a state with zero or more pattern IDs, zero or more satisfied assertions and zero or more NFA state IDs. A StateBuilderNFA can only be used for adding NFA state IDs and recording some assertions.

The expected flow here is to use the above builders to construct a candidate DFA state to check if it already exists. If it does, then there’s no need to freeze it into a State. If it doesn’t exist, then StateBuilderNFA::to_state can be called to freeze the builder into an immutable State. In either case, clear should be called on the builder to turn it back into a StateBuilderEmpty that reuses the underlying memory.

The main purpose for splitting the builder into these distinct types is to make it impossible to do things like adding a pattern ID after adding an NFA state ID. Namely, this makes it simpler to use a space-and-time efficient binary representation for the state. (The format is documented on the Repr type below.) If we just used one type for everything, it would be possible for callers to use an incorrect interleaving of calls and thus result in a corrupt representation. I chose to use more type machinery to make this impossible to do because 1) determinization is itself pretty complex and it wouldn’t be too hard to foul this up and 2) there isn’t too much machinery involved and it’s well contained.

As an optimization, sometimes states won’t have certain things set. For example, if the underlying NFA has no word boundary assertions, then there is no reason to set a state’s look-behind assertion as to whether it was generated from a word byte or not. Similarly, if a state has no NFA states corresponding to look-around assertions, then there is no reason to set look_have to a non-empty set. Finally, callers usually omit unconditional epsilon transitions when adding NFA state IDs since they aren’t discriminatory.

Finally, the binary representation used by these states is, thankfully, not serialized anywhere. So any kind of change can be made with reckless abandon, as long as everything in this module agrees.

Structs§

  • Repr 🔒
    Repr is a read-only view into the representation of a DFA state.
  • ReprVec 🔒
    ReprVec is a write-only view into the representation of a DFA state.
  • State 🔒
    A DFA state that, at its core, is represented by an ordered set of NFA states.
  • A state builder that represents an empty state.
  • A state builder that collects assertions and pattern IDs.
  • A state builder that collects some assertions and NFA state IDs.

Functions§

  • Read a signed 32-bit integer using zig-zag encoding. Also, return the number of bytes read.
  • Read an unsigned 32-bit varint. Also, return the number of bytes read.
  • write_u32 🔒
    Push a native-endian encoded n on to dst.
  • Write a signed 32-bit integer using zig-zag encoding.
  • Write an unsigned 32-bit integer as a varint. In essence, n is written as a sequence of bytes where all bytes except for the last one have the most significant bit set. The least significant 7 bits correspond to the actual bits of n. So in the worst case, a varint uses 5 bytes, but in very common cases, it uses fewer than 4.