Module regex_automata::util::determinize::state
source · Expand description
This module defines a DFA state representation and builders for constructing DFA states.
This representation is specifically for use in implementations of NFA-to-DFA conversion via powerset construction. (Also called “determinization” in this crate.)
The term “DFA state” is somewhat overloaded in this crate. In some cases, it refers to the set of transitions over an alphabet for a particular state. In other cases, it refers to a set of NFA states. The former is really about the final representation of a state in a DFA’s transition table, where as the latter—what this module is focused on—is closer to an intermediate form that is used to help eventually build the transition table.
This module exports four types. All four types represent the same idea: an ordered set of NFA states. This ordered set represents the epsilon closure of a particular NFA state, where the “epsilon closure” is the set of NFA states that can be transitioned to without consuming any input. i.e., Follow all of the NFA state’s epsilon transitions. In addition, this implementation of DFA states cares about two other things: the ordered set of pattern IDs corresponding to the patterns that match if the state is a match state, and the set of look-behind assertions that were true when the state was created.
The first, State
, is a frozen representation of a state that cannot be
modified. It may be cheaply cloned without copying the state itself and can be
accessed safely from multiple threads simultaneously. This type is useful for
when one knows that the DFA state being constructed is distinct from any other
previously constructed states. Namely, powerset construction, in practice,
requires one to keep a cache of previously created DFA states. Otherwise,
the number of DFA states created in memory balloons to an impractically
large number. For this reason, equivalent states should endeavor to have an
equivalent byte-level representation. (In general, “equivalency” here means,
“equivalent assertions, pattern IDs and NFA state IDs.” We do not require that
full DFA minimization be implemented here. This form of equivalency is only
surface deep and is more-or-less a practical necessity.)
The other three types represent different phases in the construction of a
DFA state. Internally, these three types (and State
) all use the same
byte-oriented representation. That means one can use any of the builder types
to check whether the state it represents already exists or not. If it does,
then there is no need to freeze it into a State
(which requires an alloc and
a copy). Here are the three types described succinctly:
-
StateBuilderEmpty
represents a state with no pattern IDs, no assertions and no NFA states. Creating aStateBuilderEmpty
performs no allocs. AStateBuilderEmpty
can only be used to query its underlying memory capacity, or to convert into a builder for recording pattern IDs and/or assertions. -
StateBuilderMatches
represents a state with zero or more pattern IDs, zero or more satisfied assertions and zero NFA state IDs. AStateBuilderMatches
can only be used for adding pattern IDs and recording assertions. -
StateBuilderNFA
represents a state with zero or more pattern IDs, zero or more satisfied assertions and zero or more NFA state IDs. AStateBuilderNFA
can only be used for adding NFA state IDs and recording some assertions.
The expected flow here is to use the above builders to construct a candidate
DFA state to check if it already exists. If it does, then there’s no need to
freeze it into a State
. If it doesn’t exist, then StateBuilderNFA::to_state
can be called to freeze the builder into an immutable State
. In either
case, clear
should be called on the builder to turn it back into a
StateBuilderEmpty
that reuses the underlying memory.
The main purpose for splitting the builder into these distinct types is to
make it impossible to do things like adding a pattern ID after adding an NFA
state ID. Namely, this makes it simpler to use a space-and-time efficient
binary representation for the state. (The format is documented on the Repr
type below.) If we just used one type for everything, it would be possible for
callers to use an incorrect interleaving of calls and thus result in a corrupt
representation. I chose to use more type machinery to make this impossible to
do because 1) determinization is itself pretty complex and it wouldn’t be too
hard to foul this up and 2) there isn’t too much machinery involved and it’s
well contained.
As an optimization, sometimes states won’t have certain things set. For
example, if the underlying NFA has no word boundary assertions, then there is
no reason to set a state’s look-behind assertion as to whether it was generated
from a word byte or not. Similarly, if a state has no NFA states corresponding
to look-around assertions, then there is no reason to set look_have
to a
non-empty set. Finally, callers usually omit unconditional epsilon transitions
when adding NFA state IDs since they aren’t discriminatory.
Finally, the binary representation used by these states is, thankfully, not serialized anywhere. So any kind of change can be made with reckless abandon, as long as everything in this module agrees.
Structs§
- Repr 🔒Repr is a read-only view into the representation of a DFA state.
- ReprVec 🔒ReprVec is a write-only view into the representation of a DFA state.
- State 🔒A DFA state that, at its core, is represented by an ordered set of NFA states.
- A state builder that represents an empty state.
- A state builder that collects assertions and pattern IDs.
- A state builder that collects some assertions and NFA state IDs.
Functions§
- Read a signed 32-bit integer using zig-zag encoding. Also, return the number of bytes read.
- Read an unsigned 32-bit varint. Also, return the number of bytes read.
- Push a native-endian encoded
n
on todst
. - Write a signed 32-bit integer using zig-zag encoding.
- Write an unsigned 32-bit integer as a varint. In essence,
n
is written as a sequence of bytes where all bytes except for the last one have the most significant bit set. The least significant 7 bits correspond to the actual bits ofn
. So in the worst case, a varint uses 5 bytes, but in very common cases, it uses fewer than 4.