enum State {
Empty {
next: StateID,
},
ByteRange {
trans: Transition,
},
Sparse {
transitions: Vec<Transition>,
},
Look {
look: Look,
next: StateID,
},
CaptureStart {
pattern_id: PatternID,
group_index: SmallIndex,
next: StateID,
},
CaptureEnd {
pattern_id: PatternID,
group_index: SmallIndex,
next: StateID,
},
Union {
alternates: Vec<StateID>,
},
UnionReverse {
alternates: Vec<StateID>,
},
Fail,
Match {
pattern_id: PatternID,
},
}
Expand description
An intermediate NFA state used during construction.
During construction of an NFA, it is often convenient to work with states that are amenable to mutation and other carry more information than we otherwise need once an NFA has been built. This type represents those needs.
Once construction is finished, the builder will convert these states to a
nfa::thompson::State
. This conversion not
only results in a simpler representation, but in some cases, entire classes
of states are completely removed (such as State::Empty
).
Variants§
Empty
An empty state whose only purpose is to forward the automaton to another state via an unconditional epsilon transition.
Unconditional epsilon transitions are quite useful during the construction of an NFA, as they permit the insertion of no-op placeholders that make it easier to compose NFA sub-graphs. When the Thompson NFA builder produces a final NFA, all unconditional epsilon transitions are removed, and state identifiers are remapped accordingly.
ByteRange
A state that only transitions to another state if the current input byte is in a particular range of bytes.
Fields
trans: Transition
Sparse
A state with possibly many transitions, represented in a sparse
fashion. Transitions must be ordered lexicographically by input range
and be non-overlapping. As such, this may only be used when every
transition has equal priority. (In practice, this is only used for
encoding large UTF-8 automata.) In contrast, a Union
state has each
alternate in order of priority. Priority is used to implement greedy
matching and also alternations themselves, e.g., abc|a
where abc
has priority over a
.
To clarify, it is possible to remove Sparse
and represent all things
that Sparse
is used for via Union
. But this creates a more bloated
NFA with more epsilon transitions than is necessary in the special case
of character classes.
Fields
transitions: Vec<Transition>
Look
A conditional epsilon transition satisfied via some sort of look-around.
CaptureStart
An empty state that records the start of a capture location. This is an
unconditional epsilon transition like Empty
, except it can be used to
record position information for a capture group when using the NFA for
search.
Fields
group_index: SmallIndex
The capture group index that this capture state corresponds to. The capture group index is always relative to its corresponding pattern. Therefore, in the presence of multiple patterns, both the pattern ID and the capture group index are required to uniquely identify a capturing group.
CaptureEnd
An empty state that records the end of a capture location. This is an
unconditional epsilon transition like Empty
, except it can be used to
record position information for a capture group when using the NFA for
search.
Fields
group_index: SmallIndex
The capture group index that this capture state corresponds to. The capture group index is always relative to its corresponding pattern. Therefore, in the presence of multiple patterns, both the pattern ID and the capture group index are required to uniquely identify a capturing group.
Union
An alternation such that there exists an epsilon transition to all
states in alternates
, where matches found via earlier transitions
are preferred over later transitions.
UnionReverse
An alternation such that there exists an epsilon transition to all
states in alternates
, where matches found via later transitions are
preferred over earlier transitions.
This “reverse” state exists for convenience during compilation that permits easy construction of non-greedy combinations of NFA states. At the end of compilation, Union and UnionReverse states are merged into one Union type of state, where the latter has its epsilon transitions reversed to reflect the priority inversion.
The “convenience” here arises from the fact that as new states are
added to the list of alternates
, we would like that add operation
to be amortized constant time. But if we used a Union
, we’d need to
prepend the state, which takes O(n) time. There are other approaches we
could use to solve this, but this seems simple enough.
Fail
A state that cannot be transitioned out of. This is useful for cases
where you want to prevent matching from occurring. For example, if your
regex parser permits empty character classes, then one could choose a
Fail
state to represent it.
Match
A match state. There is at most one such occurrence of this state in an NFA for each pattern compiled into the NFA. At time of writing, a match state is always produced for every pattern given, but in theory, if a pattern can never lead to a match, then the match state could be omitted.
pattern_id
refers to the ID of the pattern itself, which corresponds
to the pattern’s index (starting at 0).