Module regex_automata::util::alphabet
source · Expand description
This module provides APIs for dealing with the alphabets of finite state machines.
There are two principal types in this module, ByteClasses
and Unit
.
The former defines the alphabet of a finite state machine while the latter
represents an element of that alphabet.
To a first approximation, the alphabet of all automata in this crate is just
a u8
. Namely, every distinct byte value. All 256 of them. In practice, this
can be quite wasteful when building a transition table for a DFA, since it
requires storing a state identifier for each element in the alphabet. Instead,
we collapse the alphabet of an automaton down into equivalence classes, where
every byte in the same equivalence class never discriminates between a match or
a non-match from any other byte in the same class. For example, in the regex
[a-z]+
, then you could consider it having an alphabet consisting of two
equivalence classes: a-z
and everything else. In terms of the transitions on
an automaton, it doesn’t actually require representing every distinct byte.
Just the equivalence classes.
The downside of equivalence classes is that, of course, searching a haystack
deals with individual byte values. Those byte values need to be mapped to
their corresponding equivalence class. This is what ByteClasses
does. In
practice, doing this for every state transition has negligible impact on modern
CPUs. Moreover, it helps make more efficient use of the CPU cache by (possibly
considerably) shrinking the size of the transition table.
One last hiccup concerns Unit
. Namely, because of look-around and how the
DFAs in this crate work, we need to add a sentinel value to our alphabet
of equivalence classes that represents the “end” of a search. We call that
sentinel Unit::eoi
or “end of input.” Thus, a Unit
is either an
equivalence class corresponding to a set of bytes, or it is a special “end of
input” sentinel.
In general, you should not expect to need either of these types unless you’re
doing lower level shenanigans with DFAs, or even building your own DFAs.
(Although, you don’t have to use these types to build your own DFAs of course.)
For example, if you’re walking a DFA’s state graph, it’s probably useful to
make use of ByteClasses
to visit each element in the DFA’s alphabet instead
of just visiting every distinct u8
value. The latter isn’t necessarily wrong,
but it could be potentially very wasteful.
Structs§
- BitSet 🔒The representation of a byte set. Split out so that we can define a convenient Debug impl for it while keeping “ByteSet” in the output.
- An iterator over all elements in an equivalence class expressed as a sequence of contiguous ranges.
- An iterator over all elements in an equivalence class.
- An iterator over each equivalence class.
- An iterator over representative bytes from each equivalence class.
- A partitioning of bytes into equivalence classes.
- A representation of byte oriented equivalence classes.
- ByteSet 🔒A simple set of bytes that is reasonably cheap to copy and allocation free.
- Unit represents a single unit of haystack for DFA based regex engines.
Enums§
- UnitKind 🔒