Module regex_automata::util::alphabet

source ·
Expand description

This module provides APIs for dealing with the alphabets of finite state machines.

There are two principal types in this module, ByteClasses and Unit. The former defines the alphabet of a finite state machine while the latter represents an element of that alphabet.

To a first approximation, the alphabet of all automata in this crate is just a u8. Namely, every distinct byte value. All 256 of them. In practice, this can be quite wasteful when building a transition table for a DFA, since it requires storing a state identifier for each element in the alphabet. Instead, we collapse the alphabet of an automaton down into equivalence classes, where every byte in the same equivalence class never discriminates between a match or a non-match from any other byte in the same class. For example, in the regex [a-z]+, then you could consider it having an alphabet consisting of two equivalence classes: a-z and everything else. In terms of the transitions on an automaton, it doesn’t actually require representing every distinct byte. Just the equivalence classes.

The downside of equivalence classes is that, of course, searching a haystack deals with individual byte values. Those byte values need to be mapped to their corresponding equivalence class. This is what ByteClasses does. In practice, doing this for every state transition has negligible impact on modern CPUs. Moreover, it helps make more efficient use of the CPU cache by (possibly considerably) shrinking the size of the transition table.

One last hiccup concerns Unit. Namely, because of look-around and how the DFAs in this crate work, we need to add a sentinel value to our alphabet of equivalence classes that represents the “end” of a search. We call that sentinel Unit::eoi or “end of input.” Thus, a Unit is either an equivalence class corresponding to a set of bytes, or it is a special “end of input” sentinel.

In general, you should not expect to need either of these types unless you’re doing lower level shenanigans with DFAs, or even building your own DFAs. (Although, you don’t have to use these types to build your own DFAs of course.) For example, if you’re walking a DFA’s state graph, it’s probably useful to make use of ByteClasses to visit each element in the DFA’s alphabet instead of just visiting every distinct u8 value. The latter isn’t necessarily wrong, but it could be potentially very wasteful.

Structs§

  • BitSet 🔒
    The representation of a byte set. Split out so that we can define a convenient Debug impl for it while keeping “ByteSet” in the output.
  • An iterator over all elements in an equivalence class expressed as a sequence of contiguous ranges.
  • An iterator over all elements in an equivalence class.
  • An iterator over each equivalence class.
  • An iterator over representative bytes from each equivalence class.
  • A partitioning of bytes into equivalence classes.
  • A representation of byte oriented equivalence classes.
  • ByteSet 🔒
    A simple set of bytes that is reasonably cheap to copy and allocation free.
  • Unit represents a single unit of haystack for DFA based regex engines.

Enums§