Struct regex_automata::nfa::thompson::compiler::Utf8Compiler
source · struct Utf8Compiler<'a> {
builder: &'a mut Builder,
state: &'a mut Utf8State,
target: StateID,
}
Expand description
A UTF-8 compiler based on Daciuk’s algorithm for compilining minimal DFAs from a lexicographically sorted sequence of strings in linear time.
The trick here is that any Unicode codepoint range can be converted to a sequence of byte ranges that form a UTF-8 automaton. Connecting them together via an alternation is trivial, and indeed, it works. However, there is a lot of redundant structure in many UTF-8 automatons. Since our UTF-8 ranges are in lexicographic order, we can use Daciuk’s algorithm to build nearly minimal DFAs in linear time. (They are guaranteed to be minimal because we use a bounded cache of previously build DFA states.)
The drawback is that this sadly doesn’t work for reverse automata, since the ranges are no longer in lexicographic order. For that, we invented the range trie (which gets its own module). Once a range trie is built, we then use this same Utf8Compiler to build a reverse UTF-8 automaton.
The high level idea is described here: https://blog.burntsushi.net/transducers/#finite-state-machines-as-data-structures
There is also another implementation of this in the fst
crate.
Fields§
§builder: &'a mut Builder
§state: &'a mut Utf8State
§target: StateID