regex_automata::nfa::thompson::compiler

Struct Utf8Compiler

source
struct Utf8Compiler<'a> {
    builder: &'a mut Builder,
    state: &'a mut Utf8State,
    target: StateID,
}
Expand description

A UTF-8 compiler based on Daciuk’s algorithm for compilining minimal DFAs from a lexicographically sorted sequence of strings in linear time.

The trick here is that any Unicode codepoint range can be converted to a sequence of byte ranges that form a UTF-8 automaton. Connecting them together via an alternation is trivial, and indeed, it works. However, there is a lot of redundant structure in many UTF-8 automatons. Since our UTF-8 ranges are in lexicographic order, we can use Daciuk’s algorithm to build nearly minimal DFAs in linear time. (They are guaranteed to be minimal because we use a bounded cache of previously build DFA states.)

The drawback is that this sadly doesn’t work for reverse automata, since the ranges are no longer in lexicographic order. For that, we invented the range trie (which gets its own module). Once a range trie is built, we then use this same Utf8Compiler to build a reverse UTF-8 automaton.

The high level idea is described here: https://blog.burntsushi.net/transducers/#finite-state-machines-as-data-structures

There is also another implementation of this in the fst crate.

Fields§

§builder: &'a mut Builder§state: &'a mut Utf8State§target: StateID

Implementations§

source§

impl<'a> Utf8Compiler<'a>

source

fn new( builder: &'a mut Builder, state: &'a mut Utf8State, ) -> Result<Utf8Compiler<'a>, BuildError>

source

fn finish(&mut self) -> Result<ThompsonRef, BuildError>

source

fn add(&mut self, ranges: &[Utf8Range]) -> Result<(), BuildError>

source

fn compile_from(&mut self, from: usize) -> Result<(), BuildError>

source

fn compile(&mut self, node: Vec<Transition>) -> Result<StateID, BuildError>

source

fn add_suffix(&mut self, ranges: &[Utf8Range])

source

fn add_empty(&mut self)

source

fn pop_freeze(&mut self, next: StateID) -> Vec<Transition>

source

fn pop_root(&mut self) -> Vec<Transition>

source

fn top_last_freeze(&mut self, next: StateID)

Trait Implementations§

source§

impl<'a> Debug for Utf8Compiler<'a>

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

§

impl<'a> Freeze for Utf8Compiler<'a>

§

impl<'a> RefUnwindSafe for Utf8Compiler<'a>

§

impl<'a> Send for Utf8Compiler<'a>

§

impl<'a> Sync for Utf8Compiler<'a>

§

impl<'a> Unpin for Utf8Compiler<'a>

§

impl<'a> !UnwindSafe for Utf8Compiler<'a>

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

source§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.