regex_automata::nfa::thompson::literal_trie

Struct LiteralTrie

source
pub(crate) struct LiteralTrie {
    states: Vec<State>,
    rev: bool,
}
Expand description

A trie that preserves leftmost-first match semantics.

This is a purpose-built data structure for optimizing ‘lit1|lit2|..|litN’ patterns. It can only handle alternations of literals, which makes it somewhat restricted in its scope, but literal alternations are fairly common.

At a 5,000 foot level, the main idea of this trie is make an alternation of literals look more like a DFA than an NFA via epsilon removal.

More precisely, the main issue is in how alternations are compiled into a Thompson NFA. Namely, each alternation gets a single NFA “union” state with an epsilon transition for every branch of the alternation pointing to an NFA state corresponding to the start of that branch. The main problem with this representation is the cost of computing an epsilon closure. Once you hit the alternation’s start state, it acts as a sort of “clog” that requires you to traverse all of the epsilon transitions to compute the full closure.

While fixing such clogs in the general case is pretty tricky without going to a DFA (or perhaps a Glushkov NFA, but that comes with other problems). But at least in the case of an alternation of literals, we can convert that to a prefix trie without too much cost. In theory, that’s all you really need to do: build the trie and then compile it to a Thompson NFA. For example, if you have the pattern ‘bar|baz|foo’, then using a trie, it is transformed to something like ‘b(a(r|z))|f’. This reduces the clog by reducing the number of epsilon transitions out of the alternation’s start state from 3 to 2 (it actually gets down to 1 when you use a sparse state, which we do below). It’s a small effect here, but when your alternation is huge, the savings is also huge.

And that is… essentially what a LiteralTrie does. But there is one hiccup. Consider a regex like ‘sam|samwise’. How does a prefix trie compile that when leftmost-first semantics are used? If ‘sam|samwise’ was the entire regex, then you could just drop the ‘samwise’ branch entirely since it is impossible to match (‘sam’ will always take priority, and since it is a prefix of ‘samwise’, ‘samwise’ will never match). But what about the regex ‘\b(sam|samwise)\b’? In that case, you can’t remove ‘samwise’ because it might match when ‘sam’ doesn’t fall on a word boundary.

The main idea is that ‘sam|samwise’ can be translated to ‘sam(?:|wise)’, which is a precisely equivalent regex that also gets rid of the clog.

Another example is ‘zapper|z|zap’. That gets translated to ‘z(?:apper||ap)’.

We accomplish this by giving each state in the trie multiple “chunks” of transitions. Each chunk barrier represents a match. The idea is that once you know a match occurs, none of the transitions after the match can be re-ordered and mixed in with the transitions before the match. Otherwise, the match semantics could be changed.

See the ‘State’ data type for a bit more detail.

Future work:

  • In theory, it would be nice to generalize the idea of removing clogs and apply it to the NFA graph itself. Then this could in theory work for case insensitive alternations of literals, or even just alternations where each branch starts with a non-epsilon transition.
  • Could we instead use the Aho-Corasick algorithm here? The aho-corasick crate deals with leftmost-first matches correctly, but I think this implies encoding failure transitions into a Thompson NFA somehow. Which seems fine, because failure transitions are just unconditional epsilon transitions?
  • Or perhaps even better, could we use an aho_corasick::AhoCorasick directly? At time of writing, 0.7 is the current version of the aho-corasick crate, and that definitely cannot be used as-is. But if we expose the underlying finite state machine API, then could we use it? That would be super. If we could figure that out, it might also lend itself to more general composition of finite state machines.

Fields§

§states: Vec<State>

The set of trie states. Each state contains one or more chunks, where each chunk is a sparse set of transitions to other states. A leaf state is always a match state that contains only empty chunks (i.e., no transitions).

§rev: bool

Whether to add literals in reverse to the trie. Useful when building a reverse NFA automaton.

Implementations§

source§

impl LiteralTrie

source

pub(crate) fn forward() -> LiteralTrie

Create a new literal trie that adds literals in the forward direction.

source

pub(crate) fn reverse() -> LiteralTrie

Create a new literal trie that adds literals in reverse.

source

pub(crate) fn add(&mut self, bytes: &[u8]) -> Result<(), BuildError>

Add the given literal to this trie.

If the literal could not be added because the StateID space was exhausted, then an error is returned. If an error returns, the trie is in an unspecified state.

source

fn get_or_add_state( &mut self, from: StateID, byte: u8, ) -> Result<StateID, BuildError>

If the given transition is defined, then return the next state ID. Otherwise, add the transition to from and point it to a new state.

If a new state ID could not be allocated, then an error is returned.

source

pub(crate) fn compile( &self, builder: &mut Builder, ) -> Result<ThompsonRef, BuildError>

Compile this literal trie to the NFA builder given.

This forwards any errors that may occur while using the given builder.

Trait Implementations§

source§

impl Clone for LiteralTrie

source§

fn clone(&self) -> LiteralTrie

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for LiteralTrie

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> CloneToUninit for T
where T: Clone,

source§

unsafe fn clone_to_uninit(&self, dst: *mut T)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dst. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for T
where T: Clone,

source§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

source§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.