regex_automata::nfa::thompson::literal_trie

Struct LiteralTrie

pub(crate) struct LiteralTrie {
    states: Vec<State>,
    rev: bool,
}

Expand description

A trie that preserves leftmost-first match semantics.

This is a purpose-built data structure for optimizing ‘lit1|lit2|..|litN’ patterns. It can only handle alternations of literals, which makes it somewhat restricted in its scope, but literal alternations are fairly common.

At a 5,000 foot level, the main idea of this trie is make an alternation of literals look more like a DFA than an NFA via epsilon removal.

More precisely, the main issue is in how alternations are compiled into a Thompson NFA. Namely, each alternation gets a single NFA “union” state with an epsilon transition for every branch of the alternation pointing to an NFA state corresponding to the start of that branch. The main problem with this representation is the cost of computing an epsilon closure. Once you hit the alternation’s start state, it acts as a sort of “clog” that requires you to traverse all of the epsilon transitions to compute the full closure.

While fixing such clogs in the general case is pretty tricky without going to a DFA (or perhaps a Glushkov NFA, but that comes with other problems). But at least in the case of an alternation of literals, we can convert that to a prefix trie without too much cost. In theory, that’s all you really need to do: build the trie and then compile it to a Thompson NFA. For example, if you have the pattern ‘bar|baz|foo’, then using a trie, it is transformed to something like ‘b(a(r|z))|f’. This reduces the clog by reducing the number of epsilon transitions out of the alternation’s start state from 3 to 2 (it actually gets down to 1 when you use a sparse state, which we do below). It’s a small effect here, but when your alternation is huge, the savings is also huge.

And that is… essentially what a LiteralTrie does. But there is one hiccup. Consider a regex like ‘sam|samwise’. How does a prefix trie compile that when leftmost-first semantics are used? If ‘sam|samwise’ was the entire regex, then you could just drop the ‘samwise’ branch entirely since it is impossible to match (‘sam’ will always take priority, and since it is a prefix of ‘samwise’, ‘samwise’ will never match). But what about the regex ‘\b(sam|samwise)\b’? In that case, you can’t remove ‘samwise’ because it might match when ‘sam’ doesn’t fall on a word boundary.

The main idea is that ‘sam|samwise’ can be translated to ‘sam(?:|wise)’, which is a precisely equivalent regex that also gets rid of the clog.

Another example is ‘zapper|z|zap’. That gets translated to ‘z(?:apper||ap)’.

We accomplish this by giving each state in the trie multiple “chunks” of transitions. Each chunk barrier represents a match. The idea is that once you know a match occurs, none of the transitions after the match can be re-ordered and mixed in with the transitions before the match. Otherwise, the match semantics could be changed.

See the ‘State’ data type for a bit more detail.

Future work:

In theory, it would be nice to generalize the idea of removing clogs and apply it to the NFA graph itself. Then this could in theory work for case insensitive alternations of literals, or even just alternations where each branch starts with a non-epsilon transition.
Could we instead use the Aho-Corasick algorithm here? The aho-corasick crate deals with leftmost-first matches correctly, but I think this implies encoding failure transitions into a Thompson NFA somehow. Which seems fine, because failure transitions are just unconditional epsilon transitions?
Or perhaps even better, could we use an aho_corasick::AhoCorasick directly? At time of writing, 0.7 is the current version of the aho-corasick crate, and that definitely cannot be used as-is. But if we expose the underlying finite state machine API, then could we use it? That would be super. If we could figure that out, it might also lend itself to more general composition of finite state machines.

Fields§

§states: Vec<State>

The set of trie states. Each state contains one or more chunks, where each chunk is a sparse set of transitions to other states. A leaf state is always a match state that contains only empty chunks (i.e., no transitions).

§rev: bool

Whether to add literals in reverse to the trie. Useful when building a reverse NFA automaton.

Struct LiteralTrieCopy item path

Fields§

Implementations§

impl LiteralTrie

pub(crate) fn forward() -> LiteralTrie

pub(crate) fn reverse() -> LiteralTrie

pub(crate) fn add(&mut self, bytes: &[u8]) -> Result<(), BuildError>

fn get_or_add_state( &mut self, from: StateID, byte: u8, ) -> Result<StateID, BuildError>

pub(crate) fn compile( &self, builder: &mut Builder, ) -> Result<ThompsonRef, BuildError>

Trait Implementations§

impl Clone for LiteralTrie

fn clone(&self) -> LiteralTrie

fn clone_from(&mut self, source: &Self)

impl Debug for LiteralTrie

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Auto Trait Implementations§

impl Freeze for LiteralTrie

impl RefUnwindSafe for LiteralTrie

impl Send for LiteralTrie

impl Sync for LiteralTrie

impl Unpin for LiteralTrie

impl UnwindSafe for LiteralTrie

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

unsafe fn clone_to_uninit(&self, dst: *mut T)

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Struct LiteralTrie

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<T, U> Into<U> for T
where U: From<T>,

impl<T> ToOwned for T
where T: Clone,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,