Struct regex_automata::nfa::thompson::Compiler

source ·
pub struct Compiler {
    parser: ParserBuilder,
    config: Config,
    builder: RefCell<Builder>,
    utf8_state: RefCell<Utf8State>,
    trie_state: RefCell<RangeTrie>,
    utf8_suffix: RefCell<Utf8SuffixMap>,
}
Expand description

A builder for compiling an NFA from a regex’s high-level intermediate representation (HIR).

This compiler provides a way to translate a parsed regex pattern into an NFA state graph. The NFA state graph can either be used directly to execute a search (e.g., with a Pike VM), or it can be further used to build a DFA.

This compiler provides APIs both for compiling regex patterns directly from their concrete syntax, or via a regex_syntax::hir::Hir.

This compiler has various options that may be configured via thompson::Config.

Note that a compiler is not the same as a thompson::Builder. A Builder provides a lower level API that is uncoupled from a regex pattern’s concrete syntax or even its HIR. Instead, it permits stitching together an NFA by hand. See its docs for examples.

§Example: compilation from concrete syntax

This shows how to compile an NFA from a pattern string while setting a size limit on how big the NFA is allowed to be (in terms of bytes of heap used).

use regex_automata::{
    nfa::thompson::{NFA, pikevm::PikeVM},
    Match,
};

let config = NFA::config().nfa_size_limit(Some(1_000));
let nfa = NFA::compiler().configure(config).build(r"(?-u)\w")?;

let re = PikeVM::new_from_nfa(nfa)?;
let mut cache = re.create_cache();
let mut caps = re.create_captures();
let expected = Some(Match::must(0, 3..4));
re.captures(&mut cache, "!@#A#@!", &mut caps);
assert_eq!(expected, caps.get_match());

§Example: compilation from HIR

This shows how to hand assemble a regular expression via its HIR, and then compile an NFA directly from it.

use regex_automata::{nfa::thompson::{NFA, pikevm::PikeVM}, Match};
use regex_syntax::hir::{Hir, Class, ClassBytes, ClassBytesRange};

let hir = Hir::class(Class::Bytes(ClassBytes::new(vec![
    ClassBytesRange::new(b'0', b'9'),
    ClassBytesRange::new(b'A', b'Z'),
    ClassBytesRange::new(b'_', b'_'),
    ClassBytesRange::new(b'a', b'z'),
])));

let config = NFA::config().nfa_size_limit(Some(1_000));
let nfa = NFA::compiler().configure(config).build_from_hir(&hir)?;

let re = PikeVM::new_from_nfa(nfa)?;
let mut cache = re.create_cache();
let mut caps = re.create_captures();
let expected = Some(Match::must(0, 3..4));
re.captures(&mut cache, "!@#A#@!", &mut caps);
assert_eq!(expected, caps.get_match());

Fields§

§parser: ParserBuilder

A regex parser, used when compiling an NFA directly from a pattern string.

§config: Config

The compiler configuration.

§builder: RefCell<Builder>

The builder for actually constructing an NFA. This provides a convenient abstraction for writing a compiler.

§utf8_state: RefCell<Utf8State>

State used for compiling character classes to UTF-8 byte automata. State is not retained between character class compilations. This just serves to amortize allocation to the extent possible.

§trie_state: RefCell<RangeTrie>

State used for arranging character classes in reverse into a trie.

§utf8_suffix: RefCell<Utf8SuffixMap>

State used for caching common suffixes when compiling reverse UTF-8 automata (for Unicode character classes).

Implementations§

source§

impl Compiler

source

pub fn new() -> Compiler

Create a new NFA builder with its default configuration.

source

pub fn build(&self, pattern: &str) -> Result<NFA, BuildError>

Compile the given regular expression pattern into an NFA.

If there was a problem parsing the regex, then that error is returned.

Otherwise, if there was a problem building the NFA, then an error is returned. The only error that can occur is if the compiled regex would exceed the size limits configured on this builder, or if any part of the NFA would exceed the integer representations used. (For example, too many states might plausibly occur on a 16-bit target.)

§Example
use regex_automata::{nfa::thompson::{NFA, pikevm::PikeVM}, Match};

let config = NFA::config().nfa_size_limit(Some(1_000));
let nfa = NFA::compiler().configure(config).build(r"(?-u)\w")?;

let re = PikeVM::new_from_nfa(nfa)?;
let mut cache = re.create_cache();
let mut caps = re.create_captures();
let expected = Some(Match::must(0, 3..4));
re.captures(&mut cache, "!@#A#@!", &mut caps);
assert_eq!(expected, caps.get_match());
source

pub fn build_many<P: AsRef<str>>( &self, patterns: &[P], ) -> Result<NFA, BuildError>

Compile the given regular expression patterns into a single NFA.

When matches are returned, the pattern ID corresponds to the index of the pattern in the slice given.

§Example
use regex_automata::{nfa::thompson::{NFA, pikevm::PikeVM}, Match};

let config = NFA::config().nfa_size_limit(Some(1_000));
let nfa = NFA::compiler().configure(config).build_many(&[
    r"(?-u)\s",
    r"(?-u)\w",
])?;

let re = PikeVM::new_from_nfa(nfa)?;
let mut cache = re.create_cache();
let mut caps = re.create_captures();
let expected = Some(Match::must(1, 1..2));
re.captures(&mut cache, "!A! !A!", &mut caps);
assert_eq!(expected, caps.get_match());
source

pub fn build_from_hir(&self, expr: &Hir) -> Result<NFA, BuildError>

Compile the given high level intermediate representation of a regular expression into an NFA.

If there was a problem building the NFA, then an error is returned. The only error that can occur is if the compiled regex would exceed the size limits configured on this builder, or if any part of the NFA would exceed the integer representations used. (For example, too many states might plausibly occur on a 16-bit target.)

§Example
use regex_automata::{nfa::thompson::{NFA, pikevm::PikeVM}, Match};
use regex_syntax::hir::{Hir, Class, ClassBytes, ClassBytesRange};

let hir = Hir::class(Class::Bytes(ClassBytes::new(vec![
    ClassBytesRange::new(b'0', b'9'),
    ClassBytesRange::new(b'A', b'Z'),
    ClassBytesRange::new(b'_', b'_'),
    ClassBytesRange::new(b'a', b'z'),
])));

let config = NFA::config().nfa_size_limit(Some(1_000));
let nfa = NFA::compiler().configure(config).build_from_hir(&hir)?;

let re = PikeVM::new_from_nfa(nfa)?;
let mut cache = re.create_cache();
let mut caps = re.create_captures();
let expected = Some(Match::must(0, 3..4));
re.captures(&mut cache, "!@#A#@!", &mut caps);
assert_eq!(expected, caps.get_match());
source

pub fn build_many_from_hir<H: Borrow<Hir>>( &self, exprs: &[H], ) -> Result<NFA, BuildError>

Compile the given high level intermediate representations of regular expressions into a single NFA.

When matches are returned, the pattern ID corresponds to the index of the pattern in the slice given.

§Example
use regex_automata::{nfa::thompson::{NFA, pikevm::PikeVM}, Match};
use regex_syntax::hir::{Hir, Class, ClassBytes, ClassBytesRange};

let hirs = &[
    Hir::class(Class::Bytes(ClassBytes::new(vec![
        ClassBytesRange::new(b'\t', b'\r'),
        ClassBytesRange::new(b' ', b' '),
    ]))),
    Hir::class(Class::Bytes(ClassBytes::new(vec![
        ClassBytesRange::new(b'0', b'9'),
        ClassBytesRange::new(b'A', b'Z'),
        ClassBytesRange::new(b'_', b'_'),
        ClassBytesRange::new(b'a', b'z'),
    ]))),
];

let config = NFA::config().nfa_size_limit(Some(1_000));
let nfa = NFA::compiler().configure(config).build_many_from_hir(hirs)?;

let re = PikeVM::new_from_nfa(nfa)?;
let mut cache = re.create_cache();
let mut caps = re.create_captures();
let expected = Some(Match::must(1, 1..2));
re.captures(&mut cache, "!A! !A!", &mut caps);
assert_eq!(expected, caps.get_match());
source

pub fn configure(&mut self, config: Config) -> &mut Compiler

Apply the given NFA configuration options to this builder.

§Example
use regex_automata::nfa::thompson::NFA;

let config = NFA::config().nfa_size_limit(Some(1_000));
let nfa = NFA::compiler().configure(config).build(r"(?-u)\w")?;
assert_eq!(nfa.pattern_len(), 1);
source

pub fn syntax(&mut self, config: Config) -> &mut Compiler

Set the syntax configuration for this builder using syntax::Config.

This permits setting things like case insensitivity, Unicode and multi line mode.

This syntax configuration only applies when an NFA is built directly from a pattern string. If an NFA is built from an HIR, then all syntax settings are ignored.

§Example
use regex_automata::{nfa::thompson::NFA, util::syntax};

let syntax_config = syntax::Config::new().unicode(false);
let nfa = NFA::compiler().syntax(syntax_config).build(r"\w")?;
// If Unicode were enabled, the number of states would be much bigger.
assert!(nfa.states().len() < 15);
source§

impl Compiler

source

fn compile<H: Borrow<Hir>>(&self, exprs: &[H]) -> Result<NFA, BuildError>

Compile the sequence of HIR expressions given. Pattern IDs are allocated starting from 0, in correspondence with the slice given.

It is legal to provide an empty slice. In that case, the NFA returned has no patterns and will never match anything.

source

fn c(&self, expr: &Hir) -> Result<ThompsonRef, BuildError>

Compile an arbitrary HIR expression.

source

fn c_concat<I>(&self, it: I) -> Result<ThompsonRef, BuildError>

Compile a concatenation of the sub-expressions yielded by the given iterator. If the iterator yields no elements, then this compiles down to an “empty” state that always matches.

If the compiler is in reverse mode, then the expressions given are automatically compiled in reverse.

source

fn c_alt_slice(&self, exprs: &[Hir]) -> Result<ThompsonRef, BuildError>

Compile an alternation of the given HIR values.

This is like ‘c_alt_iter’, but it accepts a slice of HIR values instead of an iterator of compiled NFA subgraphs. The point of accepting a slice here is that it opens up some optimization opportunities. For example, if all of the HIR values are literals, then this routine might re-shuffle them to make NFA epsilon closures substantially faster.

source

fn c_alt_iter<I>(&self, it: I) -> Result<ThompsonRef, BuildError>

Compile an alternation, where each element yielded by the given iterator represents an item in the alternation. If the iterator yields no elements, then this compiles down to a “fail” state.

In an alternation, expressions appearing earlier are “preferred” at match time over expressions appearing later. At least, this is true when using “leftmost first” match semantics. (If “leftmost longest” are ever added in the future, then this preference order of priority would not apply in that mode.)

source

fn c_cap( &self, index: u32, name: Option<&str>, expr: &Hir, ) -> Result<ThompsonRef, BuildError>

Compile the given capture sub-expression. expr should be the sub-expression contained inside the capture. If “capture” states are enabled, then they are added as appropriate.

This accepts the pieces of a capture instead of a hir::Capture so that it’s easy to manufacture a “fake” group when necessary, e.g., for adding the entire pattern as if it were a group in order to create appropriate “capture” states in the NFA.

source

fn c_repetition(&self, rep: &Repetition) -> Result<ThompsonRef, BuildError>

Compile the given repetition expression. This handles all types of repetitions and greediness.

source

fn c_bounded( &self, expr: &Hir, greedy: bool, min: u32, max: u32, ) -> Result<ThompsonRef, BuildError>

Compile the given expression such that it matches at least min times, but no more than max times.

When greedy is true, then the preference is for the expression to match as much as possible. Otherwise, it will match as little as possible.

source

fn c_at_least( &self, expr: &Hir, greedy: bool, n: u32, ) -> Result<ThompsonRef, BuildError>

Compile the given expression such that it may be matched n or more times, where n can be any integer. (Although a particularly large integer is likely to run afoul of any configured size limits.)

When greedy is true, then the preference is for the expression to match as much as possible. Otherwise, it will match as little as possible.

source

fn c_zero_or_one( &self, expr: &Hir, greedy: bool, ) -> Result<ThompsonRef, BuildError>

Compile the given expression such that it may be matched zero or one times.

When greedy is true, then the preference is for the expression to match as much as possible. Otherwise, it will match as little as possible.

source

fn c_exactly(&self, expr: &Hir, n: u32) -> Result<ThompsonRef, BuildError>

Compile the given HIR expression exactly n times.

source

fn c_byte_class(&self, cls: &ClassBytes) -> Result<ThompsonRef, BuildError>

Compile the given byte oriented character class.

This uses “sparse” states to represent an alternation between ranges in this character class. We can use “sparse” states instead of stitching together a “union” state because all ranges in a character class have equal priority and are non-overlapping (thus, only one can match, so there’s never a question of priority in the first place). This saves a fair bit of overhead when traversing an NFA.

This routine compiles an empty character class into a “fail” state.

source

fn c_unicode_class(&self, cls: &ClassUnicode) -> Result<ThompsonRef, BuildError>

Compile the given Unicode character class.

This routine specifically tries to use various types of compression, since UTF-8 automata of large classes can get quite large. The specific type of compression used depends on forward vs reverse compilation, and whether NFA shrinking is enabled or not.

Aside from repetitions causing lots of repeat group, this is like the single most expensive part of regex compilation. Therefore, a large part of the expense of compilation may be reduce by disabling Unicode in the pattern.

This routine compiles an empty character class into a “fail” state.

source

fn c_unicode_class_reverse_with_suffix( &self, cls: &ClassUnicode, ) -> Result<ThompsonRef, BuildError>

Compile the given Unicode character class in reverse with suffix caching.

This is a “quick” way to compile large Unicode classes into reverse UTF-8 automata while doing a small amount of compression on that automata by reusing common suffixes.

A more comprehensive compression scheme can be accomplished by using a range trie to efficiently sort a reverse sequence of UTF-8 byte rqanges, and then use Daciuk’s algorithm via Utf8Compiler.

This is the technique used when “NFA shrinking” is disabled.

(This also tries to use “sparse” states where possible, just like c_byte_class does.)

source

fn c_look(&self, anchor: &Look) -> Result<ThompsonRef, BuildError>

Compile the given HIR look-around assertion to an NFA look-around assertion.

source

fn c_literal(&self, bytes: &[u8]) -> Result<ThompsonRef, BuildError>

Compile the given byte string to a concatenation of bytes.

source

fn c_range(&self, start: u8, end: u8) -> Result<ThompsonRef, BuildError>

Compile a “range” state with one transition that may only be followed if the input byte is in the (inclusive) range given.

Both the start and end locations point to the state created. Callers will likely want to keep the start, but patch the end to point to some other state.

source

fn c_empty(&self) -> Result<ThompsonRef, BuildError>

Compile an “empty” state with one unconditional epsilon transition.

Both the start and end locations point to the state created. Callers will likely want to keep the start, but patch the end to point to some other state.

source

fn c_fail(&self) -> Result<ThompsonRef, BuildError>

Compile a “fail” state that can never have any outgoing transitions.

source

fn patch(&self, from: StateID, to: StateID) -> Result<(), BuildError>

source

fn start_pattern(&self) -> Result<PatternID, BuildError>

source

fn finish_pattern(&self, start_id: StateID) -> Result<PatternID, BuildError>

source

fn add_empty(&self) -> Result<StateID, BuildError>

source

fn add_range(&self, start: u8, end: u8) -> Result<StateID, BuildError>

source

fn add_sparse(&self, ranges: Vec<Transition>) -> Result<StateID, BuildError>

source

fn add_look(&self, look: Look) -> Result<StateID, BuildError>

source

fn add_union(&self) -> Result<StateID, BuildError>

source

fn add_union_reverse(&self) -> Result<StateID, BuildError>

source

fn add_capture_start( &self, capture_index: u32, name: Option<&str>, ) -> Result<StateID, BuildError>

source

fn add_capture_end(&self, capture_index: u32) -> Result<StateID, BuildError>

source

fn add_fail(&self) -> Result<StateID, BuildError>

source

fn add_match(&self) -> Result<StateID, BuildError>

source

fn is_reverse(&self) -> bool

Trait Implementations§

source§

impl Clone for Compiler

source§

fn clone(&self) -> Compiler

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for Compiler

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for T
where T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.