Struct regex_syntax::hir::literal::Extractor

source ·
pub struct Extractor {
    kind: ExtractKind,
    limit_class: usize,
    limit_repeat: usize,
    limit_literal_len: usize,
    limit_total: usize,
}
Expand description

Extracts prefix or suffix literal sequences from Hir expressions.

Literal extraction is based on the following observations:

  • Many regexes start with one or a small number of literals.
  • Substring search for literals is often much faster (sometimes by an order of magnitude) than a regex search.

Thus, in many cases, one can search for literals to find candidate starting locations of a match, and then only run the full regex engine at each such location instead of over the full haystack.

The main downside of literal extraction is that it can wind up causing a search to be slower overall. For example, if there are many matches or if there are many candidates that don’t ultimately lead to a match, then a lot of overhead will be spent in shuffing back-and-forth between substring search and the regex engine. This is the fundamental reason why literal optimizations for regex patterns is sometimes considered a “black art.”

§Look-around assertions

Literal extraction treats all look-around assertions as-if they match every empty string. So for example, the regex \bquux\b will yield a sequence containing a single exact literal quux. However, not all occurrences of quux correspond to a match a of the regex. For example, \bquux\b does not match ZquuxZ anywhere because quux does not fall on a word boundary.

In effect, if your regex contains look-around assertions, then a match of an exact literal does not necessarily mean the regex overall matches. So you may still need to run the regex engine in such cases to confirm the match.

The precise guarantee you get from a literal sequence is: if every literal in the sequence is exact and the original regex contains zero look-around assertions, then a preference-order multi-substring search of those literals will precisely match a preference-order search of the original regex.

§Example

This shows how to extract prefixes:

use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse};

let hir = parse(r"(a|b|c)(x|y|z)[A-Z]+foo")?;
let got = Extractor::new().extract(&hir);
// All literals returned are "inexact" because none of them reach the
// match state.
let expected = Seq::from_iter([
    Literal::inexact("ax"),
    Literal::inexact("ay"),
    Literal::inexact("az"),
    Literal::inexact("bx"),
    Literal::inexact("by"),
    Literal::inexact("bz"),
    Literal::inexact("cx"),
    Literal::inexact("cy"),
    Literal::inexact("cz"),
]);
assert_eq!(expected, got);

This shows how to extract suffixes:

use regex_syntax::{
    hir::literal::{Extractor, ExtractKind, Literal, Seq},
    parse,
};

let hir = parse(r"foo|[A-Z]+bar")?;
let got = Extractor::new().kind(ExtractKind::Suffix).extract(&hir);
// Since 'foo' gets to a match state, it is considered exact. But 'bar'
// does not because of the '[A-Z]+', and thus is marked inexact.
let expected = Seq::from_iter([
    Literal::exact("foo"),
    Literal::inexact("bar"),
]);
assert_eq!(expected, got);

Fields§

§kind: ExtractKind§limit_class: usize§limit_repeat: usize§limit_literal_len: usize§limit_total: usize

Implementations§

source§

impl Extractor

source

pub fn new() -> Extractor

Create a new extractor with a default configuration.

The extractor can be optionally configured before calling Extractor::extract to get a literal sequence.

source

pub fn extract(&self, hir: &Hir) -> Seq

Execute the extractor and return a sequence of literals.

source

pub fn kind(&mut self, kind: ExtractKind) -> &mut Extractor

Set the kind of literal sequence to extract from an Hir expression.

The default is to extract prefixes, but suffixes can be selected instead. The contract for prefixes is that every match of the corresponding Hir must start with one of the literals in the sequence returned. Moreover, the order of the sequence returned corresponds to the preference order.

Suffixes satisfy a similar contract in that every match of the corresponding Hir must end with one of the literals in the sequence returned. However, there is no guarantee that the literals are in preference order.

Remember that a sequence can be infinite. For example, unless the limits are configured to be impractically large, attempting to extract prefixes (or suffixes) for the pattern [A-Z] will return an infinite sequence. Generally speaking, if the sequence returned is infinite, then it is presumed to be unwise to do prefix (or suffix) optimizations for the pattern.

source

pub fn limit_class(&mut self, limit: usize) -> &mut Extractor

Configure a limit on the length of the sequence that is permitted for a character class. If a character class exceeds this limit, then the sequence returned for it is infinite.

This prevents classes like [A-Z] or \pL from getting turned into huge and likely unproductive sequences of literals.

§Example

This example shows how this limit can be lowered to decrease the tolerance for character classes being turned into literal sequences.

use regex_syntax::{hir::literal::{Extractor, Seq}, parse};

let hir = parse(r"[0-9]")?;

let got = Extractor::new().extract(&hir);
let expected = Seq::new([
    "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
]);
assert_eq!(expected, got);

// Now let's shrink the limit and see how that changes things.
let got = Extractor::new().limit_class(4).extract(&hir);
let expected = Seq::infinite();
assert_eq!(expected, got);
source

pub fn limit_repeat(&mut self, limit: usize) -> &mut Extractor

Configure a limit on the total number of repetitions that is permitted before literal extraction is stopped.

This is useful for limiting things like (abcde){50}, or more insidiously, (?:){1000000000}. This limit prevents any one single repetition from adding too much to a literal sequence.

With this limit set, repetitions that exceed it will be stopped and any literals extracted up to that point will be made inexact.

§Example

This shows how to decrease the limit and compares it with the default.

use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse};

let hir = parse(r"(abc){8}")?;

let got = Extractor::new().extract(&hir);
let expected = Seq::new(["abcabcabcabcabcabcabcabc"]);
assert_eq!(expected, got);

// Now let's shrink the limit and see how that changes things.
let got = Extractor::new().limit_repeat(4).extract(&hir);
let expected = Seq::from_iter([
    Literal::inexact("abcabcabcabc"),
]);
assert_eq!(expected, got);
source

pub fn limit_literal_len(&mut self, limit: usize) -> &mut Extractor

Configure a limit on the maximum length of any literal in a sequence.

This is useful for limiting things like (abcde){5}{5}{5}{5}. While each repetition or literal in that regex is small, when all the repetitions are applied, one ends up with a literal of length 5^4 = 625.

With this limit set, literals that exceed it will be made inexact and thus prevented from growing.

§Example

This shows how to decrease the limit and compares it with the default.

use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse};

let hir = parse(r"(abc){2}{2}{2}")?;

let got = Extractor::new().extract(&hir);
let expected = Seq::new(["abcabcabcabcabcabcabcabc"]);
assert_eq!(expected, got);

// Now let's shrink the limit and see how that changes things.
let got = Extractor::new().limit_literal_len(14).extract(&hir);
let expected = Seq::from_iter([
    Literal::inexact("abcabcabcabcab"),
]);
assert_eq!(expected, got);
source

pub fn limit_total(&mut self, limit: usize) -> &mut Extractor

Configure a limit on the total number of literals that will be returned.

This is useful as a practical measure for avoiding the creation of large sequences of literals. While the extractor will automatically handle local creations of large sequences (for example, [A-Z] yields an infinite sequence by default), large sequences can be created through non-local means as well.

For example, [ab]{3}{3} would yield a sequence of length 512 = 2^9 despite each of the repetitions being small on their own. This limit thus represents a “catch all” for avoiding locally small sequences from combining into large sequences.

§Example

This example shows how reducing the limit will change the literal sequence returned.

use regex_syntax::{hir::literal::{Extractor, Literal, Seq}, parse};

let hir = parse(r"[ab]{2}{2}")?;

let got = Extractor::new().extract(&hir);
let expected = Seq::new([
    "aaaa", "aaab", "aaba", "aabb",
    "abaa", "abab", "abba", "abbb",
    "baaa", "baab", "baba", "babb",
    "bbaa", "bbab", "bbba", "bbbb",
]);
assert_eq!(expected, got);

// The default limit is not too big, but big enough to extract all
// literals from '[ab]{2}{2}'. If we shrink the limit to less than 16,
// then we'll get a truncated set. Notice that it returns a sequence of
// length 4 even though our limit was 10. This is because the sequence
// is difficult to increase without blowing the limit. Notice also
// that every literal in the sequence is now inexact because they were
// stripped of some suffix.
let got = Extractor::new().limit_total(10).extract(&hir);
let expected = Seq::from_iter([
    Literal::inexact("aa"),
    Literal::inexact("ab"),
    Literal::inexact("ba"),
    Literal::inexact("bb"),
]);
assert_eq!(expected, got);
source

fn extract_concat<'a, I: Iterator<Item = &'a Hir>>(&self, it: I) -> Seq

Extract a sequence from the given concatenation. Sequences from each of the child HIR expressions are combined via cross product.

This short circuits once the cross product turns into a sequence containing only inexact literals.

source

fn extract_alternation<'a, I: Iterator<Item = &'a Hir>>(&self, it: I) -> Seq

Extract a sequence from the given alternation.

This short circuits once the union turns into an infinite sequence.

source

fn extract_repetition(&self, rep: &Repetition) -> Seq

Extract a sequence of literals from the given repetition. We do our best, Some examples:

‘a*’ => [inexact(a), exact(“”)] ‘a*?’ => [exact(“”), inexact(a)] ‘a+’ => [inexact(a)] ‘a{3}’ => [exact(aaa)] ’a{3,5} => [inexact(aaa)]

The key here really is making sure we get the ‘inexact’ vs ‘exact’ attributes correct on each of the literals we add. For example, the fact that ‘a*’ gives us an inexact ‘a’ and an exact empty string means that a regex like ‘ab*c’ will result in [inexact(ab), exact(ac)] literals being extracted, which might actually be a better prefilter than just ‘a’.

source

fn extract_class_unicode(&self, cls: &ClassUnicode) -> Seq

Convert the given Unicode class into a sequence of literals if the class is small enough. If the class is too big, return an infinite sequence.

source

fn extract_class_bytes(&self, cls: &ClassBytes) -> Seq

Convert the given byte class into a sequence of literals if the class is small enough. If the class is too big, return an infinite sequence.

source

fn class_over_limit_unicode(&self, cls: &ClassUnicode) -> bool

Returns true if the given Unicode class exceeds the configured limits on this extractor.

source

fn class_over_limit_bytes(&self, cls: &ClassBytes) -> bool

Returns true if the given byte class exceeds the configured limits on this extractor.

source

fn cross(&self, seq1: Seq, seq2: &mut Seq) -> Seq

Compute the cross product of the two sequences if the result would be within configured limits. Otherwise, make seq2 infinite and cross the infinite sequence with seq1.

source

fn union(&self, seq1: Seq, seq2: &mut Seq) -> Seq

Union the two sequences if the result would be within configured limits. Otherwise, make seq2 infinite and union the infinite sequence with seq1.

source

fn enforce_literal_len(&self, seq: &mut Seq)

Applies the literal length limit to the given sequence. If none of the literals in the sequence exceed the limit, then this is a no-op.

Trait Implementations§

source§

impl Clone for Extractor

source§

fn clone(&self) -> Extractor

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for Extractor

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl Default for Extractor

source§

fn default() -> Extractor

Returns the “default value” for a type. Read more

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> CloneToUninit for T
where T: Clone,

source§

unsafe fn clone_to_uninit(&self, dst: *mut T)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dst. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for T
where T: Clone,

source§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

source§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.