Struct regex_automata::hybrid::LazyStateID

source ·
pub struct LazyStateID(u32);
Expand description

A state identifier specifically tailored for lazy DFAs.

A lazy state ID logically represents a pointer to a DFA state. In practice, by limiting the number of DFA states it can address, it reserves some bits of its representation to encode some additional information. That additional information is called a “tag.” That tag is used to record whether the state it points to is an unknown, dead, quit, start or match state.

When implementing a low level search routine with a lazy DFA, it is necessary to query the type of the current state to know what to do:

  • Unknown - The state has not yet been computed. The parameters used to get this state ID must be re-passed to DFA::next_state, which will never return an unknown state ID.
  • Dead - A dead state only has transitions to itself. It indicates that the search cannot do anything else and should stop with whatever result it has.
  • Quit - A quit state indicates that the automaton could not answer whether a match exists or not. Correct search implementations must return a MatchError::quit when a DFA enters a quit state.
  • Start - A start state is a state in which a search can begin. Lazy DFAs usually have more than one start state. Branching on this isn’t required for correctness, but a common optimization is to run a prefilter when a search enters a start state. Note that start states are not tagged automatically, and one must enable the Config::specialize_start_states setting for start states to be tagged. The reason for this is that a DFA search loop is usually written to execute a prefilter once it enters a start state. But if there is no prefilter, this handling can be quite diastrous as the DFA may ping-pong between the special handling code and a possible optimized hot path for handling untagged states. When start states aren’t specialized, then they are untagged and remain in the hot path.
  • Match - A match state indicates that a match has been found. Depending on the semantics of your search implementation, it may either continue until the end of the haystack or a dead state, or it might quit and return the match immediately.

As an optimization, the is_tagged predicate can be used to determine if a tag exists at all. This is useful to avoid branching on all of the above types for every byte searched.

§Example

This example shows how LazyStateID can be used to implement a correct search routine with minimal branching. In particular, this search routine implements “leftmost” matching, which means that it doesn’t immediately stop once a match is found. Instead, it continues until it reaches a dead state.

Notice also how a correct search implementation deals with CacheErrors returned by some of the lazy DFA routines. When a CacheError occurs, it returns MatchError::gave_up.

use regex_automata::{
    hybrid::dfa::{Cache, DFA},
    HalfMatch, MatchError, Input,
};

fn find_leftmost_first(
    dfa: &DFA,
    cache: &mut Cache,
    haystack: &[u8],
) -> Result<Option<HalfMatch>, MatchError> {
    // The start state is determined by inspecting the position and the
    // initial bytes of the haystack. Note that start states can never
    // be match states (since DFAs in this crate delay matches by 1
    // byte), so we don't need to check if the start state is a match.
    let mut sid = dfa.start_state_forward(
        cache,
        &Input::new(haystack),
    )?;
    let mut last_match = None;
    // Walk all the bytes in the haystack. We can quit early if we see
    // a dead or a quit state. The former means the automaton will
    // never transition to any other state. The latter means that the
    // automaton entered a condition in which its search failed.
    for (i, &b) in haystack.iter().enumerate() {
        sid = dfa
            .next_state(cache, sid, b)
            .map_err(|_| MatchError::gave_up(i))?;
        if sid.is_tagged() {
            if sid.is_match() {
                last_match = Some(HalfMatch::new(
                    dfa.match_pattern(cache, sid, 0),
                    i,
                ));
            } else if sid.is_dead() {
                return Ok(last_match);
            } else if sid.is_quit() {
                // It is possible to enter into a quit state after
                // observing a match has occurred. In that case, we
                // should return the match instead of an error.
                if last_match.is_some() {
                    return Ok(last_match);
                }
                return Err(MatchError::quit(b, i));
            }
            // Implementors may also want to check for start states and
            // handle them differently for performance reasons. But it is
            // not necessary for correctness. Note that in order to check
            // for start states, you'll need to enable the
            // 'specialize_start_states' config knob, otherwise start
            // states will not be tagged.
        }
    }
    // Matches are always delayed by 1 byte, so we must explicitly walk
    // the special "EOI" transition at the end of the search.
    sid = dfa
        .next_eoi_state(cache, sid)
        .map_err(|_| MatchError::gave_up(haystack.len()))?;
    if sid.is_match() {
        last_match = Some(HalfMatch::new(
            dfa.match_pattern(cache, sid, 0),
            haystack.len(),
        ));
    }
    Ok(last_match)
}

// We use a greedy '+' operator to show how the search doesn't just stop
// once a match is detected. It continues extending the match. Using
// '[a-z]+?' would also work as expected and stop the search early.
// Greediness is built into the automaton.
let dfa = DFA::new(r"[a-z]+")?;
let mut cache = dfa.create_cache();
let haystack = "123 foobar 4567".as_bytes();
let mat = find_leftmost_first(&dfa, &mut cache, haystack)?.unwrap();
assert_eq!(mat.pattern().as_usize(), 0);
assert_eq!(mat.offset(), 10);

// Here's another example that tests our handling of the special
// EOI transition. This will fail to find a match if we don't call
// 'next_eoi_state' at the end of the search since the match isn't found
// until the final byte in the haystack.
let dfa = DFA::new(r"[0-9]{4}")?;
let mut cache = dfa.create_cache();
let haystack = "123 foobar 4567".as_bytes();
let mat = find_leftmost_first(&dfa, &mut cache, haystack)?.unwrap();
assert_eq!(mat.pattern().as_usize(), 0);
assert_eq!(mat.offset(), 15);

// And note that our search implementation above automatically works
// with multi-DFAs. Namely, `dfa.match_pattern(match_state, 0)` selects
// the appropriate pattern ID for us.
let dfa = DFA::new_many(&[r"[a-z]+", r"[0-9]+"])?;
let mut cache = dfa.create_cache();
let haystack = "123 foobar 4567".as_bytes();
let mat = find_leftmost_first(&dfa, &mut cache, haystack)?.unwrap();
assert_eq!(mat.pattern().as_usize(), 1);
assert_eq!(mat.offset(), 3);
let mat = find_leftmost_first(&dfa, &mut cache, &haystack[3..])?.unwrap();
assert_eq!(mat.pattern().as_usize(), 0);
assert_eq!(mat.offset(), 7);
let mat = find_leftmost_first(&dfa, &mut cache, &haystack[10..])?.unwrap();
assert_eq!(mat.pattern().as_usize(), 1);
assert_eq!(mat.offset(), 5);

Tuple Fields§

§0: u32

Implementations§

source§

impl LazyStateID

source

const MAX_BIT: usize = 31usize

source

const MASK_UNKNOWN: usize = 2_147_483_648usize

source

const MASK_DEAD: usize = 1_073_741_824usize

source

const MASK_QUIT: usize = 536_870_912usize

source

const MASK_START: usize = 268_435_456usize

source

const MASK_MATCH: usize = 134_217_728usize

source

const MAX: usize = 134_217_727usize

source

pub(crate) fn new(id: usize) -> Result<LazyStateID, LazyStateIDError>

Create a new lazy state ID.

If the given identifier exceeds LazyStateID::MAX, then this returns an error.

source

const fn new_unchecked(id: usize) -> LazyStateID

Create a new lazy state ID without checking whether the given value exceeds LazyStateID::MAX.

While this is unchecked, providing an incorrect value must never sacrifice memory safety.

source

pub(crate) fn as_usize_untagged(&self) -> usize

Return this lazy state ID as an untagged usize.

If this lazy state ID is tagged, then the usize returned is the state ID without the tag. If the ID was not tagged, then the usize returned is equivalent to the state ID.

source

pub(crate) const fn as_usize_unchecked(&self) -> usize

Return this lazy state ID as its raw internal usize value, which may be tagged (and thus greater than LazyStateID::MAX).

source

pub(crate) const fn to_unknown(&self) -> LazyStateID

source

pub(crate) const fn to_dead(&self) -> LazyStateID

source

pub(crate) const fn to_quit(&self) -> LazyStateID

source

pub(crate) const fn to_start(&self) -> LazyStateID

Return this lazy state ID as a state ID that is tagged as a start state.

source

pub(crate) const fn to_match(&self) -> LazyStateID

Return this lazy state ID as a lazy state ID that is tagged as a match state.

source

pub const fn is_tagged(&self) -> bool

Return true if and only if this lazy state ID is tagged.

When a lazy state ID is tagged, then one can conclude that it is one of a match, start, dead, quit or unknown state.

source

pub const fn is_unknown(&self) -> bool

Return true if and only if this represents a lazy state ID that is “unknown.” That is, the state has not yet been created. When a caller sees this state ID, it generally means that a state has to be computed in order to proceed.

source

pub const fn is_dead(&self) -> bool

Return true if and only if this represents a dead state. A dead state is a state that can never transition to any other state except the dead state. When a dead state is seen, it generally indicates that a search should stop.

source

pub const fn is_quit(&self) -> bool

Return true if and only if this represents a quit state. A quit state is a state that is representationally equivalent to a dead state, except it indicates the automaton has reached a point at which it can no longer determine whether a match exists or not. In general, this indicates an error during search and the caller must either pass this error up or use a different search technique.

source

pub const fn is_start(&self) -> bool

Return true if and only if this lazy state ID has been tagged as a start state.

Note that if Config::specialize_start_states is disabled (which is the default), then this will always return false since start states won’t be tagged.

source

pub const fn is_match(&self) -> bool

Return true if and only if this lazy state ID has been tagged as a match state.

Trait Implementations§

source§

impl Clone for LazyStateID

source§

fn clone(&self) -> LazyStateID

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for LazyStateID

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl Default for LazyStateID

source§

fn default() -> LazyStateID

Returns the “default value” for a type. Read more
source§

impl Hash for LazyStateID

source§

fn hash<__H: Hasher>(&self, state: &mut __H)

Feeds this value into the given Hasher. Read more
1.3.0 · source§

fn hash_slice<H>(data: &[Self], state: &mut H)
where H: Hasher, Self: Sized,

Feeds a slice of this type into the given Hasher. Read more
source§

impl Ord for LazyStateID

source§

fn cmp(&self, other: &LazyStateID) -> Ordering

This method returns an Ordering between self and other. Read more
1.21.0 · source§

fn max(self, other: Self) -> Self
where Self: Sized,

Compares and returns the maximum of two values. Read more
1.21.0 · source§

fn min(self, other: Self) -> Self
where Self: Sized,

Compares and returns the minimum of two values. Read more
1.50.0 · source§

fn clamp(self, min: Self, max: Self) -> Self
where Self: Sized + PartialOrd,

Restrict a value to a certain interval. Read more
source§

impl PartialEq for LazyStateID

source§

fn eq(&self, other: &LazyStateID) -> bool

This method tests for self and other values to be equal, and is used by ==.
1.0.0 · source§

fn ne(&self, other: &Rhs) -> bool

This method tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
source§

impl PartialOrd for LazyStateID

source§

fn partial_cmp(&self, other: &LazyStateID) -> Option<Ordering>

This method returns an ordering between self and other values if one exists. Read more
1.0.0 · source§

fn lt(&self, other: &Rhs) -> bool

This method tests less than (for self and other) and is used by the < operator. Read more
1.0.0 · source§

fn le(&self, other: &Rhs) -> bool

This method tests less than or equal to (for self and other) and is used by the <= operator. Read more
1.0.0 · source§

fn gt(&self, other: &Rhs) -> bool

This method tests greater than (for self and other) and is used by the > operator. Read more
1.0.0 · source§

fn ge(&self, other: &Rhs) -> bool

This method tests greater than or equal to (for self and other) and is used by the >= operator. Read more
source§

impl Copy for LazyStateID

source§

impl Eq for LazyStateID

source§

impl StructuralPartialEq for LazyStateID

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for T
where T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.