Struct regex_automata::util::determinize::state::Repr

source ·
struct Repr<'a>(&'a [u8]);
Expand description

Repr is a read-only view into the representation of a DFA state.

Primarily, a Repr is how we achieve DRY: we implement decoding the format in one place, and then use a Repr to implement the various methods on the public state types.

The format is as follows:

The first three bytes correspond to bitsets.

Byte 0 is a bitset corresponding to miscellaneous flags associated with the state. Bit 0 is set to 1 if the state is a match state. Bit 1 is set to 1 if the state has pattern IDs explicitly written to it. (This is a flag that is not meant to be set by determinization, but rather, is used as part of an internal space-saving optimization.) Bit 2 is set to 1 if the state was generated by a transition over a “word” byte. (Callers may not always set this. For example, if the NFA has no word boundary assertion, then needing to track whether a state came from a word byte or not is superfluous and wasteful.) Bit 3 is set to 1 if the state was generated by a transition from a \r (forward search) or a \n (reverse search) when CRLF mode is enabled.

Bytes 1..5 correspond to the look-behind assertions that were satisfied by the transition that created this state. (Look-ahead assertions are not tracked as part of states. Instead, these are applied by re-computing the epsilon closure of a state when computing the transition function. See next in the parent module.)

Bytes 5..9 correspond to the set of look-around assertions (including both look-behind and look-ahead) that appear somewhere in this state’s set of NFA state IDs. This is used to determine whether this state’s epsilon closure should be re-computed when computing the transition function. Namely, look-around assertions are “just” conditional epsilon transitions, so if there are new assertions available when computing the transition function, we should only re-compute the epsilon closure if those new assertions are relevant to this particular state.

Bytes 9..13 correspond to a 32-bit native-endian encoded integer corresponding to the number of patterns encoded in this state. If the state is not a match state (byte 0 bit 0 is 0) or if it’s only pattern ID is PatternID::ZERO, then no integer is encoded at this position. Instead, byte offset 3 is the position at which the first NFA state ID is encoded.

For a match state with at least one non-ZERO pattern ID, the next bytes correspond to a sequence of 32-bit native endian encoded integers that represent each pattern ID, in order, that this match state represents.

After the pattern IDs (if any), NFA state IDs are delta encoded as varints.[1] The first NFA state ID is encoded as itself, and each subsequent NFA state ID is encoded as the difference between itself and the previous NFA state ID.

[1] - https://developers.google.com/protocol-buffers/docs/encoding#varints

Tuple Fields§

§0: &'a [u8]

Implementations§

source§

impl<'a> Repr<'a>

source

fn is_match(&self) -> bool

Returns true if and only if this is a match state.

If callers have added pattern IDs to this state, then callers MUST set this state as a match state explicitly. However, as a special case, states that are marked as match states but with no pattern IDs, then the state is treated as if it had a single pattern ID equivalent to PatternID::ZERO.

source

fn has_pattern_ids(&self) -> bool

Returns true if and only if this state has had at least one pattern ID added to it.

This is an internal-only flag that permits the representation to save space in the common case of an NFA with one pattern in it. In that case, a match state can only ever have exactly one pattern ID: PatternID::ZERO. So there’s no need to represent it.

source

fn is_from_word(&self) -> bool

Returns true if and only if this state is marked as having been created from a transition over a word byte. This is useful for checking whether a word boundary assertion is true or not, which requires look-behind (whether the current state came from a word byte or not) and look-ahead (whether the transition byte is a word byte or not).

Since states with this set are distinct from states that don’t have this set (even if they are otherwise equivalent), callers should not set this assertion unless the underlying NFA has at least one word boundary assertion somewhere. Otherwise, a superfluous number of states may be created.

source

fn is_half_crlf(&self) -> bool

Returns true if and only if this state is marked as being inside of a CRLF terminator. In the forward direction, this means the state was created after seeing a \r. In the reverse direction, this means the state was created after seeing a \n.

source

fn look_have(&self) -> LookSet

The set of look-behind assertions that were true in the transition that created this state.

Generally, this should be empty if ‘look_need’ is empty, since there is no reason to track which look-behind assertions are true if the state has no conditional epsilon transitions.

Satisfied look-ahead assertions are not tracked in states. Instead, these are re-computed on demand via epsilon closure when computing the transition function.

source

fn look_need(&self) -> LookSet

The set of look-around (both behind and ahead) assertions that appear at least once in this state’s set of NFA states.

This is used to determine whether the epsilon closure needs to be re-computed when computing the transition function. Namely, if the state has no conditional epsilon transitions, then there is no need to re-compute the epsilon closure.

source

fn match_len(&self) -> usize

Returns the total number of match pattern IDs in this state.

If this state is not a match state, then this always returns 0.

source

fn match_pattern(&self, index: usize) -> PatternID

Returns the pattern ID for this match state at the given index.

If the given index is greater than or equal to match_len() for this state, then this could panic or return incorrect results.

source

fn match_pattern_ids(&self) -> Option<Vec<PatternID>>

Returns a copy of all match pattern IDs in this state. If this state is not a match state, then this returns None.

source

fn iter_match_pattern_ids<F: FnMut(PatternID)>(&self, f: F)

Calls the given function on every pattern ID in this state.

source

fn iter_nfa_state_ids<F: FnMut(StateID)>(&self, f: F)

Calls the given function on every NFA state ID in this state.

source

fn pattern_offset_end(&self) -> usize

Returns the offset into this state’s representation where the pattern IDs end and the NFA state IDs begin.

source

fn encoded_pattern_len(&self) -> usize

Returns the total number of encoded pattern IDs in this state.

This may return 0 even when this is a match state, since the pattern ID PatternID::ZERO is not encoded when it’s the only pattern ID in the match state (the overwhelming common case).

Trait Implementations§

source§

impl<'a> Debug for Repr<'a>

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

§

impl<'a> Freeze for Repr<'a>

§

impl<'a> RefUnwindSafe for Repr<'a>

§

impl<'a> Send for Repr<'a>

§

impl<'a> Sync for Repr<'a>

§

impl<'a> Unpin for Repr<'a>

§

impl<'a> UnwindSafe for Repr<'a>

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

source§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.