Struct regex_automata::util::determinize::state::Repr
source · struct Repr<'a>(&'a [u8]);
Expand description
Repr is a read-only view into the representation of a DFA state.
Primarily, a Repr is how we achieve DRY: we implement decoding the format in one place, and then use a Repr to implement the various methods on the public state types.
The format is as follows:
The first three bytes correspond to bitsets.
Byte 0 is a bitset corresponding to miscellaneous flags associated with the
state. Bit 0 is set to 1 if the state is a match state. Bit 1 is set to 1
if the state has pattern IDs explicitly written to it. (This is a flag that
is not meant to be set by determinization, but rather, is used as part of
an internal space-saving optimization.) Bit 2 is set to 1 if the state was
generated by a transition over a “word” byte. (Callers may not always set
this. For example, if the NFA has no word boundary assertion, then needing
to track whether a state came from a word byte or not is superfluous and
wasteful.) Bit 3 is set to 1 if the state was generated by a transition
from a \r
(forward search) or a \n
(reverse search) when CRLF mode is
enabled.
Bytes 1..5 correspond to the look-behind assertions that were satisfied
by the transition that created this state. (Look-ahead assertions are not
tracked as part of states. Instead, these are applied by re-computing the
epsilon closure of a state when computing the transition function. See
next
in the parent module.)
Bytes 5..9 correspond to the set of look-around assertions (including both look-behind and look-ahead) that appear somewhere in this state’s set of NFA state IDs. This is used to determine whether this state’s epsilon closure should be re-computed when computing the transition function. Namely, look-around assertions are “just” conditional epsilon transitions, so if there are new assertions available when computing the transition function, we should only re-compute the epsilon closure if those new assertions are relevant to this particular state.
Bytes 9..13 correspond to a 32-bit native-endian encoded integer corresponding to the number of patterns encoded in this state. If the state is not a match state (byte 0 bit 0 is 0) or if it’s only pattern ID is PatternID::ZERO, then no integer is encoded at this position. Instead, byte offset 3 is the position at which the first NFA state ID is encoded.
For a match state with at least one non-ZERO pattern ID, the next bytes correspond to a sequence of 32-bit native endian encoded integers that represent each pattern ID, in order, that this match state represents.
After the pattern IDs (if any), NFA state IDs are delta encoded as varints.[1] The first NFA state ID is encoded as itself, and each subsequent NFA state ID is encoded as the difference between itself and the previous NFA state ID.
[1] - https://developers.google.com/protocol-buffers/docs/encoding#varints
Tuple Fields§
§0: &'a [u8]
Implementations§
source§impl<'a> Repr<'a>
impl<'a> Repr<'a>
sourcefn is_match(&self) -> bool
fn is_match(&self) -> bool
Returns true if and only if this is a match state.
If callers have added pattern IDs to this state, then callers MUST set this state as a match state explicitly. However, as a special case, states that are marked as match states but with no pattern IDs, then the state is treated as if it had a single pattern ID equivalent to PatternID::ZERO.
sourcefn has_pattern_ids(&self) -> bool
fn has_pattern_ids(&self) -> bool
Returns true if and only if this state has had at least one pattern ID added to it.
This is an internal-only flag that permits the representation to save space in the common case of an NFA with one pattern in it. In that case, a match state can only ever have exactly one pattern ID: PatternID::ZERO. So there’s no need to represent it.
sourcefn is_from_word(&self) -> bool
fn is_from_word(&self) -> bool
Returns true if and only if this state is marked as having been created from a transition over a word byte. This is useful for checking whether a word boundary assertion is true or not, which requires look-behind (whether the current state came from a word byte or not) and look-ahead (whether the transition byte is a word byte or not).
Since states with this set are distinct from states that don’t have this set (even if they are otherwise equivalent), callers should not set this assertion unless the underlying NFA has at least one word boundary assertion somewhere. Otherwise, a superfluous number of states may be created.
sourcefn is_half_crlf(&self) -> bool
fn is_half_crlf(&self) -> bool
Returns true if and only if this state is marked as being inside of a
CRLF terminator. In the forward direction, this means the state was
created after seeing a \r
. In the reverse direction, this means the
state was created after seeing a \n
.
sourcefn look_have(&self) -> LookSet
fn look_have(&self) -> LookSet
The set of look-behind assertions that were true in the transition that created this state.
Generally, this should be empty if ‘look_need’ is empty, since there is no reason to track which look-behind assertions are true if the state has no conditional epsilon transitions.
Satisfied look-ahead assertions are not tracked in states. Instead, these are re-computed on demand via epsilon closure when computing the transition function.
sourcefn look_need(&self) -> LookSet
fn look_need(&self) -> LookSet
The set of look-around (both behind and ahead) assertions that appear at least once in this state’s set of NFA states.
This is used to determine whether the epsilon closure needs to be re-computed when computing the transition function. Namely, if the state has no conditional epsilon transitions, then there is no need to re-compute the epsilon closure.
sourcefn match_len(&self) -> usize
fn match_len(&self) -> usize
Returns the total number of match pattern IDs in this state.
If this state is not a match state, then this always returns 0.
sourcefn match_pattern(&self, index: usize) -> PatternID
fn match_pattern(&self, index: usize) -> PatternID
Returns the pattern ID for this match state at the given index.
If the given index is greater than or equal to match_len()
for this
state, then this could panic or return incorrect results.
sourcefn match_pattern_ids(&self) -> Option<Vec<PatternID>>
fn match_pattern_ids(&self) -> Option<Vec<PatternID>>
Returns a copy of all match pattern IDs in this state. If this state is not a match state, then this returns None.
sourcefn iter_match_pattern_ids<F: FnMut(PatternID)>(&self, f: F)
fn iter_match_pattern_ids<F: FnMut(PatternID)>(&self, f: F)
Calls the given function on every pattern ID in this state.
sourcefn iter_nfa_state_ids<F: FnMut(StateID)>(&self, f: F)
fn iter_nfa_state_ids<F: FnMut(StateID)>(&self, f: F)
Calls the given function on every NFA state ID in this state.
sourcefn pattern_offset_end(&self) -> usize
fn pattern_offset_end(&self) -> usize
Returns the offset into this state’s representation where the pattern IDs end and the NFA state IDs begin.
sourcefn encoded_pattern_len(&self) -> usize
fn encoded_pattern_len(&self) -> usize
Returns the total number of encoded pattern IDs in this state.
This may return 0 even when this is a match state, since the pattern
ID PatternID::ZERO
is not encoded when it’s the only pattern ID in
the match state (the overwhelming common case).