Struct regex_automata::util::alphabet::ByteClasses
source · pub struct ByteClasses([u8; 256]);
Expand description
A representation of byte oriented equivalence classes.
This is used in a DFA to reduce the size of the transition table. This can have a particularly large impact not only on the total size of a dense DFA, but also on compile times.
The essential idea here is that the alphabet of a DFA is shrunk from the usual 256 distinct byte values down to a set of equivalence classes. The guarantee you get is that any byte belonging to the same equivalence class can be treated as if it were any other byte in the same class, and the result of a search wouldn’t change.
§Example
This example shows how to get byte classes from an
NFA
and ask for the class of various bytes.
use regex_automata::nfa::thompson::NFA;
let nfa = NFA::new("[a-z]+")?;
let classes = nfa.byte_classes();
// 'a' and 'z' are in the same class for this regex.
assert_eq!(classes.get(b'a'), classes.get(b'z'));
// But 'a' and 'A' are not.
assert_ne!(classes.get(b'a'), classes.get(b'A'));
Tuple Fields§
§0: [u8; 256]
Implementations§
source§impl ByteClasses
impl ByteClasses
sourcepub fn empty() -> ByteClasses
pub fn empty() -> ByteClasses
Creates a new set of equivalence classes where all bytes are mapped to the same class.
sourcepub fn singletons() -> ByteClasses
pub fn singletons() -> ByteClasses
Creates a new set of equivalence classes where each byte belongs to its own equivalence class.
sourcepub(crate) fn from_bytes(
slice: &[u8],
) -> Result<(ByteClasses, usize), DeserializeError>
pub(crate) fn from_bytes( slice: &[u8], ) -> Result<(ByteClasses, usize), DeserializeError>
Deserializes a byte class map from the given slice. If the slice is of insufficient length or otherwise contains an impossible mapping, then an error is returned. Upon success, the number of bytes read along with the map are returned. The number of bytes read is always a multiple of 8.
sourcepub(crate) fn write_to(&self, dst: &mut [u8]) -> Result<usize, SerializeError>
pub(crate) fn write_to(&self, dst: &mut [u8]) -> Result<usize, SerializeError>
Writes this byte class map to the given byte buffer. if the given buffer is too small, then an error is returned. Upon success, the total number of bytes written is returned. The number of bytes written is guaranteed to be a multiple of 8.
sourcepub(crate) fn write_to_len(&self) -> usize
pub(crate) fn write_to_len(&self) -> usize
Returns the total number of bytes written by write_to
.
sourcepub fn get_by_unit(&self, unit: Unit) -> usize
pub fn get_by_unit(&self, unit: Unit) -> usize
Get the equivalence class for the given haystack unit and return the
class as a usize
.
sourcepub fn eoi(&self) -> Unit
pub fn eoi(&self) -> Unit
Create a unit that represents the “end of input” sentinel based on the number of equivalence classes.
sourcepub fn alphabet_len(&self) -> usize
pub fn alphabet_len(&self) -> usize
Return the total number of elements in the alphabet represented by these equivalence classes. Equivalently, this returns the total number of equivalence classes.
sourcepub fn stride2(&self) -> usize
pub fn stride2(&self) -> usize
Returns the stride, as a base-2 exponent, required for these equivalence classes.
The stride is always the smallest power of 2 that is greater than or
equal to the alphabet length, and the stride2
returned here is the
exponent applied to 2
to get the smallest power. This is done so that
converting between premultiplied state IDs and indices can be done with
shifts alone, which is much faster than integer division.
sourcepub fn is_singleton(&self) -> bool
pub fn is_singleton(&self) -> bool
Returns true if and only if every byte in this class maps to its own equivalence class. Equivalently, there are 257 equivalence classes and each class contains either exactly one byte or corresponds to the singleton class containing the “end of input” sentinel.
sourcepub fn iter(&self) -> ByteClassIter<'_> ⓘ
pub fn iter(&self) -> ByteClassIter<'_> ⓘ
Returns an iterator over all equivalence classes in this set.
sourcepub fn representatives<R: RangeBounds<u8>>(
&self,
range: R,
) -> ByteClassRepresentatives<'_> ⓘ
pub fn representatives<R: RangeBounds<u8>>( &self, range: R, ) -> ByteClassRepresentatives<'_> ⓘ
Returns an iterator over a sequence of representative bytes from each equivalence class within the range of bytes given.
When the given range is unbounded on both sides, the iterator yields exactly N items, where N is equivalent to the number of equivalence classes. Each item is an arbitrary byte drawn from each equivalence class.
This is useful when one is determinizing an NFA and the NFA’s alphabet hasn’t been converted to equivalence classes. Picking an arbitrary byte from each equivalence class then permits a full exploration of the NFA instead of using every possible byte value and thus potentially saves quite a lot of redundant work.
§Example
This shows an example of what a complete sequence of representatives might look like from a real example.
use regex_automata::{nfa::thompson::NFA, util::alphabet::Unit};
let nfa = NFA::new("[a-z]+")?;
let classes = nfa.byte_classes();
let reps: Vec<Unit> = classes.representatives(..).collect();
// Note that the specific byte values yielded are not guaranteed!
let expected = vec![
Unit::u8(b'\x00'),
Unit::u8(b'a'),
Unit::u8(b'{'),
Unit::eoi(3),
];
assert_eq!(expected, reps);
Note though, that you can ask for an arbitrary range of bytes, and only representatives for that range will be returned:
use regex_automata::{nfa::thompson::NFA, util::alphabet::Unit};
let nfa = NFA::new("[a-z]+")?;
let classes = nfa.byte_classes();
let reps: Vec<Unit> = classes.representatives(b'A'..=b'z').collect();
// Note that the specific byte values yielded are not guaranteed!
let expected = vec![
Unit::u8(b'A'),
Unit::u8(b'a'),
];
assert_eq!(expected, reps);
sourcepub fn elements(&self, class: Unit) -> ByteClassElements<'_> ⓘ
pub fn elements(&self, class: Unit) -> ByteClassElements<'_> ⓘ
Returns an iterator of the bytes in the given equivalence class.
This is useful when one needs to know the actual bytes that belong to an equivalence class. For example, conceptually speaking, accelerating a DFA state occurs when a state only has a few outgoing transitions. But in reality, what is required is that there are only a small number of distinct bytes that can lead to an outgoing transition. The difference is that any one transition can correspond to an equivalence class which may contains many bytes. Therefore, DFA state acceleration considers the actual elements in each equivalence class of each outgoing transition.
§Example
This shows an example of how to get all of the elements in an equivalence class.
use regex_automata::{nfa::thompson::NFA, util::alphabet::Unit};
let nfa = NFA::new("[a-z]+")?;
let classes = nfa.byte_classes();
let elements: Vec<Unit> = classes.elements(Unit::u8(1)).collect();
let expected: Vec<Unit> = (b'a'..=b'z').map(Unit::u8).collect();
assert_eq!(expected, elements);
sourcefn element_ranges(&self, class: Unit) -> ByteClassElementRanges<'_> ⓘ
fn element_ranges(&self, class: Unit) -> ByteClassElementRanges<'_> ⓘ
Returns an iterator of byte ranges in the given equivalence class.
That is, a sequence of contiguous ranges are returned. Typically, every class maps to a single contiguous range.
Trait Implementations§
source§impl Clone for ByteClasses
impl Clone for ByteClasses
source§fn clone(&self) -> ByteClasses
fn clone(&self) -> ByteClasses
1.0.0 · source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source
. Read moresource§impl Debug for ByteClasses
impl Debug for ByteClasses
source§impl Default for ByteClasses
impl Default for ByteClasses
source§fn default() -> ByteClasses
fn default() -> ByteClasses
impl Copy for ByteClasses
Auto Trait Implementations§
impl Freeze for ByteClasses
impl RefUnwindSafe for ByteClasses
impl Send for ByteClasses
impl Sync for ByteClasses
impl Unpin for ByteClasses
impl UnwindSafe for ByteClasses
Blanket Implementations§
source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
source§unsafe fn clone_to_uninit(&self, dst: *mut T)
unsafe fn clone_to_uninit(&self, dst: *mut T)
clone_to_uninit
)