Struct regex_automata::util::alphabet::ByteClasses

source ·
pub struct ByteClasses([u8; 256]);
Expand description

A representation of byte oriented equivalence classes.

This is used in a DFA to reduce the size of the transition table. This can have a particularly large impact not only on the total size of a dense DFA, but also on compile times.

The essential idea here is that the alphabet of a DFA is shrunk from the usual 256 distinct byte values down to a set of equivalence classes. The guarantee you get is that any byte belonging to the same equivalence class can be treated as if it were any other byte in the same class, and the result of a search wouldn’t change.

§Example

This example shows how to get byte classes from an NFA and ask for the class of various bytes.

use regex_automata::nfa::thompson::NFA;

let nfa = NFA::new("[a-z]+")?;
let classes = nfa.byte_classes();
// 'a' and 'z' are in the same class for this regex.
assert_eq!(classes.get(b'a'), classes.get(b'z'));
// But 'a' and 'A' are not.
assert_ne!(classes.get(b'a'), classes.get(b'A'));

Tuple Fields§

§0: [u8; 256]

Implementations§

source§

impl ByteClasses

source

pub fn empty() -> ByteClasses

Creates a new set of equivalence classes where all bytes are mapped to the same class.

source

pub fn singletons() -> ByteClasses

Creates a new set of equivalence classes where each byte belongs to its own equivalence class.

source

pub(crate) fn from_bytes( slice: &[u8], ) -> Result<(ByteClasses, usize), DeserializeError>

Deserializes a byte class map from the given slice. If the slice is of insufficient length or otherwise contains an impossible mapping, then an error is returned. Upon success, the number of bytes read along with the map are returned. The number of bytes read is always a multiple of 8.

source

pub(crate) fn write_to(&self, dst: &mut [u8]) -> Result<usize, SerializeError>

Writes this byte class map to the given byte buffer. if the given buffer is too small, then an error is returned. Upon success, the total number of bytes written is returned. The number of bytes written is guaranteed to be a multiple of 8.

source

pub(crate) fn write_to_len(&self) -> usize

Returns the total number of bytes written by write_to.

source

pub fn set(&mut self, byte: u8, class: u8)

Set the equivalence class for the given byte.

source

pub fn get(&self, byte: u8) -> u8

Get the equivalence class for the given byte.

source

pub fn get_by_unit(&self, unit: Unit) -> usize

Get the equivalence class for the given haystack unit and return the class as a usize.

source

pub fn eoi(&self) -> Unit

Create a unit that represents the “end of input” sentinel based on the number of equivalence classes.

source

pub fn alphabet_len(&self) -> usize

Return the total number of elements in the alphabet represented by these equivalence classes. Equivalently, this returns the total number of equivalence classes.

source

pub fn stride2(&self) -> usize

Returns the stride, as a base-2 exponent, required for these equivalence classes.

The stride is always the smallest power of 2 that is greater than or equal to the alphabet length, and the stride2 returned here is the exponent applied to 2 to get the smallest power. This is done so that converting between premultiplied state IDs and indices can be done with shifts alone, which is much faster than integer division.

source

pub fn is_singleton(&self) -> bool

Returns true if and only if every byte in this class maps to its own equivalence class. Equivalently, there are 257 equivalence classes and each class contains either exactly one byte or corresponds to the singleton class containing the “end of input” sentinel.

source

pub fn iter(&self) -> ByteClassIter<'_>

Returns an iterator over all equivalence classes in this set.

source

pub fn representatives<R: RangeBounds<u8>>( &self, range: R, ) -> ByteClassRepresentatives<'_>

Returns an iterator over a sequence of representative bytes from each equivalence class within the range of bytes given.

When the given range is unbounded on both sides, the iterator yields exactly N items, where N is equivalent to the number of equivalence classes. Each item is an arbitrary byte drawn from each equivalence class.

This is useful when one is determinizing an NFA and the NFA’s alphabet hasn’t been converted to equivalence classes. Picking an arbitrary byte from each equivalence class then permits a full exploration of the NFA instead of using every possible byte value and thus potentially saves quite a lot of redundant work.

§Example

This shows an example of what a complete sequence of representatives might look like from a real example.

use regex_automata::{nfa::thompson::NFA, util::alphabet::Unit};

let nfa = NFA::new("[a-z]+")?;
let classes = nfa.byte_classes();
let reps: Vec<Unit> = classes.representatives(..).collect();
// Note that the specific byte values yielded are not guaranteed!
let expected = vec![
    Unit::u8(b'\x00'),
    Unit::u8(b'a'),
    Unit::u8(b'{'),
    Unit::eoi(3),
];
assert_eq!(expected, reps);

Note though, that you can ask for an arbitrary range of bytes, and only representatives for that range will be returned:

use regex_automata::{nfa::thompson::NFA, util::alphabet::Unit};

let nfa = NFA::new("[a-z]+")?;
let classes = nfa.byte_classes();
let reps: Vec<Unit> = classes.representatives(b'A'..=b'z').collect();
// Note that the specific byte values yielded are not guaranteed!
let expected = vec![
    Unit::u8(b'A'),
    Unit::u8(b'a'),
];
assert_eq!(expected, reps);
source

pub fn elements(&self, class: Unit) -> ByteClassElements<'_>

Returns an iterator of the bytes in the given equivalence class.

This is useful when one needs to know the actual bytes that belong to an equivalence class. For example, conceptually speaking, accelerating a DFA state occurs when a state only has a few outgoing transitions. But in reality, what is required is that there are only a small number of distinct bytes that can lead to an outgoing transition. The difference is that any one transition can correspond to an equivalence class which may contains many bytes. Therefore, DFA state acceleration considers the actual elements in each equivalence class of each outgoing transition.

§Example

This shows an example of how to get all of the elements in an equivalence class.

use regex_automata::{nfa::thompson::NFA, util::alphabet::Unit};

let nfa = NFA::new("[a-z]+")?;
let classes = nfa.byte_classes();
let elements: Vec<Unit> = classes.elements(Unit::u8(1)).collect();
let expected: Vec<Unit> = (b'a'..=b'z').map(Unit::u8).collect();
assert_eq!(expected, elements);
source

fn element_ranges(&self, class: Unit) -> ByteClassElementRanges<'_>

Returns an iterator of byte ranges in the given equivalence class.

That is, a sequence of contiguous ranges are returned. Typically, every class maps to a single contiguous range.

Trait Implementations§

source§

impl Clone for ByteClasses

source§

fn clone(&self) -> ByteClasses

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for ByteClasses

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl Default for ByteClasses

source§

fn default() -> ByteClasses

Returns the “default value” for a type. Read more
source§

impl Copy for ByteClasses

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> CloneToUninit for T
where T: Clone,

source§

unsafe fn clone_to_uninit(&self, dst: *mut T)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dst. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for T
where T: Clone,

source§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

source§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.