Struct regex_automata::nfa::thompson::nfa::NFA

source ·

pub struct NFA(Arc<Inner>);

Expand description

A byte oriented Thompson non-deterministic finite automaton (NFA).

A Thompson NFA is a finite state machine that permits unconditional epsilon transitions, but guarantees that there exists at most one non-epsilon transition for each element in the alphabet for each state.

An NFA may be used directly for searching, for analysis or to build a deterministic finite automaton (DFA).

Cheap clones

Since an NFA is a core data type in this crate that many other regex engines are based on top of, it is convenient to give ownership of an NFA to said regex engines. Because of this, an NFA uses reference counting internally. Therefore, it is cheap to clone and it is encouraged to do so.

Capabilities

Using an NFA for searching via the PikeVM provides the most amount of “power” of any regex engine in this crate. Namely, it supports the following in all cases:

Detection of a match.
Location of a match, including both the start and end offset, in a single pass of the haystack.
Location of matching capturing groups.
Handles multiple patterns, including (1)-(3) when multiple patterns are present.

Capturing Groups

Groups refer to parenthesized expressions inside a regex pattern. They look like this, where exp is an arbitrary regex:

(exp) - An unnamed capturing group.
(?P<name>exp) or (?<name>exp) - A named capturing group.
(?:exp) - A non-capturing group.
(?i:exp) - A non-capturing group that sets flags.

Only the first two forms are said to be capturing. Capturing means that the last position at which they match is reportable. The Captures type provides convenient access to the match positions of capturing groups, which includes looking up capturing groups by their name.

Byte oriented

This NFA is byte oriented, which means that all of its transitions are defined on bytes. In other words, the alphabet of an NFA consists of the 256 different byte values.

While DFAs nearly demand that they be byte oriented for performance reasons, an NFA could conceivably be Unicode codepoint oriented. Indeed, a previous version of this NFA supported both byte and codepoint oriented modes. A codepoint oriented mode can work because an NFA fundamentally uses a sparse representation of transitions, which works well with the large sparse space of Unicode codepoints.

Nevertheless, this NFA is only byte oriented. This choice is primarily driven by implementation simplicity, and also in part memory usage. In practice, performance between the two is roughly comparable. However, building a DFA (including a hybrid DFA) really wants a byte oriented NFA. So if we do have a codepoint oriented NFA, then we also need to generate byte oriented NFA in order to build an hybrid NFA/DFA. Thus, by only generating byte oriented NFAs, we can produce one less NFA. In other words, if we made our NFA codepoint oriented, we’d need to also make it support a byte oriented mode, which is more complicated. But a byte oriented mode can support everything.

Differences with DFAs

At the theoretical level, the precise difference between an NFA and a DFA is that, in a DFA, for every state, an input symbol unambiguously refers to a single transition and that an input symbol is required for each transition. At a practical level, this permits DFA implementations to be implemented at their core with a small constant number of CPU instructions for each byte of input searched. In practice, this makes them quite a bit faster than NFAs in general. Namely, in order to execute a search for any Thompson NFA, one needs to keep track of a set of states, and execute the possible transitions on all of those states for each input symbol. Overall, this results in much more overhead. To a first approximation, one can expect DFA searches to be about an order of magnitude faster.

So why use an NFA at all? The main advantage of an NFA is that it takes linear time (in the size of the pattern string after repetitions have been expanded) to build and linear memory usage. A DFA, on the other hand, may take exponential time and/or space to build. Even in non-pathological cases, DFAs often take quite a bit more memory than their NFA counterparts, especially if large Unicode character classes are involved. Of course, an NFA also provides additional capabilities. For example, it can match Unicode word boundaries on non-ASCII text and resolve the positions of capturing groups.

Note that a hybrid::regex::Regex strikes a good balance between an NFA and a DFA. It avoids the exponential build time of a DFA while maintaining its fast search time. The downside of a hybrid NFA/DFA is that in some cases it can be slower at search time than the NFA. (It also has less functionality than a pure NFA. It cannot handle Unicode word boundaries on non-ASCII text and cannot resolve capturing groups.)

Example

This shows how to build an NFA with the default configuration and execute a search using the Pike VM.

use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};

let re = PikeVM::new(r"foo[0-9]+")?;
let mut cache = re.create_cache();
let mut caps = re.create_captures();

let expected = Some(Match::must(0, 0..8));
re.captures(&mut cache, b"foo12345", &mut caps);
assert_eq!(expected, caps.get_match());

Example: resolving capturing groups

This example shows how to parse some simple dates and extract the components of each date via capturing groups.

use regex_automata::{
    nfa::thompson::pikevm::PikeVM,
    util::captures::Captures,
};

let vm = PikeVM::new(r"(?P<y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2})")?;
let mut cache = vm.create_cache();

let haystack = "2012-03-14, 2013-01-01 and 2014-07-05";
let all: Vec<Captures> = vm.captures_iter(
    &mut cache, haystack.as_bytes()
).collect();
// There should be a total of 3 matches.
assert_eq!(3, all.len());
// The year from the second match is '2013'.
let span = all[1].get_group_by_name("y").unwrap();
assert_eq!("2013", &haystack[span]);

This example shows that only the last match of a capturing group is reported, even if it had to match multiple times for an overall match to occur.

use regex_automata::{nfa::thompson::pikevm::PikeVM, Span};

let re = PikeVM::new(r"([a-z]){4}")?;
let mut cache = re.create_cache();
let mut caps = re.create_captures();

let haystack = b"quux";
re.captures(&mut cache, haystack, &mut caps);
assert!(caps.is_match());
assert_eq!(Some(Span::from(3..4)), caps.get_group(1));

Tuple Fields§

§0: Arc<Inner>

Struct regex_automata::nfa::thompson::nfa::NFA

Tuple Fields§

Implementations§

impl NFA

pub fn new(pattern: &str) -> Result<NFA, BuildError>

pub fn new_many<P: AsRef<str>>(patterns: &[P]) -> Result<NFA, BuildError>

pub fn always_match() -> NFA

pub fn never_match() -> NFA

pub fn config() -> Config

pub fn compiler() -> Compiler

pub fn patterns(&self) -> PatternIter<'_> ⓘ

pub fn pattern_len(&self) -> usize

pub fn start_anchored(&self) -> StateID

pub fn start_unanchored(&self) -> StateID

pub fn start_pattern(&self, pid: PatternID) -> Option<StateID>

pub(crate) fn byte_class_set(&self) -> &ByteClassSet

pub fn byte_classes(&self) -> &ByteClasses

pub fn state(&self, id: StateID) -> &State

pub fn states(&self) -> &[State]

pub fn group_info(&self) -> &GroupInfo

pub fn has_capture(&self) -> bool

pub fn has_empty(&self) -> bool

pub fn is_utf8(&self) -> bool

pub fn is_reverse(&self) -> bool

pub fn is_always_start_anchored(&self) -> bool

pub fn look_matcher(&self) -> &LookMatcher

pub fn look_set_any(&self) -> LookSet

pub fn look_set_prefix_any(&self) -> LookSet

pub fn memory_usage(&self) -> usize

Trait Implementations§

impl Clone for NFA

fn clone(&self) -> NFA

fn clone_from(&mut self, source: &Self)

impl Debug for NFA

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Auto Trait Implementations§

impl RefUnwindSafe for NFA

impl Send for NFA

impl Sync for NFA

impl Unpin for NFA

impl UnwindSafe for NFA

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

fn clone_into(&self, target: &mut T)

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>