Module regex_syntax::hir::literal
source · Expand description
Provides literal extraction from Hir
expressions.
An Extractor
pulls literals out of Hir
expressions and returns a
Seq
of Literal
s.
The purpose of literal extraction is generally to provide avenues for optimizing regex searches. The main idea is that substring searches can be an order of magnitude faster than a regex search. Therefore, if one can execute a substring search to find candidate match locations and only run the regex search at those locations, then it is possible for huge improvements in performance to be realized.
With that said, literal optimizations are generally a black art because even though substring search is generally faster, if the number of candidates produced is high, then it can create a lot of overhead by ping-ponging between the substring search and the regex search.
Here are some heuristics that might be used to help increase the chances of effective literal optimizations:
- Stick to small
Seq
s. If you search for too many literals, it’s likely to lead to substring search that is only a little faster than a regex search, and thus the overhead of using literal optimizations in the first place might make things slower overall. - The literals in your
Seq
shouldn’t be too short. In general, longer is better. A sequence corresponding to single bytes that occur frequently in the haystack, for example, is probably a bad literal optimization because it’s likely to produce many false positive candidates. Longer literals are less likely to match, and thus probably produce fewer false positives. - If it’s possible to estimate the approximate frequency of each byte according
to some pre-computed background distribution, it is possible to compute a score
of how “good” a
Seq
is. If aSeq
isn’t good enough, you might consider skipping the literal optimization and just use the regex engine.
(It should be noted that there are always pathological cases that can make any kind of literal optimization be a net slower result. This is why it might be a good idea to be conservative, or to even provide a means for literal optimizations to be dynamically disabled if they are determined to be ineffective according to some measure.)
You’re encouraged to explore the methods on Seq
, which permit shrinking
the size of sequences in a preference-order preserving fashion.
Finally, note that it isn’t strictly necessary to use an Extractor
. Namely,
an Extractor
only uses public APIs of the Seq
and Literal
types,
so it is possible to implement your own extractor. For example, for n-grams
or “inner” literals (i.e., not prefix or suffix literals). The Extractor
is mostly responsible for the case analysis over Hir
expressions. Much of
the “trickier” parts are how to combine literal sequences, and that is all
implemented on Seq
.
Structs§
- Extracts prefix or suffix literal sequences from
Hir
expressions. - A single literal extracted from an
Hir
expression. - A “preference” trie that rejects literals that will never match when executing a leftmost first or “preference” search.
- A sequence of literals.
- State 🔒A single state in a trie. Uses a sparse representation for its transitions.
Enums§
- The kind of literals to extract from an
Hir
expression.
Functions§
- Returns the “rank” of the given byte.