Module regex_automata::meta::reverse_inner

source ·
Expand description

A module dedicated to plucking inner literals out of a regex pattern, and then constructing a prefilter for them. We also include a regex pattern “prefix” that corresponds to the bits of the regex that need to match before the literals do. The reverse inner optimization then proceeds by looking for matches of the inner literal(s), and then doing a reverse search of the prefix from the start of the literal match to find the overall start position of the match.

The essential invariant we want to uphold here is that the literals we return reflect a set where at least one of them must match in order for the overall regex to match. We also need to maintain the invariant that the regex prefix returned corresponds to the entirety of the regex up until the literals we return.

This somewhat limits what we can do. That is, if we a regex like \w+(@!|%%)\w+, then we can pluck the {@!, %%} out and build a prefilter from it. Then we just need to compile \w+ in reverse. No fuss no muss. But if we have a regex like \d+@!|\w+%%, then we get kind of stymied. Technically, we could still extract {@!, %%}, and it is true that at least of them must match. But then, what is our regex prefix? Again, in theory, that could be \d+|\w+, but that's not quite right, because the \d+only matches when@!matches, and\w+only matches when%%` matches.

All of that is technically possible to do, but it seemingly requires a lot of sophistication and machinery. Probably the way to tackle that is with some kind of formalism and approach this problem more generally.

For now, the code below basically just looks for a top-level concatenation. And if it can find one, it looks for literals in each of the direct child sub-expressions of that concatenation. If some good ones are found, we return those and a concatenation of the Hir expressions seen up to that point.

Functions§

  • extract 🔒
    Attempts to extract an “inner” prefilter from the given HIR expressions. If one was found, then a concatenation of the HIR expressions that precede it is returned.
  • flatten 🔒
    Returns a copy of the given HIR but with all capturing groups removed.
  • prefilter 🔒
    Attempt to extract a prefilter from an HIR expression.
  • top_concat 🔒
    Looks for a “top level” HirKind::Concat item in the given HIR. This will try to return one even if it’s embedded in a capturing group, but is otherwise pretty conservative in what is returned.