Constant regex_automata::util::pool::inner::MAX_POOL_STACKS

source ·
const MAX_POOL_STACKS: usize = 8;
Expand description

The number of stacks we use inside of the pool. These are only used for non-owners. That is, these represent the “slow” path.

In the original implementation of this pool, we only used a single stack. While this might be okay for a couple threads, the prevalence of 32, 64 and even 128 core CPUs has made it untenable. The contention such an environment introduces when threads are doing a lot of searches on short haystacks (a not uncommon use case) is palpable and leads to huge slowdowns.

This constant reflects a change from using one stack to the number of stacks that this constant is set to. The stack for a particular thread is simply chosen by thread_id % MAX_POOL_STACKS. The idea behind this setup is that there should be a good chance that accesses to the pool will be distributed over several stacks instead of all of them converging to one.

This is not a particularly smart or dynamic strategy. Fixing this to a specific number has at least two downsides. First is that it will help, say, an 8 core CPU more than it will a 128 core CPU. (But, crucially, it will still help the 128 core case.) Second is that this may wind up being a little wasteful with respect to memory usage. Namely, if a regex is used on one thread and then moved to another thread, then it could result in creating a new copy of the data in the pool even though only one is actually needed.

And that memory usage bit is why this is set to 8 and not, say, 64. Keeping it at 8 limits, to an extent, how much unnecessary memory can be allocated.

In an ideal world, we’d be able to have something like this:

  • Grow the number of stacks as the number of concurrent callers increases. I spent a little time trying this, but even just adding an atomic addition/subtraction for each pop/push for tracking concurrent callers led to a big perf hit. Since even more work would seemingly be required than just an addition/subtraction, I abandoned this approach.
  • The maximum amount of memory used should scale with respect to the number of concurrent callers and not the total number of existing threads. This is primarily why the thread_local crate isn’t used, as as some environments spin up a lot of threads. This led to multiple reports of extremely high memory usage (often described as memory leaks).
  • Even more ideally, the pool should contract in size. That is, it should grow with bursts and then shrink. But this is a pretty thorny issue to tackle and it might be better to just not.
  • It would be nice to explore the use of, say, a lock-free stack instead of using a mutex to guard a Vec that is ultimately just treated as a stack. The main thing preventing me from exploring this is the ABA problem. The crossbeam crate has tools for dealing with this sort of problem (via its epoch based memory reclamation strategy), but I can’t justify bringing in all of crossbeam as a dependency of regex for this.

See this issue for more context and discussion: https://github.com/rust-lang/regex/issues/934