const MAX_POOL_STACKS: usize = 8;Expand description
The number of stacks we use inside of the pool. These are only used for non-owners. That is, these represent the “slow” path.
In the original implementation of this pool, we only used a single stack. While this might be okay for a couple threads, the prevalence of 32, 64 and even 128 core CPUs has made it untenable. The contention such an environment introduces when threads are doing a lot of searches on short haystacks (a not uncommon use case) is palpable and leads to huge slowdowns.
This constant reflects a change from using one stack to the number of
stacks that this constant is set to. The stack for a particular thread
is simply chosen by thread_id % MAX_POOL_STACKS. The idea behind
this setup is that there should be a good chance that accesses to the
pool will be distributed over several stacks instead of all of them
converging to one.
This is not a particularly smart or dynamic strategy. Fixing this to a specific number has at least two downsides. First is that it will help, say, an 8 core CPU more than it will a 128 core CPU. (But, crucially, it will still help the 128 core case.) Second is that this may wind up being a little wasteful with respect to memory usage. Namely, if a regex is used on one thread and then moved to another thread, then it could result in creating a new copy of the data in the pool even though only one is actually needed.
And that memory usage bit is why this is set to 8 and not, say, 64. Keeping it at 8 limits, to an extent, how much unnecessary memory can be allocated.
In an ideal world, we’d be able to have something like this:
- Grow the number of stacks as the number of concurrent callers increases. I spent a little time trying this, but even just adding an atomic addition/subtraction for each pop/push for tracking concurrent callers led to a big perf hit. Since even more work would seemingly be required than just an addition/subtraction, I abandoned this approach.
 - The maximum amount of memory used should scale with respect to the
number of concurrent callers and not the total number of existing
threads. This is primarily why the 
thread_localcrate isn’t used, as as some environments spin up a lot of threads. This led to multiple reports of extremely high memory usage (often described as memory leaks). - Even more ideally, the pool should contract in size. That is, it should grow with bursts and then shrink. But this is a pretty thorny issue to tackle and it might be better to just not.
 - It would be nice to explore the use of, say, a lock-free stack
instead of using a mutex to guard a 
Vecthat is ultimately just treated as a stack. The main thing preventing me from exploring this is the ABA problem. Thecrossbeamcrate has tools for dealing with this sort of problem (via its epoch based memory reclamation strategy), but I can’t justify bringing in all ofcrossbeamas a dependency ofregexfor this. 
See this issue for more context and discussion: https://github.com/rust-lang/regex/issues/934