Expand description
§Internal layout of ZeroTrie
A ZeroTrie is composed of a series of nodes stored in sequence in a byte slice.
There are 4 types of nodes:
- ASCII (
0xxxxxxx
): matches a literal ASCII byte. - Span (
101xxxxx
): matches a span of non-ASCII bytes. - Value (
100xxxxx
): associates a value with a string - Branch (
11xxxxxx
): matches one of a set of bytes.
Span, Value, and Branch nodes contain a varint, which has different semantics for each:
- Span varint: length of the span
- Value varint: value associated with the string
- Branch varint: number of edges in the branch and width of the offset table
If reading an ASCII, Span, or Branch node, one or more bytes are consumed from the input
string. If the next byte(s) in the input string do not match the node, we return None
.
If reading a Value node, if the string is empty, return Some(value)
; otherwise, we skip
the Value node and continue on to the next node.
When a node is consumed, a shorter, well-formed ZeroTrie remains.
§Basic Example
Here is an example ZeroTrie without branch nodes:
use zerotrie::ZeroTriePerfectHash;
let bytes = [
b'a', // ASCII literal
0b10001010, // value 10
b'b', // ASCII literal
0b10100011, // span of 3
0x81, // first byte in span
0x91, // second byte in span
0xA1, // third and final byte in span
0b10000100, // value 4
];
let trie = ZeroTriePerfectHash::from_bytes(&bytes);
// First value: "a" → 10
assert_eq!(trie.get(b"a"), Some(10));
// Second value: "ab\x81\x91\xA1" → 4
assert_eq!(trie.get(b"ab\x81\x91\xA1"), Some(4));
// A few examples of strings that do NOT have values in the trie:
assert_eq!(trie.get(b"ab"), None);
assert_eq!(trie.get(b"b"), None);
assert_eq!(trie.get(b"b\x81\x91\xA1"), None);
§Branch Nodes
There are two types of branch nodes: binary search and perfect hash. ZeroTrieSimpleAscii
contains only binary search nodes, whereas ZeroTriePerfectHash
can contain either.
The head node of the branch has a varint that encodes two things:
- Bottom 8 bits: number of edges in the branch (
N
); if N = 0, set N to 256 - Bits 9 and 10: width of the offset table (
W
)
Note that N is always in the range [1, 256]. There can’t be more than 256 edges because there are only 256 unique u8 values.
A few examples of the head node of the branch:
0b11000000
: varint bits0
: N = 0 which means N = 256; W = 00b11000110
: varint bits110
: N = 6; W = 00b11100000 0b00000101
: varint bits1000101
: N = 69; W = 00b11100010 0b00000000
: varint bits101000000
: N = 64; W = 1
In ZeroTriePerfectHash
, if N <= 15, the branch is assumed to be a binary search, and if
N > 15, the branch is assumed to be a perfect hash.
§Binary Search Branch Nodes
A binary search branch node is used when:
- The trie is a
ZeroTrieSimpleAscii
, OR - There are 15 or fewer items in the branch.
The head branch node is followed by N sorted bytes. When evaluating a branch node, one byte
is consumed from the input. If it is one of the N sorted bytes (scanned using binary search),
the index i
of the byte within the list is used to index into the offset table (described
below). If the byte is not in the list, the string is not in the trie, so return None
.
§Perfect Hash Branch Nodes
A perfect hash branch node is used when:
- The trie is NOT a
ZeroTrieSimpleAscii
, AND - There are 16 or more items in the branch.
The head branch node is followed by 1 byte containing parameter p
, N bytes containing
parameters q
, and N bytes containing the bytes to match. From these parameters, either an
index within the hash table i
is resolved and used as input to index into the offset
table (described below), or the value is determined to not be present and None
is
returned. For more detail on resolving the perfect hash function, see crate::byte_phf
.
§Offset Tables
The offset table encodes the range of the remaining buffer containing the trie reachable
from the byte matched in the branch node. Both types of branch nodes include an offset
table followig the key lookup. Given the index i
from the first step, the range
[s_i, s_(i+1))
brackets the next step in the trie.
Offset tables utilize the W
parameter stored in the branch head node. The special case
when W == 0
, with N - 1
bytes, is easiest to understand:
Offset table, W = 0: [s_1, s_2, ..., s_(N-1)]
Note that s_0
is always 0 and s_N
is always the length of the remaining slice, so those
values are not explicitly included in the offset table.
When W > 0, the high and low bits of the offsets are in separate bytes, arranged as follows:
Generalized offset table: [a_1, a_2, ..., a_(N-1), b_1, b_2, ..., b_(N-1), c_1, ...]
where s_i = (a_i << 8 + b_i) << 8 + c_i ...
(high bits first, low bits last)
§Advanced Example
The following trie encodes the following map. It has multiple varints and branch nodes, which are all binary search with W = 0. Note that there is a value for the empty string.
- “” → 0
- “axb” → 100
- “ayc” → 2
- “azd” → 3
- “bxe” → 4
- “bxefg” → 500
- “bxefh” → 6
- “bxei” → 7
- “bxeikl” → 8
use zerotrie::ZeroTrieSimpleAscii;
let bytes = [
0b10000000, // value 0
0b11000010, // branch of 2
b'a', //
b'b', //
13, //
0b11000011, // start of 'a' subtree: branch of 3
b'x', //
b'y', //
b'z', //
3, //
5, //
b'b', //
0b10010000, // value 100 (lead)
0x54, // value 100 (trail)
b'c', //
0b10000010, // value 2
b'd', //
0b10000011, // value 3
b'x', // start of 'b' subtree
b'e', //
0b10000100, // value 4
0b11000010, // branch of 2
b'f', //
b'i', //
7, //
0b11000010, // branch of 2
b'g', //
b'h', //
2, //
0b10010011, // value 500 (lead)
0x64, // value 500 (trail)
0b10000110, // value 6
0b10000111, // value 7
b'k', //
b'l', //
0b10001000, // value 8
];
let trie = ZeroTrieSimpleAscii::from_bytes(&bytes);
// Assert that the specified items are in the map
assert_eq!(trie.get(b""), Some(0));
assert_eq!(trie.get(b"axb"), Some(100));
assert_eq!(trie.get(b"ayc"), Some(2));
assert_eq!(trie.get(b"azd"), Some(3));
assert_eq!(trie.get(b"bxe"), Some(4));
assert_eq!(trie.get(b"bxefg"), Some(500));
assert_eq!(trie.get(b"bxefh"), Some(6));
assert_eq!(trie.get(b"bxei"), Some(7));
assert_eq!(trie.get(b"bxeikl"), Some(8));
// Assert that some other items are not in the map
assert_eq!(trie.get(b"a"), None);
assert_eq!(trie.get(b"bx"), None);
assert_eq!(trie.get(b"xba"), None);
Structs§
- Internal iterator type for walking the strings contained in a ZeroTrie.
Enums§
- NodeType 🔒The node type. See the module-level docs for more explanation of the four node types.
Functions§
- Given a slice starting with an offset table, returns the trie for the given index.
- Version of
get_branch()
specialized for the casew == 0
for performance - Panics
- Steps one node into the trie, assuming all branch nodes are binary search and that there are no span nodes, using an index.
- Steps one node into the trie assuming all branch nodes are binary search and that there are no span nodes.
- Steps one node into the trie if the head node is a value node, returning the value. If the head node is not a value node, no change is made.