Expand description
§ZeroTrie Builder
There are two implementations of the ZeroTrie Builder:
- konst::ZeroTrieBuilderConst allows for human-readable const construction
- nonconst::ZeroTrieBuilder has the full feaure set but requires
alloc
The two builders follow the same algorithm but have different capabilities.
§Builder Algorithm Overview
The tries are built backwards, from the last node to the first node. The key step of the algorithm is determining what is the next node to prepend.
In the simple case of ZeroTrieSimpleAscii
, all nodes are binary-search, so if the input
strings are provided in lexicographic order, there is a simple, deterministic method for
identifying the next node. This insight is what enables us to make the const builder.
The builder works with the following intermediate state variables:
prefix_len
indicates the byte index we are currently processing.i
andj
bracket a window of strings in the input that share the same prefix.current_len
is the length in bytes of the current self-contained trie.lengths_stack
contains metadata for branch nodes.
What follows is a verbal explanation of the build steps for a trie containing:
- “” → 11
- “ad” → 22
- “adef” → 33
- “adghk” → 44
When a node is prepended, it is shown in boldface.
- Initialize the builder by setting
i=3
,j=4
,prefix_len=5
(the last string),current_len=0
, andlengths_stack
empty. Start the main loop. - Top of loop. The string at
i
is equal in length toprefix_len
, so we prepend our first node: a value node 44, which requires a 2-byte varint. Increasecurrent_len
to 2. - Reduce
prefix_len
to 4, read ourkey_ascii="k"
, and recalculatei
andj
(this calculation is a long chunk of code in the builder impls). Since there is no other string with the prefix “adgh”,i
andj
stay the same, we prepend an ASCII node “k”, increasecurrent_len
to 3, and continue the main loop. - Top of loop. The string at
i
is of length 5, butprefix_len
is 4, so there is no value node to prepend. - Reduce
prefix_len
to 3, read ourkey_ascii="h"
, and recalculatei
andj
. There are no other strings sharing the prefix “abg”, so we prepend an ASCII node “h”, increasecurrent_len
to 4, and continue the main loop. - Top of loop. There is still no value node to prepend.
- Reduce
prefix_len
to 2, read ourkey_ascii="g"
, and recalculatei
andj
. We find thati=1
andj=4
, the range of strings sharing the prefix “ad”. Sincei
orj
changed, proceed to evaluate the branch node. - The last branch byte
ascii_j
for this prefix is “g”, which is the same askey_ascii
, so we are the last target of a branch node. Push an entry ontolengths_stack
:BranchMeta { ascii: "g", cumulative_length: 4, local_length: 4, count: 1 }
. - The first branch byte
ascii_i
for this prefix is “e”, which is NOT equal tokey_ascii
, so we are not the first target of a branch node. We therefore start evaluating the string preceding where we were at the top of the current loop. We seti=2
,j=3
,prefix_len=4
(length of the string ati
), and continue the main loop. - Top of loop. Since the string at
i
is equal in length toprefix_len
, we prepend a value node 33 (which requires a 2-byte varint) and increasecurrent_len
to 2. - Reduce
prefix_len
to 3, read ourkey_ascii="f"
, and recalculatei
andj
. They stay the same, so we prepend an ASCII node “f”, increasecurrent_len
to 3, and continue the main loop. - Top of loop. No value node this time.
- Reduce
prefix_len
to 2, read ourkey_ascii="e"
, and recalculatei
andj
. They go back toi=1
andj=4
. - The last branch byte
ascii_j
for this prefix is “g”, which is NOT equal tokey_ascii
, so we are not the last target of a branch node. We peek at the entry at the front of the lengths stack and use it to push another entry onto the stack:BranchMeta { ascii: "e", cumulative_length: 7, local_length: 3, count: 2 }
- The first branch byte
ascii_i
for this prefix is “e”, which is the same askey_ascii
, wo we are the first target of a branch node. We can therefore proceed to prepend the metadata for the branch node. We peek at the top of the stack and find that there are 2 tries reachable from this branch and they have a total byte length of 5. We then pull off 2 entries from the stack into a local variablebranch_metas
. From here, we write out the offset table, lookup table, and branch head node, which are determined from the metadata entries. We setcurrent_len
to the length of the two tries plus the metadata, which happens to be 11. Then we return to the top of the main loop. - Top of loop. The string at
i
is length 2, which is the same asprefix_len
, so we prepend a value node 22 (2-byte varint) and increasecurrent_len
to 13. - Reduce
prefix_len
to 1, read ourkey_ascii="d"
, and recalculatei
andj
. They stay the same, so we prepend an ASCII node “d”, increasecurrent_len
to 14, and continue the main loop. - Top of loop. No value node this time.
- Reduce
prefix_len
to 0, read ourkey_ascii="a"
, and recalculatei
andj
. They change toi=0
andj=4
, since all strings have the empty string as a prefix. However,ascii_i
andascii_j
both equalkey_ascii
, so we prepend ASCII node “a”, increasecurrent_len
to 15, and continue the main loop. - Top of loop. The string at
i
is length 0, which is the same asprefix_len
, so we prepend a value node 11 and increasecurrent_len
to 16. - We can no longer reduce
prefix_len
, so our trie is complete.
§Perfect Hash Reordering
When the PHF is added to the mix, the main change is that the strings are no longer in sorted order when they are in the trie. To resolve this issue, when adding a branch node, the target tries are rearranged in-place in the buffer to be in the correct order for the PHF.
§Example
Here is the output of the trie described above.
use zerotrie::ZeroTrieSimpleAscii;
const DATA: [(&str, usize); 4] =
[("", 11), ("ad", 22), ("adef", 33), ("adghk", 44)];
// As demonstrated above, the required capacity for this trie is 16 bytes
const TRIE: ZeroTrieSimpleAscii<[u8; 16]> =
ZeroTrieSimpleAscii::from_sorted_str_tuples(&DATA);
assert_eq!(
TRIE.as_bytes(),
&[
0x8B, // value node 11
b'a', // ASCII node 'a'
b'd', // ASCII node 'd'
0x90, // value node 22 lead byte
0x06, // value node 22 trail byte
0xC2, // branch node 2
b'e', // first target of branch
b'g', // second target of branch
3, // offset
b'f', // ASCII node 'f'
0x90, // value node 33 lead byte
0x11, // value node 33 trail byte
b'h', // ASCII node 'h'
b'k', // ASCII node 'k'
0x90, // value node 44 lead byte
0x1C, // value node 44 trail byte
]
);
assert_eq!(TRIE.get(b""), Some(11));
assert_eq!(TRIE.get(b"ad"), Some(22));
assert_eq!(TRIE.get(b"adef"), Some(33));
assert_eq!(TRIE.get(b"adghk"), Some(44));
assert_eq!(TRIE.get(b"unknown"), None);