Module zerotrie::builder

source ·
Expand description

§ZeroTrie Builder

There are two implementations of the ZeroTrie Builder:

The two builders follow the same algorithm but have different capabilities.

§Builder Algorithm Overview

The tries are built backwards, from the last node to the first node. The key step of the algorithm is determining what is the next node to prepend.

In the simple case of ZeroTrieSimpleAscii, all nodes are binary-search, so if the input strings are provided in lexicographic order, there is a simple, deterministic method for identifying the next node. This insight is what enables us to make the const builder.

The builder works with the following intermediate state variables:

  • prefix_len indicates the byte index we are currently processing.
  • i and j bracket a window of strings in the input that share the same prefix.
  • current_len is the length in bytes of the current self-contained trie.
  • lengths_stack contains metadata for branch nodes.

What follows is a verbal explanation of the build steps for a trie containing:

  • “” → 11
  • “ad” → 22
  • “adef” → 33
  • “adghk” → 44

When a node is prepended, it is shown in boldface.

  1. Initialize the builder by setting i=3, j=4, prefix_len=5 (the last string), current_len=0, and lengths_stack empty. Start the main loop.
  2. Top of loop. The string at i is equal in length to prefix_len, so we prepend our first node: a value node 44, which requires a 2-byte varint. Increase current_len to 2.
  3. Reduce prefix_len to 4, read our key_ascii="k", and recalculate i and j (this calculation is a long chunk of code in the builder impls). Since there is no other string with the prefix “adgh”, i and j stay the same, we prepend an ASCII node “k”, increase current_len to 3, and continue the main loop.
  4. Top of loop. The string at i is of length 5, but prefix_len is 4, so there is no value node to prepend.
  5. Reduce prefix_len to 3, read our key_ascii="h", and recalculate i and j. There are no other strings sharing the prefix “abg”, so we prepend an ASCII node “h”, increase current_len to 4, and continue the main loop.
  6. Top of loop. There is still no value node to prepend.
  7. Reduce prefix_len to 2, read our key_ascii="g", and recalculate i and j. We find that i=1 and j=4, the range of strings sharing the prefix “ad”. Since i or j changed, proceed to evaluate the branch node.
  8. The last branch byte ascii_j for this prefix is “g”, which is the same as key_ascii, so we are the last target of a branch node. Push an entry onto lengths_stack: BranchMeta { ascii: "g", cumulative_length: 4, local_length: 4, count: 1 }.
  9. The first branch byte ascii_i for this prefix is “e”, which is NOT equal to key_ascii, so we are not the first target of a branch node. We therefore start evaluating the string preceding where we were at the top of the current loop. We set i=2, j=3, prefix_len=4 (length of the string at i), and continue the main loop.
  10. Top of loop. Since the string at i is equal in length to prefix_len, we prepend a value node 33 (which requires a 2-byte varint) and increase current_len to 2.
  11. Reduce prefix_len to 3, read our key_ascii="f", and recalculate i and j. They stay the same, so we prepend an ASCII node “f”, increase current_len to 3, and continue the main loop.
  12. Top of loop. No value node this time.
  13. Reduce prefix_len to 2, read our key_ascii="e", and recalculate i and j. They go back to i=1 and j=4.
  14. The last branch byte ascii_j for this prefix is “g”, which is NOT equal to key_ascii, so we are not the last target of a branch node. We peek at the entry at the front of the lengths stack and use it to push another entry onto the stack: BranchMeta { ascii: "e", cumulative_length: 7, local_length: 3, count: 2 }
  15. The first branch byte ascii_i for this prefix is “e”, which is the same as key_ascii, wo we are the first target of a branch node. We can therefore proceed to prepend the metadata for the branch node. We peek at the top of the stack and find that there are 2 tries reachable from this branch and they have a total byte length of 5. We then pull off 2 entries from the stack into a local variable branch_metas. From here, we write out the offset table, lookup table, and branch head node, which are determined from the metadata entries. We set current_len to the length of the two tries plus the metadata, which happens to be 11. Then we return to the top of the main loop.
  16. Top of loop. The string at i is length 2, which is the same as prefix_len, so we prepend a value node 22 (2-byte varint) and increase current_len to 13.
  17. Reduce prefix_len to 1, read our key_ascii="d", and recalculate i and j. They stay the same, so we prepend an ASCII node “d”, increase current_len to 14, and continue the main loop.
  18. Top of loop. No value node this time.
  19. Reduce prefix_len to 0, read our key_ascii="a", and recalculate i and j. They change to i=0 and j=4, since all strings have the empty string as a prefix. However, ascii_i and ascii_j both equal key_ascii, so we prepend ASCII node “a”, increase current_len to 15, and continue the main loop.
  20. Top of loop. The string at i is length 0, which is the same as prefix_len, so we prepend a value node 11 and increase current_len to 16.
  21. We can no longer reduce prefix_len, so our trie is complete.

§Perfect Hash Reordering

When the PHF is added to the mix, the main change is that the strings are no longer in sorted order when they are in the trie. To resolve this issue, when adding a branch node, the target tries are rearranged in-place in the buffer to be in the correct order for the PHF.

§Example

Here is the output of the trie described above.

use zerotrie::ZeroTrieSimpleAscii;

const DATA: [(&str, usize); 4] =
    [("", 11), ("ad", 22), ("adef", 33), ("adghk", 44)];

// As demonstrated above, the required capacity for this trie is 16 bytes
const TRIE: ZeroTrieSimpleAscii<[u8; 16]> =
    ZeroTrieSimpleAscii::from_sorted_str_tuples(&DATA);

assert_eq!(
    TRIE.as_bytes(),
    &[
        0x8B, // value node 11
        b'a', // ASCII node 'a'
        b'd', // ASCII node 'd'
        0x90, // value node 22 lead byte
        0x06, // value node 22 trail byte
        0xC2, // branch node 2
        b'e', // first target of branch
        b'g', // second target of branch
        3,    // offset
        b'f', // ASCII node 'f'
        0x90, // value node 33 lead byte
        0x11, // value node 33 trail byte
        b'h', // ASCII node 'h'
        b'k', // ASCII node 'k'
        0x90, // value node 44 lead byte
        0x1C, // value node 44 trail byte
    ]
);

assert_eq!(TRIE.get(b""), Some(11));
assert_eq!(TRIE.get(b"ad"), Some(22));
assert_eq!(TRIE.get(b"adef"), Some(33));
assert_eq!(TRIE.get(b"adghk"), Some(44));
assert_eq!(TRIE.get(b"unknown"), None);

Modules§