Module zerotrie::reader

source ·
Expand description

§Internal layout of ZeroTrie

A ZeroTrie is composed of a series of nodes stored in sequence in a byte slice.

There are 4 types of nodes:

  1. ASCII (0xxxxxxx): matches a literal ASCII byte.
  2. Span (101xxxxx): matches a span of non-ASCII bytes.
  3. Value (100xxxxx): associates a value with a string
  4. Branch (11xxxxxx): matches one of a set of bytes.

Span, Value, and Branch nodes contain a varint, which has different semantics for each:

  • Span varint: length of the span
  • Value varint: value associated with the string
  • Branch varint: number of edges in the branch and width of the offset table

If reading an ASCII, Span, or Branch node, one or more bytes are consumed from the input string. If the next byte(s) in the input string do not match the node, we return None. If reading a Value node, if the string is empty, return Some(value); otherwise, we skip the Value node and continue on to the next node.

When a node is consumed, a shorter, well-formed ZeroTrie remains.

§Basic Example

Here is an example ZeroTrie without branch nodes:

use zerotrie::ZeroTriePerfectHash;

let bytes = [
    b'a',       // ASCII literal
    0b10001010, // value 10
    b'b',       // ASCII literal
    0b10100011, // span of 3
    0x81,       // first byte in span
    0x91,       // second byte in span
    0xA1,       // third and final byte in span
    0b10000100, // value 4
];

let trie = ZeroTriePerfectHash::from_bytes(&bytes);

// First value: "a" → 10
assert_eq!(trie.get(b"a"), Some(10));

// Second value: "ab\x81\x91\xA1" → 4
assert_eq!(trie.get(b"ab\x81\x91\xA1"), Some(4));

// A few examples of strings that do NOT have values in the trie:
assert_eq!(trie.get(b"ab"), None);
assert_eq!(trie.get(b"b"), None);
assert_eq!(trie.get(b"b\x81\x91\xA1"), None);

§Branch Nodes

There are two types of branch nodes: binary search and perfect hash. ZeroTrieSimpleAscii contains only binary search nodes, whereas ZeroTriePerfectHash can contain either.

The head node of the branch has a varint that encodes two things:

  • Bottom 8 bits: number of edges in the branch (N); if N = 0, set N to 256
  • Bits 9 and 10: width of the offset table (W)

Note that N is always in the range [1, 256]. There can’t be more than 256 edges because there are only 256 unique u8 values.

A few examples of the head node of the branch:

  • 0b11000000: varint bits 0: N = 0 which means N = 256; W = 0
  • 0b11000110: varint bits 110: N = 6; W = 0
  • 0b11100000 0b00000101: varint bits 1000101: N = 69; W = 0
  • 0b11100010 0b00000000: varint bits 101000000: N = 64; W = 1

In ZeroTriePerfectHash, if N <= 15, the branch is assumed to be a binary search, and if N > 15, the branch is assumed to be a perfect hash.

§Binary Search Branch Nodes

A binary search branch node is used when:

  1. The trie is a ZeroTrieSimpleAscii, OR
  2. There are 15 or fewer items in the branch.

The head branch node is followed by N sorted bytes. When evaluating a branch node, one byte is consumed from the input. If it is one of the N sorted bytes (scanned using binary search), the index i of the byte within the list is used to index into the offset table (described below). If the byte is not in the list, the string is not in the trie, so return None.

§Perfect Hash Branch Nodes

A perfect hash branch node is used when:

  1. The trie is NOT a ZeroTrieSimpleAscii, AND
  2. There are 16 or more items in the branch.

The head branch node is followed by 1 byte containing parameter p, N bytes containing parameters q, and N bytes containing the bytes to match. From these parameters, either an index within the hash table i is resolved and used as input to index into the offset table (described below), or the value is determined to not be present and None is returned. For more detail on resolving the perfect hash function, see crate::byte_phf.

§Offset Tables

The offset table encodes the range of the remaining buffer containing the trie reachable from the byte matched in the branch node. Both types of branch nodes include an offset table followig the key lookup. Given the index i from the first step, the range [s_i, s_(i+1)) brackets the next step in the trie.

Offset tables utilize the W parameter stored in the branch head node. The special case when W == 0, with N - 1 bytes, is easiest to understand:

Offset table, W = 0: [s_1, s_2, ..., s_(N-1)]

Note that s_0 is always 0 and s_N is always the length of the remaining slice, so those values are not explicitly included in the offset table.

When W > 0, the high and low bits of the offsets are in separate bytes, arranged as follows:

Generalized offset table: [a_1, a_2, ..., a_(N-1), b_1, b_2, ..., b_(N-1), c_1, ...]

where s_i = (a_i << 8 + b_i) << 8 + c_i ... (high bits first, low bits last)

§Advanced Example

The following trie encodes the following map. It has multiple varints and branch nodes, which are all binary search with W = 0. Note that there is a value for the empty string.

  • “” → 0
  • “axb” → 100
  • “ayc” → 2
  • “azd” → 3
  • “bxe” → 4
  • “bxefg” → 500
  • “bxefh” → 6
  • “bxei” → 7
  • “bxeikl” → 8
use zerotrie::ZeroTrieSimpleAscii;

let bytes = [
    0b10000000, // value 0
    0b11000010, // branch of 2
    b'a',       //
    b'b',       //
    13,         //
    0b11000011, // start of 'a' subtree: branch of 3
    b'x',       //
    b'y',       //
    b'z',       //
    3,          //
    5,          //
    b'b',       //
    0b10010000, // value 100 (lead)
    0x54,       // value 100 (trail)
    b'c',       //
    0b10000010, // value 2
    b'd',       //
    0b10000011, // value 3
    b'x',       // start of 'b' subtree
    b'e',       //
    0b10000100, // value 4
    0b11000010, // branch of 2
    b'f',       //
    b'i',       //
    7,          //
    0b11000010, // branch of 2
    b'g',       //
    b'h',       //
    2,          //
    0b10010011, // value 500 (lead)
    0x64,       // value 500 (trail)
    0b10000110, // value 6
    0b10000111, // value 7
    b'k',       //
    b'l',       //
    0b10001000, // value 8
];

let trie = ZeroTrieSimpleAscii::from_bytes(&bytes);

// Assert that the specified items are in the map
assert_eq!(trie.get(b""), Some(0));
assert_eq!(trie.get(b"axb"), Some(100));
assert_eq!(trie.get(b"ayc"), Some(2));
assert_eq!(trie.get(b"azd"), Some(3));
assert_eq!(trie.get(b"bxe"), Some(4));
assert_eq!(trie.get(b"bxefg"), Some(500));
assert_eq!(trie.get(b"bxefh"), Some(6));
assert_eq!(trie.get(b"bxei"), Some(7));
assert_eq!(trie.get(b"bxeikl"), Some(8));

// Assert that some other items are not in the map
assert_eq!(trie.get(b"a"), None);
assert_eq!(trie.get(b"bx"), None);
assert_eq!(trie.get(b"xba"), None);

Structs§

  • Internal iterator type for walking the strings contained in a ZeroTrie.

Enums§

  • NodeType 🔒
    The node type. See the module-level docs for more explanation of the four node types.

Functions§

  • byte_type 🔒
  • get_branch 🔒
    Given a slice starting with an offset table, returns the trie for the given index.
  • Version of get_branch() specialized for the case w == 0 for performance
  • Panics
  • Steps one node into the trie, assuming all branch nodes are binary search and that there are no span nodes, using an index.
  • Steps one node into the trie assuming all branch nodes are binary search and that there are no span nodes.
  • take_value 🔒
    Steps one node into the trie if the head node is a value node, returning the value. If the head node is not a value node, no change is made.