pub struct Builder<W> {
wtr: CountingWriter<W>,
unfinished: UnfinishedNodes,
registry: Registry,
last: Option<Vec<u8>>,
last_addr: CompiledAddr,
len: usize,
}
Expand description
A builder for creating a finite state transducer.
This is not your average everyday builder. It has two important qualities that make it a bit unique from what you might expect:
- All keys must be added in lexicographic order. Adding a key out of order will result in an error. Additionally, adding a duplicate key with an output value will also result in an error. That is, once a key is associated with a value, that association can never be modified or deleted.
- The representation of an fst is streamed to any
io::Write
as it is built. For an in memory representation, this can be aVec<u8>
.
Point (2) is especially important because it means that an fst can be
constructed without storing the entire fst in memory. Namely, since it
works with any io::Write
, it can be streamed directly to a file.
With that said, the builder does use memory, but memory usage is bounded to a constant size. The amount of memory used trades off with the compression ratio. Currently, the implementation hard codes this trade off which can result in about 5-20MB of heap usage during construction. (N.B. Guaranteeing a maximal compression ratio requires memory proportional to the size of the fst, which defeats some of the benefit of streaming it to disk. In practice, a small bounded amount of memory achieves close-to-minimal compression ratios.)
The algorithmic complexity of fst construction is O(n)
where n
is the
number of elements added to the fst.
Fields§
§wtr: CountingWriter<W>
The FST raw data is written directly to wtr
.
No internal buffering is done.
unfinished: UnfinishedNodes
The stack of unfinished nodes.
An unfinished node is a node that could potentially have a new transition added to it when a new word is added to the dictionary.
registry: Registry
A map of finished nodes.
A finished node is one that has been compiled and written to wtr
.
After this point, the node is considered immutable and will never
change.
last: Option<Vec<u8>>
The last word added.
This is used to enforce the invariant that words are added in sorted order.
last_addr: CompiledAddr
The address of the last compiled node.
This is used to optimize states with one transition that point to the previously compiled node. (The previously compiled node in this case actually corresponds to the next state for the transition, since states are compiled in reverse.)
len: usize
The number of keys added.
Implementations§
Source§impl<W: Write> Builder<W>
impl<W: Write> Builder<W>
Sourcepub fn new(wtr: W) -> Result<Builder<W>>
pub fn new(wtr: W) -> Result<Builder<W>>
Create a builder that builds an fst by writing it to wtr
in a
streaming fashion.
Sourcepub fn new_type(wtr: W, ty: FstType) -> Result<Builder<W>>
pub fn new_type(wtr: W, ty: FstType) -> Result<Builder<W>>
The same as new
, except it sets the type of the fst to the type
given.
Sourcepub fn add<B>(&mut self, bs: B) -> Result<()>
pub fn add<B>(&mut self, bs: B) -> Result<()>
Adds a byte string to this FST with a zero output value.
Sourcepub fn insert<B>(&mut self, bs: B, val: u64) -> Result<()>
pub fn insert<B>(&mut self, bs: B, val: u64) -> Result<()>
Insert a new key-value pair into the fst.
Keys must be convertible to byte strings. Values must be a u64
, which
is a restriction of the current implementation of finite state
transducers. (Values may one day be expanded to other types.)
If a key is inserted that is less than or equal to any previous key added, then an error is returned. Similarly, if there was a problem writing to the underlying writer, an error is returned.
Sourcepub fn extend_iter<T, I>(&mut self, iter: I) -> Result<()>
pub fn extend_iter<T, I>(&mut self, iter: I) -> Result<()>
Calls insert on each item in the iterator.
If an error occurred while adding an element, processing is stopped and the error is returned.
If a key is inserted that is less than or equal to any previous key added, then an error is returned. Similarly, if there was a problem writing to the underlying writer, an error is returned.
Sourcepub fn extend_stream<'f, I, S>(&mut self, stream: I) -> Result<()>
pub fn extend_stream<'f, I, S>(&mut self, stream: I) -> Result<()>
Calls insert on each item in the stream.
Note that unlike extend_iter
, this is not generic on the items in
the stream.
If a key is inserted that is less than or equal to any previous key added, then an error is returned. Similarly, if there was a problem writing to the underlying writer, an error is returned.
Sourcepub fn finish(self) -> Result<()>
pub fn finish(self) -> Result<()>
Finishes the construction of the fst and flushes the underlying
writer. After completion, the data written to W
may be read using
one of Fst
’s constructor methods.
Sourcepub fn into_inner(self) -> Result<W>
pub fn into_inner(self) -> Result<W>
Just like finish
, except it returns the underlying writer after
flushing it.
fn insert_output<B>(&mut self, bs: B, out: Option<Output>) -> Result<()>
fn compile_from(&mut self, istate: usize) -> Result<()>
fn compile(&mut self, node: &BuilderNode) -> Result<CompiledAddr>
fn check_last_key(&mut self, bs: &[u8], check_dupe: bool) -> Result<()>
Sourcepub fn bytes_written(&self) -> u64
pub fn bytes_written(&self) -> u64
Returns the number of bytes written to the underlying writer