Struct Builder

Source
pub struct Builder<W> {
    wtr: CountingWriter<W>,
    unfinished: UnfinishedNodes,
    registry: Registry,
    last: Option<Vec<u8>>,
    last_addr: CompiledAddr,
    len: usize,
}
Expand description

A builder for creating a finite state transducer.

This is not your average everyday builder. It has two important qualities that make it a bit unique from what you might expect:

  1. All keys must be added in lexicographic order. Adding a key out of order will result in an error. Additionally, adding a duplicate key with an output value will also result in an error. That is, once a key is associated with a value, that association can never be modified or deleted.
  2. The representation of an fst is streamed to any io::Write as it is built. For an in memory representation, this can be a Vec<u8>.

Point (2) is especially important because it means that an fst can be constructed without storing the entire fst in memory. Namely, since it works with any io::Write, it can be streamed directly to a file.

With that said, the builder does use memory, but memory usage is bounded to a constant size. The amount of memory used trades off with the compression ratio. Currently, the implementation hard codes this trade off which can result in about 5-20MB of heap usage during construction. (N.B. Guaranteeing a maximal compression ratio requires memory proportional to the size of the fst, which defeats some of the benefit of streaming it to disk. In practice, a small bounded amount of memory achieves close-to-minimal compression ratios.)

The algorithmic complexity of fst construction is O(n) where n is the number of elements added to the fst.

Fields§

§wtr: CountingWriter<W>

The FST raw data is written directly to wtr.

No internal buffering is done.

§unfinished: UnfinishedNodes

The stack of unfinished nodes.

An unfinished node is a node that could potentially have a new transition added to it when a new word is added to the dictionary.

§registry: Registry

A map of finished nodes.

A finished node is one that has been compiled and written to wtr. After this point, the node is considered immutable and will never change.

§last: Option<Vec<u8>>

The last word added.

This is used to enforce the invariant that words are added in sorted order.

§last_addr: CompiledAddr

The address of the last compiled node.

This is used to optimize states with one transition that point to the previously compiled node. (The previously compiled node in this case actually corresponds to the next state for the transition, since states are compiled in reverse.)

§len: usize

The number of keys added.

Implementations§

Source§

impl Builder<Vec<u8>>

Source

pub fn memory() -> Builder<Vec<u8>>

Create a builder that builds an fst in memory.

Source

pub fn into_fst(self) -> Fst<Vec<u8>>

Finishes construction of the FST and returns it.

Source§

impl<W: Write> Builder<W>

Source

pub fn new(wtr: W) -> Result<Builder<W>>

Create a builder that builds an fst by writing it to wtr in a streaming fashion.

Source

pub fn new_type(wtr: W, ty: FstType) -> Result<Builder<W>>

The same as new, except it sets the type of the fst to the type given.

Source

pub fn add<B>(&mut self, bs: B) -> Result<()>
where B: AsRef<[u8]>,

Adds a byte string to this FST with a zero output value.

Source

pub fn insert<B>(&mut self, bs: B, val: u64) -> Result<()>
where B: AsRef<[u8]>,

Insert a new key-value pair into the fst.

Keys must be convertible to byte strings. Values must be a u64, which is a restriction of the current implementation of finite state transducers. (Values may one day be expanded to other types.)

If a key is inserted that is less than or equal to any previous key added, then an error is returned. Similarly, if there was a problem writing to the underlying writer, an error is returned.

Source

pub fn extend_iter<T, I>(&mut self, iter: I) -> Result<()>
where T: AsRef<[u8]>, I: IntoIterator<Item = (T, Output)>,

Calls insert on each item in the iterator.

If an error occurred while adding an element, processing is stopped and the error is returned.

If a key is inserted that is less than or equal to any previous key added, then an error is returned. Similarly, if there was a problem writing to the underlying writer, an error is returned.

Source

pub fn extend_stream<'f, I, S>(&mut self, stream: I) -> Result<()>
where I: for<'a> IntoStreamer<'a, Into = S, Item = (&'a [u8], Output)>, S: 'f + for<'a> Streamer<'a, Item = (&'a [u8], Output)>,

Calls insert on each item in the stream.

Note that unlike extend_iter, this is not generic on the items in the stream.

If a key is inserted that is less than or equal to any previous key added, then an error is returned. Similarly, if there was a problem writing to the underlying writer, an error is returned.

Source

pub fn finish(self) -> Result<()>

Finishes the construction of the fst and flushes the underlying writer. After completion, the data written to W may be read using one of Fst’s constructor methods.

Source

pub fn into_inner(self) -> Result<W>

Just like finish, except it returns the underlying writer after flushing it.

Source

fn insert_output<B>(&mut self, bs: B, out: Option<Output>) -> Result<()>
where B: AsRef<[u8]>,

Source

fn compile_from(&mut self, istate: usize) -> Result<()>

Source

fn compile(&mut self, node: &BuilderNode) -> Result<CompiledAddr>

Source

fn check_last_key(&mut self, bs: &[u8], check_dupe: bool) -> Result<()>

Source

pub fn get_ref(&self) -> &W

Gets a reference to the underlying writer.

Source

pub fn bytes_written(&self) -> u64

Returns the number of bytes written to the underlying writer

Auto Trait Implementations§

§

impl<W> Freeze for Builder<W>
where W: Freeze,

§

impl<W> RefUnwindSafe for Builder<W>
where W: RefUnwindSafe,

§

impl<W> Send for Builder<W>
where W: Send,

§

impl<W> Sync for Builder<W>
where W: Sync,

§

impl<W> Unpin for Builder<W>
where W: Unpin,

§

impl<W> UnwindSafe for Builder<W>
where W: UnwindSafe,

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.