Struct xml5ever::tendril::encoding_rs::Encoder

source ·
pub struct Encoder {
    pub(crate) encoding: &'static Encoding,
    pub(crate) variant: VariantEncoder,
}
Expand description

A converter that encodes a Unicode stream into bytes according to a character encoding in a streaming (incremental) manner.

The various encode_* methods take an input buffer (src) and an output buffer dst both of which are caller-allocated. There are variants for both UTF-8 and UTF-16 input buffers.

An encode_* method encode characters from src into bytes characters stored into dst until one of the following three things happens:

  1. An unmappable character is encountered (*_without_replacement variants only).

  2. The output buffer has been filled so near capacity that the decoder cannot be sure that processing an additional character of input wouldn’t cause so much output that the output buffer would overflow.

  3. All the input characters have been processed.

The encode_* method then returns tuple of a status indicating which one of the three reasons to return happened, how many input code units (u8 when encoding from UTF-8 and u16 when encoding from UTF-16) were read, how many output bytes were written (except when encoding into Vec<u8>, whose length change indicates this), and in the case of the variants that perform replacement, a boolean indicating whether an unmappable character was replaced with a numeric character reference during the call.

The number of bytes “written” is what’s logically written. Garbage may be written in the output buffer beyond the point logically written to.

In the case of the methods whose name ends with *_without_replacement, the status is an EncoderResult enumeration (possibilities Unmappable, OutputFull and InputEmpty corresponding to the three cases listed above).

In the case of methods whose name does not end with *_without_replacement, unmappable characters are automatically replaced with the corresponding numeric character references and unmappable characters do not cause the methods to return early.

When encoding from UTF-8 without replacement, the methods are guaranteed not to return indicating that more output space is needed if the length of the output buffer is at least the length returned by max_buffer_length_from_utf8_without_replacement(). When encoding from UTF-8 with replacement, the length of the output buffer that guarantees the methods not to return indicating that more output space is needed in the absence of unmappable characters is given by max_buffer_length_from_utf8_if_no_unmappables(). When encoding from UTF-16 without replacement, the methods are guaranteed not to return indicating that more output space is needed if the length of the output buffer is at least the length returned by max_buffer_length_from_utf16_without_replacement(). When encoding from UTF-16 with replacement, the the length of the output buffer that guarantees the methods not to return indicating that more output space is needed in the absence of unmappable characters is given by max_buffer_length_from_utf16_if_no_unmappables(). When encoding with replacement, applications are not expected to size the buffer for the worst case ahead of time but to resize the buffer if there are unmappable characters. This is why max length queries are only available for the case where there are no unmappable characters.

When encoding from UTF-8, each src buffer must be valid UTF-8. (When calling from Rust, the type system takes care of this.) When encoding from UTF-16, unpaired surrogates in the input are treated as U+FFFD REPLACEMENT CHARACTERS. Therefore, in order for astral characters not to turn into a pair of REPLACEMENT CHARACTERS, the caller must ensure that surrogate pairs are not split across input buffer boundaries.

After an encode_* call returns, the output produced so far, taken as a whole from the start of the stream, is guaranteed to consist of a valid byte sequence in the target encoding. (I.e. the code unit sequence for a character is guaranteed not to be split across output buffers. However, due to the stateful nature of ISO-2022-JP, the stream needs to be considered from the start for it to be valid. For other encodings, the validity holds on a per-output buffer basis.)

The boolean argument last indicates that the end of the stream is reached when all the characters in src have been consumed. This argument is needed for ISO-2022-JP and is ignored for other encodings.

An Encoder object can be used to incrementally encode a byte stream.

During the processing of a single stream, the caller must call encode_* zero or more times with last set to false and then call encode_* at least once with last set to true. If encode_* returns InputEmpty, the processing of the stream has ended. Otherwise, the caller must call encode_* again with last set to true (or treat an Unmappable result as a fatal error).

Once the stream has ended, the Encoder object must not be used anymore. That is, you need to create another one to process another stream.

When the encoder returns OutputFull or the encoder returns Unmappable and the caller does not wish to treat it as a fatal error, the input buffer src may not have been completely consumed. In that case, the caller must pass the unconsumed contents of src to encode_* again upon the next call.

§Infinite loops

When converting with a fixed-size output buffer whose size is too small to accommodate one character of output, an infinite loop ensues. When converting with a fixed-size output buffer, it generally makes sense to make the buffer fairly large (e.g. couple of kilobytes).

Fields§

§encoding: &'static Encoding§variant: VariantEncoder

Implementations§

source§

impl Encoder

source

pub fn encoding(&self) -> &'static Encoding

The Encoding this Encoder is for.

source

pub fn has_pending_state(&self) -> bool

Returns true if this is an ISO-2022-JP encoder that’s not in the ASCII state and false otherwise.

source

pub fn max_buffer_length_from_utf8_if_no_unmappables( &self, byte_length: usize, ) -> Option<usize>

Query the worst-case output size when encoding from UTF-8 with replacement.

Returns the size of the output buffer in bytes that will not overflow given the current state of the encoder and byte_length number of additional input code units if there are no unmappable characters in the input or None if usize would overflow.

Available via the C wrapper.

source

pub fn max_buffer_length_from_utf8_without_replacement( &self, byte_length: usize, ) -> Option<usize>

Query the worst-case output size when encoding from UTF-8 without replacement.

Returns the size of the output buffer in bytes that will not overflow given the current state of the encoder and byte_length number of additional input code units or None if usize would overflow.

Available via the C wrapper.

source

pub fn encode_from_utf8( &mut self, src: &str, dst: &mut [u8], last: bool, ) -> (CoderResult, usize, usize, bool)

Incrementally encode into byte stream from UTF-8 with unmappable characters replaced with HTML (decimal) numeric character references.

See the documentation of the struct for documentation for encode_* methods collectively.

Available via the C wrapper.

source

pub fn encode_from_utf8_to_vec( &mut self, src: &str, dst: &mut Vec<u8>, last: bool, ) -> (CoderResult, usize, bool)

Incrementally encode into byte stream from UTF-8 with unmappable characters replaced with HTML (decimal) numeric character references.

See the documentation of the struct for documentation for encode_* methods collectively.

Available to Rust only and only with the alloc feature enabled (enabled by default).

source

pub fn encode_from_utf8_without_replacement( &mut self, src: &str, dst: &mut [u8], last: bool, ) -> (EncoderResult, usize, usize)

Incrementally encode into byte stream from UTF-8 without replacement.

See the documentation of the struct for documentation for encode_* methods collectively.

Available via the C wrapper.

source

pub fn encode_from_utf8_to_vec_without_replacement( &mut self, src: &str, dst: &mut Vec<u8>, last: bool, ) -> (EncoderResult, usize)

Incrementally encode into byte stream from UTF-8 without replacement.

See the documentation of the struct for documentation for encode_* methods collectively.

Available to Rust only and only with the alloc feature enabled (enabled by default).

source

pub fn max_buffer_length_from_utf16_if_no_unmappables( &self, u16_length: usize, ) -> Option<usize>

Query the worst-case output size when encoding from UTF-16 with replacement.

Returns the size of the output buffer in bytes that will not overflow given the current state of the encoder and u16_length number of additional input code units if there are no unmappable characters in the input or None if usize would overflow.

Available via the C wrapper.

source

pub fn max_buffer_length_from_utf16_without_replacement( &self, u16_length: usize, ) -> Option<usize>

Query the worst-case output size when encoding from UTF-16 without replacement.

Returns the size of the output buffer in bytes that will not overflow given the current state of the encoder and u16_length number of additional input code units or None if usize would overflow.

Available via the C wrapper.

source

pub fn encode_from_utf16( &mut self, src: &[u16], dst: &mut [u8], last: bool, ) -> (CoderResult, usize, usize, bool)

Incrementally encode into byte stream from UTF-16 with unmappable characters replaced with HTML (decimal) numeric character references.

See the documentation of the struct for documentation for encode_* methods collectively.

Available via the C wrapper.

source

pub fn encode_from_utf16_without_replacement( &mut self, src: &[u16], dst: &mut [u8], last: bool, ) -> (EncoderResult, usize, usize)

Incrementally encode into byte stream from UTF-16 without replacement.

See the documentation of the struct for documentation for encode_* methods collectively.

Available via the C wrapper.

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.