Struct html5ever::tendril::encoding_rs::Decoder

source ·
pub struct Decoder {
    pub(crate) encoding: &'static Encoding,
    pub(crate) variant: VariantDecoder,
    pub(crate) life_cycle: DecoderLifeCycle,
}
Expand description

A converter that decodes a byte stream into Unicode according to a character encoding in a streaming (incremental) manner.

The various decode_* methods take an input buffer (src) and an output buffer dst both of which are caller-allocated. There are variants for both UTF-8 and UTF-16 output buffers.

A decode_* method decodes bytes from src into Unicode characters stored into dst until one of the following three things happens:

  1. A malformed byte sequence is encountered (*_without_replacement variants only).

  2. The output buffer has been filled so near capacity that the decoder cannot be sure that processing an additional byte of input wouldn’t cause so much output that the output buffer would overflow.

  3. All the input bytes have been processed.

The decode_* method then returns tuple of a status indicating which one of the three reasons to return happened, how many input bytes were read, how many output code units (u8 when decoding into UTF-8 and u16 when decoding to UTF-16) were written (except when decoding into String, whose length change indicates this), and in the case of the variants performing replacement, a boolean indicating whether an error was replaced with the REPLACEMENT CHARACTER during the call.

The number of bytes “written” is what’s logically written. Garbage may be written in the output buffer beyond the point logically written to. Therefore, if you wish to decode into an &mut str, you should use the methods that take an &mut str argument instead of the ones that take an &mut [u8] argument. The former take care of overwriting the trailing garbage to ensure the UTF-8 validity of the &mut str as a whole, but the latter don’t.

In the case of the *_without_replacement variants, the status is a DecoderResult enumeration (possibilities Malformed, OutputFull and InputEmpty corresponding to the three cases listed above).

In the case of methods whose name does not end with *_without_replacement, malformed sequences are automatically replaced with the REPLACEMENT CHARACTER and errors do not cause the methods to return early.

When decoding to UTF-8, the output buffer must have at least 4 bytes of space. When decoding to UTF-16, the output buffer must have at least two UTF-16 code units (u16) of space.

When decoding to UTF-8 without replacement, the methods are guaranteed not to return indicating that more output space is needed if the length of the output buffer is at least the length returned by max_utf8_buffer_length_without_replacement(). When decoding to UTF-8 with replacement, the length of the output buffer that guarantees the methods not to return indicating that more output space is needed is given by max_utf8_buffer_length(). When decoding to UTF-16 with or without replacement, the length of the output buffer that guarantees the methods not to return indicating that more output space is needed is given by max_utf16_buffer_length().

The output written into dst is guaranteed to be valid UTF-8 or UTF-16, and the output after each decode_* call is guaranteed to consist of complete characters. (I.e. the code unit sequence for the last character is guaranteed not to be split across output buffers.)

The boolean argument last indicates that the end of the stream is reached when all the bytes in src have been consumed.

A Decoder object can be used to incrementally decode a byte stream.

During the processing of a single stream, the caller must call decode_* zero or more times with last set to false and then call decode_* at least once with last set to true. If decode_* returns InputEmpty, the processing of the stream has ended. Otherwise, the caller must call decode_* again with last set to true (or treat a Malformed result as a fatal error).

Once the stream has ended, the Decoder object must not be used anymore. That is, you need to create another one to process another stream.

When the decoder returns OutputFull or the decoder returns Malformed and the caller does not wish to treat it as a fatal error, the input buffer src may not have been completely consumed. In that case, the caller must pass the unconsumed contents of src to decode_* again upon the next call.

§Infinite loops

When converting with a fixed-size output buffer whose size is too small to accommodate one character or (when applicable) one numeric character reference of output, an infinite loop ensues. When converting with a fixed-size output buffer, it generally makes sense to make the buffer fairly large (e.g. couple of kilobytes).

Fields§

§encoding: &'static Encoding§variant: VariantDecoder§life_cycle: DecoderLifeCycle

Implementations§

source§

impl Decoder

source

pub fn encoding(&self) -> &'static Encoding

The Encoding this Decoder is for.

BOM sniffing can change the return value of this method during the life of the decoder.

Available via the C wrapper.

source

pub fn max_utf8_buffer_length(&self, byte_length: usize) -> Option<usize>

Query the worst-case UTF-8 output size with replacement.

Returns the size of the output buffer in UTF-8 code units (u8) that will not overflow given the current state of the decoder and byte_length number of additional input bytes when decoding with errors handled by outputting a REPLACEMENT CHARACTER for each malformed sequence or None if usize would overflow.

Available via the C wrapper.

source

pub fn max_utf8_buffer_length_without_replacement( &self, byte_length: usize, ) -> Option<usize>

Query the worst-case UTF-8 output size without replacement.

Returns the size of the output buffer in UTF-8 code units (u8) that will not overflow given the current state of the decoder and byte_length number of additional input bytes when decoding without replacement error handling or None if usize would overflow.

Note that this value may be too small for the _with_replacement case. Use max_utf8_buffer_length() for that case.

Available via the C wrapper.

source

pub fn decode_to_utf8( &mut self, src: &[u8], dst: &mut [u8], last: bool, ) -> (CoderResult, usize, usize, bool)

Incrementally decode a byte stream into UTF-8 with malformed sequences replaced with the REPLACEMENT CHARACTER.

See the documentation of the struct for documentation for decode_* methods collectively.

Available via the C wrapper.

source

pub fn decode_to_str( &mut self, src: &[u8], dst: &mut str, last: bool, ) -> (CoderResult, usize, usize, bool)

Incrementally decode a byte stream into UTF-8 with malformed sequences replaced with the REPLACEMENT CHARACTER with type system signaling of UTF-8 validity.

This methods calls decode_to_utf8 and then zeroes out up to three bytes that aren’t logically part of the write in order to retain the UTF-8 validity even for the unwritten part of the buffer.

See the documentation of the struct for documentation for decode_* methods collectively.

Available to Rust only.

source

pub fn decode_to_string( &mut self, src: &[u8], dst: &mut String, last: bool, ) -> (CoderResult, usize, bool)

Incrementally decode a byte stream into UTF-8 with malformed sequences replaced with the REPLACEMENT CHARACTER using a String receiver.

Like the others, this method follows the logic that the output buffer is caller-allocated. This method treats the capacity of the String as the output limit. That is, this method guarantees not to cause a reallocation of the backing buffer of String.

The return value is a tuple that contains the DecoderResult, the number of bytes read and a boolean indicating whether replacements were done. The number of bytes written is signaled via the length of the String changing.

See the documentation of the struct for documentation for decode_* methods collectively.

Available to Rust only and only with the alloc feature enabled (enabled by default).

source

pub fn decode_to_utf8_without_replacement( &mut self, src: &[u8], dst: &mut [u8], last: bool, ) -> (DecoderResult, usize, usize)

Incrementally decode a byte stream into UTF-8 without replacement.

See the documentation of the struct for documentation for decode_* methods collectively.

Available via the C wrapper.

source

pub fn decode_to_str_without_replacement( &mut self, src: &[u8], dst: &mut str, last: bool, ) -> (DecoderResult, usize, usize)

Incrementally decode a byte stream into UTF-8 with type system signaling of UTF-8 validity.

This methods calls decode_to_utf8 and then zeroes out up to three bytes that aren’t logically part of the write in order to retain the UTF-8 validity even for the unwritten part of the buffer.

See the documentation of the struct for documentation for decode_* methods collectively.

Available to Rust only.

source

pub fn decode_to_string_without_replacement( &mut self, src: &[u8], dst: &mut String, last: bool, ) -> (DecoderResult, usize)

Incrementally decode a byte stream into UTF-8 using a String receiver.

Like the others, this method follows the logic that the output buffer is caller-allocated. This method treats the capacity of the String as the output limit. That is, this method guarantees not to cause a reallocation of the backing buffer of String.

The return value is a pair that contains the DecoderResult and the number of bytes read. The number of bytes written is signaled via the length of the String changing.

See the documentation of the struct for documentation for decode_* methods collectively.

Available to Rust only and only with the alloc feature enabled (enabled by default).

source

pub fn max_utf16_buffer_length(&self, byte_length: usize) -> Option<usize>

Query the worst-case UTF-16 output size (with or without replacement).

Returns the size of the output buffer in UTF-16 code units (u16) that will not overflow given the current state of the decoder and byte_length number of additional input bytes or None if usize would overflow.

Since the REPLACEMENT CHARACTER fits into one UTF-16 code unit, the return value of this method applies also in the _without_replacement case.

Available via the C wrapper.

source

pub fn decode_to_utf16( &mut self, src: &[u8], dst: &mut [u16], last: bool, ) -> (CoderResult, usize, usize, bool)

Incrementally decode a byte stream into UTF-16 with malformed sequences replaced with the REPLACEMENT CHARACTER.

See the documentation of the struct for documentation for decode_* methods collectively.

Available via the C wrapper.

source

pub fn decode_to_utf16_without_replacement( &mut self, src: &[u8], dst: &mut [u16], last: bool, ) -> (DecoderResult, usize, usize)

Incrementally decode a byte stream into UTF-16 without replacement.

See the documentation of the struct for documentation for decode_* methods collectively.

Available via the C wrapper.

source

pub fn latin1_byte_compatible_up_to(&self, bytes: &[u8]) -> Option<usize>

Checks for compatibility with storing Unicode scalar values as unsigned bytes taking into account the state of the decoder.

Returns None if the decoder is not in a neutral state, including waiting for the BOM, or if the encoding is never Latin1-byte-compatible.

Otherwise returns the index of the first byte whose unsigned value doesn’t directly correspond to the decoded Unicode scalar value, or the length of the input if all bytes in the input decode directly to scalar values corresponding to the unsigned byte values.

Does not change the state of the decoder.

Do not use this unless you are supporting SpiderMonkey/V8-style string storage optimizations.

Available via the C wrapper.

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

source§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.