Struct icu_segmenter::grapheme::GraphemeClusterSegmenter

source ·
pub struct GraphemeClusterSegmenter {
    payload: DataPayload<GraphemeClusterBreakDataV1Marker>,
}
Expand description

Segments a string into grapheme clusters.

Supports loading grapheme cluster break data, and creating grapheme cluster break iterators for different string encodings.

§Examples

Segment a string:

use icu::segmenter::GraphemeClusterSegmenter;
let segmenter = GraphemeClusterSegmenter::new();

let breakpoints: Vec<usize> = segmenter.segment_str("Hello 🗺").collect();
// World Map (U+1F5FA) is encoded in four bytes in UTF-8.
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 5, 6, 10]);

Segment a Latin1 byte string:

use icu::segmenter::GraphemeClusterSegmenter;
let segmenter = GraphemeClusterSegmenter::new();

let breakpoints: Vec<usize> =
    segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]);

Successive boundaries can be used to retrieve the grapheme clusters. In particular, the first boundary is always 0, and the last one is the length of the segmented text in code units.

use itertools::Itertools;
let text = "मांजर";
let grapheme_clusters: Vec<&str> = segmenter
    .segment_str(text)
    .tuple_windows()
    .map(|(i, j)| &text[i..j])
    .collect();
assert_eq!(&grapheme_clusters, &["मां", "ज", "र"]);

This segmenter applies all rules provided to the constructor. Thus, if the data supplied by the provider comprises all grapheme cluster boundary rules from Unicode Standard Annex #29, Unicode Text Segmentation, which is the case of default data (both test data and data produced by icu_datagen), the segment_* functions return extended grapheme cluster boundaries, as opposed to legacy grapheme cluster boundaries. See Section 3, Grapheme Cluster Boundaries, and Table 1a, Sample Grapheme Clusters, in Unicode Standard Annex #29, Unicode Text Segmentation.

use icu::segmenter::GraphemeClusterSegmenter;
let segmenter =
    GraphemeClusterSegmenter::new();

// நி (TAMIL LETTER NA, TAMIL VOWEL SIGN I) is an extended grapheme cluster,
// but not a legacy grapheme cluster.
let ni = "நி";
let egc_boundaries: Vec<usize> = segmenter.segment_str(ni).collect();
assert_eq!(&egc_boundaries, &[0, ni.len()]);

Fields§

§payload: DataPayload<GraphemeClusterBreakDataV1Marker>

Implementations§

source§

impl GraphemeClusterSegmenter

source

pub fn new() -> Self

Constructs a GraphemeClusterSegmenter with an invariant locale from compiled data.

Enabled with the compiled_data Cargo feature.

📚 Help choosing a constructor

source

pub fn try_new_with_any_provider( provider: &(impl AnyProvider + ?Sized), ) -> Result<Self, SegmenterError>

A version of [Self :: new] that uses custom data provided by an AnyProvider.

📚 Help choosing a constructor

source

pub fn try_new_unstable<D>(provider: &D) -> Result<Self, SegmenterError>

A version of Self::new that uses custom data provided by a DataProvider.

📚 Help choosing a constructor

⚠️ The bounds on provider may change over time, including in SemVer minor releases.
source

pub fn segment_str<'l, 's>( &'l self, input: &'s str, ) -> GraphemeClusterBreakIteratorUtf8<'l, 's>

Creates a grapheme cluster break iterator for an str (a UTF-8 string).

source

pub(crate) fn new_and_segment_str<'l, 's>( input: &'s str, payload: &'l RuleBreakDataV1<'l>, ) -> GraphemeClusterBreakIteratorUtf8<'l, 's>

Creates a grapheme cluster break iterator from grapheme cluster rule payload.

There are always breakpoints at 0 and the string length, or only at 0 for the empty string.

source

pub fn segment_utf8<'l, 's>( &'l self, input: &'s [u8], ) -> GraphemeClusterBreakIteratorPotentiallyIllFormedUtf8<'l, 's>

Creates a grapheme cluster break iterator for a potentially ill-formed UTF8 string

Invalid characters are treated as REPLACEMENT CHARACTER

There are always breakpoints at 0 and the string length, or only at 0 for the empty string.

source

pub fn segment_latin1<'l, 's>( &'l self, input: &'s [u8], ) -> GraphemeClusterBreakIteratorLatin1<'l, 's>

Creates a grapheme cluster break iterator for a Latin-1 (8-bit) string.

There are always breakpoints at 0 and the string length, or only at 0 for the empty string.

source

pub fn segment_utf16<'l, 's>( &'l self, input: &'s [u16], ) -> GraphemeClusterBreakIteratorUtf16<'l, 's>

Creates a grapheme cluster break iterator for a UTF-16 string.

There are always breakpoints at 0 and the string length, or only at 0 for the empty string.

source

pub(crate) fn new_and_segment_utf16<'l, 's>( input: &'s [u16], payload: &'l RuleBreakDataV1<'l>, ) -> GraphemeClusterBreakIteratorUtf16<'l, 's>

Creates a grapheme cluster break iterator from grapheme cluster rule payload.

Trait Implementations§

source§

impl Debug for GraphemeClusterSegmenter

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl Default for GraphemeClusterSegmenter

source§

fn default() -> Self

Returns the “default value” for a type. Read more

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

source§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
source§

impl<T> ErasedDestructor for T
where T: 'static,

source§

impl<T> MaybeSendSync for T