Struct icu_segmenter::line::LineSegmenter
source Β· pub struct LineSegmenter {
options: LineBreakOptions,
payload: DataPayload<LineBreakDataV1Marker>,
complex: ComplexPayloads,
}
Expand description
Supports loading line break data, and creating line break iterators for different string encodings.
The segmenter returns mandatory breaks (as defined by definition LD7 of Unicode Standard Annex #14, Unicode Line Breaking Algorithm) as well as line break opportunities (definition LD3). It does not distinguish them. Callers requiring that distinction can check the Line_Break property of the code point preceding the break against those listed in rules LB4 and LB5, special-casing the end of text according to LB3.
For consistency with the grapheme, word, and sentence segmenters, there is always a breakpoint returned at index 0, but this breakpoint is not a meaningful line break opportunity.
let text = "Summary\r\nThis annexβ¦";
let breakpoints: Vec<usize> = segmenter.segment_str(text).collect();
// 9 and 22 are mandatory breaks, 14 is a line break opportunity.
assert_eq!(&breakpoints, &[0, 9, 14, 22]);
// There is a break opportunity between emoji, but not within the ZWJ sequence π³οΈβπ.
let flag_equation = "π³οΈβππ°π³οΈ\u{200D}π";
let possible_first_lines: Vec<&str> =
segmenter.segment_str(flag_equation).skip(1).map(|i| &flag_equation[..i]).collect();
assert_eq!(
&possible_first_lines,
&[
"π³οΈ",
"π³οΈβ",
"π³οΈβπ",
"π³οΈβππ°",
"π³οΈβππ°π³οΈβπ"
]
);
Β§Examples
Segment a string with default options:
use icu::segmenter::LineSegmenter;
let segmenter = LineSegmenter::new_auto();
let breakpoints: Vec<usize> =
segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 6, 11]);
Segment a string with CSS option overrides:
use icu::segmenter::{
LineBreakOptions, LineBreakStrictness, LineBreakWordOption,
LineSegmenter,
};
let mut options = LineBreakOptions::default();
options.strictness = LineBreakStrictness::Strict;
options.word_option = LineBreakWordOption::BreakAll;
options.ja_zh = false;
let segmenter = LineSegmenter::new_auto_with_options(options);
let breakpoints: Vec<usize> =
segmenter.segment_str("Hello World").collect();
assert_eq!(&breakpoints, &[0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11]);
Segment a Latin1 byte string:
use icu::segmenter::LineSegmenter;
let segmenter = LineSegmenter::new_auto();
let breakpoints: Vec<usize> =
segmenter.segment_latin1(b"Hello World").collect();
assert_eq!(&breakpoints, &[0, 6, 11]);
Separate mandatory breaks from the break opportunities:
use icu::properties::{maps, LineBreak};
use icu::segmenter::LineSegmenter;
let text = "Summary\r\nThis annexβ¦";
let mandatory_breaks: Vec<usize> = segmenter
.segment_str(text)
.into_iter()
.filter(|&i| {
text[..i].chars().next_back().map_or(false, |c| {
matches!(
maps::line_break().get(c),
LineBreak::MandatoryBreak
| LineBreak::CarriageReturn
| LineBreak::LineFeed
| LineBreak::NextLine
) || i == text.len()
})
})
.collect();
assert_eq!(&mandatory_breaks, &[9, 22]);
Fields§
Β§options: LineBreakOptions
Β§payload: DataPayload<LineBreakDataV1Marker>
Β§complex: ComplexPayloads
Implementations§
source§impl LineSegmenter
impl LineSegmenter
sourcepub fn new_auto() -> Self
pub fn new_auto() -> Self
Constructs a LineSegmenter
with an invariant locale and the best available compiled data for
complex scripts (Khmer, Lao, Myanmar, and Thai).
The current behavior, which is subject to change, is to use the LSTM model when available.
See also Self::new_auto_with_options
.
β¨ Enabled with the compiled_data
and auto
Cargo features.
sourcepub fn try_new_auto_with_any_provider(
provider: &(impl AnyProvider + ?Sized),
) -> Result<Self, SegmenterError>
pub fn try_new_auto_with_any_provider( provider: &(impl AnyProvider + ?Sized), ) -> Result<Self, SegmenterError>
A version of [Self :: new_auto
] that uses custom data provided by an AnyProvider
.
sourcepub fn try_new_auto_unstable<D>(provider: &D) -> Result<Self, SegmenterError>
pub fn try_new_auto_unstable<D>(provider: &D) -> Result<Self, SegmenterError>
A version of Self::new_auto
that uses custom data provided by a DataProvider
.
π Help choosing a constructor
sourcepub fn new_lstm() -> Self
pub fn new_lstm() -> Self
Constructs a LineSegmenter
with an invariant locale and compiled LSTM data for
complex scripts (Khmer, Lao, Myanmar, and Thai).
The LSTM, or Long Term Short Memory, is a machine learning model. It is smaller than the full dictionary but more expensive during segmentation (inference).
See also Self::new_lstm_with_options
.
β¨ Enabled with the compiled_data
and lstm
Cargo features.
sourcepub fn try_new_lstm_with_any_provider(
provider: &(impl AnyProvider + ?Sized),
) -> Result<Self, SegmenterError>
pub fn try_new_lstm_with_any_provider( provider: &(impl AnyProvider + ?Sized), ) -> Result<Self, SegmenterError>
A version of [Self :: new_lstm
] that uses custom data provided by an AnyProvider
.
sourcepub fn try_new_lstm_unstable<D>(provider: &D) -> Result<Self, SegmenterError>
pub fn try_new_lstm_unstable<D>(provider: &D) -> Result<Self, SegmenterError>
A version of Self::new_lstm
that uses custom data provided by a DataProvider
.
π Help choosing a constructor
sourcepub fn new_dictionary() -> Self
pub fn new_dictionary() -> Self
Constructs a LineSegmenter
with an invariant locale and compiled dictionary data for
complex scripts (Khmer, Lao, Myanmar, and Thai).
The dictionary model uses a list of words to determine appropriate breakpoints. It is faster than the LSTM model but requires more data.
See also Self::new_dictionary_with_options
.
β¨ Enabled with the compiled_data
Cargo feature.
sourcepub fn try_new_dictionary_with_any_provider(
provider: &(impl AnyProvider + ?Sized),
) -> Result<Self, SegmenterError>
pub fn try_new_dictionary_with_any_provider( provider: &(impl AnyProvider + ?Sized), ) -> Result<Self, SegmenterError>
A version of [Self :: new_dictionary
] that uses custom data provided by an AnyProvider
.
sourcepub fn try_new_dictionary_unstable<D>(
provider: &D,
) -> Result<Self, SegmenterError>
pub fn try_new_dictionary_unstable<D>( provider: &D, ) -> Result<Self, SegmenterError>
A version of Self::new_dictionary
that uses custom data provided by a DataProvider
.
π Help choosing a constructor
sourcepub fn new_auto_with_options(options: LineBreakOptions) -> Self
pub fn new_auto_with_options(options: LineBreakOptions) -> Self
Constructs a LineSegmenter
with an invariant locale, custom LineBreakOptions
, and
the best available compiled data for complex scripts (Khmer, Lao, Myanmar, and Thai).
The current behavior, which is subject to change, is to use the LSTM model when available.
See also Self::new_auto
.
β¨ Enabled with the compiled_data
and auto
Cargo features.
sourcepub fn try_new_auto_with_options_with_any_provider(
provider: &(impl AnyProvider + ?Sized),
options: LineBreakOptions,
) -> Result<Self, SegmenterError>
pub fn try_new_auto_with_options_with_any_provider( provider: &(impl AnyProvider + ?Sized), options: LineBreakOptions, ) -> Result<Self, SegmenterError>
A version of [Self :: new_auto_with_options
] that uses custom data provided by an AnyProvider
.
sourcepub fn try_new_auto_with_options_unstable<D>(
provider: &D,
options: LineBreakOptions,
) -> Result<Self, SegmenterError>
pub fn try_new_auto_with_options_unstable<D>( provider: &D, options: LineBreakOptions, ) -> Result<Self, SegmenterError>
A version of Self::new_auto_with_options
that uses custom data provided by a DataProvider
.
π Help choosing a constructor
sourcepub fn new_lstm_with_options(options: LineBreakOptions) -> Self
pub fn new_lstm_with_options(options: LineBreakOptions) -> Self
Constructs a LineSegmenter
with an invariant locale, custom LineBreakOptions
, and
compiled LSTM data for complex scripts (Khmer, Lao, Myanmar, and Thai).
The LSTM, or Long Term Short Memory, is a machine learning model. It is smaller than the full dictionary but more expensive during segmentation (inference).
See also Self::new_dictionary
.
β¨ Enabled with the compiled_data
and lstm
Cargo features.
sourcepub fn try_new_lstm_with_options_with_any_provider(
provider: &(impl AnyProvider + ?Sized),
options: LineBreakOptions,
) -> Result<Self, SegmenterError>
pub fn try_new_lstm_with_options_with_any_provider( provider: &(impl AnyProvider + ?Sized), options: LineBreakOptions, ) -> Result<Self, SegmenterError>
A version of [Self :: try_new_lstm_with_options
] that uses custom data provided by an AnyProvider
.
sourcepub fn try_new_lstm_with_options_unstable<D>(
provider: &D,
options: LineBreakOptions,
) -> Result<Self, SegmenterError>
pub fn try_new_lstm_with_options_unstable<D>( provider: &D, options: LineBreakOptions, ) -> Result<Self, SegmenterError>
A version of Self::new_lstm_with_options
that uses custom data provided by a DataProvider
.
π Help choosing a constructor
sourcepub fn new_dictionary_with_options(options: LineBreakOptions) -> Self
pub fn new_dictionary_with_options(options: LineBreakOptions) -> Self
Constructs a LineSegmenter
with an invariant locale, custom LineBreakOptions
, and
compiled dictionary data for complex scripts (Khmer, Lao, Myanmar, and Thai).
The dictionary model uses a list of words to determine appropriate breakpoints. It is faster than the LSTM model but requires more data.
See also Self::new_dictionary
.
β¨ Enabled with the compiled_data
Cargo feature.
sourcepub fn try_new_dictionary_with_options_with_any_provider(
provider: &(impl AnyProvider + ?Sized),
options: LineBreakOptions,
) -> Result<Self, SegmenterError>
pub fn try_new_dictionary_with_options_with_any_provider( provider: &(impl AnyProvider + ?Sized), options: LineBreakOptions, ) -> Result<Self, SegmenterError>
A version of [Self :: new_dictionary_with_options
] that uses custom data provided by an AnyProvider
.
sourcepub fn try_new_dictionary_with_options_unstable<D>(
provider: &D,
options: LineBreakOptions,
) -> Result<Self, SegmenterError>
pub fn try_new_dictionary_with_options_unstable<D>( provider: &D, options: LineBreakOptions, ) -> Result<Self, SegmenterError>
A version of Self::new_dictionary_with_options
that uses custom data provided by a DataProvider
.
π Help choosing a constructor
sourcepub fn segment_str<'l, 's>(
&'l self,
input: &'s str,
) -> LineBreakIteratorUtf8<'l, 's>
pub fn segment_str<'l, 's>( &'l self, input: &'s str, ) -> LineBreakIteratorUtf8<'l, 's>
Creates a line break iterator for an str
(a UTF-8 string).
There are always breakpoints at 0 and the string length, or only at 0 for the empty string.
sourcepub fn segment_utf8<'l, 's>(
&'l self,
input: &'s [u8],
) -> LineBreakIteratorPotentiallyIllFormedUtf8<'l, 's>
pub fn segment_utf8<'l, 's>( &'l self, input: &'s [u8], ) -> LineBreakIteratorPotentiallyIllFormedUtf8<'l, 's>
Creates a line break iterator for a potentially ill-formed UTF8 string
Invalid characters are treated as REPLACEMENT CHARACTER
There are always breakpoints at 0 and the string length, or only at 0 for the empty string.
sourcepub fn segment_latin1<'l, 's>(
&'l self,
input: &'s [u8],
) -> LineBreakIteratorLatin1<'l, 's>
pub fn segment_latin1<'l, 's>( &'l self, input: &'s [u8], ) -> LineBreakIteratorLatin1<'l, 's>
Creates a line break iterator for a Latin-1 (8-bit) string.
There are always breakpoints at 0 and the string length, or only at 0 for the empty string.
sourcepub fn segment_utf16<'l, 's>(
&'l self,
input: &'s [u16],
) -> LineBreakIteratorUtf16<'l, 's>
pub fn segment_utf16<'l, 's>( &'l self, input: &'s [u16], ) -> LineBreakIteratorUtf16<'l, 's>
Creates a line break iterator for a UTF-16 string.
There are always breakpoints at 0 and the string length, or only at 0 for the empty string.