Expand description
Normalizing text into Unicode Normalization Forms.
This module is published as its own crate (icu_normalizer
)
and as part of the icu
crate. See the latter for more details on the ICU4X project.
§Implementation notes
The normalizer operates on a lazy iterator over Unicode scalar values (Rust char
) internally
and iterating over guaranteed-valid UTF-8, potentially-invalid UTF-8, and potentially-invalid
UTF-16 is a step that doesn’t leak into the normalizer internals. Ill-formed byte sequences are
treated as U+FFFD.
The normalizer data layout is not based on the ICU4C design at all. Instead, the normalization data layout is a clean-slate design optimized for the concept of fusing the NFD decomposition into the collator. That is, the decomposing normalizer is a by-product of the collator-motivated data layout.
Notably, the decomposition data structure is optimized for a starter decomposing to itself, which is the most common case, and for a starter decomposing to a starter and a non-starter on the Basic Multilingual Plane. Notably, in this case, the collator makes use of the knowledge that the second character of such a decomposition is a non-starter. Therefore, decomposition into two starters is handled by generic fallback path that looks the decomposition from an array by offset and length instead of baking a BMP starter pair directly into a trie value.
The decompositions into non-starters are hard-coded. At present in Unicode, these appear to be special cases falling into three categories:
- Deprecated combining marks.
- Particular Tibetan vowel sings.
- NFKD only: half-width kana voicing marks.
Hopefully Unicode never adds more decompositions into non-starters (other than a character decomposing to itself), but if it does, a code update is needed instead of a mere data update.
The composing normalizer builds on the decomposing normalizer by performing the canonical composition post-processing per spec. As an optimization, though, the composing normalizer attempts to pass through already-normalized text consisting of starters that never combine backwards and that map to themselves if followed by a character whose decomposition starts with a starter that never combines backwards.
As a difference with ICU4C, the composing normalizer has only the simplest possible passthrough (only one inversion list lookup per character in the best case) and the full decompose-then-canonically-compose behavior, whereas ICU4C has other paths between these extremes. The ICU4X collator doesn’t make use of the FCD concept at all in order to avoid doing the work of checking whether the FCD condition holds.
Re-exports§
pub use NormalizerError as Error;
Modules§
- error 🔒
- Normalizer-specific error
- properties
- Access to the Unicode properties or property-based operations that are required for NFC and NFD.
- provider
- 🚧 [Unstable] Data provider struct definitions for this ICU4X component.
- uts46
- Bundles the part of UTS 46 that makes sense to implement as a normalization.
Macros§
Structs§
- Character
AndClass 🔒 - Pack a
char
and aCanonicalCombiningClass
in 32 bits (the former in the lower 24 bits and the latter in the high 8 bits). The latter can be initialized to 0xFF upon creation, in which case it can be actually set later by callingset_ccc_from_trie_if_not_already_set
. This is a micro optimization to avoid the Canonical Combining Class trie lookup when there is only one combining character in a sequence. This type is intentionally non-Copy
to get compiler help in making sure that the class is set on the instance on which it is intended to be set and not on a temporary copy. - Character
AndTrie 🔒Value - Struct for holding together a character and the value looked up for it from the NFD trie in a more explicit way than an anonymous pair. Also holds a flag about the supplementary-trie provenance.
- Composing
Normalizer - A normalizer for performing composing normalization.
- Composition
- An iterator adaptor that turns an
Iterator
overchar
into a lazily-decomposed and then canonically composedchar
sequence. - Decomposing
Normalizer - A normalizer for performing decomposing normalization.
- Decomposition
- An iterator adaptor that turns an
Iterator
overchar
into a lazily-decomposedchar
sequence. - IsNormalized
Sink 🔒Str - IsNormalized
Sink 🔒Utf8 - IsNormalized
Sink 🔒Utf16
Enums§
- Ignorable
Behavior 🔒 - Treatment of the ignorable marker (0xFFFFFFFF) in data.
- Normalizer
Error - A list of error outcomes for various operations in this module.
- Supplement
Payload 🔒Holder
Constants§
- BACKWARD_
COMBINING_ 🔒STARTER_ MARKER - Marker for starters that decompose to themselves but may combine backwards under canonical composition. (Main trie only; not used in the supplementary trie.)
- EMPTY_
CHAR 🔒 - EMPTY_
U16 🔒 - FDFA_
MARKER 🔒 - Marker value for U+FDFA in NFKD
- HANGUL_
JAMO_ 🔒LIMIT - One past the conjoining jamo block
- HANGUL_
L_ 🔒BASE - Lead jamo base
- HANGUL_
L_ 🔒COUNT - Lead jamo count
- HANGUL_
N_ 🔒COUNT - Vowel jamo count times trail jamo count
- HANGUL_
S_ 🔒BASE - Syllable base
- HANGUL_
S_ 🔒COUNT - Syllable count
- HANGUL_
T_ 🔒BASE - Trail jamo base (deliberately off by one to account for the absence of a trail)
- HANGUL_
T_ 🔒COUNT - Trail jamo count (deliberately off by one to account for the absence of a trail)
- HANGUL_
V_ 🔒BASE - Vowel jamo base
- HANGUL_
V_ 🔒COUNT - Vowel jamo count
- IGNORABLE_
MARKER 🔒 - Marker for UTS 46 ignorables.
- NON_
ROUND_ 🔒TRIP_ MARKER - Marker that a complex decomposition isn’t round-trippable under re-composition.
- SPECIAL_
NON_ 🔒STARTER_ DECOMPOSITION_ MARKER - Magic marker trie value for characters whose decomposition starts with a non-starter. The actual decomposition is hard-coded.
- SPECIAL_
NON_ 🔒STARTER_ DECOMPOSITION_ MARKER_ U16 u16
version of the previous marker value.- UTF16_
FAST_ 🔒PATH_ FLUSH_ THRESHOLD - Number of iterations allowed on the fast path before flushing. Since a typical UTF-16 iteration advances over a 2-byte BMP character, this means two memory pages. Intel Core i7-4770 had the best results between 2 and 4 pages when testing powers of two. Apple M1 didn’t seem to care about 1, 2, 4, or 8 pages.
Statics§
- FDFA_
NFKD 🔒 - The tail (everything after the first character) of the NFKD form U+FDFA as 16-bit units.
Functions§
- ccc_
from_ 🔒trie_ value - Extracts a canonical combining class (possibly zero) from a trie value.
- char_
from_ 🔒u16 - Convert a
u16
obtained from data provider data tochar
. - char_
from_ 🔒u32 - Convert a
u32
obtained from data provider data tochar
. - compose 🔒
- Performs canonical composition (including Hangul) on a pair of
characters or returns
None
if these characters don’t compose. Composition exclusions are taken into account. - compose_
non_ 🔒hangul - Performs (non-Hangul) canonical composition on a pair of characters
or returns
None
if these characters don’t compose. Composition exclusions are taken into account. - decomposition_
starts_ 🔒with_ non_ starter - Checks if a trie value signifies a character whose decomposition starts with a non-starter.
- in_
inclusive_ 🔒range - in_
inclusive_ 🔒range16 - in_
inclusive_ 🔒range32 - sort_
slice_ 🔒by_ ccc - trie_
value_ 🔒has_ ccc - Checks if a trie value carries a (non-zero) canonical combining class.
- trie_
value_ 🔒indicates_ special_ non_ starter_ decomposition - Checks if the trie signifies a special non-starter decomposition.
- unwrap_
or_ 🔒gigo - If
opt
isSome
, unwrap it. IfNone
, panic if debug assertions are enabled and returndefault
if debug assertions are not enabled.