Crate unicode_width

Source
Expand description

Determine displayed width of char and str types according to Unicode Standard Annex #11 and other portions of the Unicode standard. See the Rules for determining width section for the exact rules.

This crate is #![no_std].

use unicode_width::UnicodeWidthStr;

let teststr = "Hello, world!";
let width = UnicodeWidthStr::width(teststr);
println!("{}", teststr);
println!("The above string is {} columns wide.", width);

§"cjk" feature flag

This crate has one Cargo feature flag, "cjk" (enabled by default). It enables the UnicodeWidthChar::width_cjk and UnicodeWidthStr::width_cjk, which perform an alternate width calculation more suited to CJK contexts. The flag also unseals the UnicodeWidthChar and UnicodeWidthStr traits.

Disabling the flag (with no_default_features in Cargo.toml) will reduce the amount of static data needed by the crate.

use unicode_width::UnicodeWidthStr;

let teststr = "“𘀀”";
assert_eq!(teststr.width(), 4);

#[cfg(feature = "cjk")]
assert_eq!(teststr.width_cjk(), 6);

§Rules for determining width

This crate currently uses the following rules to determine the width of a character or string, in order of decreasing precedence. These may be tweaked in the future.

  1. In the following cases, the width of a string differs from the sum of the widths of its constituent characters:
    • The sequence "\r\n" has width 1.
    • Emoji-specific ligatures:
    • '\u{2018}', '\u{2019}', '\u{201C}', and '\u{201D}' always have width 1 when followed by ‘\u{FE00}’, and width 2 when followed by ‘\u{FE01}’.
    • Script-specific ligatures:
      • For all the following ligatures, the insertion of any number of default-ignorable combining marks anywhere in the sequence will not change the total width. In addition, for all non-Arabic ligatures, the insertion of any number of '\u{200D}' ZERO WIDTH JOINERs will not affect the width.
      • Arabic: A character sequence consisting of one character with Joining_Group=Lam, followed by any number of characters with Joining_Type=Transparent, followed by one character with Joining_Group=Alef, has total width 1. For example: لا‎, لآ‎, ڸا‎, لٟٞأ
      • Buginese: "\u{1A15}\u{1A17}\u{200D}\u{1A10}" (<a, -i> ya, ᨕᨗ‍ᨐ) has total width 1.
      • Hebrew: "א\u{200D}ל" (Alef-Lamed, א‍ל) has total width 1.
      • Khmer: Coeng signs consisting of '\u{17D2}' followed by a character in '\u{1780}'..='\u{1782}' | '\u{1784}'..='\u{1787}' | '\u{1789}'..='\u{178C}' | '\u{178E}'..='\u{1793}' | '\u{1795}'..='\u{1798}' | '\u{179B}'..='\u{179D}' | '\u{17A0}' | '\u{17A2}' | '\u{17A7}' | '\u{17AB}'..='\u{17AC}' | '\u{17AF}' have width 0.
      • Kirat Rai: Any sequence canonically equivalent to '\u{16D68}', '\u{16D69}', or '\u{16D6A}' has total width 1.
      • Lisu: Tone letter combinations consisting of a character in the range '\u{A4F8}'..='\u{A4FB}' followed by a character in the range '\u{A4FC}'..='\u{A4FD}' have width 1. For example: ꓹꓼ
      • Old Turkic: "\u{10C32}\u{200D}\u{10C03}" (𐰲‍𐰃) has total width 1.
      • Tifinagh: A sequence of a Tifinagh consonant in the range '\u{2D31}'..='\u{2D65}' | '\u{2D6F}', followed by either '\u{2D7F}' TIFINAGH CONSONANT JOINER or '\u{200D}', followed by another Tifinangh consonant, has total width 1. For example: ⵏ⵿ⴾ
    • In an East Asian context only, <, =, or > have width 2 when followed by '\u{0338}' COMBINING LONG SOLIDUS OVERLAY. The two characters may be separated by any number of characters whose canonical decompositions consist only of characters meeting one of the following requirements:
  2. In all other cases, the width of the string equals the sum of its character widths:
    1. '\u{2D7F}' TIFINAGH CONSONANT JOINER has width 1 (outside of the ligatures described previously).
    2. '\u{115F}' HANGUL CHOSEONG FILLER and '\u{17A4}' KHMER INDEPENDENT VOWEL QAA have width 2.
    3. '\u{17D8}' KHMER SIGN BEYYAL has width 3.
    4. The following have width 0:
    5. Characters with an East_Asian_Width of Fullwidth or Wide have width 2.
    6. Characters fulfilling all of the following conditions have width 2 in an East Asian context, and width 1 otherwise:
    7. All other characters have width 1.

§Canonical equivalence

Canonically equivalent strings are assigned the same width (CJK and non-CJK).

Constants§

UNICODE_VERSION
The version of Unicode that this version of unicode-width is based on.

Traits§

UnicodeWidthChar
Methods for determining displayed width of Unicode characters.
UnicodeWidthStr
Methods for determining displayed width of Unicode strings.
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy