Unicode Essentials

Unicode Essentials Something you need to know about modern text
systems Dan Chen / 2022-03-08

Q1 How do you reverse a string in JavaScript? 2

A 'helloworld'.split('').reverse().join('') 3

% node Welcome to Node.js v14.11.0. > 'helloworld'.split('').reverse().join('') 'dlrowolleh' >
'你好世界'.split('').reverse().join('') '界世好你' > 'hello💩world'.split('').reverse().join('') 'dlrow��olleh' poop testing 5

Q2 How do you get a string's length in JavaScript?
6

A 'helloworld'.length 7

% node Welcome to Node.js v14.11.0. > 'helloworld'.length 10 >
'你好世界'.length 4 > 'hello💩world'.length 12 8

“JavaScript is lame. That's why I use Python” 9

% python3 Python 3.9.0 (default, Dec 6 2020, 18:02:34) >>>
len('helloworld') 10 >>> len('hello💩world') 11 >>> len('helloworld ') 18 10

Life is tough, Unicode is hard. Afraid no more! •
Essential topics to minimize the burden of dealing with texts • Unicode basics ◦ Codepoints ◦ Encodings • Common issues with Unicode ◦ Length (bytes, characters) ◦ Normalization ◦ … 11

“A standard to give every character a unique identiﬁer” 12

How to represent text in computer world? • ASCII (American
Standard Code for Information Interchange) ◦ Character = 7 bits, e.g., ⟨A⟩ = 65, or 0x41, or 0b1000001 ◦ Not enough for modern texts (even with extended) • Big-5 (i.e. CP950 on Windows) ◦ Character = 16 bits (2 byte), e.g., ⟨字⟩ = 0xA672 ◦ Region-speciﬁc encoding/codepage • Unicode Codepoints ◦ Character = 21 bits (logically), e.g., ⟨字⟩ = U+5B57, or No. 22383 ◦ One standard to rule them all (over the world) ◦ Technically, some codepoints do not refer to visible/standalone characters /zì/ character 13

• A unique identiﬁer (codepoint) of a Unicode character =
21 bits (~3 bytes unsigned) • For example ⟨字⟩ is at No. 22383 (U+5B57) • “Let's store each character with 4 bytes” ◦ Named as UCS-4, or UTF-32 (where UCS=Universal Coded Character Set, and UTF=Unicode Transformation Format) ◦ Encoded buffer looks like codepoint sequence, easy peasy! ◦ ⟨語言⟩ = ⟨U+8A9E⟩⟨U+8A00⟩ → ⟨9E 8A 00 00⟩⟨8A 00 00 00⟩ ◦ ⟨AB⟩ = ⟨U+0041⟩⟨U+0042⟩ → ⟨41 00 00 00⟩⟨42 00 00 00⟩ ◦ Not eﬃcient in most cases (where using ASCII or Extended ASCII was/is enough) OK then, How to represent a number in computer world? /yǔ yán/ language I'm trying to make it short & simple, hence the information may be imprecise or incorrect. Please refer to Wikipedia and Unicode website for detailed history. ⚠ 14

• “Alright. What if we put commonly used characters to
lower 2 bytes?” ◦ UCS-2, and later UTF-16 (unlike UCS-2 & UTF-32, these two are different!) ◦ ⟨語⟩⟨言⟩ = ⟨U+8A9E⟩⟨U+8A00⟩ → bytes ⟨9E 8A 00 00⟩⟨00 8A 00 00⟩ ◦ ⟨A⟩⟨B⟩ = ⟨U+0041⟩⟨U+0042⟩ → ⟨41 00 00 00⟩⟨42 00 00 00⟩ ◦ Reduced overall document size by 50% (in comparison to UTF-32) • “Come on. My ASCII document is still bloated to 200% with a lot of trash.” ◦ UTF-8 encodes a character with 1~4 bytes smartly (i.e. variable length) ◦ ⟨語言AB⟩ = ⟨U+8A9E⟩⟨U+8A00⟩⟨U+0041⟩⟨U+0042⟩ → ⟨E8 AA 9E⟩⟨E8 A8 80⟩⟨41⟩⟨42⟩ OK then, How to represent a number in computer world? (cont.) 15

Ref. https://zh.wikipedia.org/wiki/UTF-8#/media/File:Unicode_Web_growth.svg Thumb rule → favor UTF-8 by default 16

Encoded Sequence (Storage/Transport) Unicode Codepoints (In-Memory) Unicode Planes (Rendering) E8
AA 9E E8 A8 80 41 42 E8 AA 9E ↔ U+8A9E E8 A8 80 ↔ U+8A00 41 ↔ U+0041 42 ↔ U+0041 U+8A9E ↔ 語 U+8A00 ↔ 言 U+0041 ↔ A U+0042 ↔ B “A standard to give every character a unique identiﬁer” 17

“No worries! My string is encoded in Unicode” UTF-8 UTF-16LE,
UTF-16BE UTF-32LE, UTF-32BE UCS-2 UCS-4 … 18

% python3 Python 3.9.0 (default, Dec 6 2020, 18:02:34) >>>
'語言AB'.encode('utf-16le') b'\x9e\x8a\x00\x8aA\x00B\x00' >>> '語言AB'[1].encode('utf-32le') b'\x00\x8a\x00\x00' >>> '語言AB'.encode('utf-8') b'\xe8\xaa\x9e\xe8\xa8\x80AB' >>> ' '.join(map(hex, '語言AB'.encode('utf-8'))) '0xe8 0xaa 0x9e 0xe8 0xa8 0x80 0x41 0x42' 19

UCS-2 vs. UTF-16 • Unicode needs 21 bits (logically) to
represent all characters ◦ But UCS-2 only takes 16 bits for each character ◦ What if we need to store/display a character that takes more than 16 bits? ◦ For example, 💩 = U+1F4A9 = 0b11111010010101001 (17 bits) • Augment UCS-2 with variable length capability → UTF-16 • Unicode deﬁnes planes ◦ Plane = Group of 65536 (216) codepoints ◦ BMP (Basic Multilingual Plane) or simply Plane 0, deﬁnes commonly used characters ◦ Codepoint outside BMP takes more than 16 bits 20

• Take 4 bytes (rather than 2 bytes) in UTF-16
to represent a codepoint • High surrogate U+D800~U+DBFF, combines with low surrogate U+DC00~U+DFFF • Some (classic) JavaScript APIs behave like UCS-2 ◦ While some other JavaScript APIs behave like UTF-16 (i.e. surrogate pair awared) ◦ Google V8 internally uses UTF-16 (ref. v8/heap/factory.cc) % node Welcome to Node.js v14.11.0. > '💩'.length 2 > [0,1].map(i => '💩'.charCodeAt(i).toString(16)) [ 'd83d', 'dca9' ] > [...'Hello 💩'] [ 'H', 'e', 'l', 'l', 'o', ' ', '💩' ] Surrogate Pairs 21

Given a “Unicode” ﬁle, how do you tell its encoding?
• UTF-8, UCS-2, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, etc. • BOM (Byte Order Mark) — ⟨U+FFFE⟩ ◦ Specify endianness → E.g., UTF-16BE = ⟨0xFF 0xFE⟩, UTF-16LE = ⟨0xFE 0xFF⟩ ◦ Specify encoding → E.g., UTF-8 = ⟨0xEF 0xBB 0xBF⟩ • Case Study (...) ◦ Microsoft Excel opens CSV with “system encoding” (Windows settings) ◦ E.g., CP950 (Big-5), unless the BOM is present #!python3 with open(p, 'w', encoding='utf-8') as f: w = csv.DictWriter(…) # ---- vs ---- with open(p, 'w', encoding='utf-8-sig') as f: w = csv.DictWriter(…) 22

• Canonically equivalent codepoint sequences ◦ Same appearance & meaning
when printed or displayed ◦ For example, the diacritics (accents) and the Hangul syllables • Compatible codepoint sequences ◦ Possibly distinct appearance, but same meaning in some contexts ◦ For example, ⟨ﬀ⟩ U+FB00 is compatible with two ordinary Latin ⟨f⟩ U+0066 letters ◦ Another example, ligature ⟨㍿⟩ U+337F ◦ Ref. CJK Compatibility • Normal form (Normalization) ◦ Reduce equivalent codepoint sequences to the same codepoints ◦ NFD, NFC, NFKD, NFKC Unicode Equivalence / Normalization % python3 from unicodedata import normalize >>> '\uC77C', '\uC774\u11AF' ('일', '일') >>> '\uC77C' == '\uC774\u11AF' False >>> '\uC77C' == normalize('NFC', '\uC774\u11AF') True >>> normalize('NFD', '\uC77C') == '\uC774\u11AF' False /kabushiki gaisha/ 25

• Emoji codepoints reside in SMP (Supplementary Multilingual Plane) ◦
Surrogate pairs in UTF-16 (2 × 2 bytes), 4 bytes in UTF-8 ◦ Poop testing 💩 (e.g., in Node.js, "💩".length gives 2) • ZWJ (zero-width joiner, U+200D) Sequence ◦ Man Technologist = a combination of 👨 U+1F468, ⟨ZWJ⟩ U+200D, and 💻 U+1F4BB ◦ Family = 👨 + ⟨ZWJ⟩ + 👩 + ⟨ZWJ⟩ + 👧 + ⟨ZWJ⟩ + 👦 ◦ Kiss ❤ 💋 = 👨 + 🏾 + ⟨ZWJ⟩ + ❤ + ⟨Variation Selector-16⟩ U+FE0F + ⟨ZWJ⟩ + 💋 + <ZWJ> + 👩 + 🏻 • Skin Tone Modiﬁers ◦ 6 levels according to the Fitzpatrick scale ◦ Women (Light Skin & Curly Hair) = 👩 U+1F469 🏻 U+1F3FB ⟨ZWJ⟩ U+200D 🦱 U+1F9B1 • Flags ◦ Taiwan = ⟨🇹⟩ U+1F1F9 ⟨🇼⟩ U+1F1FC % python3 >>> len('💩') 1 >>> len(' ') 3 >>> len(' ') 4 >>> len(' ') 7 >>> len(' ') 2 Emoji ❤💀🤖 26 = + + ⟨ZWJ⟩ + ❤ + ⟨Variation Selector-16⟩ U+FE0F + ⟨ZWJ⟩ + + <ZWJ> + +

• Python 2 → Use no more, and migrate to
Python 3 • Python 3 ◦ len(str) and [x for x in str] operate on codepoints (beware of combining characters ⚠) ◦ Rule of 🍔 → ① bytes#decode to text on input, ② process text, ③ str#encode to bytes on output ◦ Use unicodedata#normalize and str#casefold for display/sorting/comparison ◦ Sorting (standard sorted) may require locale#setlocale, or just import pyuca ◦ Use grapheme#length to count visible characters ◦ Ref. Unicode HOWTO and Unicode Objects and Codecs (Python 3 oﬃcial documentation) • JavaScript ◦ String#length counts UCS-2 code units, while for … of str and [...str] operate codepoints ◦ Beware of surrogate pairs (in some APIs) and combining characters ⚠ ◦ Consider String#normalize, String#localeCompare, and String#toLocaleLowerCase ◦ Use GraphemeSplitter#countGraphemes to count visible characters Programming with Unicode (the simple version) 27

Fun Programming Tricks (though you may not need these) 28
% python3 Python 3.9.0 (default, Dec 6 2020, 18:02:34) >>> '\N{PILE OF POO}' '💩' >>> x = '⑧⑥' >>> int(x) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for int() with base 10: '⑧⑥' >>> sum([unicodedata.numeric(x[i]) * 10**(len(x)-i-1) \ for i in range(len(x))]) 86.0

• Zalgo text (creepy/glitchy appearance with diacritics) • Collation (byte-to-byte
comparison among strings, e.g., radical or strokes for Chinese, ref) • Casefolding (aggressive “lower()” for non-Latin caseless comparison) • CJK Uniﬁed Ideographs (For ⟨㋿⟩, which one to use → ⟨令 U+4EE4⟩ vs ⟨令 U+F9A8⟩) • IDN homograph attack (phishing scam) • Regular Expressions (e.g., r"\p{InCJK_Unified_Ideographs}") • Transliteration (letter swap rules, e.g., Greek ⟨α⟩ → ⟨a⟩, sometimes used in SMS) • Punycode (represent Unicode characters in ASCII subset, e.g., ⟨中文⟩ → ⟨xn--ﬁq228c⟩) • Right-to-left mark ⟨U+200F⟩ (e.g., ⟨U+0623 U+0647 U+0644 U+0627! U+200F⟩ → ⟨!ﻼھأ⟩) • Localization (with Unicode CLDR) → Unicode is not just about characters • Databases: MySQL (utf8mb4), and other languages: C, C++, Swift, Go, HTML, etc. • Security issues related to Unicode processing (e.g., CVE-2018-4290 – crashes iPhone) • … (Unicode is too hard to master) Advanced Topics (not covered today) 部首筆劃 29

• Unicode Solutions in Python 2 and Python 3 (slides)
👍 • Unicode, JavaScript and the Emoji family (slides) 👍 • What Every Programmer Absolutely Needs To Know About Encodings & Character Sets (blog post) 👍 • I � Unicode (slides) 👍 • Hacking with Unicode in 2016 (slides) 👍 • Python 3 and Unicode (slides) • Counting characters – Twitter Developer Platform (blog post) • International Components for Unicode (slides) • 新元号対応についてー日本マイクロソフト株式会社 (slides) • Unicode Explained (book, O'reilly, 2006) • Unicode Demystiﬁed (book, O'reilly, 2002) • CJKV Information Processing, 2/e (book, O'reilly, 2008) • rust-lang/rust #12056 path: Windows paths may contain non-utf8-representable sequences (GitHub) • Unicode Support – The Linux kernel user's & admin's guide (documentation) Further Readings 30

• Compart Unicode Lookup • Unicode search tool (scarfboy.com) •
Emojipedia • Shapecatcher (recognize handwritten Unicode characters) • 古今文字集成 (ancient Chinese characters lookup) • Unicode tools curation (unicode.org) Useful Online Tools 31

Bonus: Encoding Option in Windows Notepad • ANSI → System
Locale • Unicode → UTF-16LE • Unicode big endian → UTF-16BE • UTF-8 → UTF-8 32

• ANSI → System Locale • UTF-16 LE • UTF-16
BE • UTF-8 • UTF-8 with BOM Bonus: Encoding Option in Windows Notepad (cont.) Things seem to get better in latest Windows (20H2~) 33

Unicode Essentials

Unicode Essentials

Dan Chen

More Decks by Dan Chen

Other Decks in Programming

Featured

Transcript