Ruby Taught Me About Under the Hood

Ruby Taught Me About Encoding Under the Hood Mari Imaizumi
@ima1zumi RubyKaigi 2025 2025-04-16

About me Mari Imaizumi @ima1zumi 🍊 Originally from Matsuyama Working
at STORES, Inc. A member of IRB team Ruby committer 2

About me Mari Imaizumi @ima1zumi Originally from Matsuyama 🏯 Working
at STORES, Inc. A member of IRB team Ruby committer 3

I'm back on this stage after 20 years 4

About me Mari Imaizumi @ima1zumi Originally from Matsuyama Working at
STORES, Inc. 💻 A member of IRB team Ruby committer 5

Nursery Sponsor Speakers: LT: Sponsor booth Let's try IRB Treasure
Hunt 💎 6

STORES, Inc. A member of IRB team 💎 Ruby committer 7

STORES, Inc. A member of IRB team Ruby committer 🆕 8

STORES, Inc. A member of IRB team Ruby committer Character encoding enthusiast 🖋 10

Character encoding is still interesting 11

Agenda History of Character Encodings Fell down the rabbit hole
of character encodings Encounter with EBCDIC The pitfalls of character counting Upgrading Ruby to Unicode 15.1.0 Future works 12

History of Character Encodings

Character Encoding 14 A, B, C, ..., Z, a, b,
c, ..., z, 0, 1, 2, ..., 9, SP, !, ", #, ... A -> 0x41 B -> 0x42 a -> 0x61 0 -> 0x30 Character Set Character Encoding Scheme จࣈू߹ จࣈූ߸Խํࣜ

Why do we need character encodings? 15

non-electric long-distance communication methods 16

Smoke Signals 17 ࿛Ԏ

Optical telegraph ޫֶࣜి৴ (Semaphore) ηϚϑΥ/࿹໦௨৴ 18 Public Domain https://commons.wikimedia.org/wiki/File:Chappe_semaphore.jpg

19 Photo by Patrick87 / CC BY-SA 3.0 https://commons.wikimedia.org/wiki/File:Chappe.svg Optical
telegraph ޫֶࣜి৴ (Semaphore) ηϚϑΥ/࿹໦௨৴

Morse code Late 18th to 19th century: Practical application of
electricity 1837–1844: Invention and practical use of Morse code 20

Morse code 21 Public Domain https://commons.wikimedia.org/w/index.php?curid=3902977

ASCII, EBCDIC 1963 ASCII American Standard Code for Information Interchange
Established by ASA now known as ANSI 7-bit encoding 1964 EBCDIC Extended Binary Coded Decimal Interchange Code 8-bit encoding, Established by IBM 22

ASCII 23

EBCDIC 24

25 Character ASCII (Hex) EBCDIC (Hex) A \x41 \xC1 B
\x42 \xC2 a \x61 \x81 b \x62 \x82 Space \x20 \x40

A Maze of Character Encodings 1967 ISO/IEC 646 1969 JIS
X 0201 1978 JIS X 0208 1984 EUC-JP 1986 ISO/IEC 2022 1987 ISO/IEC 8859-1 (Latin-1) 26

Unify to Unicode® 1991: Unicode 1.0 27

The Unicode Standard A universal character encoding standard Developed by
the Unicode Consortium 28

Univarsal Use a single character set for all scripts worldwide
e.g., Latin, Chinese, Hiragana, Katakana, Greek, Cyrillic, Arabic, Hangul, Devanagari, Tamil, etc. E ffi cient Unambiguous 29 Unicode Design Goals a ͋ ׽ Ω क Д ن ઑ அ

Unicode Code Points Code points (U+xxxx) to represent abstract characters
U+0000-10FFFF Each code point uniquely encodes one abstract character a U+0061 Ѫ U+611B Å U+00C5 🍊 U+1F34A 30

UTF-8 Unicode de fi nes a universal set of characters
with unique code points. UTF-8 transforms these code points into a variable-length sequence of 1–4 bytes. In short, Unicode is the “what,” and UTF-8 is the “how.” 31

32 Abstract Character a ͋ 🍊 Name LATIN SMALL LETTER
A HIRAGANA LETTER A TANGERINE Code Point U+0061 U+3042 U+1F34A UTF-8 byte sequences \x61 \xE3\x81\x82 \xF0\x9F \x8D\x8A

Unicode Speci fi cations Unicode Standard The Unicode Character Database
(UCD) Unicode Code Charts Unicode Standard Annexes (UAX) 33

Unify to Unicode 1991: Unicode 1.0 1993: UTF-8 was presented
2008: UTF-8 becomes the most common encoding 2010: Unicode 6.0 released (Emoji added) 2024: Unicode 16.0.0 released 34

35 History of Character Encodings

Character Encodings and Me I started programming around 2016, when
the world was already using Unicode. 36

the world was already using Unicode. The major issues with character encoding proliferation and incompatibility primarily surfaced in the 1990s. 37

the world was already using Unicode. The major issues with character encoding proliferation and incompatibility primarily surfaced in the 1990s. So why did I become a “character encoding enthusiast” in a Unicode era? 38

Recap From pre-electric signals like smoke and semaphore, we moved
to ASCII and other encodings. Unicode became the universal solution. Even in the Unicode era, deep knowledge of character encodings remains essential. 39

Fell down the rabbit hole of character encodings

How I Met Character Encodings 2016: My fi rst assignment
was on a mainframe COBOL, Assembler, JCL, z/OS 42

Our Development Environment 43 Mainframe z/OS, EBCDIC TSO ISPF EDIT

EBCDIC and Japanese EBCDIC uses 8-bit Only 256 characters It’s
impossible to fully represent Japanese: Hiragana: about 50 characters Katakana: about 50 characters Joyo kanji (commonly used kanji): 2,136 characters 44

EBCDIC Katakana extension 45 CP290 EBCDIC

Halfwidth Katakana 46

Halfwidth Katakana 48 It's hard to read 🥺

EBCDIC with Kanji Even with only 8 bits, there was
still a need to input kanji Use Shift-In (SI) and Shift-Out (SO) Control Character 50

51 example bytes Shift-In, Shift-Out bytes ↓

52 😵💫 Shift-In, Shift-Out bytes ↓ bytes ↓ example bytes

Multiple Character Sets: Complexity Outside alphabets & halfwidth kana, everything
was cumbersome in our environment Constantly checked hex bytes to avoid overwriting SI/SO control chars Realized that correct character input isn’t guaranteed 53

Recap EBCDIC’s limited code space for Japanese required halfwidth kana
and SI/SO switching. Accidental overwriting of control characters caused data corruption. 😢 Showing the characters you typed isn't easy 54

A Few Years Later… 56

Reuniting with Character Encodings Learned Ruby & Ruby on Rails
at Fjord Boot Camp 57

at Fjord Boot Camp @igaiga showed me family emoji 🧑🧑🧒🧒 that crashed IRB 58

at Fjord Boot Camp @igaiga showed me family emoji that crashed IRB Reported the issue 59

at Fjord Boot Camp @igaiga showed me family emoji that crashed IRB Reported the issue @aycabta (Reline’s author) said, “Fix it yourself!” 60

at Fjord Boot Camp @igaiga showed me family emoji that crashed IRB Reported the issue @aycabta (Reline’s author) said, “Fix it yourself!” So I did 61

What kind of bug was it? 🧑🧑🧒🧒 + Backspace +
Backspace => IRB crash Why did this happen? 62

Family emoji 🧑🧑🧒🧒 64

Family emoji 🧑🧑🧒🧒 65 "🧑🧑🧒🧒".chars.size # => 7

Family emoji 🧑🧑🧒🧒 66 "🧑🧑🧒🧒".chars.size # => 7 "🧑🧑🧒🧒".chars #
=> ["👨", "‍ ", "👩", "‍ ", "👧", "‍ ", "👦"]

Family emoji 🧑🧑🧒🧒 67 "🧑🧑🧒🧒".chars.size # => 7 "🧑🧑🧒🧒".chars #
=> ["👨", "‍ ", "👩", "‍ ", "👧", "‍ ", "👦"] "🧑🧑🧒🧒".chars.map { it.ord.to_s(16) } # => ["1f468", "200d", "1f469", "200d", "1f467", "200d", "1f466"]

68 Backspace Backspace Paste 🧑🧑🧒🧒 /usr/local/lib/ruby/2.7.0/reline/line_editor.rb:1568:in `-': nil can't be
coerced into Integer (TypeError)

69 1 2 3 4 5 6 7 8 9
10 11 12 > █ > 🧑🧑🧒🧒 █ Paste 🧑🧑🧒🧒

70 1 2 3 4 5 6 7 8 9
10 11 12 > █ > 🧑🧑🧒🧒 █ > 👨 ZWJ 👩 ZWJ 👧 ZWJ 👦 █ Paste 🧑🧑🧒🧒 ※ ZWJ is Zero Width Joiner (U+200D)

71 1 2 3 4 5 6 7 8 9
10 11 12 > █ > 🧑🧑🧒🧒 █ > █ Paste 🧑🧑🧒🧒 Backspace

Code Points vs Visible Characters : 1 code point :
7 code points Still one character visually How to handle? Grapheme Cluster (ॻهૉΫϥελ) 72 a 🧑🧑🧒🧒

73 Abstract Character a ͋ 🍊 Name LATIN SMALL LETTER
A HIRAGANA LETTER A TANGERINE Code Point U+0061 U+3042 U+1F34A UTF-8 byte sequences \x61 \xE3\x81\x82 \xF0\x9F \x8D\x8A

74 Grapheme Cluster a ͋ 🍊 🧑🧑🧒🧒 Abstract Characters a
͋ 🍊 "👨", "\u200D", "👩", "\u200D", "👧", "‍ \u200D", "👦" Name LATIN SMALL LETTER A HIRAGANA LETTER A TANGERINE nil Code Points U+0061 U+3042 U+1F34A U+1F468, U+200D, U+1F469, U+200D, U+1F467, U+200D, U+1F466 UTF-8 byte sequences \x61 \xE3\x81\x82 \xF0\x9F \x8D\x8A \xF0\x9F\x91\xA8\xE2\ x80\x8D\xF0\x9F\x91\x A9\xE2\x80\x8D\xF0\x9 F\x91\xA7\xE2\x80\x8D \xF0\x9F\x91\xA6

Grapheme Cluster Multiple code points seen as one character De
fi ned by Unicode UAX #29: Unicode Text Segmentation Ensures user-expected cursor movement & deletion 75

76 Å U+00C5 Combining Characters Å षि U+0041 U+030A び
U+0937 U+093F 🧑🧑🧒🧒 U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466 U+3072 U+3099 Precomposed Characters U+3073 nil nil ͼ

Merged into Reline (Ruby 3.0) Explored Grapheme Clusters & Unicode
depth Realized text isn’t just simple code points Found encoding to be fascinating, not just tricky 78

Recap Family emoji caused IRB to crash because multiple code
points formed a single visible character. We added Grapheme Cluster handling in Reline to respect Unicode text segmentation. This fi x ensures cursor movement and deletion align with user expectations, revealing the complexity of multi- codepoint characters. 79

Unicode 15.1.0

A Few Years Later… 82

Ruby Hackathon at RubyWorld Conference 2024 Noticed stalled Unicode updates
in Ruby Commented on Redmine and took action 83

Upgrading Ruby to Unicode 15.1.0 Why We Needed an Update
New rule: Indic_Conjunct_Break for Devanagari e.g. श क् ति (śakti) Without update, combined chars aren’t recognized as one Staying current improves international text handling 84

Ruby and Unicode 85 name2ctype.h casefold.h Ruby UnicodeData.txt DerivedCoreProperty.txt etc
Unicode Character Database

Ruby Unicode Upgrades New characters added e.g. Unicode 15.1.0 added
627 characters Properties added or updated InCB, Age, etc Aligned with Unicode specs 86

Unicode Character Database De fi nes characters and properties in
text fi les Lists code points, categories, etc Machine-readable data Ruby references UnicodeData.txt, etc. 87

UnicodeData.txt 0000;<control>;Cc;0;BN;;;;;N;NULL;;;; ... 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061; 0042;LATIN CAPITAL LETTER
B;Lu;0;L;;;;;N;;;;0062; 0043;LATIN CAPITAL LETTER C;Lu;0;L;;;;;N;;;;0063; 0044;LATIN CAPITAL LETTER D;Lu;0;L;;;;;N;;;;0064; 0045;LATIN CAPITAL LETTER E;Lu;0;L;;;;;N;;;;0065; 0046;LATIN CAPITAL LETTER F;Lu;0;L;;;;;N;;;;0066; ... 304C;HIRAGANA LETTER GA;Lo;0;L;304B 3099;;;;N;;;;; ... 094D;DEVANAGARI SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;; ... 10FFFD;<Plane 16 Private Use, Last>;Co;0;L;;;;;N;;;;; 88

DerivedCoreProperies.txt # DerivedCoreProperties-16.0.0.txt # Date: 2024-05-31, 18:09:32 GMT # ©
2024 Unicode®, Inc. # ================================================ # Derived Property: Math # Generated from: Sm + Other_Math 002B ; Math # Sm PLUS SIGN 003C..003E ; Math # Sm [3] LESS-THAN SIGN..GREATER-THAN SIGN # ================================================ # Derived Property: Lowercase # Generated from: Ll + Other_Lowercase 0061..007A ; Lowercase # L& [26] LATIN SMALL LETTER A..LATIN SMALL LETTER Z # ================================================ # Derived Property: Uppercase # Generated from: Lu + Other_Uppercase 0041..005A ; Uppercase # L& [26] LATIN CAPITAL LETTER A..LATIN CAPITAL LETTER Z 89

Onigmo A regex engine used by Ruby Supports Unicode property
matches: \p{PropertyName} and grapheme cluster matches with \X String#grapheme_clusters also calls \X 90

Property "abc".match?(/\p{ASCII}/) # => true "͍͋͏".match?(/\p{ASCII}/) # => false "🏯".match?(/\p{Emoji}/)
# => true "1".match?(/\p{Emoji}/) # => true https://docs.ruby-lang.org/en/3.4/regexp/ unicode_properties_rdoc.html 91

etc GraphemeBre akTest.txt DerivedCore Propeties.txt Ruby Unicode Upgrades UnicodeData. txt
enc-unicode.rb name2ctype.h casefold.h Auto generated test 92

Ruby's Unicode Update Process Increase the speci fi ed Unicode
version in the build process Run scripts to auto-generate tables Add new tests (some manually) 93

Issues with the Unicode 15.1.0 Update 97 etc GraphemeBre akTest.txt
DerivedCore Propeties.txt UnicodeData. txt enc-unicode.rb name2ctype.h casefold.h Auto generated test

Issues with the Unicode 15.1.0 Update Failed to parse the
UCD 98 /* 'InCB': Derived Property */  #endif /* USE_UNICODE_PROPERTIES */  tool/enc-unicode.rb:52:in 'Object#pair_codepoints': undefined method 'sort!' for nil (NoMethodError) codepoints.sort!  ^^^^^^  from tool/enc-unicode.rb:282:in 'Object#make_const'  from tool/enc-unicode.rb:441:in 'block (2 levels) in <main>'  from tool/enc-unicode.rb:434:in 'Array#each'  from tool/enc-unicode.rb:434:in 'block in <main>'

99 # DerivedCoreProperties-15.1.0.txt (snip) # Derived Property: Indic_Conjunct_Break # Generated
from the Grapheme_Cluster_Break, Indic_Syllabic_Category, # Canonical_Combining_Class, and Script properties as described in UAX #44: # ================================================ # Indic_Conjunct_Break=Linker 094D ; InCB; Linker # Mn DEVANAGARI SIGN VIRAMA (snip) # Total code points: 6 # ================================================ # Indic_Conjunct_Break=Consonant 0915..0939 ; InCB; Consonant # Lo [37] DEVANAGARI LETTER KA..DEVANAGARI LETTER HA 0958..095F ; InCB; Consonant # Lo [8] DEVANAGARI LETTER QA..DEVANAGARI LETTER YYA (snip) # Total code points: 240 # ================================================ # Indic_Conjunct_Break=Extend 0300..036F ; InCB; Extend # Mn [112] COMBINING GRAVE ACCENT..COMBINING LATIN SMALL LETTER X (snip) # Total code points: 2192

Indic Conjunct Break(InCB) Preserves consonant + linker + consonant as
one unit e.g. क + ् + त = क्त Prevents splitting of Indic ligatures (e.g., Devanagari) Crucial for correct grapheme cluster handling 100

DerivedCoreProperies.txt # DerivedCoreProperties-15.1.0.txt # ================================================ # Derived Property: Math 002B
; Math # Sm PLUS SIGN # ================================================ # Indic_Conjunct_Break=Linker 094D ; InCB; Linker # Mn DEVANAGARI SIGN VIRAMA # ================================================ # Indic_Conjunct_Break=Consonant 0915..0939 ; InCB; Consonant # Lo [37] DEVANAGARI LETTER KA..DEVANAGARI LETTER HA # ================================================ # Indic_Conjunct_Break=Extend 0300..036F ; InCB; Extend # Mn [112] COMBINING GRAVE ACCENT..COMBINING LATIN SMALL LETTER X 101

Fixed Parsing Logic Updated parsing to correctly handle properties like
InCB; Consonant 102

Successfully generated name2ctype.h 103 etc GraphemeBre akTest.txt DerivedCore Propeties.txt UnicodeData.
txt enc-unicode.rb name2ctype.h casefold.h Auto generated test

Test failed... 104 etc GraphemeBre akTest.txt DerivedCore Propeties.txt UnicodeData. txt
enc-unicode.rb name2ctype.h casefold.h Auto generated test

105 # GraphemeBreakTest-15.1.0.txt # Format: # <string> (# <comment>)? #
<string> contains hex Unicode code points, with # ÷ wherever there is a break opportunity, and # × wherever there is not. # # These samples may be extended or changed in the future. # ÷ 0020 ÷ 0020 ÷ # ÷ [0.2] SPACE (Other) ÷ [999.0] SPACE (Other) ÷ [0.3] ÷ 0020 × 0308 ÷ 0020 ÷ # ÷ [0.2] SPACE (Other) × [9.0] COMBINING DIAERESIS (Extend_ExtCccZwj) ÷ [999.0] SPACE (Other) ÷ [0.3] ÷ 0020 ÷ 000D ÷ # ÷ [0.2] SPACE (Other) ÷ [5.0] <CARRIAGE RETURN (CR)> (CR) ÷ [0.3] ÷ 0915 × 094D × 0924 ÷ # ÷ [0.2] DEVANAGARI LETTER KA (ConjunctLinkingScripts_LinkingConsonant) × [9.0] DEVANAGARI SIGN VIRAMA (Extend_ConjunctLinkingScripts_ConjunctLinker_ExtCccZwj) × [9.3] DEVANAGARI LETTER TA (ConjunctLinkingScripts_LinkingConsonant) ÷ [0.3] `

106 make check # Running tests: 2) Failure: TestGraphemeBreaksFromFile#test_each_grapheme_cluster [/Users/mi/ghq/github.com/ruby/ruby/test/ruby/enc/
test_grapheme_breaks.rb:67]: line 1202, expected '[" क्त "]', but got '["क ् ", "त"]', comment: (snip) <[" क् त "]> expected but was <["क ् ", "त"]>. `

107 ` https://unicode.org/reports/tr29/ ※Ruby supports extended grapheme clusters

Example: GB11 "👨👩👧👦".match?(/\A\p{Extended_Pictographic}+\z/) => true 👨 ZWJ 👩 ZWJ 👧
ZWJ 👦 x x x ÷ x ÷ ÷ 108

109 GB9c \p{InCB=Consonant} [ \p{InCB=Extend} \p{InCB=Linker} ]* \p{InCB=Linker} [ \p{InCB=Extend}
\p{InCB=Linker} ]* × \p{InCB=Consonant}

GB9c क्त U+0915 U+094D U+0924 क ् त U+0915 DEVANAGARI
LETTER KA InCB= Consonant U+094D DEVANAGARI SIGN VIRAMA InCB= Linker U+0924 DEVANAGARI LETTER TA InCB= Consonant + + 110

111 क्त \p{InCB=Consonant} [ \p{InCB=Extend} \p{InCB=Linker} ]* \p{InCB=Linker} [ \p{InCB=Extend}
\p{InCB=Linker} ]* × \p{InCB=Consonant} क

\p{InCB=Linker} ]* × \p{InCB=Consonant} क nil ्

\p{InCB=Linker} ]* × \p{InCB=Consonant} क nil ् nil त

node_extended_grapheme_cluster Builds the internal node structure for \X Implements complex
Unicode Grapheme Break rules Creates ALT/SEQ/CCLASS nodes for CR, LF, Control, etc. Hard-coded logic that must stay synced with Unicode updates 115

116 static int node_extended_grapheme_cluster(Node** np, ScanEnv* env) { ... /*
xpicto-sequence := \p{Extended_Pictographic} (Extend* ZWJ \p{Extended_Pictographic})* */ { Node **XP_list = core_alts + 5; /* size: 3 */ R_ERR(create_property_node(XP_list+0, env, "Extended_Pictographic")); /* (Extend* ZWJ \p{Extended_Pictographic})* */ { Node **Ex_list = XP_list + 2; /* size: 4 */ R_ERR(quantify_property_node(Ex_list+0, env, "Grapheme_Cluster_Break=Extend", '*')); /* ZWJ (ZERO WIDTH JOINER) * r = ONIGENC_CODE_TO_MBC(env->enc, 0x200D, buf); if (r < 0) goto err; Ex_list[1] = node_new_str_raw(buf, buf + r); if (IS_NULL(Ex_list[1])) goto err; R_ERR(create_property_node(Ex_list+2, env, "Extended_Pictographic")); R_ERR(create_node_from_array(LIST, XP_list+1, Ex_list)); } R_ERR(quantify_node(XP_list+1, 0, REPEAT_INFINITE)); /* TODO: Check about node freeing */ R_ERR(create_node_from_array(LIST, core_alts+4, XP_list)); }

Create nodes in Onigmo to align with the regex engine
117 /* conjunctCluster := \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+ */ { // \p{InCB=Consonant} Node **CC_list = core_alts + 6; /* size: 3 */ R_ERR(create_property_node(CC_list+0, env, "InCB=Consonant")); { Node **CC_inner_list = CC_list + 2; /* size: 5 */ { // [\p{InCB=Extend} \p{InCB=Linker}]* R_ERR(create_property_node(CC_inner_list+0, env, "InCB=Extend")); R_ERR(add_property_to_cc(NCCLASS(CC_inner_list[0]), "InCB=Linker", 0, env)); R_ERR(quantify_node(CC_inner_list+0, 0, REPEAT_INFINITE)); } R_ERR(quantify_node(CC_list+1, 1, REPEAT_INFINITE)); } (snip)

Grapheme Clusters Implementation Create Nodes for \p{InCB=Consonant} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker}
[\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant} All test passed! 118

Unicode 15.1.0 Devanagari consonant clusters no longer split 119 #
Before " क् त ".grapheme_clusters # => ["क ् ", "त"] # After " क् त ".grapheme_clusters # => [" क् त "]

Merged! 120

Recap 121 Ruby now supports Unicode 15.1.0, adding Indic_Conjunct_Break (InCB)
for Devanagari ligatures. Onigmo’s grapheme cluster logic (\X) was updated with new break rules (GB9c). Devanagari consonant clusters (e.g., क्त ) no longer split In Ruby 3.5, Unicode 15.1.0 is available.

Future works: Unicode 16.0.0

Unicode 16.0.0 Ruby's Unicode 16.0.0 update currently in progress Normalization
tests are failing 124 WIP etc GraphemeBre akTest.txt DerivedCore Propeties.txt UnicodeData. txt enc-unicode.rb name2ctype.h casefold.h Auto generated test

Unicode Normalization Unicode normalization uni fi es strings that look
identical but di ff er internally. NFD/NFC use canonical equivalence (e.g., e + ⤆ 㲗 é). NFKD/NFKC use compatibility equivalence (e.g., ᶃ → 1). Normalization reduces search mismatches and security risks. Prevents garbled text across OS/ fi le systems and boosts data compatibility. 125

126 1611E 16123 1611E 1611E 1611F NFD NFC 16121 1611F
16126 Expected 1611E 16123 1611E 1611E 1611F NFD 1611E 16123 Actual NFC

Implementation Rewrote most of the normalization logic Just for understanding
https://github.com/ruby/ruby/pull/13117 Referenced Rust's unicode-normalization project https://github.com/unicode-rs/unicode-normalization 127

Future works All tests pass, but performance may have regressed.
I removed optimizations temporarily. Plan to maintain performance while upgrading to Unicode 16.0.0. Considering “Quick Check” for faster validation. 128

Acknowledgements

Acknowledgments Thank you, STORES, Inc. team, for your reviews and
scheduling support. Special thanks to fujimura-san, @ko1, and @mame for repeatedly reviewing my work. My husband Takuya’s support made this presentation possible. I’m also grateful to everyone who reviewed my PRs and provided valuable advice. 130

RubyKaigi 2025 Has Begun!

132 https://x.com/spikeolaf/status/1909531905747484889

Ask the Speaker Share Your Thoughts 133

🧑🧑🧒🧒 134

Value Curiosity 135

Ruby Taught Me About Under the Hood

Ruby Taught Me About Under the Hood

More Decks by ima1zumi

Featured

Transcript