Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ruby Taught Me About Under the Hood

ima1zumi
April 16, 2025
420

Ruby Taught Me About Under the Hood

ima1zumi

April 16, 2025
Tweet

Transcript

  1. About me Mari Imaizumi @ima1zumi 🍊 Originally from Matsuyama Working

    at STORES, Inc. A member of IRB team Ruby committer 2
  2. About me Mari Imaizumi @ima1zumi Originally from Matsuyama 🏯 Working

    at STORES, Inc. A member of IRB team Ruby committer 3
  3. About me Mari Imaizumi @ima1zumi Originally from Matsuyama Working at

    STORES, Inc. 💻 A member of IRB team Ruby committer 5
  4. About me Mari Imaizumi @ima1zumi Originally from Matsuyama Working at

    STORES, Inc. A member of IRB team 💎 Ruby committer 7
  5. About me Mari Imaizumi @ima1zumi Originally from Matsuyama Working at

    STORES, Inc. A member of IRB team Ruby committer 🆕 8
  6. 9

  7. About me Mari Imaizumi @ima1zumi Originally from Matsuyama Working at

    STORES, Inc. A member of IRB team Ruby committer Character encoding enthusiast 🖋 10
  8. Agenda History of Character Encodings Fell down the rabbit hole

    of character encodings Encounter with EBCDIC The pitfalls of character counting Upgrading Ruby to Unicode 15.1.0 Future works 12
  9. Character Encoding 14 A, B, C, ..., Z, a, b,

    c, ..., z, 0, 1, 2, ..., 9, SP, !, ", #, ... A -> 0x41 B -> 0x42 a -> 0x61 0 -> 0x30 Character Set Character Encoding Scheme จࣈू߹ จࣈූ߸Խํࣜ
  10. Morse code Late 18th to 19th century: Practical application of

    electricity 1837–1844: Invention and practical use of Morse code 20
  11. ASCII, EBCDIC 1963 ASCII American Standard Code for Information Interchange

    Established by ASA now known as ANSI 7-bit encoding 1964 EBCDIC Extended Binary Coded Decimal Interchange Code 8-bit encoding, Established by IBM 22
  12. 25 Character ASCII (Hex) EBCDIC (Hex) A \x41 \xC1 B

    \x42 \xC2 a \x61 \x81 b \x62 \x82 Space \x20 \x40
  13. A Maze of Character Encodings 1967 ISO/IEC 646 1969 JIS

    X 0201 1978 JIS X 0208 1984 EUC-JP 1986 ISO/IEC 2022 1987 ISO/IEC 8859-1 (Latin-1) 26
  14. Univarsal Use a single character set for all scripts worldwide

    e.g., Latin, Chinese, Hiragana, Katakana, Greek, Cyrillic, Arabic, Hangul, Devanagari, Tamil, etc. E ffi cient Unambiguous 29 Unicode Design Goals a ͋ ׽ Ω क Д ن ઑ அ
  15. Unicode Code Points Code points (U+xxxx) to represent abstract characters

    U+0000-10FFFF Each code point uniquely encodes one abstract character a U+0061 Ѫ U+611B Å U+00C5 🍊 U+1F34A 30
  16. UTF-8 Unicode de fi nes a universal set of characters

    with unique code points. UTF-8 transforms these code points into a variable-length sequence of 1–4 bytes. In short, Unicode is the “what,” and UTF-8 is the “how.” 31
  17. 32 Abstract Character a ͋ 🍊 Name LATIN SMALL LETTER

    A HIRAGANA LETTER A TANGERINE Code Point U+0061 U+3042 U+1F34A UTF-8 byte sequences \x61 \xE3\x81\x82 \xF0\x9F \x8D\x8A
  18. Unicode Speci fi cations Unicode Standard The Unicode Character Database

    (UCD) Unicode Code Charts Unicode Standard Annexes (UAX) 33
  19. Unify to Unicode 1991: Unicode 1.0 1993: UTF-8 was presented

    2008: UTF-8 becomes the most common encoding 2010: Unicode 6.0 released (Emoji added) 2024: Unicode 16.0.0 released 34
  20. Character Encodings and Me I started programming around 2016, when

    the world was already using Unicode. The major issues with character encoding proliferation and incompatibility primarily surfaced in the 1990s. 37
  21. Character Encodings and Me I started programming around 2016, when

    the world was already using Unicode. The major issues with character encoding proliferation and incompatibility primarily surfaced in the 1990s. So why did I become a “character encoding enthusiast” in a Unicode era? 38
  22. Recap From pre-electric signals like smoke and semaphore, we moved

    to ASCII and other encodings. Unicode became the universal solution. Even in the Unicode era, deep knowledge of character encodings remains essential. 39
  23. Agenda History of Character Encodings Fell down the rabbit hole

    of character encodings Encounter with EBCDIC The pitfalls of character counting Upgrading Ruby to Unicode 15.1.0 Future works 41
  24. How I Met Character Encodings 2016: My fi rst assignment

    was on a mainframe COBOL, Assembler, JCL, z/OS 42
  25. EBCDIC and Japanese EBCDIC uses 8-bit Only 256 characters It’s

    impossible to fully represent Japanese: Hiragana: about 50 characters Katakana: about 50 characters Joyo kanji (commonly used kanji): 2,136 characters 44
  26. EBCDIC with Kanji Even with only 8 bits, there was

    still a need to input kanji Use Shift-In (SI) and Shift-Out (SO) Control Character 50
  27. Multiple Character Sets: Complexity Outside alphabets & halfwidth kana, everything

    was cumbersome in our environment Constantly checked hex bytes to avoid overwriting SI/SO control chars Realized that correct character input isn’t guaranteed 53
  28. Recap EBCDIC’s limited code space for Japanese required halfwidth kana

    and SI/SO switching. Accidental overwriting of control characters caused data corruption. 😢 Showing the characters you typed isn't easy 54
  29. Agenda History of Character Encodings Fell down the rabbit hole

    of character encodings Encounter with EBCDIC The pitfalls of character counting Upgrading Ruby to Unicode 15.1.0 Future works 55
  30. Reuniting with Character Encodings Learned Ruby & Ruby on Rails

    at Fjord Boot Camp @igaiga showed me family emoji 🧑🧑🧒🧒 that crashed IRB 58
  31. Reuniting with Character Encodings Learned Ruby & Ruby on Rails

    at Fjord Boot Camp @igaiga showed me family emoji that crashed IRB Reported the issue 59
  32. Reuniting with Character Encodings Learned Ruby & Ruby on Rails

    at Fjord Boot Camp @igaiga showed me family emoji that crashed IRB Reported the issue @aycabta (Reline’s author) said, “Fix it yourself!” 60
  33. Reuniting with Character Encodings Learned Ruby & Ruby on Rails

    at Fjord Boot Camp @igaiga showed me family emoji that crashed IRB Reported the issue @aycabta (Reline’s author) said, “Fix it yourself!” So I did 61
  34. What kind of bug was it? 🧑🧑🧒🧒 + Backspace +

    Backspace => IRB crash Why did this happen? 62
  35. 63

  36. Family emoji 🧑🧑🧒🧒 67 "🧑🧑🧒🧒".chars.size # => 7 "🧑🧑🧒🧒".chars #

    => ["👨", "‍ ", "👩", "‍ ", "👧", "‍ ", "👦"] "🧑🧑🧒🧒".chars.map { it.ord.to_s(16) } # => ["1f468", "200d", "1f469", "200d", "1f467", "200d", "1f466"]
  37. 69 1 2 3 4 5 6 7 8 9

    10 11 12 > █ > 🧑🧑🧒🧒 █ Paste 🧑🧑🧒🧒
  38. 70 1 2 3 4 5 6 7 8 9

    10 11 12 > █ > 🧑🧑🧒🧒 █ > 👨 ZWJ 👩 ZWJ 👧 ZWJ 👦 █ Paste 🧑🧑🧒🧒 ※ ZWJ is Zero Width Joiner (U+200D)
  39. 71 1 2 3 4 5 6 7 8 9

    10 11 12 > █ > 🧑🧑🧒🧒 █ > █ Paste 🧑🧑🧒🧒 Backspace
  40. Code Points vs Visible Characters : 1 code point :

    7 code points Still one character visually How to handle? Grapheme Cluster (ॻهૉΫϥελ) 72 a 🧑🧑🧒🧒
  41. 73 Abstract Character a ͋ 🍊 Name LATIN SMALL LETTER

    A HIRAGANA LETTER A TANGERINE Code Point U+0061 U+3042 U+1F34A UTF-8 byte sequences \x61 \xE3\x81\x82 \xF0\x9F \x8D\x8A
  42. 74 Grapheme Cluster a ͋ 🍊 🧑🧑🧒🧒 Abstract Characters a

    ͋ 🍊 "👨", "\u200D", "👩", "\u200D", "👧", "‍ \u200D", "👦" Name LATIN SMALL LETTER A HIRAGANA LETTER A TANGERINE nil Code Points U+0061 U+3042 U+1F34A U+1F468, U+200D, U+1F469, U+200D, U+1F467, U+200D, U+1F466 UTF-8 byte sequences \x61 \xE3\x81\x82 \xF0\x9F \x8D\x8A \xF0\x9F\x91\xA8\xE2\ x80\x8D\xF0\x9F\x91\x A9\xE2\x80\x8D\xF0\x9 F\x91\xA7\xE2\x80\x8D \xF0\x9F\x91\xA6
  43. Grapheme Cluster Multiple code points seen as one character De

    fi ned by Unicode UAX #29: Unicode Text Segmentation Ensures user-expected cursor movement & deletion 75
  44. 76 Å U+00C5 Combining Characters Å षि U+0041 U+030A び

    U+0937 U+093F 🧑🧑🧒🧒 U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466 U+3072 U+3099 Precomposed Characters U+3073 nil nil ͼ
  45. 77

  46. Merged into Reline (Ruby 3.0) Explored Grapheme Clusters & Unicode

    depth Realized text isn’t just simple code points Found encoding to be fascinating, not just tricky 78
  47. Recap Family emoji caused IRB to crash because multiple code

    points formed a single visible character. We added Grapheme Cluster handling in Reline to respect Unicode text segmentation. This fi x ensures cursor movement and deletion align with user expectations, revealing the complexity of multi- codepoint characters. 79
  48. Agenda History of Character Encodings Fell down the rabbit hole

    of character encodings Encounter with EBCDIC The pitfalls of character counting Upgrading Ruby to Unicode 15.1.0 Future works 80
  49. Ruby Hackathon at RubyWorld Conference 2024 Noticed stalled Unicode updates

    in Ruby Commented on Redmine and took action 83
  50. Upgrading Ruby to Unicode 15.1.0 Why We Needed an Update

    New rule: Indic_Conjunct_Break for Devanagari e.g. श क् ति (śakti) Without update, combined chars aren’t recognized as one Staying current improves international text handling 84
  51. Ruby Unicode Upgrades New characters added e.g. Unicode 15.1.0 added

    627 characters Properties added or updated InCB, Age, etc Aligned with Unicode specs 86
  52. Unicode Character Database De fi nes characters and properties in

    text fi les Lists code points, categories, etc Machine-readable data Ruby references UnicodeData.txt, etc. 87
  53. UnicodeData.txt 0000;<control>;Cc;0;BN;;;;;N;NULL;;;; ... 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061; 0042;LATIN CAPITAL LETTER

    B;Lu;0;L;;;;;N;;;;0062; 0043;LATIN CAPITAL LETTER C;Lu;0;L;;;;;N;;;;0063; 0044;LATIN CAPITAL LETTER D;Lu;0;L;;;;;N;;;;0064; 0045;LATIN CAPITAL LETTER E;Lu;0;L;;;;;N;;;;0065; 0046;LATIN CAPITAL LETTER F;Lu;0;L;;;;;N;;;;0066; ... 304C;HIRAGANA LETTER GA;Lo;0;L;304B 3099;;;;N;;;;; ... 094D;DEVANAGARI SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;; ... 10FFFD;<Plane 16 Private Use, Last>;Co;0;L;;;;;N;;;;; 88
  54. DerivedCoreProperies.txt # DerivedCoreProperties-16.0.0.txt # Date: 2024-05-31, 18:09:32 GMT # ©

    2024 Unicode®, Inc. # ================================================ # Derived Property: Math # Generated from: Sm + Other_Math 002B ; Math # Sm PLUS SIGN 003C..003E ; Math # Sm [3] LESS-THAN SIGN..GREATER-THAN SIGN # ================================================ # Derived Property: Lowercase # Generated from: Ll + Other_Lowercase 0061..007A ; Lowercase # L& [26] LATIN SMALL LETTER A..LATIN SMALL LETTER Z # ================================================ # Derived Property: Uppercase # Generated from: Lu + Other_Uppercase 0041..005A ; Uppercase # L& [26] LATIN CAPITAL LETTER A..LATIN CAPITAL LETTER Z 89
  55. Onigmo A regex engine used by Ruby Supports Unicode property

    matches: \p{PropertyName} and grapheme cluster matches with \X String#grapheme_clusters also calls \X 90
  56. Property "abc".match?(/\p{ASCII}/) # => true "͍͋͏".match?(/\p{ASCII}/) # => false "🏯".match?(/\p{Emoji}/)

    # => true "1".match?(/\p{Emoji}/) # => true https://docs.ruby-lang.org/en/3.4/regexp/ unicode_properties_rdoc.html 91
  57. etc GraphemeBre akTest.txt DerivedCore Propeties.txt Ruby Unicode Upgrades UnicodeData. txt

    enc-unicode.rb name2ctype.h casefold.h Auto generated test 92
  58. Ruby's Unicode Update Process Increase the speci fi ed Unicode

    version in the build process Run scripts to auto-generate tables Add new tests (some manually) 93
  59. 94

  60. 95

  61. 96

  62. Issues with the Unicode 15.1.0 Update 97 etc GraphemeBre akTest.txt

    DerivedCore Propeties.txt UnicodeData. txt enc-unicode.rb name2ctype.h casefold.h Auto generated test
  63. Issues with the Unicode 15.1.0 Update Failed to parse the

    UCD 98 /* 'InCB': Derived Property */
 #endif /* USE_UNICODE_PROPERTIES */
 tool/enc-unicode.rb:52:in 'Object#pair_codepoints': undefined method 'sort!' for nil (NoMethodError) codepoints.sort!
 ^^^^^^
 from tool/enc-unicode.rb:282:in 'Object#make_const'
 from tool/enc-unicode.rb:441:in 'block (2 levels) in <main>'
 from tool/enc-unicode.rb:434:in 'Array#each'
 from tool/enc-unicode.rb:434:in 'block in <main>'
  64. 99 # DerivedCoreProperties-15.1.0.txt (snip) # Derived Property: Indic_Conjunct_Break # Generated

    from the Grapheme_Cluster_Break, Indic_Syllabic_Category, # Canonical_Combining_Class, and Script properties as described in UAX #44: # ================================================ # Indic_Conjunct_Break=Linker 094D ; InCB; Linker # Mn DEVANAGARI SIGN VIRAMA (snip) # Total code points: 6 # ================================================ # Indic_Conjunct_Break=Consonant 0915..0939 ; InCB; Consonant # Lo [37] DEVANAGARI LETTER KA..DEVANAGARI LETTER HA 0958..095F ; InCB; Consonant # Lo [8] DEVANAGARI LETTER QA..DEVANAGARI LETTER YYA (snip) # Total code points: 240 # ================================================ # Indic_Conjunct_Break=Extend 0300..036F ; InCB; Extend # Mn [112] COMBINING GRAVE ACCENT..COMBINING LATIN SMALL LETTER X (snip) # Total code points: 2192
  65. Indic Conjunct Break(InCB) Preserves consonant + linker + consonant as

    one unit e.g. क + ् + त = क्त Prevents splitting of Indic ligatures (e.g., Devanagari) Crucial for correct grapheme cluster handling 100
  66. DerivedCoreProperies.txt # DerivedCoreProperties-15.1.0.txt # ================================================ # Derived Property: Math 002B

    ; Math # Sm PLUS SIGN # ================================================ # Indic_Conjunct_Break=Linker 094D ; InCB; Linker # Mn DEVANAGARI SIGN VIRAMA # ================================================ # Indic_Conjunct_Break=Consonant 0915..0939 ; InCB; Consonant # Lo [37] DEVANAGARI LETTER KA..DEVANAGARI LETTER HA # ================================================ # Indic_Conjunct_Break=Extend 0300..036F ; InCB; Extend # Mn [112] COMBINING GRAVE ACCENT..COMBINING LATIN SMALL LETTER X 101
  67. Test failed... 104 etc GraphemeBre akTest.txt DerivedCore Propeties.txt UnicodeData. txt

    enc-unicode.rb name2ctype.h casefold.h Auto generated test
  68. 105 # GraphemeBreakTest-15.1.0.txt # Format: # <string> (# <comment>)? #

    <string> contains hex Unicode code points, with # ÷ wherever there is a break opportunity, and # × wherever there is not. # # These samples may be extended or changed in the future. # ÷ 0020 ÷ 0020 ÷ # ÷ [0.2] SPACE (Other) ÷ [999.0] SPACE (Other) ÷ [0.3] ÷ 0020 × 0308 ÷ 0020 ÷ # ÷ [0.2] SPACE (Other) × [9.0] COMBINING DIAERESIS (Extend_ExtCccZwj) ÷ [999.0] SPACE (Other) ÷ [0.3] ÷ 0020 ÷ 000D ÷ # ÷ [0.2] SPACE (Other) ÷ [5.0] <CARRIAGE RETURN (CR)> (CR) ÷ [0.3] ÷ 0915 × 094D × 0924 ÷ # ÷ [0.2] DEVANAGARI LETTER KA (ConjunctLinkingScripts_LinkingConsonant) × [9.0] DEVANAGARI SIGN VIRAMA (Extend_ConjunctLinkingScripts_ConjunctLinker_ExtCccZwj) × [9.3] DEVANAGARI LETTER TA (ConjunctLinkingScripts_LinkingConsonant) ÷ [0.3] `
  69. 106 make check # Running tests: 2) Failure: TestGraphemeBreaksFromFile#test_each_grapheme_cluster [/Users/mi/ghq/github.com/ruby/ruby/test/ruby/enc/

    test_grapheme_breaks.rb:67]: line 1202, expected '[" क्त "]', but got '["क ् ", "त"]', comment: (snip) <[" क् त "]> expected but was <["क ् ", "त"]>. `
  70. GB9c क्त U+0915 U+094D U+0924 क ् त U+0915 DEVANAGARI

    LETTER KA InCB= Consonant U+094D DEVANAGARI SIGN VIRAMA InCB= Linker U+0924 DEVANAGARI LETTER TA InCB= Consonant + + 110
  71. node_extended_grapheme_cluster Builds the internal node structure for \X Implements complex

    Unicode Grapheme Break rules Creates ALT/SEQ/CCLASS nodes for CR, LF, Control, etc. Hard-coded logic that must stay synced with Unicode updates 115
  72. 116 static int node_extended_grapheme_cluster(Node** np, ScanEnv* env) { ... /*

    xpicto-sequence := \p{Extended_Pictographic} (Extend* ZWJ \p{Extended_Pictographic})* */ { Node **XP_list = core_alts + 5; /* size: 3 */ R_ERR(create_property_node(XP_list+0, env, "Extended_Pictographic")); /* (Extend* ZWJ \p{Extended_Pictographic})* */ { Node **Ex_list = XP_list + 2; /* size: 4 */ R_ERR(quantify_property_node(Ex_list+0, env, "Grapheme_Cluster_Break=Extend", '*')); /* ZWJ (ZERO WIDTH JOINER) * r = ONIGENC_CODE_TO_MBC(env->enc, 0x200D, buf); if (r < 0) goto err; Ex_list[1] = node_new_str_raw(buf, buf + r); if (IS_NULL(Ex_list[1])) goto err; R_ERR(create_property_node(Ex_list+2, env, "Extended_Pictographic")); R_ERR(create_node_from_array(LIST, XP_list+1, Ex_list)); } R_ERR(quantify_node(XP_list+1, 0, REPEAT_INFINITE)); /* TODO: Check about node freeing */ R_ERR(create_node_from_array(LIST, core_alts+4, XP_list)); }
  73. Create nodes in Onigmo to align with the regex engine

    117 /* conjunctCluster := \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+ */ { // \p{InCB=Consonant} Node **CC_list = core_alts + 6; /* size: 3 */ R_ERR(create_property_node(CC_list+0, env, "InCB=Consonant")); { Node **CC_inner_list = CC_list + 2; /* size: 5 */ { // [\p{InCB=Extend} \p{InCB=Linker}]* R_ERR(create_property_node(CC_inner_list+0, env, "InCB=Extend")); R_ERR(add_property_to_cc(NCCLASS(CC_inner_list[0]), "InCB=Linker", 0, env)); R_ERR(quantify_node(CC_inner_list+0, 0, REPEAT_INFINITE)); } R_ERR(quantify_node(CC_list+1, 1, REPEAT_INFINITE)); } (snip)
  74. Unicode 15.1.0 Devanagari consonant clusters no longer split 119 #

    Before " क् त ".grapheme_clusters # => ["क ् ", "त"] # After " क् त ".grapheme_clusters # => [" क् त "]
  75. Recap 121 Ruby now supports Unicode 15.1.0, adding Indic_Conjunct_Break (InCB)

    for Devanagari ligatures. Onigmo’s grapheme cluster logic (\X) was updated with new break rules (GB9c). Devanagari consonant clusters (e.g., क्त ) no longer split In Ruby 3.5, Unicode 15.1.0 is available.
  76. Agenda History of Character Encodings Fell down the rabbit hole

    of character encodings Encounter with EBCDIC The pitfalls of character counting Upgrading Ruby to Unicode 15.1.0 Future works 122
  77. Unicode 16.0.0 Ruby's Unicode 16.0.0 update currently in progress Normalization

    tests are failing 124 WIP etc GraphemeBre akTest.txt DerivedCore Propeties.txt UnicodeData. txt enc-unicode.rb name2ctype.h casefold.h Auto generated test
  78. Unicode Normalization Unicode normalization uni fi es strings that look

    identical but di ff er internally. NFD/NFC use canonical equivalence (e.g., e + ⤆  㲗 é). NFKD/NFKC use compatibility equivalence (e.g., ᶃ → 1). Normalization reduces search mismatches and security risks. Prevents garbled text across OS/ fi le systems and boosts data compatibility. 125
  79. 126 1611E 16123 1611E 1611E 1611F NFD NFC 16121 1611F

    16126 Expected 1611E 16123 1611E 1611E 1611F NFD 1611E 16123 Actual NFC
  80. Implementation Rewrote most of the normalization logic Just for understanding

    https://github.com/ruby/ruby/pull/13117 Referenced Rust's unicode-normalization project https://github.com/unicode-rs/unicode-normalization 127
  81. Future works All tests pass, but performance may have regressed.

    I removed optimizations temporarily. Plan to maintain performance while upgrading to Unicode 16.0.0. Considering “Quick Check” for faster validation. 128
  82. Acknowledgments Thank you, STORES, Inc. team, for your reviews and

    scheduling support. Special thanks to fujimura-san, @ko1, and @mame for repeatedly reviewing my work. My husband Takuya’s support made this presentation possible. I’m also grateful to everyone who reviewed my PRs and provided valuable advice. 130