Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
続・mruby/cにUTF-8 を実装する
Search
ima1zumi
August 19, 2023
1
27
続・mruby/cにUTF-8 を実装する
ima1zumi
August 19, 2023
Tweet
Share
More Decks by ima1zumi
See All by ima1zumi
Ruby Taught Me About Under the Hood
ima1zumi
3
7k
Exploring Reline: Enhancing Command Line Usability
ima1zumi
0
71
10年物のRailsアプリにキャッチアップ!〜コードを読まずに理解したかった〜
ima1zumi
0
79
RubyKaigiの登壇者一覧ページを作った
ima1zumi
0
340
Relineのその後の生活
ima1zumi
0
210
IRB and Reline Kaigi 2024
ima1zumi
0
11
Exploring Reline: Enhancing Command Line Usability
ima1zumi
3
14k
Reline 1分 Cooking
ima1zumi
0
35
UTF-8 is coming to mruby/c
ima1zumi
4
5.4k
Featured
See All Featured
Art, The Web, and Tiny UX
lynnandtonic
298
20k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
129
19k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
Automating Front-end Workflow
addyosmani
1370
200k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
5
570
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
10
770
jQuery: Nuts, Bolts and Bling
dougneiner
63
7.7k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
280
13k
Large-scale JavaScript Application Architecture
addyosmani
512
110k
VelocityConf: Rendering Performance Case Studies
addyosmani
329
24k
Building a Scalable Design System with Sketch
lauravandoore
462
33k
Measuring & Analyzing Core Web Vitals
bluesmoon
7
400
Transcript
ଓɾmruby/cʹUTF-8 Λ࣮͢Δ 2023-08-19 ima1zumi
Introduction • @ima1zumi (Mari Imaizumi) • Character encoding lover •
IRB and Reline committer • ESM, inc. 2
ఏڙ
RubyKaigi 2023Ͱͨ͜͠ͱ • mruby/cʹUTF-8Λ࣮ͯ͠·͢
mruby/cͱ • ϝϞϦ༻ྔ͕গͳ͍ • < 40KB • ϫϯνοϓϚΠίϯ͕λʔήοτ • CͱRubyͰ࣮͞Ε͍ͯΔ
5 [^3]: https://github.com/mrubyc/mrubyc
mruby/c !== mruby 6 Ref: mruby/cͰ࢝ΊΔΦϦδφϧIoTσόΠε࡞Γ https://magazine.rubyist.net/articles/0059/0059-original_mrubyc_iot_device.html
mruby/cͷString • RubyͰ͍͏ͱ͜ΖͷASCII-8BIT͔͑͠ͳ͍ • UTF-8͑ΔΑ͏ʹ࣮த
ରԠ͠ͳ͍͍ͯ͘ϝιου • new, +, *, to_i, to_f, to_s, b, clear,
chomp, chomp!, dup, empty?, getbyte, lstrip, lstrip!, rstrip, rstrip!, strip!, to_sym, start_with?, end_with?, include?, bytes, ==, <=> • 25ݸ 8
ରԠ͕ඞཁ • ࣮ࡁ: 12 • index, size(length), slice([]), slice!([]=), insert,
inspect, ord • <<, Integer#chr, each_char, encoding, valid_encoding? • Todo: 3 • tr, tr!, split 9
Integer#chr • selfΛίʔυϙΠϯτͱΈͳͯ͠จࣈΛฦ͢ • UTF-8ͷ߹Unicode scalar valueͱΈͳ͢ • Scalar value্Լαϩήʔτ(0xD800͔Β0xDFFF)ؚ·ͳ͍
https://www.unicode.org/versions/ Unicode15.0.0/ch03.pdf
None
0b0011_0000_0100_0010 (U+3042) ↑scalar value
0b0011_0000_0100_0010 (U+3042) 1110zzzz 10yyyyyy 10xxxxxx ↑scalar value
1110zzzz 10yyyyyy 10xxxxxx ↑scalar value 0b0011_0000_0100_0010 (U+3042)
11100011 10yyyyyy 10xxxxxx ↑scalar value 0b0011_0000_0100_0010 (U+3042)
0b0011_0000_0100_0010 (U+3042) 11100011 10yyyyyy 10xxxxxx ↑scalar value
0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10xxxxxx ↑scalar value
0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10xxxxxx ↑scalar value
0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10000010 ↑scalar value
0b00110000_01000010 (U+3042) 0b11100011_10000001_10000010 "あ".bytes.map { _1.to_s(2) } => ["11100011", "10000001",
"10000010"]
None
String#<<
String#encoding • mruby/cͷจࣈίʔυΓସ͑ϏϧυΦϓγϣϯͰ੍ޚ • ϏϧυΦϓγϣϯΛݟͯฦ͚ͩ͢
String#valid_encoding? • ASCII-8BITͳΜͰOK • UTF-8well-formed͔Ͳ͏͔ͷఆ͕ඞཁ • ϏϧυΦϓγϣϯͰStringͷencodingΛΓସ͍͑ͯΔͨΊɺUTF-8ͱͯ͠well-formed͔Ͳ͏͔ Λͯ͢ͷStringͰ֬ೝ͢ΔͱόΠφϦ͕StringʹೖΒͳ͍ͱ͍͏͕͋Δ • ͔ͱ͍ͬͯෆਖ਼UTF-8ڐͨ͘͠ͳ͍
• ηΩϡϦςΟϦεΫ • StringͷߏମΛେ͖ͨ͘͘͠ͳ͍ • ંҊ: valid_encoding?ͰνΣοΫՄೳʹ͢Δ
Well-formed UTF-8 byte sequences • mruby/cϏϧυΦϓγϣϯͰStringͷencodingΛΓସ͑Δ • શStringͰਖ਼͍͔֬͠ೝ͢ΔͱόΠφϦ͕࡞ෆՄೳʹͳΔ • ͔ͱ͍ͬͯෆਖ਼UTF-8ڐͨ͘͠ͳ͍
• ηΩϡϦςΟϦεΫ • StringͷߏମΛେ͖ͨ͘͘͠ͳ͍ • ંҊ: valid_encoding?ͰνΣοΫՄೳʹ͢Δ
None
None
valid_encoding? • UTF-8ͷόϦσʔγϣϯΞϧΰϦζϜ͍Ζ͍Ζ • [ߴͳUTF-8όϦσʔγϣϯɺrangeΞϧΰϦζϜͷհ - ͖ͯͱ͏ ͳ͍͞ͱɻ͐ͨΜ](https://tekitoh-memdhoi.info/views/872) • https://github.com/cyb70289/utf8
• [[2010.03090] Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090) • Θ͔Γ͍ͨ
upcase, downcase • ରԠ͠ͳ͍ • UnicodeͷେจࣈɺখจࣈมϚοϐϯάςʔϒϧ͕ඞਢ • "LJ".downcase == "lj"
• ϚΠίϯͰͦ͜·Ͱ͍ͨ͠Ϣʔεέʔε͋·Γͳͦ͞͏