Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
続・mruby/cにUTF-8 を実装する
Search
ima1zumi
August 19, 2023
0
3
続・mruby/cにUTF-8 を実装する
ima1zumi
August 19, 2023
Tweet
Share
More Decks by ima1zumi
See All by ima1zumi
Reline 1分 Cooking
ima1zumi
0
4
UTF-8 is coming to mruby/c
ima1zumi
3
4.4k
たのしいString
ima1zumi
1
440
Watchから始めるOSS生活
ima1zumi
0
45
String meets Encoding
ima1zumi
2
2.4k
Emojiの正規表現
ima1zumi
0
3
RubyKaigiで話した話
ima1zumi
1
340
漢字が文字コードになる前の話
ima1zumi
0
3
Dive into Encoding
ima1zumi
4
2.1k
Featured
See All Featured
Designing for humans not robots
tammielis
248
25k
ParisWeb 2013: Learning to Love: Crash Course in Emotional UX Design
dotmariusz
104
6.6k
Rebuilding a faster, lazier Slack
samanthasiow
74
8.2k
Reflections from 52 weeks, 52 projects
jeffersonlam
345
19k
Put a Button on it: Removing Barriers to Going Fast.
kastner
58
3.1k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
8
1.5k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
323
20k
Building Effective Engineering Teams - LeadDev
addyosmani
29
1.9k
It's Worth the Effort
3n
180
27k
Making the Leap to Tech Lead
cromwellryan
125
8.5k
BBQ
matthewcrist
80
8.8k
The MySQL Ecosystem @ GitHub 2015
samlambert
244
12k
Transcript
ଓɾmruby/cʹUTF-8 Λ࣮͢Δ 2023-08-19 ima1zumi
Introduction • @ima1zumi (Mari Imaizumi) • Character encoding lover •
IRB and Reline committer • ESM, inc. 2
ఏڙ
RubyKaigi 2023Ͱͨ͜͠ͱ • mruby/cʹUTF-8Λ࣮ͯ͠·͢
mruby/cͱ • ϝϞϦ༻ྔ͕গͳ͍ • < 40KB • ϫϯνοϓϚΠίϯ͕λʔήοτ • CͱRubyͰ࣮͞Ε͍ͯΔ
5 [^3]: https://github.com/mrubyc/mrubyc
mruby/c !== mruby 6 Ref: mruby/cͰ࢝ΊΔΦϦδφϧIoTσόΠε࡞Γ https://magazine.rubyist.net/articles/0059/0059-original_mrubyc_iot_device.html
mruby/cͷString • RubyͰ͍͏ͱ͜ΖͷASCII-8BIT͔͑͠ͳ͍ • UTF-8͑ΔΑ͏ʹ࣮த
ରԠ͠ͳ͍͍ͯ͘ϝιου • new, +, *, to_i, to_f, to_s, b, clear,
chomp, chomp!, dup, empty?, getbyte, lstrip, lstrip!, rstrip, rstrip!, strip!, to_sym, start_with?, end_with?, include?, bytes, ==, <=> • 25ݸ 8
ରԠ͕ඞཁ • ࣮ࡁ: 12 • index, size(length), slice([]), slice!([]=), insert,
inspect, ord • <<, Integer#chr, each_char, encoding, valid_encoding? • Todo: 3 • tr, tr!, split 9
Integer#chr • selfΛίʔυϙΠϯτͱΈͳͯ͠จࣈΛฦ͢ • UTF-8ͷ߹Unicode scalar valueͱΈͳ͢ • Scalar value্Լαϩήʔτ(0xD800͔Β0xDFFF)ؚ·ͳ͍
https://www.unicode.org/versions/ Unicode15.0.0/ch03.pdf
None
0b0011_0000_0100_0010 (U+3042) ↑scalar value
0b0011_0000_0100_0010 (U+3042) 1110zzzz 10yyyyyy 10xxxxxx ↑scalar value
1110zzzz 10yyyyyy 10xxxxxx ↑scalar value 0b0011_0000_0100_0010 (U+3042)
11100011 10yyyyyy 10xxxxxx ↑scalar value 0b0011_0000_0100_0010 (U+3042)
0b0011_0000_0100_0010 (U+3042) 11100011 10yyyyyy 10xxxxxx ↑scalar value
0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10xxxxxx ↑scalar value
0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10xxxxxx ↑scalar value
0b0011_0000_0100_0010 (U+3042) 11100011 10000001 10000010 ↑scalar value
0b00110000_01000010 (U+3042) 0b11100011_10000001_10000010 "あ".bytes.map { _1.to_s(2) } => ["11100011", "10000001",
"10000010"]
None
String#<<
String#encoding • mruby/cͷจࣈίʔυΓସ͑ϏϧυΦϓγϣϯͰ੍ޚ • ϏϧυΦϓγϣϯΛݟͯฦ͚ͩ͢
String#valid_encoding? • ASCII-8BITͳΜͰOK • UTF-8well-formed͔Ͳ͏͔ͷఆ͕ඞཁ • ϏϧυΦϓγϣϯͰStringͷencodingΛΓସ͍͑ͯΔͨΊɺUTF-8ͱͯ͠well-formed͔Ͳ͏͔ Λͯ͢ͷStringͰ֬ೝ͢ΔͱόΠφϦ͕StringʹೖΒͳ͍ͱ͍͏͕͋Δ • ͔ͱ͍ͬͯෆਖ਼UTF-8ڐͨ͘͠ͳ͍
• ηΩϡϦςΟϦεΫ • StringͷߏମΛେ͖ͨ͘͘͠ͳ͍ • ંҊ: valid_encoding?ͰνΣοΫՄೳʹ͢Δ
Well-formed UTF-8 byte sequences • mruby/cϏϧυΦϓγϣϯͰStringͷencodingΛΓସ͑Δ • શStringͰਖ਼͍͔֬͠ೝ͢ΔͱόΠφϦ͕࡞ෆՄೳʹͳΔ • ͔ͱ͍ͬͯෆਖ਼UTF-8ڐͨ͘͠ͳ͍
• ηΩϡϦςΟϦεΫ • StringͷߏମΛେ͖ͨ͘͘͠ͳ͍ • ંҊ: valid_encoding?ͰνΣοΫՄೳʹ͢Δ
None
None
valid_encoding? • UTF-8ͷόϦσʔγϣϯΞϧΰϦζϜ͍Ζ͍Ζ • [ߴͳUTF-8όϦσʔγϣϯɺrangeΞϧΰϦζϜͷհ - ͖ͯͱ͏ ͳ͍͞ͱɻ͐ͨΜ](https://tekitoh-memdhoi.info/views/872) • https://github.com/cyb70289/utf8
• [[2010.03090] Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090) • Θ͔Γ͍ͨ
upcase, downcase • ରԠ͠ͳ͍ • UnicodeͷେจࣈɺখจࣈมϚοϐϯάςʔϒϧ͕ඞਢ • "LJ".downcase == "lj"
• ϚΠίϯͰͦ͜·Ͱ͍ͨ͠Ϣʔεέʔε͋·Γͳͦ͞͏