UTF-8 is coming to mruby/c

UTF-8 is coming to mruby/c 2023-05-11 ima1zumi

Introduction • @ima1zumi (Mari Imaizumi) • Character encoding lover •
IRB and Reline committer • ESM, inc. 2

• ڈ೥ͷը૾ΛషΔ 3

Coffeehouse sponsor

Distribute Ruby Method Karuta 6

Question

Question Have you ever used UTF-8?

UTF-8 is de facto standard • > UTF-8 is used
by 97.9% of all the websites whose character encoding they know. • https://w3techs.com/technologies/overview/character_encoding • Many programing languages support UTF-8 • Ruby, Python, Rust, C++[^1], PHP, Java[^2] 9 [^1]: char8_t: A type for UTF-8 characters and strings https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0482r0.html [^2]: JEP 400: UTF-8 by Default https://openjdk.org/jeps/400

What is UTF-8 10

What is UTF-8 • 8-bit UCS Transformation Format • Variable
length • 1 byte ~ 4 bytes 11

UTF-8 implementation in mruby/c (in progress) 12 https://www.s-itoc.jp/support/technical-support/mrubyc/mrubyc-logo/

Before implementation 13

After implementation 27

How to implement in mruby/c https://github.com/mrubyc/mrubyc/pull/191

What is mruby/c • mruby/c is another implementation of mruby
29 [^3]: https://github.com/mrubyc/mrubyc

What is mruby • mruby is the lightweight implementation of
the Ruby language • Focus on compatibility with Ruby • Memory size < 400KB 30

What is mruby/c • Small memory consumption • < 40KB
• Concurrent • Target • one-chip micro processors • Written in C or Ruby 31 [^3]: https://github.com/mrubyc/mrubyc

mruby/c !== mruby 32 Ref: mruby/cͰ࢝ΊΔΦϦδφϧIoTσόΠε࡞Γ https://magazine.rubyist.net/articles/0059/0059-original_mrubyc_iot_device.html

Use case • Micro controller • IoT devise • Reduction
of defect rate in industrial sewing machines[^4] • prk_ fi rmware 33 [^4]: https://www.ruby.or.jp/ja/showcase/case80.html

Why I started https://slide.rabbit-shocker.org/authors/hasumikin/RubyWorldConference2022/?page=58 34

Advantages of using UTF-8 in mruby/c • prk_ fi rmware
allows UTF-8 • Network programming • Shell 35

It would be nice if UTF-8 is implemented in mruby/c!

What should we implement?

mruby/c string is just a sequence of bytes 38

What does String mean as a byte sequence? • It
doesn't have a character encoding. 39

What is a character encoding • Label to indicate what
character code the byte sequence is • Does not affect the byte sequence 40

character code the byte sequence is • Does not affect the byte sequence 41 11100011 10000001 10000010

character code the byte sequence is • Does not affect the byte sequence 42 11100011 10000001 10000010 UTF-8

If String is a byte sequence 43

Why "💎❤🏯".size is 11 • Because "💎❤🏯" is 11 bytes
in UTF-8 • mruby/c string has no character encoding, so returns bytes 44

How can I get `"💎❤🏯".size` to return 3? 45

How do we know "how many characters" from byte sequences?

In the case of UTF-8, you can tell by looking
at the fi rst n bits of a byte • UTF-8 is variable length from 1 to 4 bytes • There are 1 byte, 2 bytes, 3 bytes, 4 bytes, and ongoing byte sequences • 1 byte: 0xxxxxxx • 2 bytes: 110xxxxx • 3 bytes: 1110xxxx • 4 bytes: 11110xxx • ongoing: 10xxxxxx 47

Sample 48

Sample 49 • 1 byte: 0xxxxxxx • 2 bytes: 110xxxxx
• 3 bytes: 1110xxxx • 4 bytes: 11110xxx • ongoing: 10xxxxxx

Implementation

Written in Ruby 54

Written in C 55

size (length) is now working 56

Tests 57

Implement in the same way

Status of UTF-8 support in mruby/c

Implementation not required • new, +, *, to_i, to_f, to_s,
b, clear, chomp, chomp!, dup, empty?, getbyte, lstrip, lstrip!, rstrip, rstrip!, strip!, to_sym, start_with?, end_with?, include?, bytes, ==, <=> • Byte sequence operations can handle UTF-8 60

Implementation required • Implemented: • index, size(length), slice([]), slice!([]=), insert,
inspect, ord • To be implemented: • <<, tr, tr!, chr, each_char, split 61

Implementation required • Handling character count (index, size, slice, insert)
• Handling Unicode code point (ord, chr, <<) • Only ASCII was supported (inspect) • Multipurpose (tr, tr!) • Segmentation fault (split) 62

What is counting characters of Unicode • Unicode code point
• Grapheme clusters 63

Unicode code point • Any value in the Unicode codespace
• That is, the range of integers from 0 to 10FFFF. • Not all code points are assigned to encoded characters. 64

Grapheme clusters • Visually perceived text units of combined Unicode
code points • Almost user-perceived character 65

Emoji zero width joiner sequences 66

Unicode code point / grapheme clusters • Unicode code point
is easier to manipulate strings • grapheme_clusters is slower 67

Unicode code point / grapheme clusters 68

Handling character count • index, size, slice, insert • Convert
character count to byte count or byte count to character count 69

String#index before implementation 70

String#index 71 0 1 2 3 4 5 6 7
8 9 10 byte F0 9F 92 8E E2 9D A4 F0 9F 8F AF character index

String#index 72 0 1 2 3 4 5 6 7
8 9 10 byte F0 9F 92 8E E2 9D A4 F0 9F 8F AF character index 0 1 2

String#index 73 0 1 2 3 4 5 6 7
8 9 10 byte F0 9F 92 8E E2 9D A4 F0 9F 8F AF character index 0 1 2

String#slice 74

Handling Unicode code point 75

Handling Unicode code point • ord, chr, << 76

String#ord • Returns the integer ordinal of the fi rst
character of self 77

(digression) String#ord • Useful snippet in .irbrc 78

Relationship between Unicode codepoint and UTF-8 • Unicode codepoint is
a value that uniquely de fi nes a Unicode character • UTF-8 encodes Unicode code points • UTF-16, UTF-32 • One-to-one correspondence between UTF-8 byte sequences and Unicode code points • Can convert from byte sequences to Unicode code points • Also can convert from Unicode code points to byte sequences 79

UTF-8 Bit Distribution • https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf D92 80

Implementation 81

ASCII Only supported 82

The status quo • 695 String tests passed • 80%
of String methods have been implemented to support UTF-8 83

Future issues • Support chr, <<, each_char, tr, tr! •
Performance evaluation • Enable to use UTF-8 in prk_ fi rmware • Error with invalid strings as UTF-8 84

Happy Binary Hacking!

UTF-8 is coming to mruby/c

UTF-8 is coming to mruby/c

More Decks by ima1zumi

Featured

Transcript