Upgrade to Pro — share decks privately, control downloads, hide ads and more …

UTF-8 is coming to mruby/c

ima1zumi
May 11, 2023
4.4k

UTF-8 is coming to mruby/c

ima1zumi

May 11, 2023
Tweet

Transcript

  1. UTF-8 is de facto standard • > UTF-8 is used

    by 97.9% of all the websites whose character encoding they know. • https://w3techs.com/technologies/overview/character_encoding • Many programing languages support UTF-8 • Ruby, Python, Rust, C++[^1], PHP, Java[^2] 9 [^1]: char8_t: A type for UTF-8 characters and strings https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0482r0.html [^2]: JEP 400: UTF-8 by Default https://openjdk.org/jeps/400
  2. What is mruby/c • mruby/c is another implementation of mruby

    29 [^3]: https://github.com/mrubyc/mrubyc
  3. What is mruby • mruby is the lightweight implementation of

    the Ruby language • Focus on compatibility with Ruby • Memory size < 400KB 30
  4. What is mruby/c • Small memory consumption • < 40KB

    • Concurrent • Target • one-chip micro processors • Written in C or Ruby 31 [^3]: https://github.com/mrubyc/mrubyc
  5. Use case • Micro controller • IoT devise • Reduction

    of defect rate in industrial sewing machines[^4] • prk_ fi rmware 33 [^4]: https://www.ruby.or.jp/ja/showcase/case80.html
  6. Advantages of using UTF-8 in mruby/c • prk_ fi rmware

    allows UTF-8 • Network programming • Shell 35
  7. What does String mean as a byte sequence? • It

    doesn't have a character encoding. 39
  8. What is a character encoding • Label to indicate what

    character code the byte sequence is • Does not affect the byte sequence 40
  9. What is a character encoding • Label to indicate what

    character code the byte sequence is • Does not affect the byte sequence 41 11100011 10000001 10000010
  10. What is a character encoding • Label to indicate what

    character code the byte sequence is • Does not affect the byte sequence 42 11100011 10000001 10000010 UTF-8
  11. Why "💎❤🏯".size is 11 • Because "💎❤🏯" is 11 bytes

    in UTF-8 • mruby/c string has no character encoding, so returns bytes 44
  12. In the case of UTF-8, you can tell by looking

    at the fi rst n bits of a byte • UTF-8 is variable length from 1 to 4 bytes • There are 1 byte, 2 bytes, 3 bytes, 4 bytes, and ongoing byte sequences • 1 byte: 0xxxxxxx • 2 bytes: 110xxxxx • 3 bytes: 1110xxxx • 4 bytes: 11110xxx • ongoing: 10xxxxxx 47
  13. Sample 49 • 1 byte: 0xxxxxxx • 2 bytes: 110xxxxx

    • 3 bytes: 1110xxxx • 4 bytes: 11110xxx • ongoing: 10xxxxxx
  14. Sample 50 • 1 byte: 0xxxxxxx • 2 bytes: 110xxxxx

    • 3 bytes: 1110xxxx • 4 bytes: 11110xxx • ongoing: 10xxxxxx
  15. Sample 51 • 1 byte: 0xxxxxxx • 2 bytes: 110xxxxx

    • 3 bytes: 1110xxxx • 4 bytes: 11110xxx • ongoing: 10xxxxxx
  16. Sample 52 • 1 byte: 0xxxxxxx • 2 bytes: 110xxxxx

    • 3 bytes: 1110xxxx • 4 bytes: 11110xxx • ongoing: 10xxxxxx
  17. Implementation not required • new, +, *, to_i, to_f, to_s,

    b, clear, chomp, chomp!, dup, empty?, getbyte, lstrip, lstrip!, rstrip, rstrip!, strip!, to_sym, start_with?, end_with?, include?, bytes, ==, <=> • Byte sequence operations can handle UTF-8 60
  18. Implementation required • Implemented: • index, size(length), slice([]), slice!([]=), insert,

    inspect, ord • To be implemented: • <<, tr, tr!, chr, each_char, split 61
  19. Implementation required • Handling character count (index, size, slice, insert)

    • Handling Unicode code point (ord, chr, <<) • Only ASCII was supported (inspect) • Multipurpose (tr, tr!) • Segmentation fault (split) 62
  20. Unicode code point • Any value in the Unicode codespace

    • That is, the range of integers from 0 to 10FFFF. • Not all code points are assigned to encoded characters. 64
  21. Grapheme clusters • Visually perceived text units of combined Unicode

    code points • Almost user-perceived character 65
  22. Unicode code point / grapheme clusters • Unicode code point

    is easier to manipulate strings • grapheme_clusters is slower 67
  23. Handling character count • index, size, slice, insert • Convert

    character count to byte count or byte count to character count 69
  24. String#index 71 0 1 2 3 4 5 6 7

    8 9 10 byte F0 9F 92 8E E2 9D A4 F0 9F 8F AF character index
  25. String#index 72 0 1 2 3 4 5 6 7

    8 9 10 byte F0 9F 92 8E E2 9D A4 F0 9F 8F AF character index 0 1 2
  26. String#index 73 0 1 2 3 4 5 6 7

    8 9 10 byte F0 9F 92 8E E2 9D A4 F0 9F 8F AF character index 0 1 2
  27. Relationship between Unicode codepoint and UTF-8 • Unicode codepoint is

    a value that uniquely de fi nes a Unicode character • UTF-8 encodes Unicode code points • UTF-16, UTF-32 • One-to-one correspondence between UTF-8 byte sequences and Unicode code points • Can convert from byte sequences to Unicode code points • Also can convert from Unicode code points to byte sequences 79
  28. The status quo • 695 String tests passed • 80%

    of String methods have been implemented to support UTF-8 83
  29. Future issues • Support chr, <<, each_char, tr, tr! •

    Performance evaluation • Enable to use UTF-8 in prk_ fi rmware • Error with invalid strings as UTF-8 84