Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Is Ruby's Multi-Encoding Overhead Heavy?

Avatar for ima1zumi ima1zumi
April 23, 2026
770

Is Ruby's Multi-Encoding Overhead Heavy?

Avatar for ima1zumi

ima1zumi

April 23, 2026

Transcript

  1. ˒ Mari Imaizumi @ima1zumi ˒ Working at STORES, Inc. ˒

    Ruby commi tt er 💎 Introduction 2
  2. ˒ In 2026, text is mostly UTF-8 (or ASCII) ˒

    Ruby's M17N is CSI (Character Set Independent): each String carries an encoding tag ˒ Most languages are UCS: internal form is always Unicode (Python / Java) ˒ How much does CSI actually cost? ˒ Scope: per-char String ops only. String#encode / IO conversion / Regexp are out. Motivation 3
  3. ˒ Keep CSI, poke the encoding layer — does speed

    change? ˒ If faster: what was heavy? ˒ If not: why not? Question 4
  4. ˒ Experiments: ˒ 1. Shrink the number of encodings ˒

    2. Replace indirect calls with direct UTF-8 calls ˒ All numbers below are ratios vs. master(9dd446f18e). ˒ Nx faster = N× master speed. Nx slower = 1/N master speed. Question 5
  5. Benchmark 6 prelude: | utf8_short = "こんにちは世界!" utf8_long = "こんにちは世界!"

    * 100 ascii_long = "Hello, World! " * 100 mixed = "Hello こんにちは " * 100 # For each_char / chars - iterates character by character through encoding layer utf8_chars = "あいうえおかきくけこ" * 100 # --- Regression-focused cases for rb_str_inspect UTF-8 fast path --- # Tiny strings — overhead of the added fast-path branch / coderange check str_empty = "" str_1ascii = "a" str_1utf8 = "あ" # Non-UTF-8 encodings: must take the generic path — should NOT regress binary_long = ("Hello, World! " * 100).b us_ascii_long = ("Hello, World! " * 100).force_encoding("US-ASCII") # eucjp_long = ("こんにちは世界!" * 100).encode("EUC-JP") # sjis_long = ("こんにちは世界!" * 100).encode("Shift_JIS") # utf16le_long = ("こんにちは世界!" * 100).encode("UTF-16LE") # Broken UTF-8 (coderange BROKEN → generic path) broken_utf8 = ("\xFF".b * 1000).force_encoding("UTF-8") ...
  6. Benchmark 7 benchmark: # Baseline (kept from original) inspect_utf8_short: utf8_short.inspect

    inspect_utf8_long: utf8_long.inspect inspect_ascii_long: ascii_long.inspect inspect_mixed: mixed.inspect # Tiny strings (branch overhead) inspect_empty: str_empty.inspect inspect_1char_ascii: str_1ascii.inspect inspect_1char_utf8: str_1utf8.inspect # Non-UTF-8 encodings (generic path — regression check) inspect_binary_long: binary_long.inspect inspect_us_ascii_long: us_ascii_long.inspect # inspect_eucjp_long: eucjp_long.inspect # inspect_sjis_long: sjis_long.inspect # inspect_utf16le_long: utf16le_long.inspect # Broken UTF-8 (generic path) inspect_broken_utf8: broken_utf8.inspect # Escape-heavy (no bulk-skip possible) inspect_all_newlines: newlines_long.inspect inspect_control_chars: ctrl_long.inspect inspect_all_quotes: quotes_long.inspect ...
  7. ˒ Goal: 103 → 3 (UTF-8, ASCII-8BIT, US-ASCII) ˒ Benchmarks

    ˒ 1.00x~1.05x slower or faster ˒ No change ˒ Why: non-builtin encodings are dynamically loaded. If unused, they never touch memory — zero hot-path cost. 1. Shrink encodings 8
  8. ˒ Patched `ONIGENC_PRECISE_MBC_ENC_LEN` etc. with `__builtin_expect(encoding_index == 1, 1)` →

    call `utf8_mbc_enc_len` directly for UTF-8 ˒ Result: ˒ inspect_binary_long (ASCII-8BIT 1400B .inspect): 1.33x slower ˒ inspect_us_ascii_long (US-ASCII 1400B .inspect): 1.22x slower ˒ valid_encoding_utf8 (.valid_encoding?): 1.16x slower ˒ Rest: noise ˒ Got slower (especially on non-UTF-8 strings). 2. Direct calls 9
  9. ˒ Why: ˒ 1. Hot paths already skip the indirect

    (String#length, String#+) ˒ 2. The predictor handles the rest — encoding is stable ˒ 3. Non-UTF-8 pays a dead compare — then takes the indirect anyway 2. Direct calls 10
  10. ˒ The encoding layer itself has no overhead ˒ String#inspect's

    bo tt leneck = per-char work itself (3 indirect calls per char) ˒ reduce the whole thing ˒ Add UTF-8-speci fi c fast paths ˒ See also: byroot's blog — h tt ps://byroot.github.io/ruby/ performance/2026/04/18/faster-paths.html So what actually works? 11
  11. ˒ Pure-ASCII at 1400 bytes: 7–10x faster ˒ inspect_ascii_long (`"Hello,

    World! "` × 100, UTF-8): 9.89x faster ˒ inspect_us_ascii_long (same bytes, US-ASCII): 7.74x faster ˒ inspect_binary_long (same bytes, ASCII-8BIT): 7.03x faster ˒ inspect_sparse_escape (`"a" × 99 + "\n"` × 14, 1400B): 6.26x faster Results(String#inspect) 15
  12. ˒ Mixed ASCII + multibyte: also improves ˒ inspect_mixed (`"Hello

    ͜Μʹͪ͸ "` × 100): 1.71x faster ˒ inspect_utf8_long (`"͜Μʹͪ͸ੈքʂ"` × 100): 1.17x faster ˒ Non-UTF-8 encodings (inspect_eucjp_long / sjis_long / utf16le_long, Japanese re-encoded): noise (±3%, generic path unchanged, no regression) Results(String#inspect) 16
  13. ˒ Ruby's CSI is not a heavy abstraction — the

    encoding layer itself is cheap ˒ But there's still room to bolt UTF-8 / ASCII fast paths on top of CSI — `inspect` gives up to 10x ˒ Same trick should work anywhere per-char indirect calls still live in the loop Takeaway 17