Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
String meets Encoding
Search
ima1zumi
September 11, 2022
Programming
2
2.8k
String meets Encoding
https://rubykaigi.org/2022/presentations/ima1zumi.html#day3
ima1zumi
September 11, 2022
Tweet
Share
More Decks by ima1zumi
See All by ima1zumi
Exploring Reline: Enhancing Command Line Usability
ima1zumi
0
52
10年物のRailsアプリにキャッチアップ!〜コードを読まずに理解したかった〜
ima1zumi
0
64
RubyKaigiの登壇者一覧ページを作った
ima1zumi
0
270
Relineのその後の生活
ima1zumi
0
190
IRB and Reline Kaigi 2024
ima1zumi
0
7
Exploring Reline: Enhancing Command Line Usability
ima1zumi
3
14k
Reline 1分 Cooking
ima1zumi
0
25
続・mruby/cにUTF-8 を実装する
ima1zumi
1
23
UTF-8 is coming to mruby/c
ima1zumi
4
5.3k
Other Decks in Programming
See All in Programming
CI改善もDatadogとともに
taumu
0
120
『テスト書いた方が開発が早いじゃん』を解き明かす #phpcon_nagoya
o0h
PRO
2
290
パスキーのすべて ── 導入・UX設計・実装の紹介 / 20250213 パスキー開発者の集い
kuralab
3
790
sappoRo.R #12 初心者セッション
kosugitti
0
260
責務と認知負荷を整える! 抽象レベルを意識した関心の分離
yahiru
7
690
第3回関東Kaggler会_AtCoderはKaggleの役に立つ
chettub
3
1k
Ruby on cygwin 2025-02
fd0
0
150
2024年のkintone API振り返りと2025年 / kintone API look back in 2024
tasshi
0
220
昭和の職場からアジャイルの世界へ
kumagoro95
1
380
ペアーズでの、Langfuseを中心とした評価ドリブンなリリースサイクルのご紹介
fukubaka0825
2
330
SwiftUIで単方向アーキテクチャを導入して得られた成果
takuyaosawa
0
270
Amazon Q Developer Proで効率化するAPI開発入門
seike460
PRO
0
110
Featured
See All Featured
A Philosophy of Restraint
colly
203
16k
How to Think Like a Performance Engineer
csswizardry
22
1.3k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
27
1.6k
The Language of Interfaces
destraynor
156
24k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
21
2.5k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
356
29k
Statistics for Hackers
jakevdp
797
220k
Fantastic passwords and where to find them - at NoRuKo
philnash
51
3k
4 Signs Your Business is Dying
shpigford
182
22k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
27
1.9k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
32
2.1k
Designing on Purpose - Digital PM Summit 2013
jponch
117
7.1k
Transcript
String meets Encoding RubyKaigi 2022 2022-09-10 Mari Imaizumi
Agenda • Motivation • CSV.read • stackprof • String#split •
perf • faster String#split • ruby/ruby #6351 2
Evaluation environments • MacBook Pro 2020 • macOS 12.4 •
2 GHz Quad-Core Intel Core i5 • 32 GB 3733 MHz LPDDR4X • Vagrant • Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-46-generic x86_64) • ruby 3.2.0dev (2022-09-05T15:39:37Z 63ed61e322) [x86_64-darwin21] 3
Introduction @ima1zumi (Mari Imaizumi) ESM, inc. Hamada.rb, Fukuoka.rb ❤ Character,
Character Encoding 4
https://hamadarb.connpass.com/event/260134/ 5
6
Dive into Encoding - RubyKaigi Takeout 2021 7
Motivation • > If you want to do something around
encoding in Ruby, you need to speed up String#encode. Right now it takes as long to convert CP932 to UTF-8 as it does to parse KEN_ALL.CSV in pure Ruby. (DeepL translate) • https://twitter.com/ktou/status/ 1436656477826019329 8
🙆 String#encode 9
🙆 String#encode 🤔 CSV.read (String#split) 10
CSV.read("KEN_ALL.CSV") 11
KEN_ALL.CSV • Zip code data in Japan • https://www.post.japanpost.jp/zipcode/dl/kogaki-zip.html •
16 MB • 15 lines • 124,541 rows • Encoding: CP932 (Windows-31J) 12
KEN_ALL.CSV 01101,060 ,0600000,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,ւಓ,ࡳຈࢢதԝ۠,ҎԼʹܝࡌ͕ ͳ͍߹,0,0,0,0,0,0 01101,064 ,0640941,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŗšűŜƄśŜ,ւಓ,ࡳຈࢢதԝ۠,Ѵέٰ,0,0,1,0,0,0 řŸ),ւಓ,ࡳຈࢢதԝ۠,Ұʢ̎̌ʙ̎̔ஸʣ,1,0,1,0,0,0 ...(about 120000 lines)...
47382,90718,9071800,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,ҎԼʹ ܝࡌ͕ͳ͍߹,0,0,0,0,0,0 47382,90718,9071801,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,żūŞƄŬ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,༩ಹ ࠃ,0,0,0,0,0,0 13
Benchmark for CSV.read 14
Benchmark for CSV.read 15
stackprof 🔍 16
Stackprof • A sampling call-stack pro fi ler for Ruby
• https://github.com/tmm1/stackprof • sampling mode • :wall, :cpu, :object, :custom • fl amegraph 17
Stackprof 18
19
stackprof --d3- fl amegraph stackprof-cpu-cp932- csv.dump > stackprof.html 20
Stackprof 21
grep split 22
grep split 23
Summary • Reading KEN_ALL.CSV with CSV.read took about 1.8 seconds.
• CSV.read uses 29% for String#split 24
Measure String#split with perf 25
String#split • split(pattern = nil, limit = 0) • pattern:
Regexp, String, nil • limit: number of splits • return: Array or self 26
Try perf 27 • performance analyzing tool in Linux
28
perf record String#split 29
30
fl amegraph 31 →alphabetical order
fl amegraph 32 rb_ary_push rb_enc_cr_str_copy_for_substr →alphabetical order str_new0
Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13%
• rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 33
String#split • 1. Check arguments • 2. Check patterns •
3. loop • 1. Search substr • 2. create substr • 3. result << substr • 4. return result 34
rb_str_split_m summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr
13% • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 35
Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13%
• rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 36
rb_str_subseq 37
rb_str_subseq 38 create substring from str
rb_str_subseq 39 set encoding and coderange to str2 create substring
from str
rb_ enc_ cr_ str_ copy_ for_ substr 40
rb_ enc_ cr_ str_ copy_ for_ substr 41 set encoding
rb_ enc_ cr_ str_ copy_ for_ substr 42 set encoding
set coderange
str_enc_copy 43
44
45
🤔 • Don't get encoding dynamically • just pass the
Encoding of the original string 46
Make rb_enc_set_index_fastpath 47
Benchmark for String#split 48
Benchmark for String#split https://github.com/ruby/ruby/pull/6351 49 SVCZ SVCZEFW CVJMUSVCZ 4USJOHTQMJU
65' 4USJOHTQMJU 64"4$** SVCZY SVCZEFWY SVCZY SVCZEFWY
Conclusion • String, Encoding check is a bit heavy •
must_encindex • mustnot_broken • Not checking or omitting unnecessary checks leads to faster speeds • https://github.com/ruby/ruby/pull/6072#issuecomment-1191371088 50