Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
String meets Encoding
Search
ima1zumi
September 11, 2022
Programming
2
3k
String meets Encoding
https://rubykaigi.org/2022/presentations/ima1zumi.html#day3
ima1zumi
September 11, 2022
Tweet
Share
More Decks by ima1zumi
See All by ima1zumi
Ruby Taught Me About Under the Hood
ima1zumi
6
17k
Exploring Reline: Enhancing Command Line Usability
ima1zumi
1
100
10年物のRailsアプリにキャッチアップ!〜コードを読まずに理解したかった〜
ima1zumi
0
110
RubyKaigiの登壇者一覧ページを作った
ima1zumi
0
460
Relineのその後の生活
ima1zumi
0
250
IRB and Reline Kaigi 2024
ima1zumi
0
17
Exploring Reline: Enhancing Command Line Usability
ima1zumi
3
15k
Reline 1分 Cooking
ima1zumi
0
40
続・mruby/cにUTF-8 を実装する
ima1zumi
1
35
Other Decks in Programming
See All in Programming
登壇は dynamic! な営みである / speech is dynamic
da1chi
0
310
kiroとCodexで最高のSpec駆動開発を!!数時間で web3ネイティブなミニゲームを作ってみたよ!
mashharuki
0
150
Cursorハンズオン実践!
eltociear
2
1k
技術的負債の正体を知って向き合う / Facing Technical Debt
irof
0
160
デミカツ切り抜きで面倒くさいことはPythonにやらせよう
aokswork3
0
230
Goで実践するドメイン駆動開発 AIと歩み始めた新規プロダクト開発の現在地
imkaoru
4
820
CSC305 Lecture 05
javiergs
PRO
0
210
Pull-Requestの内容を1クリックで動作確認可能にするワークフロー
natmark
2
500
Six and a half ridiculous things to do with Quarkus
hollycummins
0
170
あなたとKaigi on Rails / Kaigi on Rails + You
shimoju
0
130
CI_CD「健康診断」のススメ。現場でのボトルネック特定から、健康診断を通じた組織的な改善手法
teamlab
PRO
0
210
[Kaigi on Rais 2025] 全問正解率3%: RubyKaigiで出題したやりがちな危険コード5選
power3812
0
110
Featured
See All Featured
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
140
34k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
27k
Done Done
chrislema
185
16k
Facilitating Awesome Meetings
lara
56
6.6k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
19
1.2k
Scaling GitHub
holman
463
140k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
132
19k
Thoughts on Productivity
jonyablonski
70
4.9k
Code Review Best Practice
trishagee
72
19k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
37
2.6k
GraphQLの誤解/rethinking-graphql
sonatard
73
11k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
285
14k
Transcript
String meets Encoding RubyKaigi 2022 2022-09-10 Mari Imaizumi
Agenda • Motivation • CSV.read • stackprof • String#split •
perf • faster String#split • ruby/ruby #6351 2
Evaluation environments • MacBook Pro 2020 • macOS 12.4 •
2 GHz Quad-Core Intel Core i5 • 32 GB 3733 MHz LPDDR4X • Vagrant • Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-46-generic x86_64) • ruby 3.2.0dev (2022-09-05T15:39:37Z 63ed61e322) [x86_64-darwin21] 3
Introduction @ima1zumi (Mari Imaizumi) ESM, inc. Hamada.rb, Fukuoka.rb ❤ Character,
Character Encoding 4
https://hamadarb.connpass.com/event/260134/ 5
6
Dive into Encoding - RubyKaigi Takeout 2021 7
Motivation • > If you want to do something around
encoding in Ruby, you need to speed up String#encode. Right now it takes as long to convert CP932 to UTF-8 as it does to parse KEN_ALL.CSV in pure Ruby. (DeepL translate) • https://twitter.com/ktou/status/ 1436656477826019329 8
🙆 String#encode 9
🙆 String#encode 🤔 CSV.read (String#split) 10
CSV.read("KEN_ALL.CSV") 11
KEN_ALL.CSV • Zip code data in Japan • https://www.post.japanpost.jp/zipcode/dl/kogaki-zip.html •
16 MB • 15 lines • 124,541 rows • Encoding: CP932 (Windows-31J) 12
KEN_ALL.CSV 01101,060 ,0600000,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,ւಓ,ࡳຈࢢதԝ۠,ҎԼʹܝࡌ͕ ͳ͍߹,0,0,0,0,0,0 01101,064 ,0640941,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŗšűŜƄśŜ,ւಓ,ࡳຈࢢதԝ۠,Ѵέٰ,0,0,1,0,0,0 řŸ),ւಓ,ࡳຈࢢதԝ۠,Ұʢ̎̌ʙ̎̔ஸʣ,1,0,1,0,0,0 ...(about 120000 lines)...
47382,90718,9071800,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,ҎԼʹ ܝࡌ͕ͳ͍߹,0,0,0,0,0,0 47382,90718,9071801,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,żūŞƄŬ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,༩ಹ ࠃ,0,0,0,0,0,0 13
Benchmark for CSV.read 14
Benchmark for CSV.read 15
stackprof 🔍 16
Stackprof • A sampling call-stack pro fi ler for Ruby
• https://github.com/tmm1/stackprof • sampling mode • :wall, :cpu, :object, :custom • fl amegraph 17
Stackprof 18
19
stackprof --d3- fl amegraph stackprof-cpu-cp932- csv.dump > stackprof.html 20
Stackprof 21
grep split 22
grep split 23
Summary • Reading KEN_ALL.CSV with CSV.read took about 1.8 seconds.
• CSV.read uses 29% for String#split 24
Measure String#split with perf 25
String#split • split(pattern = nil, limit = 0) • pattern:
Regexp, String, nil • limit: number of splits • return: Array or self 26
Try perf 27 • performance analyzing tool in Linux
28
perf record String#split 29
30
fl amegraph 31 →alphabetical order
fl amegraph 32 rb_ary_push rb_enc_cr_str_copy_for_substr →alphabetical order str_new0
Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13%
• rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 33
String#split • 1. Check arguments • 2. Check patterns •
3. loop • 1. Search substr • 2. create substr • 3. result << substr • 4. return result 34
rb_str_split_m summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr
13% • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 35
Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13%
• rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 36
rb_str_subseq 37
rb_str_subseq 38 create substring from str
rb_str_subseq 39 set encoding and coderange to str2 create substring
from str
rb_ enc_ cr_ str_ copy_ for_ substr 40
rb_ enc_ cr_ str_ copy_ for_ substr 41 set encoding
rb_ enc_ cr_ str_ copy_ for_ substr 42 set encoding
set coderange
str_enc_copy 43
44
45
🤔 • Don't get encoding dynamically • just pass the
Encoding of the original string 46
Make rb_enc_set_index_fastpath 47
Benchmark for String#split 48
Benchmark for String#split https://github.com/ruby/ruby/pull/6351 49 SVCZ SVCZEFW CVJMUSVCZ 4USJOHTQMJU
65' 4USJOHTQMJU 64"4$** SVCZY SVCZEFWY SVCZY SVCZEFWY
Conclusion • String, Encoding check is a bit heavy •
must_encindex • mustnot_broken • Not checking or omitting unnecessary checks leads to faster speeds • https://github.com/ruby/ruby/pull/6072#issuecomment-1191371088 50