Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
String meets Encoding
Search
ima1zumi
September 11, 2022
Programming
2
3k
String meets Encoding
https://rubykaigi.org/2022/presentations/ima1zumi.html#day3
ima1zumi
September 11, 2022
Tweet
Share
More Decks by ima1zumi
See All by ima1zumi
Ruby Taught Me About Under the Hood
ima1zumi
6
16k
Exploring Reline: Enhancing Command Line Usability
ima1zumi
1
96
10年物のRailsアプリにキャッチアップ!〜コードを読まずに理解したかった〜
ima1zumi
0
98
RubyKaigiの登壇者一覧ページを作った
ima1zumi
0
440
Relineのその後の生活
ima1zumi
0
240
IRB and Reline Kaigi 2024
ima1zumi
0
15
Exploring Reline: Enhancing Command Line Usability
ima1zumi
3
15k
Reline 1分 Cooking
ima1zumi
0
39
続・mruby/cにUTF-8 を実装する
ima1zumi
1
34
Other Decks in Programming
See All in Programming
TDD 実践ミニトーク
contour_gara
1
280
AIレビュアーをスケールさせるには / Scaling AI Reviewers
technuma
2
240
兎に角、コードレビュー
mitohato14
0
170
Laravel Boost 超入門
fire_arlo
2
190
レガシープロジェクトで最大限AIの恩恵を受けられるようClaude Codeを利用する
tk1351
4
1.6k
Design Foundational Data Engineering Observability
sucitw
2
150
Google I/O recap web編 大分Web祭り2025
kponda
0
2.9k
🔨 小さなビルドシステムを作る
momeemt
3
650
CSC305 Summer Lecture 12
javiergs
PRO
0
130
プロポーザル駆動学習 / Proposal-Driven Learning
mackey0225
2
550
Zendeskのチケットを Amazon Bedrockで 解析した
ryokosuge
3
260
Introducing ReActionView: A new ActionView-compatible ERB Engine @ Rails World 2025, Amsterdam
marcoroth
0
390
Featured
See All Featured
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
44
2.5k
Producing Creativity
orderedlist
PRO
347
40k
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.4k
Intergalactic Javascript Robots from Outer Space
tanoku
272
27k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
667
120k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4k
Making the Leap to Tech Lead
cromwellryan
134
9.5k
Building Applications with DynamoDB
mza
96
6.6k
Code Review Best Practice
trishagee
70
19k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
284
13k
Transcript
String meets Encoding RubyKaigi 2022 2022-09-10 Mari Imaizumi
Agenda • Motivation • CSV.read • stackprof • String#split •
perf • faster String#split • ruby/ruby #6351 2
Evaluation environments • MacBook Pro 2020 • macOS 12.4 •
2 GHz Quad-Core Intel Core i5 • 32 GB 3733 MHz LPDDR4X • Vagrant • Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-46-generic x86_64) • ruby 3.2.0dev (2022-09-05T15:39:37Z 63ed61e322) [x86_64-darwin21] 3
Introduction @ima1zumi (Mari Imaizumi) ESM, inc. Hamada.rb, Fukuoka.rb ❤ Character,
Character Encoding 4
https://hamadarb.connpass.com/event/260134/ 5
6
Dive into Encoding - RubyKaigi Takeout 2021 7
Motivation • > If you want to do something around
encoding in Ruby, you need to speed up String#encode. Right now it takes as long to convert CP932 to UTF-8 as it does to parse KEN_ALL.CSV in pure Ruby. (DeepL translate) • https://twitter.com/ktou/status/ 1436656477826019329 8
🙆 String#encode 9
🙆 String#encode 🤔 CSV.read (String#split) 10
CSV.read("KEN_ALL.CSV") 11
KEN_ALL.CSV • Zip code data in Japan • https://www.post.japanpost.jp/zipcode/dl/kogaki-zip.html •
16 MB • 15 lines • 124,541 rows • Encoding: CP932 (Windows-31J) 12
KEN_ALL.CSV 01101,060 ,0600000,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,ւಓ,ࡳຈࢢதԝ۠,ҎԼʹܝࡌ͕ ͳ͍߹,0,0,0,0,0,0 01101,064 ,0640941,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŗšűŜƄśŜ,ւಓ,ࡳຈࢢதԝ۠,Ѵέٰ,0,0,1,0,0,0 řŸ),ւಓ,ࡳຈࢢதԝ۠,Ұʢ̎̌ʙ̎̔ஸʣ,1,0,1,0,0,0 ...(about 120000 lines)...
47382,90718,9071800,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,ҎԼʹ ܝࡌ͕ͳ͍߹,0,0,0,0,0,0 47382,90718,9071801,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,żūŞƄŬ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,༩ಹ ࠃ,0,0,0,0,0,0 13
Benchmark for CSV.read 14
Benchmark for CSV.read 15
stackprof 🔍 16
Stackprof • A sampling call-stack pro fi ler for Ruby
• https://github.com/tmm1/stackprof • sampling mode • :wall, :cpu, :object, :custom • fl amegraph 17
Stackprof 18
19
stackprof --d3- fl amegraph stackprof-cpu-cp932- csv.dump > stackprof.html 20
Stackprof 21
grep split 22
grep split 23
Summary • Reading KEN_ALL.CSV with CSV.read took about 1.8 seconds.
• CSV.read uses 29% for String#split 24
Measure String#split with perf 25
String#split • split(pattern = nil, limit = 0) • pattern:
Regexp, String, nil • limit: number of splits • return: Array or self 26
Try perf 27 • performance analyzing tool in Linux
28
perf record String#split 29
30
fl amegraph 31 →alphabetical order
fl amegraph 32 rb_ary_push rb_enc_cr_str_copy_for_substr →alphabetical order str_new0
Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13%
• rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 33
String#split • 1. Check arguments • 2. Check patterns •
3. loop • 1. Search substr • 2. create substr • 3. result << substr • 4. return result 34
rb_str_split_m summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr
13% • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 35
Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13%
• rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 36
rb_str_subseq 37
rb_str_subseq 38 create substring from str
rb_str_subseq 39 set encoding and coderange to str2 create substring
from str
rb_ enc_ cr_ str_ copy_ for_ substr 40
rb_ enc_ cr_ str_ copy_ for_ substr 41 set encoding
rb_ enc_ cr_ str_ copy_ for_ substr 42 set encoding
set coderange
str_enc_copy 43
44
45
🤔 • Don't get encoding dynamically • just pass the
Encoding of the original string 46
Make rb_enc_set_index_fastpath 47
Benchmark for String#split 48
Benchmark for String#split https://github.com/ruby/ruby/pull/6351 49 SVCZ SVCZEFW CVJMUSVCZ 4USJOHTQMJU
65' 4USJOHTQMJU 64"4$** SVCZY SVCZEFWY SVCZY SVCZEFWY
Conclusion • String, Encoding check is a bit heavy •
must_encindex • mustnot_broken • Not checking or omitting unnecessary checks leads to faster speeds • https://github.com/ruby/ruby/pull/6072#issuecomment-1191371088 50