Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
String meets Encoding
Search
ima1zumi
September 11, 2022
Programming
2
2.4k
String meets Encoding
https://rubykaigi.org/2022/presentations/ima1zumi.html#day3
ima1zumi
September 11, 2022
Tweet
Share
More Decks by ima1zumi
See All by ima1zumi
Reline 1分 Cooking
ima1zumi
0
4
続・mruby/cにUTF-8 を実装する
ima1zumi
0
3
UTF-8 is coming to mruby/c
ima1zumi
3
4.3k
たのしいString
ima1zumi
1
440
Watchから始めるOSS生活
ima1zumi
0
44
Emojiの正規表現
ima1zumi
0
3
RubyKaigiで話した話
ima1zumi
1
340
漢字が文字コードになる前の話
ima1zumi
0
3
Dive into Encoding
ima1zumi
4
2.1k
Other Decks in Programming
See All in Programming
[技育CAMPアカデミア]アイディアを形に!【超入門】スマホアプリ開発〜リリースまでの流れをご紹介
teamlab
PRO
0
350
コードレビューで学ぶ!Kotlinオブジェクト指向デザインパターン
akkie76
2
180
OpenAPIを中心に考えるAPI開発入門 / Introduction to API Development with a Focus on OpenAPI
seike460
PRO
2
120
Ruby GitHub Packages
bkuhlmann
0
620
From Spring Boot 2 to Spring Boot 3 with Java 21 and Jakarta EE
ivargrimstad
0
1.1k
Elm Form Validation
bkuhlmann
0
510
今の SmartHR にエンジニアで入社するとどうなるの?
daisukeshinoku
5
4.6k
オブジェクト指向のリ・オリエンテーション~歴史を振り返り、AI時代に向きなおる~
hanyudaeiiti
10
5.6k
VSCodeでのDatabricks開発もお勧めしたい/I would also recommend Databricks development with VSCode.
kazumain
0
240
Micro Frontends for Java Microservices - Devnexus 2024
mraible
PRO
0
430
Javaエンジニアのための Nodejs/Nuxt3入門
hidekatsu_izuno
0
280
SpringBoot+MyBatisで例外が出たときどこを見るか
syukai
0
110
Featured
See All Featured
The Brand Is Dead. Long Live the Brand.
mthomps
48
28k
Bootstrapping a Software Product
garrettdimon
PRO
301
110k
We Have a Design System, Now What?
morganepeng
42
6.7k
Visualization
eitanlees
135
14k
How to train your dragon (web standard)
notwaldorf
72
5.1k
Learning to Love Humans: Emotional Interface Design
aarron
266
39k
Docker and Python
trallard
33
2.7k
Happy Clients
brianwarren
91
6.4k
Art, The Web, and Tiny UX
lynnandtonic
288
19k
Stop Working from a Prison Cell
hatefulcrawdad
266
19k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
6
990
Scaling GitHub
holman
457
140k
Transcript
String meets Encoding RubyKaigi 2022 2022-09-10 Mari Imaizumi
Agenda • Motivation • CSV.read • stackprof • String#split •
perf • faster String#split • ruby/ruby #6351 2
Evaluation environments • MacBook Pro 2020 • macOS 12.4 •
2 GHz Quad-Core Intel Core i5 • 32 GB 3733 MHz LPDDR4X • Vagrant • Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-46-generic x86_64) • ruby 3.2.0dev (2022-09-05T15:39:37Z 63ed61e322) [x86_64-darwin21] 3
Introduction @ima1zumi (Mari Imaizumi) ESM, inc. Hamada.rb, Fukuoka.rb ❤ Character,
Character Encoding 4
https://hamadarb.connpass.com/event/260134/ 5
6
Dive into Encoding - RubyKaigi Takeout 2021 7
Motivation • > If you want to do something around
encoding in Ruby, you need to speed up String#encode. Right now it takes as long to convert CP932 to UTF-8 as it does to parse KEN_ALL.CSV in pure Ruby. (DeepL translate) • https://twitter.com/ktou/status/ 1436656477826019329 8
🙆 String#encode 9
🙆 String#encode 🤔 CSV.read (String#split) 10
CSV.read("KEN_ALL.CSV") 11
KEN_ALL.CSV • Zip code data in Japan • https://www.post.japanpost.jp/zipcode/dl/kogaki-zip.html •
16 MB • 15 lines • 124,541 rows • Encoding: CP932 (Windows-31J) 12
KEN_ALL.CSV 01101,060 ,0600000,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,ւಓ,ࡳຈࢢதԝ۠,ҎԼʹܝࡌ͕ ͳ͍߹,0,0,0,0,0,0 01101,064 ,0640941,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŗšűŜƄśŜ,ւಓ,ࡳຈࢢதԝ۠,Ѵέٰ,0,0,1,0,0,0 řŸ),ւಓ,ࡳຈࢢதԝ۠,Ұʢ̎̌ʙ̎̔ஸʣ,1,0,1,0,0,0 ...(about 120000 lines)...
47382,90718,9071800,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,ҎԼʹ ܝࡌ͕ͳ͍߹,0,0,0,0,0,0 47382,90718,9071801,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,żūŞƄŬ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,༩ಹ ࠃ,0,0,0,0,0,0 13
Benchmark for CSV.read 14
Benchmark for CSV.read 15
stackprof 🔍 16
Stackprof • A sampling call-stack pro fi ler for Ruby
• https://github.com/tmm1/stackprof • sampling mode • :wall, :cpu, :object, :custom • fl amegraph 17
Stackprof 18
19
stackprof --d3- fl amegraph stackprof-cpu-cp932- csv.dump > stackprof.html 20
Stackprof 21
grep split 22
grep split 23
Summary • Reading KEN_ALL.CSV with CSV.read took about 1.8 seconds.
• CSV.read uses 29% for String#split 24
Measure String#split with perf 25
String#split • split(pattern = nil, limit = 0) • pattern:
Regexp, String, nil • limit: number of splits • return: Array or self 26
Try perf 27 • performance analyzing tool in Linux
28
perf record String#split 29
30
fl amegraph 31 →alphabetical order
fl amegraph 32 rb_ary_push rb_enc_cr_str_copy_for_substr →alphabetical order str_new0
Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13%
• rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 33
String#split • 1. Check arguments • 2. Check patterns •
3. loop • 1. Search substr • 2. create substr • 3. result << substr • 4. return result 34
rb_str_split_m summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr
13% • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 35
Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13%
• rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 36
rb_str_subseq 37
rb_str_subseq 38 create substring from str
rb_str_subseq 39 set encoding and coderange to str2 create substring
from str
rb_ enc_ cr_ str_ copy_ for_ substr 40
rb_ enc_ cr_ str_ copy_ for_ substr 41 set encoding
rb_ enc_ cr_ str_ copy_ for_ substr 42 set encoding
set coderange
str_enc_copy 43
44
45
🤔 • Don't get encoding dynamically • just pass the
Encoding of the original string 46
Make rb_enc_set_index_fastpath 47
Benchmark for String#split 48
Benchmark for String#split https://github.com/ruby/ruby/pull/6351 49 SVCZ SVCZEFW CVJMUSVCZ 4USJOHTQMJU
65' 4USJOHTQMJU 64"4$** SVCZY SVCZEFWY SVCZY SVCZEFWY
Conclusion • String, Encoding check is a bit heavy •
must_encindex • mustnot_broken • Not checking or omitting unnecessary checks leads to faster speeds • https://github.com/ruby/ruby/pull/6072#issuecomment-1191371088 50