Upgrade to Pro — share decks privately, control downloads, hide ads and more …

String meets Encoding

ima1zumi
September 11, 2022

String meets Encoding

ima1zumi

September 11, 2022
Tweet

More Decks by ima1zumi

Other Decks in Programming

Transcript

  1. Agenda • Motivation • CSV.read • stackprof • String#split •

    perf • faster String#split • ruby/ruby #6351 2
  2. Evaluation environments • MacBook Pro 2020 • macOS 12.4 •

    2 GHz Quad-Core Intel Core i5 • 32 GB 3733 MHz LPDDR4X • Vagrant • Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-46-generic x86_64) • ruby 3.2.0dev (2022-09-05T15:39:37Z 63ed61e322) [x86_64-darwin21] 3
  3. 6

  4. Motivation • > If you want to do something around

    encoding in Ruby, you need to speed up String#encode. Right now it takes as long to convert CP932 to UTF-8 as it does to parse KEN_ALL.CSV in pure Ruby. (DeepL translate) • https://twitter.com/ktou/status/ 1436656477826019329 8
  5. KEN_ALL.CSV • Zip code data in Japan • https://www.post.japanpost.jp/zipcode/dl/kogaki-zip.html •

    16 MB • 15 lines • 124,541 rows • Encoding: CP932 (Windows-31J) 12
  6. KEN_ALL.CSV 01101,060 ,0600000,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,๺ւಓ,ࡳຈࢢதԝ۠,ҎԼʹܝࡌ͕ ͳ͍৔߹,0,0,0,0,0,0 01101,064 ,0640941,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŗšűŜƄśŜ,๺ւಓ,ࡳຈࢢதԝ۠,Ѵέٰ,0,0,1,0,0,0 řŸ),๺ւಓ,ࡳຈࢢதԝ۠,๺Ұ৚੢ʢ̎̌ʙ̎̔ஸ໨ʣ,1,0,1,0,0,0 ...(about 120000 lines)...

    47382,90718,9071800,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,ҎԼʹ ܝࡌ͕ͳ͍৔߹,0,0,0,0,0,0 47382,90718,9071801,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,żūŞƄŬ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,༩ಹ ࠃ,0,0,0,0,0,0 13
  7. Stackprof • A sampling call-stack pro fi ler for Ruby

    • https://github.com/tmm1/stackprof • sampling mode • :wall, :cpu, :object, :custom • fl amegraph 17
  8. 19

  9. String#split • split(pattern = nil, limit = 0) • pattern:

    Regexp, String, nil • limit: number of splits • return: Array or self 26
  10. 28

  11. 30

  12. Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13%

    • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 33
  13. String#split • 1. Check arguments • 2. Check patterns •

    3. loop • 1. Search substr • 2. create substr • 3. result << substr • 4. return result 34
  14. rb_str_split_m summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr

    13% • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 35
  15. Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13%

    • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 36
  16. 44

  17. 45

  18. Benchmark for String#split https://github.com/ruby/ruby/pull/6351 49 SVCZ SVCZEFW  CVJMUSVCZ 4USJOHTQMJU

    65'    4USJOHTQMJU 64"4$**    SVCZY SVCZEFWY SVCZY SVCZEFWY
  19. Conclusion • String, Encoding check is a bit heavy •

    must_encindex • mustnot_broken • Not checking or omitting unnecessary checks leads to faster speeds • https://github.com/ruby/ruby/pull/6072#issuecomment-1191371088 50