Upgrade to Pro — share decks privately, control downloads, hide ads and more …

String meets Encoding

Avatar for ima1zumi ima1zumi
September 11, 2022

String meets Encoding

Avatar for ima1zumi

ima1zumi

September 11, 2022
Tweet

More Decks by ima1zumi

Other Decks in Programming

Transcript

  1. Agenda • Motivation • CSV.read • stackprof • String#split •

    perf • faster String#split • ruby/ruby #6351 2
  2. Evaluation environments • MacBook Pro 2020 • macOS 12.4 •

    2 GHz Quad-Core Intel Core i5 • 32 GB 3733 MHz LPDDR4X • Vagrant • Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-46-generic x86_64) • ruby 3.2.0dev (2022-09-05T15:39:37Z 63ed61e322) [x86_64-darwin21] 3
  3. 6

  4. Motivation • > If you want to do something around

    encoding in Ruby, you need to speed up String#encode. Right now it takes as long to convert CP932 to UTF-8 as it does to parse KEN_ALL.CSV in pure Ruby. (DeepL translate) • https://twitter.com/ktou/status/ 1436656477826019329 8
  5. KEN_ALL.CSV • Zip code data in Japan • https://www.post.japanpost.jp/zipcode/dl/kogaki-zip.html •

    16 MB • 15 lines • 124,541 rows • Encoding: CP932 (Windows-31J) 12
  6. KEN_ALL.CSV 01101,060 ,0600000,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,๺ւಓ,ࡳຈࢢதԝ۠,ҎԼʹܝࡌ͕ ͳ͍৔߹,0,0,0,0,0,0 01101,064 ,0640941,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŗšűŜƄśŜ,๺ւಓ,ࡳຈࢢதԝ۠,Ѵέٰ,0,0,1,0,0,0 řŸ),๺ւಓ,ࡳຈࢢதԝ۠,๺Ұ৚੢ʢ̎̌ʙ̎̔ஸ໨ʣ,1,0,1,0,0,0 ...(about 120000 lines)...

    47382,90718,9071800,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,ҎԼʹ ܝࡌ͕ͳ͍৔߹,0,0,0,0,0,0 47382,90718,9071801,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,żūŞƄŬ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,༩ಹ ࠃ,0,0,0,0,0,0 13
  7. Stackprof • A sampling call-stack pro fi ler for Ruby

    • https://github.com/tmm1/stackprof • sampling mode • :wall, :cpu, :object, :custom • fl amegraph 17
  8. 19

  9. String#split • split(pattern = nil, limit = 0) • pattern:

    Regexp, String, nil • limit: number of splits • return: Array or self 26
  10. 28

  11. 30

  12. Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13%

    • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 33
  13. String#split • 1. Check arguments • 2. Check patterns •

    3. loop • 1. Search substr • 2. create substr • 3. result << substr • 4. return result 34
  14. rb_str_split_m summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr

    13% • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 35
  15. Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13%

    • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 36
  16. 44

  17. 45

  18. Benchmark for String#split https://github.com/ruby/ruby/pull/6351 49 SVCZ SVCZEFW  CVJMUSVCZ 4USJOHTQMJU

    65'    4USJOHTQMJU 64"4$**    SVCZY SVCZEFWY SVCZY SVCZEFWY
  19. Conclusion • String, Encoding check is a bit heavy •

    must_encindex • mustnot_broken • Not checking or omitting unnecessary checks leads to faster speeds • https://github.com/ruby/ruby/pull/6072#issuecomment-1191371088 50