$30 off During Our Annual Pro Sale. View Details »

String meets Encoding

ima1zumi
September 11, 2022

String meets Encoding

ima1zumi

September 11, 2022
Tweet

More Decks by ima1zumi

Other Decks in Programming

Transcript

  1. String meets Encoding RubyKaigi 2022 2022-09-10 Mari Imaizumi

  2. Agenda • Motivation • CSV.read • stackprof • String#split •

    perf • faster String#split • ruby/ruby #6351 2
  3. Evaluation environments • MacBook Pro 2020 • macOS 12.4 •

    2 GHz Quad-Core Intel Core i5 • 32 GB 3733 MHz LPDDR4X • Vagrant • Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-46-generic x86_64) • ruby 3.2.0dev (2022-09-05T15:39:37Z 63ed61e322) [x86_64-darwin21] 3
  4. Introduction @ima1zumi (Mari Imaizumi) ESM, inc. Hamada.rb, Fukuoka.rb ❤ Character,

    Character Encoding 4
  5. https://hamadarb.connpass.com/event/260134/ 5

  6. 6

  7. Dive into Encoding - RubyKaigi Takeout 2021 7

  8. Motivation • > If you want to do something around

    encoding in Ruby, you need to speed up String#encode. Right now it takes as long to convert CP932 to UTF-8 as it does to parse KEN_ALL.CSV in pure Ruby. (DeepL translate) • https://twitter.com/ktou/status/ 1436656477826019329 8
  9. 🙆 String#encode 9

  10. 🙆 String#encode 🤔 CSV.read (String#split) 10

  11. CSV.read("KEN_ALL.CSV") 11

  12. KEN_ALL.CSV • Zip code data in Japan • https://www.post.japanpost.jp/zipcode/dl/kogaki-zip.html •

    16 MB • 15 lines • 124,541 rows • Encoding: CP932 (Windows-31J) 12
  13. KEN_ALL.CSV 01101,060 ,0600000,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,๺ւಓ,ࡳຈࢢதԝ۠,ҎԼʹܝࡌ͕ ͳ͍৔߹,0,0,0,0,0,0 01101,064 ,0640941,ŴŕŜŘŪƄř,šŕŴƅƁŢŧœřśřŞ,ŗšűŜƄśŜ,๺ւಓ,ࡳຈࢢதԝ۠,Ѵέٰ,0,0,1,0,0,0 řŸ),๺ւಓ,ࡳຈࢢதԝ۠,๺Ұ৚੢ʢ̎̌ʙ̎̔ஸ໨ʣ,1,0,1,0,0,0 ...(about 120000 lines)...

    47382,90718,9071800,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,ŘŜŬşŘšŘŜƄūŘŰƄŗŘ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,ҎԼʹ ܝࡌ͕ͳ͍৔߹,0,0,0,0,0,0 47382,90718,9071801,śŝūƂşƃ,źŚźŵŞƄƃżūŞƄŬŧŔř,żūŞƄŬ,ԭೄݝ,ീॏࢁ܊༩ಹࠃொ,༩ಹ ࠃ,0,0,0,0,0,0 13
  14. Benchmark for CSV.read 14

  15. Benchmark for CSV.read 15

  16. stackprof 🔍 16

  17. Stackprof • A sampling call-stack pro fi ler for Ruby

    • https://github.com/tmm1/stackprof • sampling mode • :wall, :cpu, :object, :custom • fl amegraph 17
  18. Stackprof 18

  19. 19

  20. stackprof --d3- fl amegraph stackprof-cpu-cp932- csv.dump > stackprof.html 20

  21. Stackprof 21

  22. grep split 22

  23. grep split 23

  24. Summary • Reading KEN_ALL.CSV with CSV.read took about 1.8 seconds.

    • CSV.read uses 29% for String#split 24
  25. Measure String#split with perf 25

  26. String#split • split(pattern = nil, limit = 0) • pattern:

    Regexp, String, nil • limit: number of splits • return: Array or self 26
  27. Try perf 27 • performance analyzing tool in Linux

  28. 28

  29. perf record String#split 29

  30. 30

  31. fl amegraph 31 →alphabetical order

  32. fl amegraph 32 rb_ary_push rb_enc_cr_str_copy_for_substr →alphabetical order str_new0

  33. Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13%

    • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 33
  34. String#split • 1. Check arguments • 2. Check patterns •

    3. loop • 1. Search substr • 2. create substr • 3. result << substr • 4. return result 34
  35. rb_str_split_m summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr

    13% • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 35
  36. Summary • str_new0 40.13% • rb_ary_push 19.25% • rb_enc_cr_str_copy_for_substr 13%

    • rb_mem_search 4.88% • rb_enc_right_char_head 3.68% 36
  37. rb_str_subseq 37

  38. rb_str_subseq 38 create substring from str

  39. rb_str_subseq 39 set encoding and coderange to str2 create substring

    from str
  40. rb_ enc_ cr_ str_ copy_ for_ substr 40

  41. rb_ enc_ cr_ str_ copy_ for_ substr 41 set encoding

  42. rb_ enc_ cr_ str_ copy_ for_ substr 42 set encoding

    set coderange
  43. str_enc_copy 43

  44. 44

  45. 45

  46. 🤔 • Don't get encoding dynamically • just pass the

    Encoding of the original string 46
  47. Make rb_enc_set_index_fastpath 47

  48. Benchmark for String#split 48

  49. Benchmark for String#split https://github.com/ruby/ruby/pull/6351 49 SVCZ SVCZEFW  CVJMUSVCZ 4USJOHTQMJU

    65'    4USJOHTQMJU 64"4$**    SVCZY SVCZEFWY SVCZY SVCZEFWY
  50. Conclusion • String, Encoding check is a bit heavy •

    must_encindex • mustnot_broken • Not checking or omitting unnecessary checks leads to faster speeds • https://github.com/ruby/ruby/pull/6072#issuecomment-1191371088 50