Hacking and Profiling Ruby for Performance - RubyKaigi 2023

Hacking and Pro fi ling Ruby for Performance Daisuke Aritomo
(@osyoyu)

pp @osyoyu • Daisuke Aritomo (osyoyu, ͓͠ΐʔΏ, ͓͠ΐ͏Ώ) • "osyoyu"
is pronounced as "oh show you" • #rubykaigiNOC (Venue Wi-Fi Team) • Software Engineer at Cookpad Inc.

Get a free drink from Cookpad's fridge! Cookpad provides RUBY-POWERED
FRIDGES to RubyKaigi 2023! Let's make everyday cooking fun together! https://cookpad.careers/

Cookpad is doing a look- back RubyKaigi event! 5/18 @
Tokyo • Look back RubyKaigi with   hands on! • Let's make the next RubyKaigi together Come to Cookpad booth for details https://cookpad.connpass.com/event/282436/

About this Talk • How to pro fi le and
tune a Ruby webapp • ... you know nothing about, • ... within 8 hours, • ... as part of a performance tuning competition, • for fun!

Do you like performance?   🙋

: the performance challenge • Contestants are given a VM
and a super-slow webapp • Contestants may request a 60-second benchmark during the contest • Scores and standings are decided based on benchmark results • The goal is to get the best score Benchmarker (scorer) Contestant VM (running a Ruby webapp) benchmark requests   sent at an intensive rate for 60 seconds

Day 2 11:00-

The Initial State • 3 VMs (Computing Instances) • Ruby
• Sinatra • Puma • Nginx • MySQL • Implementations in other languages • Python, Go, Node.js, ... • Initial performance is designed to be not so different

"Make this webapp server as fast as possible -  
but no scaling up nor out."

Hack around with Ruby code Run a benchmark, get your
score

score The highest score wins!

Will talks/Won't talks • Why Ruby is a great language
to compete in ISUCON • How to track down and pro fi le slow code on the CRuby level • What future Rubies need to shine in ISUCON Wills Won't • Con fi guring Nginx, Linux, etc... • Monitoring on the system level • Itamae recipes we made for ISUCON

(Almost) Everything is permitted • Add effective RDBMS indexes •
Kill N+1 SQL Queries • → user_ids.each {|id| query("select * from users where id = ?", id) } • → query("select * from users where id in (?)", user_ids) • Replace suboptimal algorithms • Utilize Server Resources (cpu, memory) to the last drop • Adding Puma threads/processes • Caching • Upgrading to ruby/ruby master • (Adding VMs and scaling VMs up are prohibited)

(Almost) Everything is permitted • Add effective RDBMS indexes •
Kill N+1 SQL Queries • → user_ids.each {|id| query("select * from users where id = ?", id) } • → query("select * from users where id in (?)", user_ids) • Replace suboptimal algorithms • Utilize Server Resources (cpu, memory) to the last drop • Adding Puma threads/processes • Caching • Upgrading to ruby/ruby master • (Adding VMs and scaling VMs up are prohibited) But where should we start from?

score See pro fi ling results,   think what next

Dancing with Ruby

Dancing with Ruby • You'll be given 500+ lines of
Sinatra code • and need to make it really fast within 8 hours • This is where Ruby really shines

Ruby Go • Same code, different language • Save reading
time and writing time!

The mighty Array, Hash, Enumerable • #map • #each_with_object •
#sort_by! • Almost anything is possible • Comes in really handy when   tacking N+1 queries   (without ActiveRecord)

Monkey patching • Need something in the Standard Library? Build
it on site!

Monkey patching • Need something in the Standard Library? Build
it on site! binding.irb/pry in production • Debugging work fl ow: • Stop real benchmark requests • Write code in binding.irb/pry using real requests • Con fi rm it works • Copy to editor and save (Note: I don't do this at work - just a competition technique 😁)

Challenges in Ruby

START BENCHMARK

• This... happens • It's fi xable, no worries •
Let's hope RBS, TypeProf and other projects solve this problem

Pro fi ling the accurate bottlenecks Utilizing 100% cpu in
Ruby Achieving high concurrency in Ruby Performance Challenges in Ruby

Don't guess, measure! • Random improvements will take you nowhere
• Spending time to fi x non-real problems is the   last thing you want to do in an 8-hour timeframe • Accurate pro fi ling is the key

Pro fi ler choices • Tracing pro fi lers •
Tracks everything, but huge performance impact • ruby-prof based on TracePoint • Sampling pro fi lers • Collects samples every 10-100ms,   small performance impact • Stackprof (cpu, wall, memory) • based on rb_pro fi le_frames() API • rbspy (wall) • runs as a separate process and reads ruby memory (process_vm_readv(2))

Let's pro fi le! Flamegraph Visualizer: jlfwong/speedscope

Let's pro fi le! Benchmark window (60 seconds) Time spent
in particular handler (space for optimization?) POST /api/condition/...

Let's pro fi le! POST /api/condition/... N+1 INSERT query found!

Let's pro fi le! POST /api/condition/... N+1 INSERT query found!
👍

Flamegraph source: rb_pro fi le_frames() • Stackprof utilizes the rb_pro
fi le_frames() API • Returns the call stack that was running when rb_pro fi le_frames() was called • Stackprof calls rb_pro fi le_frames() on SIGPROF (cpu mode) or SIGALRM (wall mode) timers ti me a() a() a() a() b() b() c() 📸 rb_pro fi le_frames() Records call stack [a(), b()]

• rb_pro fi le_frames (Stackprof) is inaccurate when I/O comes
into action burn_cpu Thread#join No io!?

Multithreading in Ruby / GVL • Only 1 Thread can
use the CPU at the same time • due to the Global VM Lock (GVL) • I/O ( fi le read/writes, network access, ...) can be performed in the background Thread 1 Thread 2 Thread 3 Release GVL (Do I/O) Release GVL (Do I/O) Acquire GVL Acquire GVL Use CPU Do   Do   Use   Wait Wait

Issues in rb_pro fi le_frames() • The current implementation of
rb_pro fi le_frames() returns information about the last active Thread (which had the GVL) • Threads doing I/O have low chances to be targeted • Statistics ( fl amegraphs) built from continuous rb_pro fi le_frames() calls may be not accurate, especially when many Threads are doing I/Os • (even in wall mode!) 💥 = Stack frame peeked by rb_pro fi le_frames() Thread 1 Thread 2 Thread 3 Release GVL (Do I/O) Release GVL (Do I/O) Acquire GVL Acquire GVL 💥 = Stack frame peeked by rb_pro fi le_frames()

Issues in rb_pro fi le_frames() • Proposal:   Add rb_thread_pro
fi le_frames() API, a per-thread version of rb_pro fi le_frames() • Accepts VALUE thread as arg • https://github.com/ruby/ruby/ pull/7784 Thread 1 Thread 2 Thread 3 Release GVL (Do I/O) Release GVL (Do I/O) Acquire GVL Acquire GVL 💥 = Stack frame peeked by rb_pro fi le_frames()

Utilizing 100% cpu in Ruby • Only 1 Thread can
be active due to the GVL • CPU tasks and I/O can run simultaneously, but conditions are not always ideal • Resources are very limited in ISUCON; we want to squeeze everything out! 4 Ruby processes, 32 threads isn't enough to burn all CPU we want to see this (1 process, 8 threads in Go)

Reducing GVL wait • Less Threads = Less races for
the GVL • Ok, Let's reduce Threads! • In the ISUCON webapp case, Puma is creating Threads • We can create more processes in place of Threads to keep the number of workers

Tuning Puma for 100% cpu • Adding Server (Puma) processes
is effective, but consumes more memory • Memory is precious in ISUCON as MySQL lives in the same VM . . . . . . More Processes (better cpu utilization but more memory) More Threads (lesser memory consumption but suboptimal cpu util)

wip: Finding the process/thread balance using GVL stats • GVL
event hooks were added in Ruby 3.2 • RUBY_INTERNAL_THREAD_EVENT_* • Shopify/gvltools, ivoanjo/gvl-tracing • I integrated this into pro fi ler results • Usable for tuning # of Puma threads? Try adding Puma threads Perfect! Check GVL wait Reduce threads,   add processes Low enough Somewhat high

Higher concurrency? • Falcon (socketry/async series) • Event-loop based async
I/O for Ruby! • Sadly, we couldn't rewrite everything in 8 hours • Truf fl eRuby • ISUCON VMs didn't have enough memory for Truf fl eRuby 😔 • Arming Ractors for lesser GVL waits • Maybe my next challenge!

Does everyone use Ruby? (in ISUCON)

Quals Does everyone use Ruby? (in ISUCON)

Quals Finals My team! Does everyone use Ruby? (in ISUCON)

Ruby vs. Go • Go: Goroutines • Kind of a
lightweight threads - this system makes high concurrency very easy • Go: Concurrency deeply embedded in the ecosystem • Contexts: Ability to cancel no longer needed MySQL queries • Timeouts: Same

Wrapping up • It's fun to write Ruby 😉 •
Pro fi ling is important! • But it's more important to check if those pro fi les are accurate • Let's do ISUCON!

Acknowledgements @s4ichi and @koba789,   my long-time ISUCON teammates @ko1
and @mame, who gave us many valuable advice

Thank you!

Hacking and Profiling Ruby for Performance - Ru...

Hacking and Profiling Ruby for Performance - RubyKaigi 2023

More Decks by osyoyu

Other Decks in Programming

Featured

Transcript