Profile and benchmark every change - RubyKaigi 2025

Proﬁle and benchmark every change Daisuke Aritomo @osyoyu

pp @osyoyu Daisuke Aritomo github.com/osyoyu Works at SmartBank, Inc.

SmartBank, Inc. is the Hack Space Sponsor

Also a profiler author (osyoyu/pf2) Recent Pf2 updates: Changed sample
format and reduced memory by 90% Expanded features for comparing profiles Rewritten core in C (from Rust) I also realized that profilers don't make programs fast Just like how debuggers don't find bugs

Also a proﬁler author (osyoyu/pf2)

Outline "Benchmark-Driven Development (BenchDD)" A walkthrough: Building Xinatra, a faster
Sinatra The toolset for doing BenchDD Building a 100x fast Sinatra ⽇本語字幕です。英語より情報量が多いことはないです䢀 JA mini-translation. You won't miss a thing by skipping this

Xinatra: A faster Sinatra Zee-na-tra A drop-in replacement for Sinatra
apps, 100x faster "Cinatra" ? C = 100 in Roman numerals was hard to distinguish in verbal comms, so "Xinatra" class MyApp < Xinatra::Base before do ... end get '/' do return "Hello, world!" end end 100倍速いSinatraを作ってみようと思ったんですよね

Is it really 100x fast? It depends on the deﬁnition
of "fast". The routing/handling logic is 100x fast = "Hello world" apps are 100x fast In real-world-ish benchmarks, it's about 1.02x fast ルーティングとハンドリングが100倍速いことをもって100倍といってます

How to make it 100x faster? That's the main part
! I'm going to introduce A technique called "Benchmark-Driven Development" and a toolset to practice BenchDD. ベンチマーク駆動開発

Benchmark-Driven Development (BenchDD)

Benchmarking require 'benchmark/ips' Benchmark.ips do |x| x.report("sumup") { (1..100).inject(&:+) }
end Warming up -------------------------------------- sumup 39.181k i/100ms Calculating ------------------------------------- sumup 392.296k (± 0.2%) i/s (2.55 μs/i) 1.998M in 5.093700s Run code and measure its time

Set a measurable perf goal Write code Measure & improve
Benchmark-Driven Development

Write failing tests Make it pass Refactor TDD BenchDD Set
a measurable perf goal Write code Measure & improve

Working Broken Slow Fast It's hard to make slow code
fast. Instead, write fast code from the beginning.

Performance 101: Focus on bottlenecks! (?) If there's something signiﬁcantly
slow, you should work on that! Work on the algorithm/ architecture! ... very correct. ボトルネックをつぶせって⾔いますよね。そうだと思います。 the clear bottleneck

Reality: There's not always something signiﬁcant There's not always a
"bottleneck" And still, your program needs to be faster ボトルネックがあるとは限らない no signiﬁcant bottleneck

It's hard to do performance afterwards Lots of slight slowdowns
will impact performance as a whole However, those are hard to ﬁnd since those are slight Even though they may be easy to ﬁx "Not slow" != "Fast" チリツモでプログラムは遅くなる

Know your performance numbers = (1..100).to_a # slow? numbers.select {
it.even? }.first # numbers.find { it.even? } 10x faster! Take this impl

Benchmarking "every single" change

Benchmarking "every single" change Run benchmarks as much as possible
to catch slow code Maybe on every pull request? Or on every commit? Even more, every time you type? 1タイプするたびにベンチマークとってもいいですよ

benchmarkkit I have created tools and frameworks to keep benchmarking
"in the loop" Combined together forming BenchDD. ベンチマーク取るのを楽にするツールを作りました

A BenchDD walkthrough: Xinatra, a 100x faster Sinatra

Building Xinatra with BenchDD Set a measurable perf goal Write
code Measure & improve Everything starts from setting a measurable performance goal = Deﬁne what needs to be 100x fast Write an benchmark ﬁrst!

Our ﬁrst benchmark Benchmark.ips do |x| x.report("e2e") { # call
GET /hello App.call({...}) } end

How many nanoseconds do we have? Sinatra Roda 100x Sinatra
Empty Rack App ns/req (lower is better) 215 ns/req 350 ns/req 912 ns/req 35,000 ns/req class EmptyRackApp def call(env) return [200, "Hello world!", {}] end end

How many nanoseconds do we have? Sinatra Roda 100x Sinatra
Empty Rack App ns/req (lower is better) 215 ns/req 350 ns/req 912 ns/req 35,000 ns/req We need to ﬁt Sinatra features in 135 ns/req 無の Rack app で 215 ns かかるので、残りの 135 ns で Sinatra の機能相当を動かさないといけないわけ Feature headroom = 135 ns

Our starting point class Xinatra::Base def call(env) handler = do_routing(env)
response = handler.call(env) return response end end How much time can we spend here?

Know and benchmark your target Hono (Bun) Empty Rack app
ns/req (lower is better) 215 ns/req 51 ns/req Honoのほうが空のRackアプリより速かった……。 I initially wanted to overtake Hono, a JavaScript web framework Full-featured Hono was faster than an empty Rack app...

BenchDD main loop Anyways, we now know our goal Now
it's time to write real code! vim → bench → vim → bench → ... Set a measurable perf goal Write code Measure & improve

Starting implementation class Xinatra::Base def call(env) handler = do_routing(env) response
= handler.call(env) return response end private def do_routing(env) if env['RACK_...'] return handler end end end

Problem 1: not fun. Benchmark Benchmark Benchmark Benchmark Benchmark Benchmark
Benchmark Benchmark Benchmark Edit code Edit code Edit code Edit code Edit code Edit code Edit code

Benchmarking framework describe "routing" do setup do @router = TrieRouter.new
@router.define("GET", "/foo", -> () { 'hello' }) end dataset "small" do { ["/hello", ...] } dataset "large" do { ... } scenario "trie" do data.each do |d| @router.match("GET", d) end end end

Benchmark suite DSL + Editor integration I have created a
benchmarking suite framework somewhat like RSpec With editor integration for easy running ベンチマーク⽤のRSpec⾵DSLとエディタ統合を作りました

Benchmarking framework describe "routing" do setup do @router = TrieRouter.new
@router.define("GET", "/foo", -> () { 'hello' }) end dataset "small" do { ["/hello", ...] } dataset "large" do { ... } scenario "trie" do data.each do |d| @router.match("GET", d) end end end

Designing the workload Workloads should be (1) realistic/representative and (2)
compact For Xinatra, I prepared multiple tiers of workloads Small: A generated set of requests (10k reqs) Large: Log collected from real Sinatra apps (100k reqs) ベンチマークのデータセットも複数作っていい

Problem 2: not informative.

Benchmarking tells us too less. What benchmarking tells us: time
per iteration of current code What we need: How did the performance change? Why did the performance change?

Benchmarking ❤ Proﬁling Explaining performance is exactly what proﬁlers do
I've added a new view to show performance di! between two revisions

Di!erential ﬂamegraphs # Improved $ Degraded from last run Benchmark
1 (before) Benchmark 2 (after) Image from https://www.brendangregg.com/blog/2014-11-09/di!erential-ﬂame-graphs.html

Brendan Gregg Sensei 『詳解システム‧パフォーマンス』を本屋さんに⼊れてもらったので買ってください In the bookstore!!!!!

Integrating the culprit viewer to editors Time spent shown in
ghost text

subject do @app = ZeroFeatureRackApp.new end scenario("1000 requests") do @app.call({})
end Automatically engages Pf2 proﬁling

Di"ng flamegraphs Profiling is automatically initiated for each benchmark Results
are recorded in tmp/ The di! engine in Pf2 (profiler) generates di!erential flamegraphs

Nurturing your bench suite Write a benchmark for each feature
Routing Handling Before/after actions E2E Set a measurable perf goal Write code Measure & improve

Making Xinatra 100x faster

Sinatra 100x Sinatra Empty Rack App ns/req (lower is better)
215 ns/req 350 ns/req 35,000 ns/req Reminder: We need to ﬁt Sinatra features in 135 ns/req Feature headroom = 135 ns Ø∞± ns でいろんな機能を実装する必要があります

Optimizing the signiﬁcant: Routing Routing is the largest part in
Xinatra = What routes.rb does in Rails Some algorithms come to mind... Trie-based routing Linear routing ルーティングが⼀番重いので、そこのアルゴリズムをちゃんとしておくのは当然のこと

GET /admin/users/1 handler.call() Trie Routing handler = routes[:get] .dig(parts(req)) O(logN)
Linear Routing handler = routes.ﬁnd {|rt| rt.is_for?(req) } O(N) O(N) but Faster when lesser routes O(logN) and fast when more routes

Choosing the routing strategy It's important to know the line
where a simple O(N) loses to a complex O(logN) For 10-20 routes, linear routing was faster found by benchmarking! 270 ns/req (144x Sinatra) 数が少ないうちは O(N) のほうが O(logN) より速いことも

Implementing other features & nitpicking We have implemented routing 215
→ 266 ns/req (84 ns headroom to go) Many features to implement params access, before/after actions, Cookies, ... Not bottlenecks, though challenging to ﬁt in 84 ns 他の全機能を 84 ns/req に収めなきゃ

Feature 1: params get '/search' do params #=> { "q"
=> "ruby" } end get '/search' do @params #=> { "q" => "ruby" } end Problem: Method calls are expensive % @ivar access is much faster! method 20-50 ns / call ivar 10 ns / access * incl. benchmarking overhead

params() → @params @params is faster ... but it is
mutable, and can do less work Xinatra supports both Users can gradually switch to @params and gain perf Lesson learned: Performance can inﬂuence API design. params() can't be faster than @params paramsを使って移⾏しつつ @params に切り替えることで速くできる

Feature 2: The request object get '/search' do request #=>
#<Sinatra::Request> request.env #=> {"RACK_*" => ... } end Can request be changed to gain performance?

To implement Request#params ... Option 1: class class Request def
env; ...; end def params; ...; end end Option 2: Data& Request = Data.define(:env, :params, ...) Option 3: Struct Request = Struct.new(:env, :params, ...) Data#params 87 ns / access Struct#path 89 ns / access Class#path 94 ns / access * incl. benchmarking overhead / w/YJIT

Feature 3: before/after actions before do do_good_auth(params) end get '/'
do # `before` implicitly called ... end Block is saved on app startup The block gets "called" on every request Can be used for authentication and other checks 認証とかに使える before actions

Calling Procs (blocks) in Ruby Block#call 310 → 319 ns
instance_exec 310 → 404 ns instance_eval 310 → 412 ns Fastest& but unusable since context changes Feature rich and faster version of _exec (...?) 412 ns (95x Sinatra) '

Or just make it an actual method class Xinatra::Base def
self.before(&block) @@befores_count += 1 define_method("before_#{@@befores_count}", &block) end def call(env) ... @@befores_count.times.do |i| self.send(:"before_#{i}") end end end Calling methods is faster than calling blocks + Allows YJIT ! 342 ns (118x Sinatra)

Or make it static class Xinatra::Base def self.before(&block) define_method("before", &block)
end def call(env) # eliminated #send __before end def __before; end # no-opstub end ⚠ Fast, but multiple befores cannot be deﬁned in this version (breaking!) 270 ns/req (144x Sinatra)

Feature 4: Rack::Session Session handling wasn't in the original benchmark
set, so no numbers here, but Rack::Session usually consumes quite a lot of CPU Implement a equivalent in Rust ) Rack::Session 結構CPU使いがち

And more, and more... Reducing Hash access Reducing object allocation
Mutability is god! Reducing more and more # slow hash = {} hash[key] ||= [] # init hash[key] << something # faster hash = Hash.new { [] } hash << something その他チマチマと削っていく

Wrapping up Building Xinatra was removing a ton of small
debris Doing high performance was not "making it fast", but "not making it slow" Ten 10% slower code = 150% slower code チリを取るのが仕事でした

Tips for better benchmarking

Does this matter with me? Yes! In Ruby/Rails, CPU time
is very precious ☹ "Databases are the bottleneck, Ruby code won't matter!" Rails isn't IO-bound That 1 ms could go far, especially when you scale

Don't gacha! It's tempting to repeat benchmark commands until you
get good results I did that a lot Instead of wasting time, do a statistical hypothesis test (p=0.05) ベンチガチャ引くより仮説検定しよう

YJIT Enabling YJIT during benchmarking is important Keep environment close
to prod! YJIT engages JITing for method called 30 times A short warmup period should su"ce

Won't proﬁling a!ect benchmarking? Yes. You will see lower scores
with the proﬁler enabled. That's okay as long is the overhead is consistent. プロファイリングで遅くなっても、⼀定の遅くなり具合ならいい

Benchmarking in CI? Benchmarking in local is tedious! Why not
run them in CI? Because CI envs are very unstable. Hyper Threading Neighbors Library updates Unstable Base CPU CIのマシンの性能は本当に不安定なのでダメ

Wrapping up

Do you now feel benchmarking? Some optimizations I covered today
won't be easy to do after writing code Always run benchmarks when writing code and ﬁnd them before git commit! git commit する前にベンチマークを取ろう！

Thank you!!!

Profile and benchmark every change - RubyKaigi ...

Profile and benchmark every change - RubyKaigi 2025

More Decks by osyoyu

Other Decks in Programming

Featured

Transcript