Reactive ❤️ Loom: A Forbidden Love Story

A Forbidden Love Story Reactive ❤ Loom

Who I am - Java Champion - Performance Obsessed Engineer
- Working hard on Quarkus performance In the Performance App Services Team in IBM - @forked_franz on Twitter/X/Whatever

What this talk is NOT about

So, Franz, the title was just a “click-bait”?

We help people, not fight them!

OSS Benchmarks FTW benchmark code results

Benchmark configuration • 100 concurrent clients • CPU intensive i.e.
~100% of cpu utilization • Application Server with 4 cores • Relatively fast DBMS i.e. <= 1ms RTT • “All out” throughput workload • We compare peak throughput on steady state

Show me the numbers!

What about Loom? + 30% improvement!

Loom: just not quite right here 181,689.60 task-clock/op # 3.452
CPUs utilized 4.80 context-switches/op # 26.417 K/sec 0.35 cpu-migrations/op # 1.931 K/sec 0.56 page-faults/op # 3.082 K/sec 683,149.20 instructions/op # 0.99 insn per cycle # 0.49 stalled cycles per insn 688,266.36 cycles/op # 3.788 GHz 331,978.85 stalled-cycles-frontend/op # 48.23% frontend cycles idle 146,473.19 branches/op # 806.221 M/sec 7,039.46 branch-misses/op # 4.81% of all branches 281,374.32 L1-dcache-loads/op # 1.549 G/sec 23,588.94 L1-dcache-load-misses/op # 8.38% of all L1-dcache accesses 154,021.43 L1-icache-loads/op # 847.770 M/sec 1,215.95 L1-icache-load-misses/op # 0.79% of all L1-icache accesses 5,210.09 dTLB-loads/op # 28.676 M/sec 158.77 dTLB-load-misses/op # 3.05% of all dTLB cache accesses 1,975.17 iTLB-loads/op # 10.872 M/sec 685.00 iTLB-load-misses/op # 34.68% of all iTLB cache accesses 10.003351327 seconds time elapsed note: Running on a [email protected] GHz

Too many context switches! ~4.8 context switches/request!* * YMMV 181,689.60
task-clock/op # 3.452 CPUs utilized 4.80 context-switches/op # 26.417 K/sec 0.35 cpu-migrations/op # 1.931 K/sec 0.56 page-faults/op # 3.082 K/sec 683,149.20 instructions/op # 0.99 insn per cycle # 0.49 stalled cycles per insn 688,266.36 cycles/op # 3.788 GHz 331,978.85 stalled-cycles-frontend/op # 48.23% frontend cycles idle 146,473.19 branches/op # 806.221 M/sec 7,039.46 branch-misses/op # 4.81% of all branches 281,374.32 L1-dcache-loads/op # 1.549 G/sec 23,588.94 L1-dcache-load-misses/op # 8.38% of all L1-dcache accesses 154,021.43 L1-icache-loads/op # 847.770 M/sec 1,215.95 L1-icache-load-misses/op # 0.79% of all L1-icache accesses 5,210.09 dTLB-loads/op # 28.676 M/sec 158.77 dTLB-load-misses/op # 3.05% of all dTLB cache accesses 1,975.17 iTLB-loads/op # 10.872 M/sec 685.00 iTLB-load-misses/op # 34.68% of all iTLB cache accesses 10.003351327 seconds time elapsed

Back in time: like an “old”(er) CPU This CPU can
achieve 4.3 Ghz, but used cycles are 3.788 Ghz! It’s like using an older CPU* :”( 181,689.60 task-clock/op # 3.452 CPUs utilized 4.80 context-switches/op # 26.417 K/sec 0.35 cpu-migrations/op # 1.931 K/sec 0.56 page-faults/op # 3.082 K/sec 683,149.20 instructions/op # 0.99 insn per cycle # 0.49 stalled cycles per insn 688,266.36 cycles/op # 3.788 GHz 331,978.85 stalled-cycles-frontend/op # 48.23% frontend cycles idle 146,473.19 branches/op # 806.221 M/sec 7,039.46 branch-misses/op # 4.81% of all branches 281,374.32 L1-dcache-loads/op # 1.549 G/sec 23,588.94 L1-dcache-load-misses/op # 8.38% of all L1-dcache accesses 154,021.43 L1-icache-loads/op # 847.770 M/sec 1,215.95 L1-icache-load-misses/op # 0.79% of all L1-icache accesses 5,210.09 dTLB-loads/op # 28.676 M/sec 158.77 dTLB-load-misses/op # 3.05% of all dTLB cache accesses 1,975.17 iTLB-loads/op # 10.872 M/sec 685.00 iTLB-load-misses/op # 34.68% of all iTLB cache accesses 10.003351327 seconds time elapsed

Quarkus threading model /

I knew it: Asynchronous and Reactive! /

Quarkus meets Loom /

Sharing is…scaring! • 4 available cores • 4 I/O threads
performing some light HTTP work • 4 threads in Loom FJ pool The OS Scheduler interleaves via non-voluntary context-switches 8 hungry puppies!

What if…?

How Loom handle VirtualThread(s) • VirtualThread(s) runs via Continuations Runnables
• The Loom “built-in” scheduler is backed by a ForkJoin ThreadPool • The Hotspot runtime manage to do its magic to: ◦ Materialize the VirtualThread stack as a continuous ◦ Yield and resume them efficiently • VirtualThread(s) are non-blocking Runnable(s): ◦ The Scheduler decides on which Platform Thread they run ◦ Synchronous Socket operations are handled non-blocking under the hood

The wish-list • The Scheduler should interleaves Netty I/O with
other VirtualThread(s) on the same Platform Thread • Users should be able to submit Virtual Threads local to specific Netty I/O Threads i.e. no hand-off

Custom Loom Schedulers to the rescue!

Welcome Netty VirtualThread Scheduler

Get back to the lab! • Using a DBMS is
not ideal: we need MORE control! • Let’s create a setup which resemble the original benchmark • Just Netty + Jackson + blocking Http client

The setup

Test configuration • 100 concurrent clients • 1 Handoff Server
with 2 cores ◦ Built-in Loom Scheduler 2 FJ OS Threads + 2 Netty Event Loops = 4 OS Threads ◦ Custom Netty Scheduler 2 Custom Netty Scheduler OS threads = 2 OS Threads • 1 Mock Server with 1 ms “think time” i.e. 1000 tps x HTTP connection • We measure the peak throughput

Throughput results +18,97% - not bad! Metric FJ 2 +
2 I/O Custom Scheduler Improvement Requests/sec 60,459.04 71,926.67 18.97% increase Context-switches 0.29/op 0.005/op 58.0x less CPU migrations 0.0038/op 0.00034/op 11.18x less Cycles 4.100 GHz 4.271 GHz 4.17% increase Instructions per cycle (IPC) 1.56 1.58 1.28% increase

But, what if…?

…we have an I/O bound workload? • 100 concurrent clients
-> 10K concurrent clients • Handoff Server with 2 cores -> 16 cores • Mock Server with 1 ms “think time” -> 30 ms • Fixed* Throughput 50K tps

Metric FJ 16 + 16 I/O Custom Scheduler Improvement CPUs
utilized 5.269 3.378 35.89% less Context-switches 4.16/op 1.69/op 2.46x less CPU migrations 1.04/op 0.08/op 12.41x less Cycles 3.596 GHz 3.874 GHz 7.73% increase Instructions per cycle (IPC) 0.75 0.85 13.33% increase OOTB FJ + Netty

Metric FJ 8 + 8 I/O Custom Scheduler Improvement CPUs
utilized 4.494 3.378 24.83% less Context-switches 3.58/op 1.69/op 2.12x less CPU migrations 0.14/op 0.08/op 1.67x less Cycles 3.631 GHz 3.874 GHz 6.69% increase Instructions per cycle (IPC) 0.77 0.85 10.39% increase “Tuned” FJ + Netty i.e. puppies === bowls Why the custom scheduler is still more efficient?

Thread Pools 8 FJ + 8 I/O Voluntary ctx-switches Custom
Scheduler Voluntary ctx-switches Total 149,896 / sec 69,663 / sec “You snooze, you lose” Voluntary context switches happen once threads park A separate thread-pools topology has more chances for parking and wake-ups with any non-saturating load!

It’s like “driving” in the traffic!

Economical implications • C6a.2xlarge (8 vcpus) • 1.33 capacity ratio
in favour of the custom scheduler • $2940 / year per instance (see c6a.2xlarge pricing@AWS EC2 w ~730 hours / month) • Same load requires 1/1.33=0.7519 i.e. ~24.81% fewer instances • With a fleet of 10K instances, we can remove 2481 instances Savings ≈ $7.29M / year

Summary: Is It worthy? • Squeezing more capacity out of
HW reduces the number of instances • Less instances means $$$ saving • Amortizing wake-ups improves CPU utilization… • …with less carbon emitted • It tooks 447 LOC(!)

Engineering delivers within constraints • Replacing a Networking Framework is
not feasible: ◦ Security ◦ Performance characteristics ◦ High-level protocols ◦ Custom Native Transports ◦ … • A Custom scheduler enables existing frameworks with a “Reactive” core to efficiently embrace Loom

The unsung heroes • IBM OpenJDK Team • IBM Quarkus
Team • IBM App Services Performance Team • Oracle Loom Team • Micronaut OracleLabs Team • Netflix • Apple

BONUS: Custom Observability

Reactive ❤️ Loom: A Forbidden Love Story

Reactive ❤️ Loom: A Forbidden Love Story

Other Decks in Programming

Featured

Transcript