Improving Rust Performance Through Profiling and Benchmarking

Understanding Rust Performance Steve Jenson @stevej

BIOGRAPHY • Engineer at Buoyant • Work on Rust and
Scala code

LINKERD-TCP • TCP proxy written in Rust • Designed to
work in cloud-native environments • Protocols coming! • Apache licensed • github.com/linkerd/linkerd-tcp

TACHO • Stats library for instrumenting applications • Counter -
how many times something happened • Timing - how long something takes (w/ histogram) • Gauge - value at a Point in time • Apache licensed • github.com/linkerd/tacho

TACHO example/multithread.rs thread::spawn(move || { let mut prior = None;
for i in 0..10_000_000 { let t0 = Timing::start(); current_iter.set(i); loop_counter.incr(1); if let Some(p) = prior { loop_iter_us.add(p); } prior = Some(t0.elapsed_us()); } if let Some(p) = prior { loop_iter_us.add(p); } work_done_tx.send(()).expect("could not send"); });

OUTLINE • Causes of Slowness • Rust-specific pitfalls • Tools
• IPC

CAUSES OF SLOWNESS

LOCK CONTENTION CPU UTILIZATION MEMORY STALLS

MEMORY STALLS

MEMORY HIERARCHY LATENCY GUIDELINES Register 0.5 nanoseconds Last-Level Cache 10
nanoseconds RAM 100 nanoseconds Numbers every programmer should know by Jeff Dean

LOCK CONTENTION

LOCK CONTENTION • spin loops • blocking wait

CPU UTILIZATION

CPU UTILIZATION • Can hide memory latency (slow instructions) •
Can hide lock contention (spin loops) • Idleness is often counted as useful work • 90% utilized can also mean 80% waiting for RAM or disk

RUST-SPECIFIC PITFALLS

#[derive(Copy)]on large structs • Copy semantics can be a life-saver
• Overuse can kill memory bandwidth • Most common reason It was small when I ﬁrst derived!

#[derive(Copy)]on large structs #[derive(Copy)] struct Person { user_id: Int, name:
&str, // Whoops! 800MB! Should be a reference dna: Vec[u8], }

clone() in a loop • Saturate memory bandwidth • clone()can
be an easy way to satisfy the borrow checker • Thankfully, easy to spot in a proﬁle

clone() in a loop for person in &people { friends.push(person.clone());
}

DEFAULT HASHER IN THE STANDARD HASHMAP • Cryptographically strong for
DoS protection • Well-known trade-off for Rustaceans • Surprises programmers new to Rust • Lots of great alternatives!

DEFAULT HASHER IN THE STANDARD HASHMAP let map: HashMap<Vec<u8>, u32,
DefaultState<FnvHasher>> = Default::default();

EXPENSIVE ARGUMENTS TO expect() • Don’t use expensive expansions as
arguments to expect() • Not speciﬁc to expect(), be mindful of eagerness

EXPENSIVE ARGUMENTS TO expect() let index = self.to_byte_index(index) .expect( &format!(“invalid
index! {:?} in {:?}" , index, s));

PREALLOCATE Vec WHEN POSSIBLE • If you have a sense
of how many items you’ll need, use that as your initial capacity

PREALLOCATE Vec WHEN POSSIBLE let buf = { let sz
= self.buffer_size .unwrap_or(DEFAULT_BUFFER_SIZE); vec![0 as u8; sz] };

MAC TOOLS Instruments  cargo bench cargo benchcmp LINUX TOOLS perf 
FlameGraphs  VTune  cargo bench cargo benchcmp

CARGO BENCH AND BENCHCMP

CARGO BENCH • Microbenchmarking tool • Part of the standard
tooling • Great for important parts of your API

CARGO BENCH #[bench] fn bench_counter_create(b: &mut Bencher) { let (metrics,
_) = super::new(); b.iter(move || { let _ = metrics.counter(“counter_name”); }); }

CARGO BENCH test tests::bench_counter_create ... bench: 143 ns/iter (+/- 71)
test tests::bench_counter_create_x1000 ... bench: 429,854 ns/iter (+/- 277,291) test tests::bench_counter_update ... bench: 23 ns/iter (+/- 10) test tests::bench_counter_update_x1000 ... bench: 955 ns/iter (+/- 141) test tests::bench_gauge_create ... bench: 136 ns/iter (+/- 19) test tests::bench_gauge_create_x1000 ... bench: 415,618 ns/iter (+/- 301,114) test tests::bench_gauge_update ... bench: 17 ns/iter (+/- 7) test tests::bench_gauge_update_x1000 ... bench: 3,327 ns/iter (+/- 519) test tests::bench_scope_clone ... bench: 64 ns/iter (+/- 11) test tests::bench_scope_clone_x1000 ... bench: 177,623 ns/iter (+/- 103,595) test tests::bench_scope_label ... bench: 164 ns/iter (+/- 91) test tests::bench_scope_label_x1000 ... bench: 269,845 ns/iter (+/- 54,753) test tests::bench_stat_add_x1000 ... bench: 2,575 ns/iter (+/- 425) test tests::bench_stat_create ... bench: 131 ns/iter (+/- 45) test tests::bench_stat_create_x1000 ... bench: 412,913 ns/iter (+/- 121,406) test tests::bench_stat_update ... bench: 47 ns/iter (+/- 4) test tests::bench_stat_update_x1000 ... bench: 2,694 ns/iter (+/- 1,243)

CARGO BENCHCMP • Compare two cargo bench runs • Great
for avoiding performance regressions • github.com/BurntSushi/cargo-benchcmp

CARGO BENCHCMP

HOW TO CONSTRUCT A MACROBENCHMARK

HOW TO CONSTRUCT A MACROBENCHMARK • Microbenchmarks are limited in
utility • Measure your code running in context • Exercise a reasonable subset of your API • In one of our macrobenchmarks, we loop 10,000,000 times and do work each loop

HOW TO CONSTRUCT A MACROBENCHMARK • cargo build —release —example
multithread • Always use release builds for proﬁling • For symbols, add this to your Cargo.toml [profile.release] debug = true

HOW TO CONSTRUCT A MACROBENCHMARK • tacho has two macrobenchmarks
• single-threaded (simple.rs) • multi-threaded (multithread.rs)

TACHO example/multithread.rs thread::spawn(move || { let mut prior = None;
for i in 0..10_000_000 { let t0 = Timing::start(); current_iter.set(i); loop_counter.incr(1); if let Some(p) = prior { loop_iter_us.add(p); } prior = Some(t0.elapsed_us()); } if let Some(p) = prior { loop_iter_us.add(p); } work_done_tx.send(()).expect("could not send"); });

INSTRUCTIONS PER CYCLE • IPC is a useful empirical metric
• How many instructions are completed every clock cycle

BASIC CPU ARCHITECTURE • Executes instructions serially • How we
learn CPU architecture in school • Not how it works on modern Intel CPUs • Deep pipelines, dependent on other work

HOW DO WE KNOW IF WE’RE HITTING PEAK PERFORMANCE? •
Since instructions can depend on other instructions • And Performance is dictated by a full pipeline • How do we know we’re doing well?

INTEL PERFORMANCE COUNTERS

INTEL PERFORMANCE COUNTERS • Intel engineers had the same question
• Added Performance Monitor Counters • How often certain events happen • Allow you to calculate ratios

INTEL PERFORMANCE COUNTERS • Number of counters is daunting •
Hundreds of counters • 400+ pages of documentation • Allow you calculate derived metrics

INSTRUCTIONS PER CYCLE

INSTRUCTIONS PER CYCLE • How many instructions can the core
“retire” per cycle • < 1.0 often means memory stalled • > 1.0 often means instruction stalled • You can learn this empirically! • On a 3 wide core, theoretical max IPC of 3.0

INSTRUMENTS (MAC)

INTEL PMCS • Available directly in Instruments • Counter •
Recording Options • Events • Can create formula from PMCs

INSTRUMENTS (MAC) TAKEAWAY • Lots of easy-to-use Performance tools •
Unfortunately, many speciﬁc to Cocoa programming • A simple way to access Performance Counters

PERF (LINUX)

PERF • Linux kernel and user space • Sampling profiler
with configurable sampling rate • Constantly being improved

PERF STAT — IPC $ sudo perf stat target/release/examples/multithread Performance
counter stats for 'target/release/examples/multithread': 12268.515738 374,342 3 505 26,206,859,982 13,711,393,152 2,838,706,433 7,730,077 7.663850635 sec time elapsed task-clock (msec)# 1.601 CPUs utilized context-switches # 0.031 M/sec cpu-migrations # 0.000 K/sec page-faults # 0.041 K/sec cycles # 2.136 GHz instructions # 0.52 instructions per cycle branches # 231.381 M/sec branch-misses # 0.27% of all branches

PERF CACHE MISSES/HITS

PERF TAKEAWAY • Deep tooling • Low overhead • Kernel
and User space • Linux-speciﬁc • Scheduler analysis • IO and Network subsystems

FLAMEGRAPHS

FLAMEGRAPHS • Sample what’s on the CPU • Aggregate the
call stacks • Gives you a sense of the shape of your program • The color change has no semantic value • Mouse-over for extra info • Can drill into stacks • Peak is what’s on the CPU at sample time

FLAMEGRAPHS TAKEAWAY • Really useful for looking at a long-running
program • Netﬂix has pioneered this technique for measuring the health of their online services • Needs symbols!

VTUNE • Made by Intel • Helps make sense of
the many Performance Counters • Tooltips! • GUI (works with ssh X forwarding on macOS) • CLI with CSV • Free for open source developers

VTUNE TAKEAWAY • Intel knows their CPUs better than anyone
• VTune is detailed and powerful • Overwhelming at ﬁrst • Helpful tooltips!

LESSON LEARNED • While preparing this talk, I learned something!
• VTune highlighted a ‘Remote Cache’ issue • Oh no! One of my threads was running on a different socket! • Cache hit rate improved with taskset

BEFORE TASKSET

AFTER TASKSET

TAKEAWAYS • Performance is hard to understand • Need an
empirical measurement • IPC is one empirical measurement • The best tool is the one you use

THANKS! A special thank you to Eliza Weisman for the
Instruments walk-through, screenshots, and feedback

@linkerd github.com/linkerd linkerd.io @buoyantIO buoyant.io

Improving Rust Performance Through Profiling ...

Improving Rust Performance Through Profiling and Benchmarking

Other Decks in Programming

Featured

Transcript