Performance Engineering for Everyone

Should GitHub talk about performance?

The Problem with Guessing Most of us treat performance like
a guessing game. We know something is slow, but we rely on suspicions rather than evidence. "This is fine," we say, as latency spikes and users complain.

The Sensation The Suspicion Metrics show latency going up. A
page feels heavy. Everyone can feel the drag, but the mechanics remain hidden behind high-level dashboards. Our first instinct is to look at a spike and start guessing. We blame the database, the network, or the last deploy without verifying the actual path. Our Relationship with Performance

Add an Index Enable Caching Scale Hardware Assuming the database
is the bottleneck is common. We add indexes and hope they hit. Papering over the cracks with a cache often hides the inefficient logic beneath. Throwing larger hardware at the problem is expensive and avoids the root software cause. The Instinctive (and Suboptimal) Fixes

Why Intuition is a Trap Don't trust your gut. Performance
optimization is counter-intuitive. The method you've been suspicious of for years? It's probably fine. The innocuous single line of code that nobody looks at? That might be your main culprit.

Intuition vs. Measurement Scenario Guessing Approach Flamegraph Approach Slow Page
Load Add Redis cache layer Profile reveals N+1 query CPU Spike Upgrade to beefier machines Finds regex backtracking bug Memory Leak Restart pods regularly Identifies garbage collection events

The Profiling Paradigm Shift Moving from "I think" to "I
know". Let's look at how we visualize the stack.

The Power of the Flamegraph Chaos becomes a map.

Stack Chart It's an upside-down flamegraph.

Meet Vernier A modern Ruby profiler. Created by John Hawthorn
at GitHub. What It Captures › All threads simultaneously › SQL queries performed › Feature flag checks › Garbage Collector events and memory allocation › Idle time, RPC calls, cache calls etc.

Installing the Vernier gem in Rails # 1. Add to
your Gemfile: gem "vernier", group: :development # 2. Add to a controller: around_action :vernier_profile, if: -> { params[:flamegraph] } # 3. Then visit GET /posts?flamegraph=1

Installing the Vernier gem in Rails

Where can I see the flamegraph? › https://vernier.prof › https://gh.io/flameviewer

Viewing Your Profile https://gh.io/flameviewer Safe for production data - processes
files entirely in your browser. Nothing uploaded.

https://gh.io/flameviewer

How does GitHub generate flamegraphs? Web Requests Add query params
to any GitHub URL: ?flamegraph=1 API Requests Use gh api or curl with the same param: ?flamegraph=1 Use the Flamegraph Copilot Skill In-house flamegraph skill which lets you ask copilot to get the flamegraph for you.

Ask Copilot to get the flamegraph "Capture a flamegraph of
https://github.com/github/github" "Profile the repos API endpoint https://github.com/repos/github/github" "Get a Vernier profile of this GraphQL query" "Why is this page slow? Capture a flamegraph"

The Shapes You're Looking For You don't need a PhD
to read a flamegraph. You just need to spot the shapes.

Pattern 1: The Comb Teeth (N+1) The same SQL frame
stacked many times horizontally. Each tooth is a separate database call that could have been batched or preloaded. The simple approach: look for repeating vertical bars, like teeth on a comb.

Pattern 1: The Comb Teeth (N+1)

Pattern 2: A suspiciously large SQL query At the bottom
of the flamegraph viewer we can see SQL queries. An extra long bar is likely a sign of an inefficient SQL query. In our example we spend 700ms to load labels on the Watch button. The simple approach: look for big SQL blocks.

Pattern 2: Hover over the query to see it If
we hover over the query we can see what it was.

Pattern 2.1: A suspiciously wide bar The simple approach: find
the fattest rectangle. That might be your problem.

Pattern 2.1: A suspiciously wide bar

The Watch Button

Pattern 3: A lot of garbage collection events The simple
approach: what is causing a lot of GC pauses? Try to avoid rampant creation of objects. In some cases we see the comb pattern but it doesn’t make any SQL calls. However, it does trigger “Garbage collection” events. This usually means we’re creating too many objects in Ruby, causing the Garbage collector to pause the request to recover memory. This is especially common with GraphQL.

Three shapes

Case Study: GitHub Scale

Profiling The Repo Page

The Packages sidebar

But that check was repeated in an XHR request

What We Found A method called might_have_packages? was consuming 199ms
on every page load. Most of that time was spent building an Elasticsearch query on the fly to answer a yes- or-no question - "does the repo have any packages?" Maybe. On one of the most visited pages on the internet. Nobody would have guessed that a packages sidebar check was the most expensive thing on this page.

Result -23% CPU reduction -46% Elasticsearch CPU cores

The Three Optimisation Strategies 1. Don't Do It Delete the
code entirely. The fastest code is code that never runs. This was our fix. 60 lines deleted. 2. Do It Cheaper Batch, short-circuit, cache it, use a better algorithm, reduce object allocations. 3. Do It Later Background job, lazy evaluation, defer to a non- critical path. Biggest wins almost always come from #1 and #2.

AI-Assisted Analysis Using Copilot to keep up with the pace
of change.

The Velocity Problem GitHub ships fast. Everyone is shipping faster.
We can't manually investigate every latency spike. › 65 million requests a day on just one page › Hundreds of deploys per week › hundreds of feature flag checks per page load › Humans make performance mistakes, and the codebase keeps growing We need a way to keep up We’re not saying we should replace engineers with AI. We’re saying we can equip engineers with a faster way to find the signal in the noise.

Copilot + Flamegraphs 1. Generate a profile with Vernier 2.
Convert to AI-readable format: vernier view –output=markdown WEB_REQUEST.vernier.json or use the “Copy findings button” at the top right of the flamegraph viewer. 3. Feed the summary to Copilot (not the raw JSON, it'll blow up the context). Have the profile stored locally so copilot can explore it. 4. Ask it to identify the top bottlenecks 5. Verify everything it says against the actual flamegraph Pro tip: Never feed raw Vernier JSON directly. Use the markdown output. The AI needs a summary, not a 50MB JSON blob.

An Example Prompt "You are a performance engineer. Given this
flamegraph profile, identify the top 3 bottlenecks. Look for: - Repeated frames (N+1 queries) - Large SQL queries or methods that take a long time - Garbage collection overhead For each bottleneck, suggest which optimisation strategy applies: don't do it, do it cheaper, or do it later."

Or try the Analysis tab The Flamegraph Viewer (elenatanasoiu.com/flamegraph-viewer) which
auto-detects common issues.

The Shape of the Job Has Changed, But Your Responsibility
Has Not Before › Manually scan profiles line by line › Hunt for patterns through intuition › Investigate one spike at a time › Performance work is a specialist skil Now › AI summarizes the chaos › You decide what's real and what's noise › Reviewing AI output is a core engineering skill › Performance work is accessible to everyone

The Human with the Face › Don't let the tool
make the decision. › If it hits production, it's on you, not the bot.

16 pull requests. All green. All ready to ship. None
merged.

Who is feeling this pain? What were they trying to
do? What did it cost them?

X Hacker News Reddit Developer Tools feedback mostly lives on...

X Latest, not Top min_faves:500 filter:verified DMs make follow-up easy.

Hacker News hn.algolia.com (not the default search) Search inside comments
No DMs — look for email in bios.

Reddit site:reddit.com (on Google, not Reddit) Default search is bad
DMs exist, but pseudonymity makes them awkward.

This is grunt work. So automate it.

What (not) to say.

"Sorry to hear about your experience! Please DM us with
more details."

"We're aware of this and our team is working on
improvements."

"Could you file an issue with reproduction steps? That'll help
us prioritise."

What were they trying to do?

Their life, not your product. Don't ask if a feature
would help. Ask them to walk you through the last time it happened.

The past, not the future. People are honest about what
they did. They're optimistic about what they imagine they'd do.

Dig for cost. The first answer is rarely the real
answer. Three follow-ups in, you get to the workaround.

He stopped. He reviews locally now.

Start with David. Replicate his experience.

Averages hide everything. Use CrUXVis if you don't have good
internal data.

If we slice, David appears!

Who David. Principal engineer at an infrastructure company. Power user.
And the data tells us he's representing thousands. What he was doing Reviewing pull requests in the browser. Core workflow, not incidental. What it cost him He stopped. He reviews locally now.

Performance has a face. Find it.

Thank you Slides

Performance Engineering for Everyone

Performance Engineering for Everyone

More Decks by Elena Tanasoiu

Other Decks in Programming

Featured

Transcript