Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Engineering Large Systems When You're Not Googl...
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Charity Majors
April 30, 2018
Technology
20
5.6k
Engineering Large Systems When You're Not Google Or Facebook (test in prod)
lightning talk at Clever, 4/30/18
Charity Majors
April 30, 2018
Tweet
Share
More Decks by Charity Majors
See All by Charity Majors
The Twin Mandate of Observability
charity
4
2.1k
In Praise of "Normal" Engineers (LDX3)
charity
4
2.8k
In Praise of "Normal" Engineers (with full speaker notes)
charity
1
240
AIOps: Prove It! (An Open Letter to Vendors Selling AI for SREs)
charity
1
52
SRECon 2024 Keynote: Is It Already Time To Version Observability? (Signs Point To Yes)
charity
3
500
CTO Craft Con Keynote: Observability is due for a version change: are you ready for it?
charity
4
1.4k
Case Studies: Modern Development Practices In Highly Regulated Environments
charity
6
4.3k
Compliance & Regulatory Standards Are NOT Incompatible With Modern Development Best Practices
charity
7
6.2k
Perils, Pitfalls and Pratfalls of Platform Engineering (QCon NYC, 2023)
charity
1
420
Other Decks in Technology
See All in Technology
サイボウズ 開発本部採用ピッチ / Cybozu Engineer Recruit
cybozuinsideout
PRO
10
73k
変化するコーディングエージェントとの現実的な付き合い方 〜Cursor安定択説と、ツールに依存しない「資産」〜
empitsu
4
1.3k
MCPでつなぐElasticsearchとLLM - 深夜の障害対応を楽にしたい / Bridging Elasticsearch and LLMs with MCP
sashimimochi
0
130
30万人の同時アクセスに耐えたい!新サービスの盤石なリリースを支える負荷試験 / SRE Kaigi 2026
genda
1
150
Agile Leadership Summit Keynote 2026
m_seki
1
260
GitLab Duo Agent Platform × AGENTS.md で実現するSpec-Driven Development / GitLab Duo Agent Platform × AGENTS.md
n11sh1
0
110
Sansan Engineering Unit 紹介資料
sansan33
PRO
1
3.8k
SREのプラクティスを用いた3領域同時 マネジメントへの挑戦 〜SRE・情シス・セキュリティを統合した チーム運営術〜
coconala_engineer
2
550
日本語テキストと音楽の対照学習の技術とその応用
lycorptech_jp
PRO
1
410
All About Sansan – for New Global Engineers
sansan33
PRO
1
1.3k
顧客との商談議事録をみんなで読んで顧客解像度を上げよう
shibayu36
0
130
FinTech SREのAWSサービス活用/Leveraging AWS Services in FinTech SRE
maaaato
0
120
Featured
See All Featured
The Curse of the Amulet
leimatthew05
1
8.2k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
196
71k
A better future with KSS
kneath
240
18k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
Evolving SEO for Evolving Search Engines
ryanjones
0
120
Testing 201, or: Great Expectations
jmmastey
46
8k
Measuring Dark Social's Impact On Conversion and Attribution
stephenakadiri
1
120
How STYLIGHT went responsive
nonsquared
100
6k
Paper Plane
katiecoart
PRO
0
46k
Why Mistakes Are the Best Teachers: Turning Failure into a Pathway for Growth
auna
0
50
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
Navigating the Design Leadership Dip - Product Design Week Design Leaders+ Conference 2024
apolaine
0
170
Transcript
Engineering Large Systems When You’re Not Google Or Facebook Some
Advice By Charity Majors
None
I blame this guy: Testing in production has gotten a
bad rap.
None
how they think we are how we really are
but *why*?
monitoring => observability known unknowns => unknown unknowns LAMP stack
=> distributed systems
“Complexity is increasing” - Science
Many catastrophic states exist at any given time. Your system
is never entirely ‘up’
We are all distributed systems engineers now the unknowns outstrip
the knowns why does this matter more and more?
Distributed systems are particularly hostile to being cloned or imitated
(or monitored). (clients, concurrency, chaotic traffic patterns, edge cases …)
Distributed systems have an infinitely long list of almost-impossible failure
scenarios that make staging environments particularly worthless. this is a black hole for engineering time
unit tests integration tests functional tests basic failover test before
prod: … the basics. the simple stuff. known-unknowns
behavioral tests experiments load tests (!!) edge cases canaries rolling
deploys multi-region test in prod: unknown-unknowns
test in staging? meh
unit tests integration tests functional tests “What happens when …”
(you know the answer) “What happens when …” (you don’t) behavioral tests experiments load tests (!!) edge cases canaries rolling deploys multi-region test before prod: test in prod:
Only production is production. You can ONLY verify the deploy
for any env by deploying to that env
1. Every deploy is a *unique* exercise of your process+
code+system 2. Deploy scripts are production code. If you’re using fabric or capistrano, this means you have fab/cap in production.
Staging is not production.
Why do people sink so much time into staging, when
they can’t even tell if their own production environment is healthy or not?
That energy is better used elsewhere: Production. You can catch
80% of the bugs with 20% of the effort. And you should. @caitie’s PWL talk: https://youtu.be/-3tw2MYYT0Q
feature flags (launch darkly) high cardinality tooling (honeycomb) canary canary
canaries, shadow systems (goturbine, linkerd) capture/replay for databases (apiary, percona) also build or use: plz dont build your own ffs
Failure is not rare Practice shipping and fixing lots of
small problems And practice on your users!!
Failure: it’s “when”, not “if” (lots and lots and lots
of “when’s”)
Does everyone … know what normal looks like? know how
to deploy? know how to roll back? know how to canary? know how to debug in production? Practice!!~
None
None
None
• Charity Majors @mipsytipsy