Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Disaster Recovery: A Process, Not a Tool
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Richard Yen
June 09, 2026
Technology
28
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Disaster Recovery: A Process, Not a Tool
As presented at PGDay Boston 2026
Richard Yen
June 09, 2026
More Decks by Richard Yen
See All by Richard Yen
pg_stats: How Postgres Internal Stats Work
richyen
0
11
Playing Nice with Your Friends: Database Diversity with Postgres FDWs
richyen
0
160
How to Ride Elephants Safely: Working with Postgres when Your DBA is not Around
richyen
0
160
Scaling the Wall of Text: Best Practices for Logging in PostgreSQL
richyen
0
180
How to Ride Elephants Safely: Working with Postgres when Your DBA is not Around
richyen
0
140
How to Ride Elephants Safely: Working with Postgres when Your DBA is not Around
richyen
0
66
How to Ride Elephants Safely: Working with PostgreSQL when your DBA is not around
richyen
0
59
Playing Nice with Your Friends: Database Diversity with Postgres FDWs
richyen
0
160
Explaining EXPLAIN: A Dive Into PostgreSQL EXPLAIN Plans
richyen
0
170
Other Decks in Technology
See All in Technology
いまさら聞けない「仕様駆動開発入門」 〜AI活用時代の開発プロセスを考える〜
findy_eventslides
2
190
Claude Codeをどのように キャッチアップしているか
oikon48
13
8.8k
気軽に使える"情報のハブ"としてのNotion活用 〜フロー情報の集積点 と、 Claude Code × Notion AI〜
syucream
1
180
事業会社における 機械学習・推薦システム技術の活用事例と必要な能力 / ml-recsys-in-layerx-wantedly-2026
yuya4
0
160
新しいUbuntu/GNOMEが使いたいからXからWaylandへ移行頑張ってるの巻 2026-06-20
nobutomurata
0
160
2026年6月23日 Syncable Tech + Start Python Club にて
hamukazu
0
150
サイバーエージェントにおけるAI推進戦略と変革への取り組み
shotatsuge
0
490
複数のSONiCディストリビューションを触りながら比較してみた
sonic
0
120
脱SaaS!FDEを支えるプロビジョニングと分離設計
knih
0
260
AI 不只幫你寫 Code: 當專案從 300 暴增到 1500, 我們如何撐住 DevOps
appleboy
0
180
生成 AI 実践ガイド (概略版) AIガバナンス編
asei
0
180
MUSUBI 田中裕一『AIと共に行う「しごとのリデザイン」- スモールバックオフィス編』AI Ops Lab #4
musubi
0
310
Featured
See All Featured
The Art of Programming - Codeland 2020
erikaheidi
57
14k
Color Theory Basics | Prateek | Gurzu
gurzu
0
370
WENDY [Excerpt]
tessaabrams
11
38k
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.3k
Building Flexible Design Systems
yeseniaperezcruz
330
40k
So, you think you're a good person
axbom
PRO
2
2.1k
The Language of Interfaces
destraynor
162
27k
The Straight Up "How To Draw Better" Workshop
denniskardys
239
140k
Documentation Writing (for coders)
carmenintech
77
5.4k
We Have a Design System, Now What?
morganepeng
55
8.2k
New Earth Scene 8
popppiees
3
2.4k
Everyday Curiosity
cassininazir
0
230
Transcript
Disaster Recovery A Process, not a Tool June 9, 2026
Richard Yen
The Changed Landscape
None
The Changed Landscape •99.99% Uptime •p95, p99 metrics •Status pages
•Social Media affects reputation
Agenda 1. Where We Are 2. Where We Need to
Be 3. How We’ll Get There 4. Some Stories Along the Way
Where We Are
A disaster is any sustained event that compromises the system’s
availability, correctness, or business trust
How DR is Usually Done 1. Prepare 2. Prevent
How DR is Usually Done 1. Prepare 2. Prevent “An
ounce of prevention is worth a pound of cure”
Disaster Recovery is the act of restoring business operations
Where We Need to Be
Postgres Makes Recovery Easy • pg_dump/pg_restore • pg_basebackup • pg_stat_replication
• pg_stat_activity • Point-In-Time Recovery • repmgr/efm • Third-party backup tools
RPO & RTO
RPO & RTO – It’s going to cost you
RPO & RTO Talk to your leadership, and you’ll discover
how much it’s really worth to them
RPO 1. 24-hour RPO -- $ 2. 15-minute RPO --
$$ 3. Near-zero RPO -- $$$
RTO is your team’s ability to execute the DR plan
How We’ll Get There
3 Layers of DR Planning 1. Infrastructure failure 2. Procedural
failure 3. Human failure
3 Layers of DR Planning 1. Infrastructure failure 2. Procedural
failure 3. Human failure Recovery is not always about failing over
Runbook Engineering Should Assume 1. Stress 2. Chaos 3. Confusion
4. Exhaustion 5. Ambiguity
Runbook Engineering Should Assume 1. Stress 2. Chaos 3. Confusion
4. Exhaustion 5. Ambiguity
Runbook Engineering: Anti-patterns 1. Wiki Pages 2. Stale documents 3.
Unclear owner 4. Vague instructions
Runbook Engineering: Non-Technical Essentials 1. Incident Commander 2. Communications Owner
3. Notification Cadence 4. Escalation Chain 5. Risk Authorization
Runbook Validation 1. Can a new engineer follow it? 2.
Does it assume access? 3. Are commands and names current? 4. Does it get regular playtime?
Runbook Validation: Level Up Your Ability 1. Prove that your
Runbook works 2. Reduce the time it takes to complete 3. Simulate failure 4. Test with unavailable human resources
Runbook Validation: Level Up Your Ability 1. Prove that your
Runbook works 2. Reduce the time it takes to complete 3. Simulate failure 4. Test with unavailable human resources This is how you reduce RTO
Validation Metrics 1. Did recovery succeed? 2. How long did
each section take? 3. What vagueness needs to be clarified? 4. Identify documentation gaps
Validation Metrics 1. Did recovery succeed? 2. How long did
each section take? 3. What vagueness needs to be clarified? 4. Identify documentation gaps 5. Be Encouraging! Go out for dinner!
Don’t Blame, or You’ll Feel Lame 1. Communication is Key
2. People hide when they feel shame 3. When people don’t feel safe to ask, they guess 4. Guessing hurts your RTO
Make your RPO worth it by investing in your RTO
© Copyright Microsoft Corporation. All rights reserved.