Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Disaster Recovery: A Process, Not a Tool
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Richard Yen
June 09, 2026
Technology
7
0
Share
Disaster Recovery: A Process, Not a Tool
As presented at PGDay Boston 2026
Richard Yen
June 09, 2026
More Decks by Richard Yen
See All by Richard Yen
Playing Nice with Your Friends: Database Diversity with Postgres FDWs
richyen
0
160
How to Ride Elephants Safely: Working with Postgres when Your DBA is not Around
richyen
0
150
Scaling the Wall of Text: Best Practices for Logging in PostgreSQL
richyen
0
170
How to Ride Elephants Safely: Working with Postgres when Your DBA is not Around
richyen
0
130
How to Ride Elephants Safely: Working with Postgres when Your DBA is not Around
richyen
0
61
How to Ride Elephants Safely: Working with PostgreSQL when your DBA is not around
richyen
0
56
Playing Nice with Your Friends: Database Diversity with Postgres FDWs
richyen
0
150
Explaining EXPLAIN: A Dive Into PostgreSQL EXPLAIN Plans
richyen
0
160
Explaining EXPLAIN: An Introduction to PostgreSQL EXPLAIN Plans
richyen
0
220
Other Decks in Technology
See All in Technology
個人最適 から 全体最適 へ AI情報共有会・AIギルド・AI-DLC で進める カンリーの組織展開
rfdnxbro
0
270
long-running-tasks
cipepser
3
460
Javaで学ぶSOLID原則
negima
1
260
「速く作る」から「正しく作る」へ ─ 生成AI時代の開発フロー改革の ロードマップと実行 ─
starfish719
0
480
oracle-to-databricks-migration-with-llm-and-dbt
casek
1
400
Sony_KMP_Journey_KotlinConf2026
sony
2
200
GoとSIMDとWasmの今。
askua
3
460
インフラが苦手でも大丈夫! 紙芝居 Kubernetes -WWGT 10周年編-
aoi1
1
320
ポスター発表&デモと総括 / Poster Presentations & Demonstrations and Summary
ks91
PRO
0
190
Strands Agents超入門
kintotechdev
1
150
TypeScript Compiler APIとPHP-Parserを活用し、TypeScriptとPHPで型を共有する
shuta13
0
320
オンコールの負荷軽減のためのBits Assistant 活用方法 / How to Use Bits Assistant to Reduce the Workload on On-Call Staff
sms_tech
1
370
Featured
See All Featured
VelocityConf: Rendering Performance Case Studies
addyosmani
333
25k
Mobile First: as difficult as doing things right
swwweet
225
10k
職位にかかわらず全員がリーダーシップを発揮するチーム作り / Building a team where everyone can demonstrate leadership regardless of position
madoxten
62
54k
Primal Persuasion: How to Engage the Brain for Learning That Lasts
tmiket
0
350
GraphQLとの向き合い方2022年版
quramy
50
15k
Pawsitive SEO: Lessons from My Dog (and Many Mistakes) on Thriving as a Consultant in the Age of AI
davidcarrasco
0
150
How To Speak Unicorn (iThemes Webinar)
marktimemedia
1
470
Designing for Performance
lara
611
70k
Stewardship and Sustainability of Urban and Community Forests
pwiseman
0
220
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
133
19k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
First, design no harm
axbom
PRO
2
1.2k
Transcript
Disaster Recovery A Process, not a Tool June 9, 2026
Richard Yen
The Changed Landscape
None
The Changed Landscape •99.99% Uptime •p95, p99 metrics •Status pages
•Social Media affects reputation
Agenda 1. Where We Are 2. Where We Need to
Be 3. How We’ll Get There 4. Some Stories Along the Way
Where We Are
A disaster is any sustained event that compromises the system’s
availability, correctness, or business trust
How DR is Usually Done 1. Prepare 2. Prevent
How DR is Usually Done 1. Prepare 2. Prevent “An
ounce of prevention is worth a pound of cure”
Disaster Recovery is the act of restoring business operations
Where We Need to Be
Postgres Makes Recovery Easy • pg_dump/pg_restore • pg_basebackup • pg_stat_replication
• pg_stat_activity • Point-In-Time Recovery • repmgr/efm • Third-party backup tools
RPO & RTO
RPO & RTO – It’s going to cost you
RPO & RTO Talk to your leadership, and you’ll discover
how much it’s really worth to them
RPO 1. 24-hour RPO -- $ 2. 15-minute RPO --
$$ 3. Near-zero RPO -- $$$
RTO is your team’s ability to execute the DR plan
How We’ll Get There
3 Layers of DR Planning 1. Infrastructure failure 2. Procedural
failure 3. Human failure
3 Layers of DR Planning 1. Infrastructure failure 2. Procedural
failure 3. Human failure Recovery is not always about failing over
Runbook Engineering Should Assume 1. Stress 2. Chaos 3. Confusion
4. Exhaustion 5. Ambiguity
Runbook Engineering Should Assume 1. Stress 2. Chaos 3. Confusion
4. Exhaustion 5. Ambiguity
Runbook Engineering: Anti-patterns 1. Wiki Pages 2. Stale documents 3.
Unclear owner 4. Vague instructions
Runbook Engineering: Non-Technical Essentials 1. Incident Commander 2. Communications Owner
3. Notification Cadence 4. Escalation Chain 5. Risk Authorization
Runbook Validation 1. Can a new engineer follow it? 2.
Does it assume access? 3. Are commands and names current? 4. Does it get regular playtime?
Runbook Validation: Level Up Your Ability 1. Prove that your
Runbook works 2. Reduce the time it takes to complete 3. Simulate failure 4. Test with unavailable human resources
Runbook Validation: Level Up Your Ability 1. Prove that your
Runbook works 2. Reduce the time it takes to complete 3. Simulate failure 4. Test with unavailable human resources This is how you reduce RTO
Validation Metrics 1. Did recovery succeed? 2. How long did
each section take? 3. What vagueness needs to be clarified? 4. Identify documentation gaps
Validation Metrics 1. Did recovery succeed? 2. How long did
each section take? 3. What vagueness needs to be clarified? 4. Identify documentation gaps 5. Be Encouraging! Go out for dinner!
Don’t Blame, or You’ll Feel Lame 1. Communication is Key
2. People hide when they feel shame 3. When people don’t feel safe to ask, they guess 4. Guessing hurts your RTO
Make your RPO worth it by investing in your RTO
© Copyright Microsoft Corporation. All rights reserved.