YOW2016

Architectural Patterns of Resilient Distributed Systems YOW 2016

Ines Sombra @Randommood

Globally distributed & highly available

Today’s Journey Forest Company 1 2 3 4 Motivation Resilience
in literature Resilience in industry Conclusions Tie it all together Foundational knowledge Why Ines cares & you should too What are others doing?

Resilience is the ability of a system to adapt or
keep working when challenges occur

Defining Resilience Fault-tolerance Evolvability Scalability Failure isolation Complexity management

How can we construct more resilient systems?

It’s what really matters

The Team

3000 × 2000 px 361 KB

Trim all edges by 25% http:/ /www.fastly.io/image.jpg?trim=0.25 Crop the image
square and resize the width to 200px http:/ /www.fastly.io/image.jpg?crop=1:1&width=200 1000 × 667 px 92 KB 200 × 200 px 9 KB

CDN Image Opto Origin Image Opto Image Opto Image Opto
Image Opto ImageOpto 101

Origin Image Opto Image Opto Image Opto Image Opto Image
Opto CDN ImageOpto 101

Resilience in Literature

Harvest & Yield Model

Fraction of successfully answered queries Focus on yield rather than
uptime (think amazon during xmas) Yield

From Coda Hale’s “You can’t sacrifice partition tolerance” Server A
Server B Server C Baby Animals Cute Fraction of the complete result Harvest

" 100% harvest

Server B Server C Baby Animals Cute X 66% harvest Fraction of the complete result Harvest

☹ 66% harvest

Server B Server C Baby Animals Cute X 33% harvest Fraction of the complete result Harvest X

33% harvest $

Randomness to make the worst-case & average-case the same Replication
of high-priority data for greater harvest control Degrading results based on client capability #1: Probabilistic Availability

Break into subsystems Only provide strong consistency for the subsystems
that need it Use orthogonal mechanisms #2 Decomposition & Orthogonality 1 2 3 4 5

If your system favors yield or harvest is an outcome
of its design “ ” ~ Fox & Brewer

Harvest & Yield applied ImageOpto favors harvest Consistent hashing based
on pristine image Replication to secondary nodes Orthogonality in CDN side Origin CDN IO X

Cook & Rasmussen model

Economic failure boundary Unacceptable workload boundary Accident boundary Pressure towards
efficiency Reduction of effort error margin Marginal boundary Safety Campaign Incident! Operating point Cook & Rasmussen

error margin Original marginal boundary R.I.Cook - 2004 Acceptable operating
point Accident boundary New marginal boundary! Flirting with the margin

Engineering resilience requires a model of safety based on: mentoring,
responding, adapting, and learning System safety is about what can happen, where the operating point actually is, and what we do under pressure Insights from Cook’s model

Build support for continuous maintenance Resilience is operator community focused
Know it’s going to get moved, replaced, and used in ways you did not intend Engineering system resilience

Cook & Rasmussen applied Unexpected use-cases Acceptable workload boundary influenced
a redesign Use response to incidents as educational opportunities Origin CDN IO

Borrill's model

Classical  engineering Reactive  ops unk-unk Probability of failure Rank Cascading
or catastrophic failures & you don’t know where they will come from! Same area as other 2 combined A system’s complexity

Classical  engineering Reactive  ops unk-unk Failure areas need != strategies
Probability of failure Rank % & ' ☠'

Thinking about building system resilience using a single discipline is
insufficient. We need different strategies “ ” ~ Borrill

Code standards Programming patterns Full system testing Metrics & monitoring
Convergence to good state Hazard inventories Redundancies Feature flags Dark deploys Runbooks & docs Canaries System verification Formal methods Fault injection Classical engineering Reactive Operations Unknown-Unknown Strategies to build resilience

System verification Formal methods Fault injection Classical engineering Reactive Operations
Unknown-Unknown Strategies to build resilience Code standards Programming patterns Full system testing Metrics & monitoring Convergence to good state Hazard inventories Redundancies Feature flags Dark deploys Runbooks & docs Canaries

Resilience   in Industry

Library vs service? Service and client library control + storage
of small data files with restricted operations Engineers don’t plan for: availability, consensus, primary elections, failures, their own bugs, operability, or the future. They also don’t understand Distributed Systems Key insights from Chubby %

Key insights from Chubby Centralized services are hard to construct
but you can dedicate effort into architecting them well and making them failure-tolerant Restricting user behavior increased resilience Consumers of your service are part of your UNK-UNK scenarios

ImageOpto insights Dependencies are hard: customer setup, customer inputs, caching
layer, & libraries - we have to be resilient from all of them Unk-Unks also lay in hidden dependencies (reduce as many of them as possible)

Ship something out earlier with a limited API. Continuously invest
in design of functionality and operability “ ” ~ Me today

In design What compromises does your system make as things
go bad? Resilient systems are designed for high yield & variable harvest

Unawareness of proximity to error boundary means we are always
guessing Complex operations make systems less resilient & more incident-prone You design operability too! Operations matter

Adding resilience may come at the cost of other desired
goals (e.g. time, performance, simplicity, cost, etc) Redundancies help Not all complexity is bad

IN DESIGN OPERABILITY UNK-UNK Are we favoring harvest or yield?
Are we resilient to our dependencies? Use orthogonality & decomposition Theory matters! Am I providing enough control to my operators? Operators impact resilience Narrowing your API helps The existence of this stresses diligence on the other two areas tl;dr The goal is to build failure domain independence

github.com/Randommood/YOW2016 ~ THANK YOU ~

YOW2016

YOW2016

More Decks by Ines Sombra

Other Decks in Technology

Featured

Transcript