Managing Millions of Data Services At Heroku

Managing Millions of Data Services @ Gabe Enslein

February 28th, 2017 17:44 UTC

AWS S3 Outage in Virginia https://status.heroku.com/incidents/1059 Primary Region failure February
28th, 2017 17:44 UTC

Dedicated Data Services running on February 28, 2017 Postgresql: ~
1.5 Million Redis: ~ 50K Kafka: ~ 1K

February 28 from 17:37 UTC to March 1 00:18 UTC
• AWS S3 service impact officially ended at 21:54 UTC • Other residual effects lasted an undisclosed amount of time • EBS service fulfilling a backlogged requests slowed resolution • AMIs were unavailable due to being stored in S3 • It took 5 additional hours to recover It Could have been so much worse

How can we avoid disasters? • Orchestration for recovering existing
services • Immutable infrastructure when failure is not automatically recoverable ◦ CAVEAT: Failover strategies must be in place • Removing manual or script surgery as an option at scale

Who is Gabe Enslein? Joined Heroku Data late 2016, Careerbuilder
before that Ruby backend services, microservices architecture and DevOps I was on call during the S3 Incident Big xkcd fan

Ephemeral services, real hardware Things to take note of •
Layers of abstraction help simplify development • Simpliﬁcation integration pipeline • Enabling robust deployment strategies • Separating concerns from features and operations

Ephemeral services, real hardware Be wary of the truth Ultimately
all software runs on hardware Abstractions can hide the true problems Mapping symptoms to root causes can take longer Reproducing failures can be diﬃcult

I’ll just™do this operation... • How often does someone “just”™do
this operation? • How likely are they to make a mistake? • Is this going to wake someone up at night? • Is there a way to stop “just”™doing the operation? • Will we operation need the operation in the future?

Photo: Is it Worth the Time? By Randall Munroe https://xkcd.com/1205/
is licensed under CC-BY-NC 2.5

Orchestration

Automate yourself out of a job...but how? We can generate
one-oﬀ queries We make scripts, reusable templates Conﬁguration Management tools, schedulers, etc. What about real-time remediation?

Photo: Good Code By Randall Munroe https://xkcd.com/844/ is licensed under
CC-BY-NC 2.5

Stateful Services, State Machines Model the management after the objects
• Finite State Machines ◦ Deterministic Finite State Machines (DFSM) ◦ Non-deterministic Finite State Machines (NDFSM)

Why use Finite State Machines? Programmatic control of machines Easier
to model operations for real Services Reiterable methods of modeling stateful components Integrated view of relationships

Deterministic Finite State Machines Some Pros • Single direction of
state change ◦ A given input can only return one target state • Can only change states after receiving input • State is locked otherwise at the current state

Basic Deterministic Finite State Machine

Deterministic Finite State Machines Some Cons State locks can cause
stale view of state the object is in Single direction transitions can make long chains Repeat State deﬁnitions Multiple reasons the real service can be in a given state

Nondeterministic Finite State Machines Upsides Can have multiple transitions from
a single input Can transition without input (loops for days) Easier to implement retry logic due to bidirectional transitions

Less Basic Nondeterministic Finite State Machine

Nondeterministic Finite State Machines Downsides The lack of assurance of
state locks on input States can transition in less predictable ways State Machines can interact with input each other

Applying State Machines: Choosing NDFSM • Flexibility is key when
dealing with rapidly changing infrastructure • Multiple ways to get into the same problems in the ecosystem • We can implement “optimistic” state locking ◦ More predictability in when transitions occur • We can control how states transition to each other

An Application of NDFSM

An Application of NDFSM: Data Services • Triggering installation of
the service and monitor install ◦ Can includes userdata, scripts, upstart, systemd, cron, etc. • Monitor Service health and availability • Check Service-controlled processes and resources on the Server • Transitions are triggered by inputs -> State “ticks” ◦ Ticks queued regularly across each SM to check changes in input (or lack of input)

An Application of NDFSM: A Data Service

An Application of NDFSM: A Data Service • All data
services are containerized • Assign each Service to a subsequent Server • The Server State machine represents system-level State of the underlying OS • The Server can trigger state changes up to the Service and vice-versa

An Application of NDFSM: How the Server interacts

An Application of NDFSM: Servers • The Server State machine
represents system-level state of the underlying VM • Constantly monitors health of the base VM • Runs remediations against the system resources ◦ Disk space ◦ RAM usage ◦ etc.

An Application of NDFSM: Operational consistency • Running backup processes
• High-Availability replication • Security Credential management • Service performance metric emissions • Many more individual service-type-speciﬁc operations

An Application of NDFSM: API credential rotation

An Application of NDFSM: Routine credential rotation • Average runtime
of API credential rotation ~2 minutes • Recall Feb. 28th: ~1.55M services (1.5M + 50K + 1K) • Rotations happen every 4 hours (6 times a day) • 2 minutes * 6(per day) * ~1.55M services 18612000 minutes = 310200 hours = 12925 days = 35.5 YEARS saved

An Application of NDFSM: Tools to make it possible Postgres
to persist the NDFSMs and their states Redis for Sidekiq queues holding transition messages Ruby and Sinatra to serve the orchestration logic AWS EC2, S3 and EBS (which is also S3)

Postgresql: Maintains active snapshots History of messages (“Ticks”) Metadata for
each FSM History of FSM relations An Application of NDFSM: Tools to make it possible Redis/SK: Constant queuing for all FSMs Partitioned queues for FSM speciﬁc “ticks” State locks for contentious operations

An Application of NDFSM: More urgent Ops Servers control maintaining
storage disks on servers Disks need resizing as part of normal customer usage Maintenances occur that requiring underlying VMs be sunset Hardware failures triggering failovers

Applying NDFSM to S3pocalypse: What went wrong Backup failures to
us-east-1 S3 caused servers to ﬁll disks faster than expected Some services experienced downtime from failed state changes Inability to acquire new disks kept new services from being provisioned

Tested in the wild: Needing manual ﬁxes Are you sure?
Photo: Fixing Problems, By Randall Munroe https://xkcd.com/1739/ is licensed under CC-BY-NC 2.5

Immutable Infrastructure: Stay your hands • Enforces knowledge of the
application created at that time • Standardizes mechanisms for maintenance • Discourages just™ doing manual operations • Favor consistent conﬁgurations

Immutable Infrastructure: Stay your hands • Favor consistency ◦ instance
replacement instead of manual mitigation • Failover strategies for all infrastructure • Encourage seeing Infrastructure as Code • Tests: Unit, Integration and Performance

S3pocalypse resolutions: Missed edge cases Some services and servers did
not recover cleanly Some gotchas occurred needing engineers live Needed some scripted ﬁxes Dependency loops were identiﬁed in S3 usage

NDFSM to S3pocalypse: Recovering from the disaster Most services recovered
without any interaction from the operators State machines similar to the Rotate Credentials example Services with automated remediation healed once S3 was available Conﬁrmation that no data loss occurred And we were able to go to sleep

Photo: Exploits of a Mom By Randall Munroe https://xkcd.com/327/ is
licensed under CC-BY-NC 2.5

Immutable Infrastructure: Lessons learned Need to keep “Break Glass” measures
for such occasions More automation, including emergency remedies Increased testing of reliability cases

The story Continues

March 15, 2017 2:39 PM UTC The system could be
made to crash or run programs as an administrator.

USN-3234-1 (CVE-2016-10229, CVE-2017-5551) Linux Kernel Vulnerability DoS and Admin escalation
vulnerability What images are running the vulnerability? March 15, 2017 2:39 PM UTC

Immutable Infrastructure: Security vulnerabilities • CVE-2016-10229, CVE-2017-5551, CVE-2017-2636, CVE-2017-7308, CVE-2017-5551...
As fast as attackers can ﬁnd and exploit them How can we Find and remove in our fleet?

Immutable Infrastructure, as a NDFSM Whaaat?!?!

Fleet contains many versions of Containers Servers have many iterations
of AMIs Features may not be blanketly enabled for certain versions Our case here Live patching kernel vulnerabilities: Large risk, small reward Immutable Infrastructure, as a NDFSM

Immutable Infrastructure, as a NDFSM Container Images and Root Machine
Images ◦ Services installed ◦ Security vulnerabilities that are patched ◦ New features available ◦ Bugs ﬁxes rolled ◦ Reliability test results

Great Success: Patching security holes Service State machine retirements Vulnerable
infrastructure removed Bad images state transitioned to decommissioned No services interrupted

Key Takeaways Automate yourself out of regular operations Have emergency
automation in place (scripts, jobs, etc.) Make routine failover strategies Treat infrastructure as full units Abstractions have their limits

State Machine libraries in lots of languages http://awesome-ruby.com/#awesome-ruby-state-machines https://github.com/uhub/awesome-javascript https://github.com/akullpp/awesome-java#distributed-applications
https://awesome-go.com/#distributed-systems https://github.com/quozd/awesome-dotnet#state-machines A few places to get started

Check us out https://github.com/heroku https://elements.heroku.com/addons/heroku-kafka https://elements.heroku.com/addons/heroku-postgresql https://elements.heroku.com/addons/heroku-redis https://devcenter.heroku.com/start Thank you

Managing Millions of Data Services At Heroku

Managing Millions of Data Services At Heroku

More Decks by Gabe Enslein

Other Decks in Technology

Featured

Transcript