Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Managing Millions of Data Services At Heroku

Managing Millions of Data Services At Heroku

Over the years, Heroku Data's offerings continue to grow and reach new higher demands with Postgres, Kafka and Redis. Performing repairs, maintainenances, applying patches and auditing a fleet of millions creates some serious time constraints. We'll walk through the evolution of fleet orchestration, immutable infrastructure, security auditing and more to see how managing the data services for many Salesforce customers, start-ups and hobby developers alike is done with as little human interaction as possible.

Gabe Enslein

June 28, 2017
Tweet

More Decks by Gabe Enslein

Other Decks in Technology

Transcript

  1. February 28 from 17:37 UTC to March 1 00:18 UTC

    • AWS S3 service impact officially ended at 21:54 UTC • Other residual effects lasted an undisclosed amount of time • EBS service fulfilling a backlogged requests slowed resolution • AMIs were unavailable due to being stored in S3 • It took 5 additional hours to recover It Could have been so much worse
  2. How can we avoid disasters? • Orchestration for recovering existing

    services • Immutable infrastructure when failure is not automatically recoverable ◦ CAVEAT: Failover strategies must be in place • Removing manual or script surgery as an option at scale
  3. Who is Gabe Enslein? Joined Heroku Data late 2016, Careerbuilder

    before that Ruby backend services, microservices architecture and DevOps I was on call during the S3 Incident Big xkcd fan
  4. Ephemeral services, real hardware Things to take note of •

    Layers of abstraction help simplify development • Simplification integration pipeline • Enabling robust deployment strategies • Separating concerns from features and operations
  5. Ephemeral services, real hardware Be wary of the truth Ultimately

    all software runs on hardware Abstractions can hide the true problems Mapping symptoms to root causes can take longer Reproducing failures can be difficult
  6. I’ll just™do this operation... • How often does someone “just”™do

    this operation? • How likely are they to make a mistake? • Is this going to wake someone up at night? • Is there a way to stop “just”™doing the operation? • Will we operation need the operation in the future?
  7. Automate yourself out of a job...but how? We can generate

    one-off queries We make scripts, reusable templates Configuration Management tools, schedulers, etc. What about real-time remediation?
  8. Stateful Services, State Machines Model the management after the objects

    • Finite State Machines ◦ Deterministic Finite State Machines (DFSM) ◦ Non-deterministic Finite State Machines (NDFSM)
  9. Why use Finite State Machines? Programmatic control of machines Easier

    to model operations for real Services Reiterable methods of modeling stateful components Integrated view of relationships
  10. Deterministic Finite State Machines Some Pros • Single direction of

    state change ◦ A given input can only return one target state • Can only change states after receiving input • State is locked otherwise at the current state
  11. Deterministic Finite State Machines Some Cons State locks can cause

    stale view of state the object is in Single direction transitions can make long chains Repeat State definitions Multiple reasons the real service can be in a given state
  12. Nondeterministic Finite State Machines Upsides Can have multiple transitions from

    a single input Can transition without input (loops for days) Easier to implement retry logic due to bidirectional transitions
  13. Nondeterministic Finite State Machines Downsides The lack of assurance of

    state locks on input States can transition in less predictable ways State Machines can interact with input each other
  14. Applying State Machines: Choosing NDFSM • Flexibility is key when

    dealing with rapidly changing infrastructure • Multiple ways to get into the same problems in the ecosystem • We can implement “optimistic” state locking ◦ More predictability in when transitions occur • We can control how states transition to each other
  15. An Application of NDFSM: Data Services • Triggering installation of

    the service and monitor install ◦ Can includes userdata, scripts, upstart, systemd, cron, etc. • Monitor Service health and availability • Check Service-controlled processes and resources on the Server • Transitions are triggered by inputs -> State “ticks” ◦ Ticks queued regularly across each SM to check changes in input (or lack of input)
  16. An Application of NDFSM: A Data Service • All data

    services are containerized • Assign each Service to a subsequent Server • The Server State machine represents system-level State of the underlying OS • The Server can trigger state changes up to the Service and vice-versa
  17. An Application of NDFSM: Servers • The Server State machine

    represents system-level state of the underlying VM • Constantly monitors health of the base VM • Runs remediations against the system resources ◦ Disk space ◦ RAM usage ◦ etc.
  18. An Application of NDFSM: Operational consistency • Running backup processes

    • High-Availability replication • Security Credential management • Service performance metric emissions • Many more individual service-type-specific operations
  19. An Application of NDFSM: Routine credential rotation • Average runtime

    of API credential rotation ~2 minutes • Recall Feb. 28th: ~1.55M services (1.5M + 50K + 1K) • Rotations happen every 4 hours (6 times a day) • 2 minutes * 6(per day) * ~1.55M services 18612000 minutes = 310200 hours = 12925 days = 35.5 YEARS saved
  20. An Application of NDFSM: Tools to make it possible Postgres

    to persist the NDFSMs and their states Redis for Sidekiq queues holding transition messages Ruby and Sinatra to serve the orchestration logic AWS EC2, S3 and EBS (which is also S3)
  21. Postgresql: Maintains active snapshots History of messages (“Ticks”) Metadata for

    each FSM History of FSM relations An Application of NDFSM: Tools to make it possible Redis/SK: Constant queuing for all FSMs Partitioned queues for FSM specific “ticks” State locks for contentious operations
  22. An Application of NDFSM: More urgent Ops Servers control maintaining

    storage disks on servers Disks need resizing as part of normal customer usage Maintenances occur that requiring underlying VMs be sunset Hardware failures triggering failovers
  23. Applying NDFSM to S3pocalypse: What went wrong Backup failures to

    us-east-1 S3 caused servers to fill disks faster than expected Some services experienced downtime from failed state changes Inability to acquire new disks kept new services from being provisioned
  24. Tested in the wild: Needing manual fixes Are you sure?

    Photo: Fixing Problems, By Randall Munroe https://xkcd.com/1739/ is licensed under CC-BY-NC 2.5
  25. Immutable Infrastructure: Stay your hands • Enforces knowledge of the

    application created at that time • Standardizes mechanisms for maintenance • Discourages just™ doing manual operations • Favor consistent configurations
  26. Immutable Infrastructure: Stay your hands • Favor consistency ◦ instance

    replacement instead of manual mitigation • Failover strategies for all infrastructure • Encourage seeing Infrastructure as Code • Tests: Unit, Integration and Performance
  27. S3pocalypse resolutions: Missed edge cases Some services and servers did

    not recover cleanly Some gotchas occurred needing engineers live Needed some scripted fixes Dependency loops were identified in S3 usage
  28. NDFSM to S3pocalypse: Recovering from the disaster Most services recovered

    without any interaction from the operators State machines similar to the Rotate Credentials example Services with automated remediation healed once S3 was available Confirmation that no data loss occurred And we were able to go to sleep
  29. Immutable Infrastructure: Lessons learned Need to keep “Break Glass” measures

    for such occasions More automation, including emergency remedies Increased testing of reliability cases
  30. March 15, 2017 2:39 PM UTC The system could be

    made to crash or run programs as an administrator.
  31. USN-3234-1 (CVE-2016-10229, CVE-2017-5551) Linux Kernel Vulnerability DoS and Admin escalation

    vulnerability What images are running the vulnerability? March 15, 2017 2:39 PM UTC
  32. Immutable Infrastructure: Security vulnerabilities • CVE-2016-10229, CVE-2017-5551, CVE-2017-2636, CVE-2017-7308, CVE-2017-5551...

    As fast as attackers can find and exploit them How can we Find and remove in our fleet?
  33. Fleet contains many versions of Containers Servers have many iterations

    of AMIs Features may not be blanketly enabled for certain versions Our case here Live patching kernel vulnerabilities: Large risk, small reward Immutable Infrastructure, as a NDFSM
  34. Immutable Infrastructure, as a NDFSM Container Images and Root Machine

    Images ◦ Services installed ◦ Security vulnerabilities that are patched ◦ New features available ◦ Bugs fixes rolled ◦ Reliability test results
  35. Great Success: Patching security holes Service State machine retirements Vulnerable

    infrastructure removed Bad images state transitioned to decommissioned No services interrupted
  36. Key Takeaways Automate yourself out of regular operations Have emergency

    automation in place (scripts, jobs, etc.) Make routine failover strategies Treat infrastructure as full units Abstractions have their limits