[2020.01 Meetup] [TALK 2] SRE is pretty much what you make of it- Luís Rodrigues
A personal overview of a journey through SRE principles, ideas and best practices and how this is actually working across several teams. Presented by an old linux sysadmin guy.
3 years ▸ Former freelance jack of all trades ▸ Failed punk, ex-geologist wannabe ▸ Opinions are my own! But based on a lot of other people ones ▸ You can find me in at @luisvegeta 3
again. Sysadmins Overloaded. Constant firefighting. Waiting in ticket queues for everything. Everyone is busy, but it doesn’t get any better. Everything takes too long, cost too much and break too often!
SRE Overloaded. Constant firefighting. Waiting in ticket queues for everything. Everyone is busy, but it doesn’t get any better. Everything takes too long, cost too much and break too often!
for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.” 9 Ben Treynor VP of Engineering at Google
principles Service Level Objectives Service indicators from objectives and agreements Toil Eliminate repetitive work that scales with service growth Monitoring Reliability comes from a good ability to observe Release Engineering Changes provoke most of the outages Automation Automate as much as possible Simplicity As simple as possible, no simpler
range for a service level measured by an SLI. - Your service is up enough. - Your HTTP server responds with success often and fast enough. Service-Level Objective (SLO) Represents the amount of failure we expect to actually have. Error Budget A quantitative measure of some aspect of the level of service that is provided. “A quantifiable measure of service reliability” Examples: request latency, throughput, availability, error rate Service-Level Indicator (SLI)
teams need to be able to both run your systems and make them better. They can’t be buried in operational work. 14 Toil Engineering work Reduce toil improve the business E.W. No capacity to improve the business Toil No capacity to reduce toil
giving the most mission-critical systems to your SREs. Teams need space to flourish and grow. Share the responsibility of running services with the rest of the dev team. 15
team #1 Cross-functional team #2 Cross-functional team #3 Cross-functional team #4 Development team #1 Development team #2 Development team #3 Development team #4 SRE team Squad #1 Squad #2 Squad #3 Squad #4 Clear handoff requirements Error budget consequences
percent of the principles in the book work out of the box Fifty percent of the principles are good advice There’s a small number — 4% — that you should not execute. 19 Forrester research blog post
principles at OLX Everything is ephemeral Servers will die on you, network will fail. Maybe even DNS. Infrastructure as code Everything must be declared, pull requests approvals, etc. Monitoring Everything is monitored. RED metrics for all services Incident management SREs and Devs oncall for all services. All user impacting incidents require a postmortem and action points Alerting Smart alerts trough Pagerduty and Slack
RED Method Requests, Errors, Duration - Four Golden Signals Latency, traffic, errors, and saturation RED Method Rate the number of requests, per second, you services are serving. Errors the number of failed requests per second. Duration distributions of the amount of time each request takes.
to managed K8s ▹ Removed and simplified several layers ▸ Application level ▹ Adoption of managed services ▸ Monitoring level ▹ RED method ▹ Unification of tools 26