[2020.01 Meetup] [TALK 2] SRE is pretty much what you make of it- Luís Rodrigues

SRE is pretty much what you make of it DevOps
meetup Lisbon

Agenda ▸ What in the world is SRE? ▸ SRE
and DevOps ▸ SRE work at OLX Group 2

Who is this guy Luis Rodrigues ▸ OLX SRE for
3 years ▸ Former freelance jack of all trades ▸ Failed punk, ex-geologist wannabe ▸ Opinions are my own! But based on a lot of other people ones ▸ You can ﬁnd me in at @luisvegeta 3

Sysadmin life before SRE 4 Things break. Break again. And
again. Sysadmins Overloaded. Constant ﬁreﬁghting. Waiting in ticket queues for everything. Everyone is busy, but it doesn’t get any better. Everything takes too long, cost too much and break too often!

Then the company decides to implement SRE 5

And everything changes! 6 Things break. Break again. And again.
SRE Overloaded. Constant ﬁreﬁghting. Waiting in ticket queues for everything. Everyone is busy, but it doesn’t get any better. Everything takes too long, cost too much and break too often!

7 Changing job titles or adding individual skills doesn’t make
systems administrators SREs. Damon Edwards Co-Founder of Rundeck Inc

1 What in the world is SRE? Principles, ideas and
sources of information

Google created SRE “In general, an SRE team is responsible
for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.” 9 Ben Treynor VP of Engineering at Google

Embracing Risk Assessment, management and error budgets 10 And it’s
principles Service Level Objectives Service indicators from objectives and agreements Toil Eliminate repetitive work that scales with service growth Monitoring Reliability comes from a good ability to observe Release Engineering Changes provoke most of the outages Automation Automate as much as possible Simplicity As simple as possible, no simpler

And google engineers (literally) wrote the book on it 11
SLIs, SLOs, SLAs, What?! https://sre.xyz

12 Site Reliability Principles SRE needs Service Level Objectives, with
consequences SREs have time to make tomorrow better than today. SRE teams have the ability to regulate their workload

13 SRE needs SLOs, with consequences A target value or
range for a service level measured by an SLI. - Your service is up enough. - Your HTTP server responds with success often and fast enough. Service-Level Objective (SLO) Represents the amount of failure we expect to actually have. Error Budget A quantitative measure of some aspect of the level of service that is provided. “A quantiﬁable measure of service reliability” Examples: request latency, throughput, availability, error rate Service-Level Indicator (SLI)

SREs have time to make tomorrow better than today SRE
teams need to be able to both run your systems and make them better. They can’t be buried in operational work. 14 Toil Engineering work Reduce toil improve the business E.W. No capacity to improve the business Toil No capacity to reduce toil

SRE teams have the ability to regulate their workload Prioritize
giving the most mission-critical systems to your SREs. Teams need space to ﬂourish and grow. Share the responsibility of running services with the rest of the dev team. 15

2 SRE and DevOps class SRE implements DevOps

- Operations - Incident management - Post Mortems - Monitoring/Alerting
- Capacity planning SRE vs DevOps? - Delivery - Release automation - Environment builds - Conﬁg management - Infrastructure as code Reliability Delivery Speed SRE DevOps 17

SRE and DevOps: teams org model 18 SRE team Cross-functional
team #1 Cross-functional team #2 Cross-functional team #3 Cross-functional team #4 Development team #1 Development team #2 Development team #3 Development team #4 SRE team Squad #1 Squad #2 Squad #3 Squad #4 Clear handoff requirements Error budget consequences

We already do Devops! Can we start doing SRE? Forty-six
percent of the principles in the book work out of the box Fifty percent of the principles are good advice There’s a small number — 4% — that you should not execute. 19 Forrester research blog post

3 SRE work at OLX Group What we do, did
and what we are planning to do

SRE team and development packs 21 SRE leads Pack #1
Pack #2 Pack #3 Pack #N Head of Infrastructure SRE leads Pack #1 Pack #2 Pack #3 Pack #N SRE leads Pack #1 Pack #2 Pack #3 Pack #N HUB #1 HUB #2 HUB #3

Automation Automate as much as possible. Automate everything! 22 SRE
principles at OLX Everything is ephemeral Servers will die on you, network will fail. Maybe even DNS. Infrastructure as code Everything must be declared, pull requests approvals, etc. Monitoring Everything is monitored. RED metrics for all services Incident management SREs and Devs oncall for all services. All user impacting incidents require a postmortem and action points Alerting Smart alerts trough Pagerduty and Slack

Automation: Atlantis Terraform Pull Request Automation Make Terraform changes visible
to your team. Enable all engineers to collaborate on Terraform. Standardize your Terraform workﬂows. 23 https://www.runatlantis.io

Monitoring: guidelines 24 - USE Method Utilization, Saturation, Errors -
RED Method Requests, Errors, Duration - Four Golden Signals Latency, traffic, errors, and saturation RED Method Rate the number of requests, per second, you services are serving. Errors the number of failed requests per second. Duration distributions of the amount of time each request takes.

Incident Management ▸ Classiﬁcation: P1, P2 or bug ▸ Incident
triggering ▹ Monitoring ▹ Slack bot ▸ Incident handling ▹ Oncall or dev teams ▹ First Responders Team (FRT) ▹ War rooms ▸ Blameless Postmortems 25

Simplicity: less is more ▸ Stack level ▹ From datacenter
to managed K8s ▹ Removed and simpliﬁed several layers ▸ Application level ▹ Adoption of managed services ▸ Monitoring level ▹ RED method ▹ Uniﬁcation of tools 26

27 THANKS! Any questions? You can ﬁnd me at: ▸
@luisvegeta ▸ [email protected]

[2020.01 Meetup] [TALK 2] SRE is pretty much wh...

[2020.01 Meetup] [TALK 2] SRE is pretty much what you make of it- Luís Rodrigues

DevOps Lisbon

More Decks by DevOps Lisbon

Other Decks in Technology

Featured

Transcript

SRE is pretty much what you make of it DevOps

Agenda ▸ What in the world is SRE? ▸ SRE

Who is this guy Luis Rodrigues ▸ OLX SRE for

Sysadmin life before SRE 4 Things break. Break again. And

Then the company decides to implement SRE 5

And everything changes! 6 Things break. Break again. And again.

7 Changing job titles or adding individual skills doesn’t make

1 What in the world is SRE? Principles, ideas and

Google created SRE “In general, an SRE team is responsible

Embracing Risk Assessment, management and error budgets 10 And it’s

And google engineers (literally) wrote the book on it 11

12 Site Reliability Principles SRE needs Service Level Objectives, with

13 SRE needs SLOs, with consequences A target value or

SREs have time to make tomorrow better than today SRE

SRE teams have the ability to regulate their workload Prioritize

2 SRE and DevOps class SRE implements DevOps

- Operations - Incident management - Post Mortems - Monitoring/Alerting

SRE and DevOps: teams org model 18 SRE team Cross-functional

We already do Devops! Can we start doing SRE? Forty-six

3 SRE work at OLX Group What we do, did

SRE team and development packs 21 SRE leads Pack #1

Automation Automate as much as possible. Automate everything! 22 SRE

Automation: Atlantis Terraform Pull Request Automation Make Terraform changes visible

Monitoring: guidelines 24 - USE Method Utilization, Saturation, Errors -

Incident Management ▸ Classiﬁcation: P1, P2 or bug ▸ Incident

Simplicity: less is more ▸ Stack level ▹ From datacenter

27 THANKS! Any questions? You can ﬁnd me at: ▸