Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SRE NEXT 2020 [C6] Designing fault-tolerant mic...

SRE NEXT 2020 [C6] Designing fault-tolerant microservices with SRE and circuit breaker centric architecture

The deck for the talk in SRE NEXT 2020 (https://sre-next.dev/schedule#c6)

More Decks by Takayuki WATANABE (渡辺 喬之)

Other Decks in Technology

Transcript

  1. SRE NEXT 2020 Designing fault-tolerant microservices with SRE and circuit

    breaker centric architecture Takayuki Watanabe Cookpad Inc. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe
  2. Who? Name: Takayuki Watanabe Affiliation: Cookpad Inc. Job: Site Reliability

    Engineering Chapter Lead Sns: Blog: blog.takanabe.tokyo GitHub: takanabe Twitter: @takanabe_w Interests: - Chaos Engineering - Distributed Systems - Resilience Engineering SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 2
  3. Menu • About Cookpad Global • Search-v2 and ML APIs

    • Gaps: ideal and reality • Designing fault-tolerant microservices with SRE and circuit breaker centric architecture SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 3
  4. Out of scope • Monolith vs SOA vs Microservices •

    So2ware design and development in Cloud Na<ve Era • Container orchestrators: Why ECS? Why EKS(k8s)? • Explana<on of fundamental SRE words (e.g: SLO, SLI, Error budget) SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 4
  5. Cookpad Global by numbers • 42,700,000 monthly users • 3,160,000

    recipes • 74 countries • 32 languages SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 6
  6. Cookpad Global by numbers • 1 monolith + 7 microservices

    in produc5on • 300+ spot instances for ECS clusters • 400+ deployments per ECS task defini5on per day • 20 deployements to produc5on per day 7
  7. See more details on Speaker Deck ... 1,2 2 Cookpad

    TechConf 2019, Challenges for Global Service from a Perspec>ve of SRE ~ 2nd season ~ 1 Cookpad TechConf 2018, Challenges for Global Service from a Perspec>ve of SRE SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 9
  8. Search is essen+al3 3 Go Global - #CookpadTechconf 2017 SRE

    NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 12
  9. Can users reach the best recipes out of 3,160,000 recipes?

    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 13
  10. Search-v2 and ML APIs • Search-v2: people can meet their

    favorite recipes for cooking • (e.g) Personalized search, visual search, recommenda@ons • ML APIs: Other APIs can provide machine learning integrated features • (e.g) Image enhancement, image to recipe SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 16
  11. 19

  12. 20

  13. Machine learning researcher ≠ SWE in machine learning SRE NEXT

    2020 (Jan 25, 2020) / Takayuki Watanabe 21
  14. 22

  15. 24

  16. 29

  17. Microservice architecture = Each team can use any technology we

    want SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 30
  18. Microservice architecture = Each team can use any technology we

    want SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 31
  19. Can we transfer internal resources and knowledge to other teams?

    SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 34
  20. Need more efforts to gain benetfits from microservice architecture SRE

    NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 35
  21. Is it possible to develop search-v2/ML APIs with those tech

    stacks? SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 39
  22. As-Is Developers use restricted technology stack To-Be Search/ML team can

    use mainstream technology stack for their fields SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 41
  23. 43

  24. 44

  25. This service is experimental This service is beta This service

    is prototype This service is [ANY EXPRESSIONS] 45
  26. Low service level APIs poten2ally cause cascading outages SRE NEXT

    2020 (Jan 25, 2020) / Takayuki Watanabe 46
  27. As-Is Produc'on is down due to outages of new microservices

    To-Be No produc)on outages due to low service level microservices SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 48
  28. 50

  29. Does team have enough capacity for on-call? "Assuming that there

    are always two people on-call (primary and secondary, with different du:es), the minimum number of engineers needed for on-call duty from a single-site team is eight: assuming week-long shi?s, each engineer is on-call (primary or secondary) for one week every month." 4 "For produc7on on-call responsibili7es, I’ve found that two-7er 24/7 support requires eight engineers. As teams holding their own pagers have become increasingly mainstream, this has become an important sizing constraint, and I try to ensure that every engineering team’s steady state is eight people" 5 5 Larson, Will. An Elegant Puzzle: Systems of Engineering Management, 2.1 Sizing teams (p.33) 4 Google - Site Reliability Engineering Chapter 11 - Being On-Call SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 51
  30. As-Is People have to be responsible for on-call rota0ons for

    new mircorservices To-Be New search/ml team must be free from on-call pressures for their new microservices SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 52
  31. 54

  32. As-Is Many teams need tough nego)a)ons to release ML related

    features To-Be ML team can release experimental features with light process in produc0on SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 55
  33. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Produc6on outages due to new microservices No produc@on outages due to low service level microservices People have to be responsible for on- call rota6ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Many teams need tough nego6a6ons to release ML related features ML team can release experimental features with light process in produc@on SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 56
  34. 58

  35. Design Docs • Reach consensus against scopes and expecta2ons 6

    • In Cookpad, only SRE team knows en2re system designs 7 7 Google, The Site Reliability Workbook, Chapter 7 - Simplicity 6 Google, Site Reliability Engineering, Chapter 31 - Communica<on and Collabora<on in SRE SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 59
  36. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Produc=on outages due to new microservices No produc=on outages due to low service level microservices Design document People have to be responsible for on-call rota=ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego=a=ons to release ML related features ML team can release experimental features with light process in produc=on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 62
  37. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document + ? Produc8on outages due to new microservices No produc8on outages due to low service level microservices Design document People have to be responsible for on- call rota8ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego8a8ons to release ML related features ML team can release experimental features with light process in produc8on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 63
  38. 67

  39. Implementa)on pa,ern • IAM (delega,on level: low) • IAM Permissions

    Boundary (delega,on level: medium) • Dedecated AWS account (delega,on level: high) SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 68
  40. Dedicated AWS account • Use AWS Organiza0ons to issue new

    AWS account • Design network by SRE • Build VPC peering between new and old VPCs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 69
  41. 70

  42. Transparent security and audit support • Enforce managed audit and

    security service on AWS • VPCFlowLogs • CloudTrail • GuardDuty • AWS Config 72
  43. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega1on and resource isola1on Produc=on outages due to new microservices No produc=on outages due to low service level microservices Design document People have to be responsible for on-call rota=ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego=a=ons to release ML related features ML team can release experimental features with light process in produc=on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 73
  44. Don't accept excep,ons • We only have 3 SREs (in

    2019) • Follow the boundary we define in the design document • Don't share servers managed by SRE team • Use SaaS to accelerate minimum product development cycles 8 • e.g: CI • e.g: Observability 8 Prac'cal Monitoring: Effec've Strategies for the Real World, Chapter 2.3 PaAern #3: Buy, Not Build SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 74
  45. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc2on outages due to low service level microservices Design document + ? People have to be responsible for on-call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 75
  46. 76

  47. Why Circuit Breaker? • Fail fast strategy to prevent cascading

    failures • Limits external service and network impacts • Don’t waste capacity calling a broken service • External service is slow • External service is down • Network is unstable SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 80
  48. • Closed • Traffic flows normaly • Health is assessed

    every 100ms based on a 10s rolling average • Open / Tripped • Fail fast - return 503 error • Stays in this state for 10s • Recovering / Half Open • Ramp up traffic over 10s • Check health every 100ms -> if fail go back to Open state • Return to Closed if health is OK aJer 10s SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 84
  49. Circuit Breaker - Implica1ons • We can introduce experimental and

    new services with less risk to other parts of the applica8on • Slow responses ~= Outage! • Fallback strategies become more important • Add values to use SLOs for communica8on tools about service levels SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 85
  50. Implementa)on pa,ern • Applica(on library (e.g: cookpad/expeditor, Ne;lix/Hystrix) • Proxy

    (e.g: Envoy Proxy, Traefik) • Service Mesh (e.g: Is(o, Maesh) SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 86
  51. Circuit breaker proxy side-car container • Use a L7 reverse

    proxy with circuit breaking middleware • Each microservice has it's own independently configured circuit breaker • Run as a sidecar container SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 87
  52. Traefik as circuit breaker proxy • NetworkErrorRa+o • Covers networking

    errors connec0ng to the service • Shedding load can help some errors to recover! • ResponseCodeRa+o • Don’t bother calling broken serivice • LatencyAtQuan+leMS • Isolate slow services. SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 88
  53. Traefik configura-on example { service1: { backend: 'http://service1_endpoint', circuit_breaker: "LatencyAtQuantileMS(50.0)

    > 1000 || ResponseCodeRatio(500, 600, 0, 600) > 0.30 || NetworkErrorRatio() > 0.10", }, service2: { backend: 'http://service2_endpoint', circuit_breaker: "LatencyAtQuantileMS(50.0) > 3000 || ResponseCodeRatio(500, 600, 0, 600) > 0.10 || NetworkErrorRatio() > 0.10", }, } SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 89
  54. 90

  55. Availability class • We customize produc0on readiness check as availablity

    class (a.k.a produc0on readiness review 9) 9 Google - Site Reliability Engineering, Chapter 32 - The Evolving SRE Engagement Model SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 96
  56. Availability class presets • Baseline • Medium • High •

    No SLO SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 97
  57. Baseline availability class Availablity Target: > 95% Period Down*me Budget

    Daily 1h 12m Weekly 8h 24 Monthly 36h 31m SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 98
  58. Medium availability class Availablity Target: > 99% Period Down*me Budget

    Daily 14m 24s Weekly 1h 41m Monthly 7h 18m SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 99
  59. High availability class Availablity Target: > 99.9% Period Down*me Budget

    Daily 1m 26s Weekly 10m 4s Monthly 43m 49s SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 100
  60. 101

  61. 102

  62. How do we know the service level? SRE NEXT 2020

    (Jan 25, 2020) / Takayuki Watanabe 103
  63. Implemen'ng alerts on SLO There are several strategies to implement

    alerts on SLO 10 • Target Error Rate ≥ SLO Threshold • Increased Alert Window • Incremen<ng Alert Dura<on • Alert on Burn Rate • Mul<ple Burn Rate Alerts • Mul<window, Mul<-Burn-Rate Alerts 10 Google - The Site Reliability Workbook, Chapter 5: Aler<ng on SLOs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 105
  64. Implemen'ng alerts on SLO There are several strategies to implement

    alerts on SLO 10 • Target Error Rate ≥ SLO Threshold • Increased Alert Window • Incremen<ng Alert Dura<on • Alert on Burn Rate • Mul<ple Burn Rate Alerts • Mul$window, Mul$-Burn-Rate Alerts 10 Google - The Site Reliability Workbook, Chapter 5: Aler<ng on SLOs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 106
  65. Burn rate Burn rate is how fast a service consumes

    the error budget on SLO SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 107
  66. Burn rates and +me to complete budget exhaus+on 10 Burn

    rate Error rate for 99.9% SLO Time to exhaus8on 1 0.1% 30 days 2 0.2% 15 days 10 1% 3 days 1000 100% 43minutes 10 Google - The Site Reliability Workbook, Chapter 5: Aler<ng on SLOs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 108
  67. Burn rates and +me to complete budget exhaus+on 10 10

    Google - The Site Reliability Workbook, Chapter 5: Aler<ng on SLOs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 109
  68. Mul$window, Mul$-Burn-Rate Alerts 10 • This approach provides good precision

    alerts and reduce the number of false posi7ves • Make the short window 1/12 the dura7on of the long window as the star7ng point Severity No*fica*on Long window Short window Burn rate Error budget consumed Cri$cal Pager 1 hour 5 minutes 14.4 2% Cri$cal Pager 6 hour 30 minutes 6 5% Warning Chat, $cket 3 days 6 hours 1 10% 10 Google - The Site Reliability Workbook, Chapter 5: Aler<ng on SLOs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 110
  69. Mul$window, Mul$-Burn-Rate Alerts 10 10 Google - The Site Reliability

    Workbook, Chapter 5: Aler<ng on SLOs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 111
  70. Mul$window, Mul$-Burn-Rate Alerts 10 10 Google - The Site Reliability

    Workbook, Chapter 5: Aler<ng on SLOs SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 112
  71. 113

  72. Implemen'ng Prometheus configs in Jsonnet • Jsonnet11 is a data

    templa0ng language • Simple extension of JSON • Eliminate duplica0on with object-orienta0on 11 google/jsonnet: Jsonnet - The data templa5ng language SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 116
  73. Prometheus config structure in Jsonnent $ tree prometheus-config prometheus-config ├──

    alertmanager.jsonnet ├── alertmanager_templates.jsonnet ├── lib │ ├── alert.libsonnet │ ├── alertmanager.libsonnet │ [...snip...] │ ├── traefik.libsonnet │ └── utils.libsonnet ├── platform.libsonnet ├── prometheus_rules.jsonnet ├── runbooks │ ├── alertmanager-down.md │ ├── blackbox-exporter-down.md │ [...snip...] │ └── ssh-probe-failed.md ├── services │ ├── service1.libsonnet │ └── service2.libsonnet └── services.libsonnet SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 117
  74. Aler%ng rule library for Traefik $ cat lib/traefik.libsonnet { [...snip...]

    traefik_backend_high_error_budget_burn_rate_alert: self.alert { name: 'TraefikBackendHighErrorBudgetBurnRate', summary: '[{{ $labels.backend }} in {{ $labels.environment }}] Traefik backend error budget burn rate is high', description: '[{{ $labels.backend }} in {{ $labels.environment }}] Immediate intervention is required to defend the Uptime SLO', expr: ||| ( environment_backend:traefik_backend_errors_per_request:ratio_rate1h{%(matchers)s} > (14.4*0.001) and environment_backend:traefik_backend_errors_per_request:ratio_rate5m{%(matchers)s} > (14.4*0.001) ) or ( environment_backend:traefik_backend_errors_per_request:ratio_rate6h{%(matchers)s} > (6*0.001) and environment_backend:traefik_backend_errors_per_request:ratio_rate30m{%(matchers)s} > (6*0.001) ) ||| % self, }, [...snip...] } SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 118
  75. Aler%ng config for service1 $ cat services/service1.libsonnet local resque =

    import '../lib/resque.libsonnet'; local service = import '../lib/service.libsonnet'; local traefik = import '../lib/traefik.libsonnet'; service { name: 'service1', slack_channel: 'service1-alerts', dashboard: 'https://grafana.example.com./d/service1', components+: [ [...snip...] self.component('traefik') { alerts+: [ self.traefik_backend_high_error_budget_burn_rate_alert { matchers: 'backend="service1", environment="production"', }, self.traefik_backend_high_error_budget_burn_rate_warning_alert { matchers: 'backend="service1", environment="production"', }, ], } + traefik, [...snip...] ], } SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 119
  76. Cau$on! • Jsonnet is super powerful language to elimiate redundancy

    • Too DRYed-configura<ons is difficult to maintain • We have to control the power and make configura<ons simple SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 120
  77. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 123
  78. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document + ? Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 124
  79. 125

  80. Strategy to make new team free from on-call pressure SRE

    NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 126
  81. Fallback to search-v1 when circuit breaker is open • Proxy

    par*al requests to search-v2 in feature toggle • Strict circuit breaking threshold (No SLO or extreamely low SLO) and fail fast when upstream is unstable • Rescue all errors in feature toggle • Fallback all requests to search-v1 when circuit breaker returns 503s SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 127
  82. 128

  83. 129

  84. 130

  85. 131

  86. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document SLO Circuit breaker + Fallback Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 133
  87. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document SLO Circuit breaker + Fallback Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document + ? SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 134
  88. Implementa)on pa,ern • API Gateway (BFF) for mobile apps with

    JWT • Feature toggle + path-based rouBng SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 135
  89. BFF pa&ern for mobile clients in Cookpad 12 12 Cookpad

    Developers' Blog, ϞμϯBFFΛ׆༻ͨ͠طଘAPIαʔόʔͷ࠶ߏங SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 136
  90. Prod endpoint + feature toggle + path-based rou5ng • Specify

    shared single ML API endpoint in feature toggle • Strict circuit breaking threshold (No SLO) and fail fast when upstream is unstable • Rescue all errors in feature toggle and dismiss • Change desDnaDon for each ML API based on request path SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 138
  91. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document SLO Circuit breaker + Fallback Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SLO (No SLO) Circuit breaker Feature toggle + Path-based rouAng SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 142
  92. Goals for the SRE team As-Is To-Be Approach Developers use

    restricted technology stack Search/ML team can use mainstream technology stack for their fields Design document Delega<on and resource isola<on Produc<on outages due to new microservices No produc<on outages due to low service level microservices Design document SLO Circuit breaker People have to be responsible for on- call rota<ons for new mircorservices New search/ml team must be free from on-call pressures for their new microservices Design document SLO Circuit breaker + Fallback Many teams need tough nego<a<ons to release ML related features ML team can release experimental features with light process in produc<on Design document SLO (No SLO) Circuit breaker Feature toggle + Path-based rou<ng SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 143
  93. SRE exper)se and circuit breaker • Protect microservices from unreliable

    microservice • Enforce contracts(alignment) among teams • Provide on-call free environment for new team • Enable developers to release experimental features • Reduce unproduc=ve communica=on among teams SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 146
  94. What is the best on-call rota0on? • It really depends

    on your team members • Someone loves weekly rota6on • Someone loves daily rota6on • Someone loves on-call on weekends • Don't create organiza6on-wide rota6on rule 13 13 Well designed policy about on-call compensa6on is necessary to achieve this SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 148
  95. On-call rota+on strategy in Cookpad • Don't page with events

    which don't damage our SLO • Use advantages of ;me-zone differences and distributed team14 • SREs and developers collaborate closely to fix problems 14 Strategy for two-/er on-call rota/on, h5ps:/ /blog.takanabe.tokyo SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 149
  96. On-call rota+on in SRE team • Hybrid strategy to use

    advantages of 3me-zone differences • JP(UTC+9) & UK(UTC+0) business hour shiF • Daily off-hours rota3on SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 151
  97. How can we introduce SRE in organiza3on? If you tackle

    to introduce the SRE methodology and culture with bo9om-up approaches, • Start from a small thing • Find your buddy from product develop teams who are happy to support your ideas • Provide incen;ve to your product developers • SREs are responsible for primary on-call if your services achieve your SLO standard (e.g: 99.99 % avaiability) for a month • Find win-win strategy for developers and SREs • Don't throw SRE sales pitch • Don't play "SRE is one of the Google best prac;ces" cards • We should seriously provide benefits to organiza;on with SRE methodologies (Why do we need SLO? What benefits do we have?) SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 154
  98. Achievements • Improvement of produc0on stability • Apply SRE technique

    to real service • Release of machine learning integrated search in produc0on 15 • Release of machine learning oriented infrastruture 15 Vector scoring for term embeddings in Elas5csearch - Speaker Deck SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 155
  99. What's next? • Promote SRE culture with ba4le-tested methodologies •

    Providing JWT auth endpoint for ML and other microservices • Machine learning researchers want to provide services that will be consumed by beta builds of mobile applicaCons • Monolith doesn't need frequent code changes for ML experiences • Monolith doesn't have to proxy anything (this sounds worry SRE NEXT 2020 (Jan 25, 2020) / Takayuki Watanabe 156