Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ITT 2019 - Constance Caramanolis - High severit...

ITT 2019 - Constance Caramanolis - High severity incident response leveraging Envoy

Incident management is inherently stressful and is made worse when the diagnostics and observability data is lacking and heterogeneous. Lyft runs Envoy at every hop of the network providing best in class observability across the entirety of Lyft’s network topology. Homogenous data reduces the time it takes to identify production issues. This talk will introduce Envoy, how Lyft configures Envoy and simulate a production incident at Lyft. Attendees are guided from the dreaded notification of an issue in production to resolution, by showing how engineers use Envoy’s extensive observability to identify and root cause the incident and remedy the situation.

Istanbul Tech Talks

April 02, 2019
Tweet

More Decks by Istanbul Tech Talks

Other Decks in Technology

Transcript

  1. WHY BUILD ENVOY? Service Oriented Architecture gets complicated quickly. •

    Languages and frameworks • Protocols • Distributed Systems best practices • Libraries for service calls • Observability outputs • Load Balancers
  2. WHAT IS ENVOY? The network should be transparent to applications.

    When network and application problems do occur it should be easy to determine the source of the problem.
  3. ENVOY IS PRETTY COOL… • Performance • Reliability • Modern

    Codebase • Configuration API • Observability • Community
  4. virtual_hosts: - name: www domains: - www.yourcompany.com routes: - match:

    prefix: "/foo/bar" route: cluster: "service2" - match: prefix: "/" route: cluster: "service1" - name: api domains: - api.yourcompany.com routes: - match: prefix: "/" route: cluster: "service3" CONFIGURING EDGE ENVOY • Ordered list of domains and routes. • Each route can associate a cluster for proxying the request to. • Request is matched to the first route that satisfies the constraints.
  5. CONFIGURING INTERNAL SERVICES • Request are made to Envoy through

    localhost:<port number> ◦ Application no longer needs to maintain connections, handle errors or emit metrics for requests! ◦ Envoy will handle service discovery for you! • General rule of thumb ◦ One port for ingress traffic to a service ◦ One port for egress traffic to other internal services ◦ One port per external dependency (database, 3rd party API, etc) port: 9001 virtual_hosts: - name: driver domains: driver routes: - match: prefix: "/" route: cluster: "driver" - name: locations domains: locations routes: - match: prefix: "/" route: cluster: "locations"
  6. UPSTREAM AND DOWNSTREAM Downstream: the direction of where the water

    flows Upstream: the direction against the flow of water. Response ~ Water Request Response Service A Service B
  7. CRASH COURSE ENVOY METRICS • HTTP Status Codes metrics ◦

    upstream_rq_200, upstream_rq_404, upstream_rq_503 per upstream cluster • Request and Connection errors ◦ upstream_rq_retries, upstream_rq_maintenance_mode, upstream_cx_connect_fail … • HTTP errors per listeners ◦ http.listener.downstream_rq_2xx, http.listener.downstream_rq_4xx, … Success Rate is the ratio of successful request (2xx) over total requests sent. Remember - Envoy is used at every hop!
  8. P0 EDGE ENVOY DEGRADATION Incident Report Email Time: 3:44 am

    Edge Envoy is paging due to success rate dip. Investigation ongoing in #operations. Limited understanding of impact. Update: When more is known.
  9. EDGE ENVOY SUCCESS RATE DIP • P0 page at Lyft.

    • Requests coming into Lyft experiencing degraded state • Usually indicates customer impact.
  10. • Ordered list of domains and routes. • Identifying route

    can be done a few ways: ◦ Visual inspection of routes ◦ Virtual Cluster metrics ◦ Access Logs [2019-03-26] "GET /show_maps HTTP/1.1" 503 UH ”1-2-3-4" "api.lyft.net" "maprender"
  11. RE: P0 EDGE ENVOY DEGRADATION Incident Report Email Update Time:

    3:55 am Root cause has been identified. Photo Filters API is returning errors. Remediation is being discussed in #operations. Impact: Maps aren’t rendering in clients.
  12. MANAGING PHOTO FILTER ERRORS 1. Make changes to the Driver

    code at 4 am. 2. Reduce the load on Photo Filter API. * LATE NIGHT CODING PHOTO PROVIDED BY HTTPS://WWW.FLICKR.COM/PHOTOS/JJACKOWSKI/15659707052 IN ITS ORIGINAL FORMAT. PLEASE REFER TO CREATIVE COMMON FOR MORE INFO.. Maintenance mode is a runtime key that sheds a percentage of traffic to an upstream host.
  13. RE: P0 EDGE ENVOY DEGRADATION Incident Report Email Update Time:

    4:00 am Maintenance mode applied for Photo Filters API. Waiting to see results. Update in 15 minutes or with new information.
  14. IMPACT TO DRIVER SERVICE • Increase the number of Driver

    instances. ◦ Scaling can be SLOW. • Reduce the number of requests that can be made to Driver services by configuring Circuit Breakers.
  15. CIRCUIT BREAKERS ¡ Quickly fails and allows for backpressure to

    be applied throughout the system. ¡ Configure maximum number of connections, pending requests, requests, active retries and concurrent connection pools. ¡ Different levers for HTTP 1 and HTTP 2! ¡ Different levers for request priorities! ¡ Metrics emitted! ¡ upstream_rq_pending_overflow, upstream_cx_overflow, … ¡ Runtime configurable
  16. RE: P0 EDGE ENVOY DEGRADATION Incident Report Email Update Time:

    4:10 am Maintenance mode to Photo Filter APi was insufficient. In addition, Driver service CPU was running hot due to misconfigured circuit breaker settings. Reducing the number of requests per host and scaling up. Update in 10 mins
  17. EDGE TRAFFIC TO MAPRENDER • Routes are matched in ordered

    specified in the configuration. • All request paths prefixed with ‘/show_map’ are proxied to MapRender. • All other requests are proxied to Render.
  18. RE: P0 EDGE ENVOY DEGRADATION Incident Report Email Update Time:

    4:20 am Decision was made to disable calls to MapRender for API /show_map (which invoked the Driver service and Photo Filters API). Defaulting to legacy Render service. This is expected to return Lyft to normal state. Update: 10 mins.
  19. RESOLVED: P0 EDGE ENVOY DEGRADATION Incident Report Email Update Time:

    4:25 am Confirmed map is correctly rendering in all clients. Application success rate is back at 100%. Post mortem is scheduled for Wednesday at 1pm. No further updates. Good night!
  20. FASTER ROOT CAUSING Instead of following every hop to the

    failing service ¡ Edge to Map Render ¡ Map Render to Driver ¡ Driver to Photo Filter API Look at all upstream failures at once
  21. JUST THE BEGINNING! ¡ Circuit breaker settings ¡ Outlier detection

    ¡ Access logging ¡ Tracing ¡ Request Mirroring https://www.envoyproxy.io/ ¡ Dynamic Envoy configurations ¡ HTTP header options ¡ Traffic shifting ¡ Maintenance mode