Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lynn Root - Tracing, Fast and Slow: Digging int...

Lynn Root - Tracing, Fast and Slow: Digging into and improving your web service’s performance

Do you maintain a [Rube Goldberg](https://s-media-cache-ak0.pinimg.com/564x/92/27/a6/9227a66f6028bd19d418c4fb3a55b379.jpg)-like service? Perhaps it’s highly distributed? Or you recently walked onto a team with an unfamiliar codebase? Have you noticed your service responds slower than molasses? This talk will walk you through how to pinpoint bottlenecks, approaches and tools to make improvements, and make you seem like the hero! All in a day’s work.

The talk will describe various types of tracing a web service, including black & white box tracing, tracing distributed systems, as well as various tools and external services available to measure performance. I’ll also present a few different rabbit holes to dive into when trying to improve your service’s performance.

https://us.pycon.org/2017/schedule/presentation/565/

PyCon 2017

May 21, 2017
Tweet

More Decks by PyCon 2017

Other Decks in Programming

Transcript

  1. Lynn Root | SRE | @roguelynn Tracing: Fast & Slow

    Digging into and improving your web service’s performance
  2. agenda • Overview and problem space • Approaches to tracing

    • Tracing at scale • Diagnosing performance issues —
  3. agenda • Overview and problem space • Approaches to tracing

    • Tracing at scale • Diagnosing performance issues • Tracing services & systems —
  4. machine-centric • Focus on a single machine • No view

    into a service’s dependencies —
  5. why trace? • Performance analysis • Anomaly detection • Profiling

    • Resource attribution • Workload modeling —
  6. def request_id(f): @wraps(f) def decorated(*args, **kwargs): req_id = request.headers.get( "X-Request-Id",

    uuid.uuid4()) return f(req_id, *args, **kwargs) return decorated @app.route("/") @request_id def list_services(req_id): # log w/ ID for wherever you want to trace # app logic
  7. upstream appserver { 10.0.0.0:80; } server { listen 80; #

    Return to client add_header X-Request-ID $request_id; location / { proxy_pass http://appserver; # Pass to app server proxy_set_header X-Request-ID $request_id; } }
  8. log_format trace '$remote_addr … $request_id'; server { listen 80; add_header

    X-Request-ID $request_id; location / { proxy_pass http://app_server; proxy_set_header X-Request-ID $request_id; # Log $request_id access_log /var/log/nginx/access_trace.log trace; } }
  9. four things to think about • What relationships to track

    • How to track them • Which sampling approach to take —
  10. four things to think about • What relationships to track

    • How to track them • Which sampling approach to take • How to visualize to employ —
  11. gantt chart — GET /home GET /feed GET /profile GET

    /messages GET /friends Trace ID: de4db33f
  12. — request flow graph A call B call C call

    C call D call E call E reply D reply B reply C reply C reply A reply 2200µs 1500µs 500µs 300µs 400µs 600µs 800µs 500µs 500µs 700µs 500µs 400µs 600µs 100µs
  13. keep in mind • What do I want to know?

    • How much can I instrument? —
  14. keep in mind • What do I want to know?

    • How much can I instrument? • How much do I want to know? —
  15. questions to ask • Batch requests? • Any parallelization opportunities?

    • Useful to add/fix caching? • Frontend resource loading? —
  16. questions to ask • Batch requests? • Any parallelization opportunities?

    • Useful to add/fix caching? • Frontend resource loading? • Chunked or JIT responses? —
  17. Zipkin (Twitter) • Out-of-band reporting to remote collector • Report

    via HTTP, Kafka, and Scribe • Python libs only support propagation via HTTP —
  18. Zipkin (Twitter) • Out-of-band reporting to remote collector • Report

    via HTTP, Kafka, and Scribe • Python libs only support propagation via HTTP • Limited web UI —
  19. def http_transport(span_data): requests.post( "http://zipkinserver:9411/api/v1/spans", data=span_data, headers={"Content-type": "application/x-thrift"}) @app.route("/") def index():

    with zipkin_span(service_name="myawesomeapp", span_name="index", # need to write own transport func transport_handler=http_transport, port=app_port, # 0-100 percent sample_rate=100): # do something
  20. Jaeger (Uber) • Local daemon to collect & report •

    Storage support for only Cassandra —
  21. Jaeger (Uber) • Local daemon to collect & report •

    Storage support for only Cassandra • Lacking in documentation —
  22. Jaeger (Uber) • Local daemon to collect & report •

    Storage support for only Cassandra • Lacking in documentation • Cringe-worthy client library —
  23. import opentracing as ot config = Config(…) tracer = config.initialize_tracer()

    @app.route("/") def index(): with ot.tracer.start_span("ASpan") as span: span.log_event("test message", payload={"life": 42}) with ot.tracer.start_span("AChildSpan", child_of=span) as cspan: span.log_event("another test message") # wat time.sleep(2) # yield to IOLoop to flush the spans tracer.close() # flush any buffered spans
  24. Stackdriver Trace (Google) • No Python client libraries; no gRPC

    client support • Forward traces from Zipkin —
  25. Stackdriver Trace (Google) • No Python client libraries; no gRPC

    client support • Forward traces from Zipkin • Storage limitation of 30 days —
  26. X-Ray (AWS) • No first class Python support; Boto available

    • Configurable sampling, but not for Boto —
  27. X-Ray (AWS) • No first class Python support; Boto available

    • Configurable sampling, but not for Boto • Flow graphs with latency, response %, sample % —
  28. tl;dr — • You need this • Docs are lacking

    • Language support lacking • One size fits all approaches
  29. tl;dr — • You need this • Docs are lacking

    • Language support lacking • One size fits all approaches • But there’s an open spec!