Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Tracing: Understanding how your com...

Distributed Tracing: Understanding how your components work together - BuildStuffLT 2018

Understanding failures or latencies in monoliths or small systems usually starts with looking at a single component in isolation. Microservices architecture invalidates this assumption because end user requests now traverse dozen of components and a single component simply does not give you enough information: each part is just one side of a bigger story.

In this talk we’ll look at distributed tracing which summarizes all sides of the story into a shared timeline and also distributed tracing tools like Zipkin, which highlights the relationship between components, from the very top of the stack to the deepest aspects of the system.

José Carlos Chávez

November 14, 2018
Tweet

More Decks by José Carlos Chávez

Other Decks in Programming

Transcript

  1. About me José Carlos Chávez • Software Engineer at Typeform

    focused on the aggregate of responses services. • Zipkin core team and open source contributor for Observability projects. @jcchavezs / #BuildStuffLT
  2. Distributed systems A collection of independent components appears to its

    users as a single coherent system. Characteristics: • Concurrency • No global clock • Independent failures @jcchavezs / #BuildStuffLT
  3. Water heater Gas supplier Cold water storage tank Shutoff valve

    First floor branch Tank valve 爆$❄#☭ Distributed systems
  4. Auth service Images service Videos service DB2 DB3 DB4 Error

    1152 ER_ABORTING_CONNECTION 500 Internal Error 500 Internal Error GET /media/e5k2 API Proxy Distributed systems: Understanding failures DB1 Media API
  5. Water heater Gas supplier Cold water storage tank Shutoff valve

    First floor branch Tank valve 爆$❄#☭ I AM HERE! First floor distributor is clogged! Distributed systems: Understanding failures
  6. API Proxy Auth service Media API Images service Videos service

    DB2 DB3 DB4 500 Internal Error 500 Internal Error GET /media/e5k2 Logs & Concurrency DB1 Error 1152 ER_ABORTING_CONNECTION
  7. [24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/13548” [24/Oct/2017

    13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/23948” [24/Oct/2017 13:50:08 +0000] “GET /media HTTP/1.1” 200 … **0/12396” [24/Oct/2017 13:50:07 +0000] “GET /videos HTTP/1.1” 200 … **0/23748” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/23248” [24/Oct/2017 13:50:07 +0000] “GET /auth HTTP/1.1” 200 … **0/26548” [24/Oct/2017 13:50:07 +0000] “POST /media HTTP/1.1” 200 … **0/13148” [24/Oct/2017 13:50:07 +0000] “GET /media HTTP/1.1” 200 … **0/2588” [24/Oct/2017 13:50:07 +0000] “GET /auth HTTP/1.1” 500 … **0/3248” [24/Oct/2017 13:50:07 +0000] “POST /media HTTP/1.1” 200 … **0/23548” [24/Oct/2017 13:50:07 +0000] “GET /images HTTP/1.1” 200 … **0/22598” [24/Oct/2017 13:50:07 +0000] “GET /videos HTTP/1.1” 200 … **0/13948” ... ? ? Logs & Concurrency
  8. Water heater Gas supplier Cold water storage tank Shutoff valve

    First floor branch Tank valve 爆$❄#☭ I AM HERE! First floor distributor is clogged! Distributed systems: Understanding failures
  9. API Proxy Media API Auth Videos Images Time error Distributed

    tracing [1508410442] no cache for resource, retrieving from DBc TraceID d52d38b69b0fb15efa I AM HERE! Aborted connection
  10. Distributed Tracing: What answers I get? • What services did

    a request pass through? • What occurred in each service for a given request? • Where did the error happen? • Where are the bottlenecks? • What is the critical path for a request? • Who should I page? @jcchavezs / #BuildStuffLT
  11. Benefits of Distributed Tracing • (almost) Immediate feedback • System

    insight, clarifies non trivial interactions • Visibility to critical paths and dependencies • Understand latencies • Request scoped, not request’s lifecycle scoped. @jcchavezs / #BuildStuffLT op 1 op 2
  12. Trace’s Anatomy • A trace shows an execution path through

    a distributed system • A span in the trace represents a logical unit of work (with a start and end) • A context includes information that should be propagated across services • Tags and logs (optional) add complementary information to spans. /things auth.Auth Time GET /videos mysql.Get T R A C E @jcchavezs / #BuildStuffLT
  13. Elements of distributed tracing Credits: Nic Munroe Leg 1: inbound

    propagation Leg 2: outbound propagation Leg 3: in-process propagation Distributed Tracing
  14. Leg 1: Inbound propagation When your service process a request

    or consume a message. API Proxy Media API GET /media TraceID: fAf3oXL6DS SpanID: dZ0xHIBa1A ... @jcchavezs / #BuildStuffLT
  15. Leg 2: Outbound propagation When your service makes an outbound

    call to another service Media API Video service GET /videos TraceID: fAf3oXL6DS ParentID: dZ0xHIBa1A SpanID: y74fr5udj http/get @jcchavezs / #BuildStuffLT
  16. mysql.Query redis.Get Leg 3: In process propagation When performing an

    operation inside the service Media API Cache service Images service GET /images
  17. API Proxy Media API Auth Videos Images Time error Distributed

    tracing [1508410442] no cache for resource, retrieving from DBc TraceID d52d38b69b0fb15efa I AM HERE! Aborted connection
  18. Any overhead? For users: • Observability tools are meant to

    be unintrusive • Sampling reduces overhead • (Don’t) trace every single operation For developers: • Not all libraries are ready to plug instruments • Instrumentation can be delegated to common frameworks @jcchavezs / #BuildStuffLT
  19. Apache Zipkin Based on B3 and inspired on Google Dapper

    (2010). It was open sourced by Twitter (2012) and joined Apache Incubator on September 2018. • Mature tracing model emerged from users’ needs. • Used by large companies like Netflix, SoundCloud and Yelp but also small ones. • Strong community: ◦ @zipkinproject ◦ gitter.im/openzipkin @jcchavezs / #BuildStuffLT
  20. Service (instrumented) Transport Collect spans Collector API UI Storage DB

    Visualize Retrieve data Store spans http/kafka/grpc Receive spans Deserialize and schedule for storage Cassandra/MySQL/ElasticSearch Zipkin: architecture