Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PWL-SF#1 => Ryan Kennedy on Dapper, a Distribut...

PWL-SF#1 => Ryan Kennedy on Dapper, a Distributed Systems Tracing Infrastructure

Ryan Kennedy and Anjali Shenoy from Yammer Engineering kicked off our group by presenting the Dapper, a Large-Scale Distributed Systems Tracing Infrastructure paper by Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan and Chandan Shanbhag.

Papers_We_Love

March 26, 2014
Tweet

More Decks by Papers_We_Love

Other Decks in Technology

Transcript

  1. About Us • Ryan Kennedy @rckenned - runs infrastructure for

    @yammer. • Anjali Shenoy @anjshenoy - infrastructure engineer @yammer.
  2. “We built Dapper to provide Google’s developers with more information

    about the behavior of complex distributed systems.”
  3. Benefits of a distributed system… • A collection of software

    services • Developed by different teams • Across different platforms • Using different programming languages
  4. Downsides of a distributed system… • A collection of software

    services • Developed by different teams • Across different platforms • Using different programming languages
  5. Engineering Context • Multiple services • Each service manned by

    a separate team • Continuous deployment of services
  6. On-call troubles • Investigate overall health of system • Guess

    which service is at fault • Why which service is at fault*
  7. Dapper’s Problem Space • End user - on-call engineer •

    Bird’s-eye view into overall system health • Ability to drill down into a service and see why its holding up the train • Long term pattern recognition
  8. Low overhead. But… how? • Sampling - 0.01% of requests

    for high throughput systems. • Adaptive sampling was being deployed at publish time. • Out of band trace data collection
  9. What is a trace? • A dapper trace is a

    tree of spans - 1 for every RPC • Each span has its own set of annotations, including the parent span • Annotation - application specific data you want to send along with the trace
  10. What about annotations? • Annotation is some application specific information

    you pass along with your span. • A span can have 0-many annotations. • Each annotation has a timestamp and either a textual value or key-value pairs.
  11. {! "trace_id": "7021185255097625687",! "spans": [ {! "span_id": "2186499883",! "parent_span_id": "",!

    "name": "groups.create",! "start_time": 1395364621946662144,! "duration": 1359471104,! "annotations": [ {! "path_to_sql": {! "sql": "INSERT into messages (…)",! "start_node": “app/controllers/ a_controller.rb:create",! "path": "app/models/b.rb:create_message, app/ models/c.rb:create_group_message!"! },! "logged_at": 1395364623306275840! }! ]}]}
  12. Ok. Now what? • How to effectively coalesce data in

    downstream systems? • Data for immediate perusal • Data for long term pattern recognition
  13. Moar sampling… • Google’s prod clusters generate >1TB of data/

    day. • Dapper end users want to query trace data ~ 2 weeks old
  14. Dapper API • By Trace Id: load on demand •

    Bulk Access: leverage MapReduce jobs to provide access to billions of traces in parallel. • Indexed access: composite index => lookup by service name, host name, timestamp.
  15. Dapper API Usage • Online web applications • Command line

    - on demand • One-off analytical tools
  16. Blind spots • Coalescing effects • Finding a root cause

    within a service • Tying kernel events to a trace
  17. In conclusion • Best use case for dev/ops teams. •

    Practical - negligible performance impact • Keep the trace repo API open
  18. Thank you, authors • Benjamin H. Sigelman • Luiz André

    Barroso • Mike Burrows • Pat Stephenson • Manoj Plakal • Donald Beaver • Saul Jaspan • Chandan Shanbhag