Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability of Distributed Systems

Observability of Distributed Systems

Operating distributed systems is hard, not only because of their inherent complexity of the number of components and their distribution but also because the unpredictability of their failures modes: it is plenty of unknown unknowns. We are left with an imperative to build systems that can be debugged, armed with evidence instead of conjecture.

Observability is the practice of understanding the internal state of a system via knowledge of its external outputs. In this talk, we will discuss observability practices, benefits, and opportunities. We’ll also explore observability as a part of the development process.

José Carlos Chávez

November 07, 2019
Tweet

More Decks by José Carlos Chávez

Other Decks in Programming

Transcript

  1. 2 Expedia Group Proprietary and Confidential About me - Software

    Engineer at Expedia Group - Zipkin core team member and open source contributor for observability projects @jcchavezs - #oredev2019
  2. 4 Expedia Group Proprietary and Confidential Distributed systems @jcchavezs -

    #oredev2019 A collection of independent components appears to its users as a single coherent system. Image source: https://link.medium.com/jey42ga7p1
  3. 5 Expedia Group Proprietary and Confidential Complexity (noun) 1. the

    state of having many parts and being difficult to understand or find an answer to. Cambridge Dictionary @jcchavezs - #oredev2019
  4. 6 Expedia Group Proprietary and Confidential The three body problem

    (1687) Given the initial positions and velocities of three masses find their subsequent paths of motion, according to laws of motion and universal gravitation. TL;DR - Known initial conditions - Unpredictable state of the system at given time @jcchavezs - #oredev2019
  5. 7 Expedia Group Proprietary and Confidential Distributed systems are complex

    System complexity can be described as a measure of how understandable a system is and how difficult it is to understand an operation in the system. Sources of complexity in systems: - Task-Structure Complexity - Unpredictability - Size Complexity - Chaotic Complexity - Algorithmic Complexity @jcchavezs - #oredev2019
  6. 8 Expedia Group Proprietary and Confidential Why is it hard

    to operate a Distributed System? - Systems change all the time - Things fail in unexpected ways - Unknown unknowns - Most problems are the convergence of many different things failing at once - Everyone in the team is supposed to respond with the same level of confidence and tools no matter experience or expertise and the more components, the less individuals know about them @jcchavezs - #oredev2019
  7. 9 Expedia Group Proprietary and Confidential Distributed systems are never

    "up"; they exist in a constant state of partially degraded service. Source: https://opensource.com/article/17/7/state-systems-administration
  8. 11 Expedia Group Proprietary and Confidential What is Observability? [...]

    is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals...one can determine the behavior of the entire system from the system’s outputs. If a system is not observable, this means that the current values of some of its state variables cannot be determined through output sensors. This implies that their value is unknown to the controller (although they can be estimated by various means). Wikipedia @jcchavezs - #oredev2019
  9. 12 Expedia Group Proprietary and Confidential What is Observability? Observability

    is the property of the system that allows to understand internal states from its inputs and output signals, in a way that actions can be distilled from that understanding. That means: - Observability is not tooling - It is fundamentally tied to control - Signals are not data but measurements connected to something we need to know @jcchavezs - #oredev2019
  10. 13 Expedia Group Proprietary and Confidential What is Observability? Source:

    https://twitter.com/popsysdig/status/1139505998299877377 @jcchavezs - #oredev2019
  11. 14 Expedia Group Proprietary and Confidential Three pillars of observability

    @jcchavezs - #oredev2019 Image source: https://twitter.com/autoletics/status/1163345131128401920
  12. 16 Expedia Group Proprietary and Confidential Why should we invest

    in observability? - Gives real-time feedback from signals - Helps to understand unknown-unknowns - Eases the debugging task by providing context and scope for signals - Improves resilience of systems by giving visibility to baseline failure modes in development cycle @jcchavezs - #oredev2019
  13. 18 Expedia Group Proprietary and Confidential - On develop make

    sure your system can emit meaningful signals. - When testing make sure actionable failure modes can be surfaced. - At deploy time, use observability signals to understand the impact of the changes been released. @jcchavezs - #oredev2019 Image source: https://link.medium.com/zvm1AfYvy0 Observability as part of the software lifecycle
  14. 19 Expedia Group Proprietary and Confidential - When operating a

    system, use signals to: - understand health - detect anomalies - triage problems - evolve the system - When in support, you can re-scope the issues based on the signal context @jcchavezs - #oredev2019 Image source: https://link.medium.com/zvm1AfYvy0 Observability as part of the software lifecycle
  15. 21 Expedia Group Proprietary and Confidential Ownership Landing observability in

    an engineering department needs champions who: - Raise awareness about the problems that can be solved by introducing observability - Understand teams’ pains when it comes to operate and triage the system and decide the right tools for those pains - Set practices, evolve them and help to replicate them among teams Building an observability culture @jcchavezs - #oredev2019
  16. 22 Expedia Group Proprietary and Confidential Tooling Observability is not

    tooling but tooling is key to achieve a good observability, what is needed: - Suitable observability platforms and instrumentation in place - Tools and dashboards that connect the dots among stakeholders - Automated checks that make sure signal outputs make sense after a deploy - Right processes to make sure Personally Identifiable Information (PII) is safe Building an observability culture @jcchavezs - #oredev2019
  17. 23 Expedia Group Proprietary and Confidential Business value Observability can

    also be beneficial for other stakeholders of the system: - Helping to achieve SLOs by improving the triage experience. - Giving support teams and engineers a common context to understand and fix problems in production. - Improving support teams awareness by foresee trends when it comes to failures. Building an observability culture @jcchavezs - #oredev2019
  18. 24 Expedia Group Proprietary and Confidential Summary - Systems are

    complex and will be, observability helps us to understand better failure modes. - Observability is not a goal itself, it is only important if we close the cycle by the actions we take from the observations. - Observability will not only benefit developers and operators but all stakeholders of the system. - Like everything else in software industry, building the culture is more important than the code, infrastructure and tooling. @jcchavezs - #oredev2019
  19. 26 Expedia Group Proprietary and Confidential See also - Does

    software understand complexity? - Michael Feathers - What is the Complexity of a Distributed System? - Anand Ranganathan, Roy H. Campbell - Observability: The significant parts - William Louth - Observations on observability - Colin Breck - Observability 3 ways: Logging, Metrics & Tracing - Adrian Cole @jcchavezs - #oredev2019