$30 off During Our Annual Pro Sale. View Details »

SRECon 2024 Keynote: Is It Already Time To Vers...

SRECon 2024 Keynote: Is It Already Time To Version Observability? (Signs Point To Yes)

Recording: https://www.usenix.org/conference/srecon24americas/presentation/majors-plenary

Pillars, cardinality, metrics, dashboards ... the definition of observability has been debated to death, and I'm done with it. Let's just say that observability is a property of complex systems, just like reliability or performance. This definition feels both useful and true, and I am 100% behind it.

However, there has recently been a generational sea change in data types, usability, workflows, and cost models, along with what users report is a massive, discontinuous leap in value. In the parlance of semantic versioning, it is a breaking, backwards-incompatible change. Which means it’s time to bump the major version number. Observability 1.0, meet Observability 2.0.

In this presentation, we will outline the technical and sociotechnical characteristics of each generation of tooling and describe concrete steps you can take to advance or improve. These changes are being driven by the relentless increase in complexity of our systems, and none of us can afford to ignore them.

Charity Majors

May 27, 2024
Tweet

More Decks by Charity Majors

Other Decks in Technology

Transcript

  1. What does “observability” mean? “In control theory, observability is a

    measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals." — wikipedia “Observability has three pillars: metrics, logs and traces.” — Peter Bourgon “Monitoring is about known-unknowns, observability is about unknown-unknowns.” — me “Observability is the process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn.” — Hazel Weakly “Observability demands high cardinality, high dimensionality, and exploitability.” — me “Monitoring is the monitor telling me the baby is crying but observability is telling me why.” — Austin Parker
  2. 2016 2017 2018 “What would the control theory definition mean,

    applied to software?” 🤔 2019 2020 2021 2022 2023 2024 the “well, actually…” years “Observability has three pillars: metrics, logs and traces” Observability is a generic synonym for telemetry “ugh, who cares” — everybody The laundry list A chronological history of observability in software Gartner adds a category for “Observability”
  3. 1.0 ➡ 2.0 Observability “Three pillars:” metrics, logs, traces Single

    source of truth: wide structured logs (A breaking, backwards-incompatible change)
  4. Observability 1.0 Metrics, logs and traces, captured separately Many. APM,

    RUM, logging, tracing, metrics, analytics… Static dashboards Debug based on intuition, scar tissue from past outages, and guesswork Page on symptoms appearing in metrics Pay to store your data many times Data: Source of truth: Interface: Debugging: . Alerts: Cost:
  5. Observability 2.0 Wide, rich structured logs (aka events or spans),

    with high cardinality and high dimensionality One Exploratory, interactive; no dead ends Follow the trail of breadcrumbs. It’s in the data. Page on customer pain via SLOs Pay to store your data once Data: . Source of truth: Interface: Debugging: Alerts: Cost:
  6. You have observability if you have… 1. Arbitrarily-wide structured raw

    events 2. Context persisted through the execution path 3. Without indexes or schemas 4. High-cardinality, high- dimensionality 5. Ordered dimensions for traceability 6. Client-side dynamic sampling 7. An exploratory visual interface that lets you slice and dice and combine dimensions 8. In close to real-time If you have three pillars, and many tools: Observability 1.0 If you have a single source of truth: Observability 2.0
  7. OBSERVABILITY 1.0 OBSERVABILITY 2.0 How the data gets stored •

    Metrics • Logs • Traces • APM • RUM • … • Tracing is an entirely different tool • Siloed tools, with no connective tissue or only a few, predefined connective bits • Arbitrarily-wide structured data blobs • Single source of truth • Tracing is just visualizing over time • It’s just data. Treat your data like data. • Write-time aggregation • Read-time aggregation; raw events
  8. OBSERVABILITY 1.0 OBSERVABILITY 2.0 Who uses it, and how? •

    About MTTR, MTTD, and reliability • Usually a checklist item before shipping code to production — “how will we monitor this?” • An “ops concern” • No support for structured data • Underpins the entire software development lifecycle. • Part of the development process • High cardinality • High dimensionality • Static dashboards • Exploratory, open-ended interface
  9. Observability 1.0 is about how you ✨operate✨ software Observability 2.0

    is about how you ✨develop✨ software It is what underpins the entire software development lifecycle, allowing you to hook up tight feedback loops and move swiftly, with confidence. It is traditionally focused more on bugs, errors, MTTR, MTTD, reliability, monitoring, and performance.
  10. OBSERVABILITY 1.0 OBSERVABILITY 2.0 How you interact with production •

    You deploy your code and wait to get paged. 🤞 • Your job is done when you commit your code and tests pass • You practice Observability-Driven Development • Your job isn’t done until you’ve verified it works in production. • These worlds are porous and overlapping • You are in constant conversation with your code. 💜 • Your world is broken up into two very different universes, Dev & Prod
  11. OBSERVABILITY 1.0 OBSERVABILITY 2.0 How you debug • You flip

    from dashboard to dashboard, pattern-matching with your eyeballs • You lean heavily on intuition, past experience, and a rich mental model of the system • The best debuggers are always the engineers who have been there the longest and seen the most. • You form a hypothesis, ask a question, consider the results, and ask another based on the answer. • You don’t have to guess. You follow the trail of breadcrumbs to the answers, every time. • Analysis-first • The best debuggers are the people who are the most curious. • Search-first
  12. OBSERVABILITY 1.0 OBSERVABILITY 2.0 The cost model • You pay

    to store your data again and again and again and again, multiplied by the number of tools • Cost goes up (at best) linearly, driven by the number of custom metrics you define • Keeping costs under control requires ongoing investment from engineering • You pay to store your data ✨once✨ • You can store infinite “custom metrics”, appended to your events • Powerful, surgical options for controlling costs via head-based or tail-based dynamic sampling • Cost for individual metrics can spike massively and unpredictably
  13. Why does observability 1.0 cost so much? Because you have

    to pay for so many different tools / pillars, your costs rise at a multiplier of your traffic (5x? 7x?) Because so many of those tools are built on metrics https://www.honeycomb.io/blog/cost-crisis-observability-tooling Because of the high overhead of ongoing engineering labor to manage costs and billing data Because of the dark matter of lost engineering cycles.
  14. Envelope math: cost of a custom metric 5 hosts, 4

    endpoints, 2 status codes as a Count metric: 40 custom metrics https://www.honeycomb.io/blog/cost-crisis-observability-tooling request.Latency 1000 hosts, 5 methods, 20 handlers, 63 status codes as a Count metric: 6.3M custom metrics 1000 hosts, 5 methods, 20 handlers, 63 status codes as a histogram using defaults (max, median, avg, 95pct, count): 31.5M custom metrics 1000 hosts, 5 methods, 20 handlers, 63 status codes as a histogram using defaults (max, median, avg, 95pct, count) plus distribution (99pct, 99.5pct, 99.9pct, 99.99pct: 63M custom metrics A DataDog acct comes with 100-200 free custom metrics, and costs 10 cents for every 100 over. 63M custom metrics costs you $63,000/month for request.Latency
  15. OBSERVABILITY 1.0 OBSERVABILITY 2.0 The cost model • Ballooning costs

    are baked in to the 1.0 model. ☹ • As your bill goes up, the value you get out of your tools actually goes down. • Your costs go up as your traffic goes up and as you add more spans for finer-grained inspection • As your bill goes up, the value you get out of your tools goes up too. • Metrics and unstructured logs both suffer from opaque, bursty billing and degrade in punishing ways • Costs effectively nothing to widen structured data & add more context
  16. There are only three types of data: 1. The metric

    2. Unstructured logs (strings) 3. Structured logs RUM tools are built on top of metrics to understand browser user sessions. APM tools are built on top of metrics to understand application performance.
  17. Tiny, fast, and cheap Each metric is a single number,

    with some tags appended Stored in TSDBs NO context. NO high cardinality. NO data structures. NO ability to correlate or dig deeper Only basic static dashboards.
  18. Unstructured Logs To understand our systems, we turn to logs.

    Even unstructured logs are more powerful than metrics, because they preserve SOME context and connective dimensions. However, you have to know what you’re looking for in order to find it. And the only thing you can do is string search, which is slloooooowwwww.
  19. We have learned to be insanely clever when it comes

    to wringing every last bit of utility out of metrics and unstructured logs. What if it was all just … data. What if we didn’t have to work that hard?
  20. Metrics are a bridge to the past. Structured logs are

    the bridge to the future. Metrics aren’t completely useless; they still have their place! (In infrastructure 😛.) ❤
  21. What you can do ✨NOW✨ to start moving towards o11y

    2.0: 1. Instrument your code using the principles of canonical logs. It is difficult to overstate the value of doing this. Make them wide. 2. Add trace IDs and span IDs, so you can trace your code using the same events instead of having to hop between tools 3. Feed your data into a columnar store, to move away from predefined schemas or indexes 4. Use a storage engine that supports high cardinality 5. Adopt tools with explorable interfaces, or at least dynamic dashboards.
  22. Observability 2.0 is much faster, cheaper, and simpler to use.

    The way you are doing it NOW is the hard way.
  23. Complexity is exploding, but our tools were designed for predictable

    worlds. We used to be able to reason about our architecture. Not anymore. Now we HAVE to instrument for observability — get it out of our heads and into our tools — or we are screwed.
  24. Observability for software engineers Can you understand what is happening

    inside your systems, just by interrogating them from the outside? Can you debug your code and understand its behavior by observing its outputs? Can you ask (and answer) new questions without shipping new code?
  25. You build better systems by building software this way. You

    become a better engineer by building software this way.
  26. Here’s the dirty little secret: It can’t be done. The

    next generation of systems won’t be built and run by burned-out, exhausted people, or command-and-control teams just following orders. Our systems have become too complicated…too hard. The shit that can be done on autopilot will be automated out of existence.
  27. Those who try will lose. We can no longer hold

    a model of these systems in our heads and reason about them, or intuit the solution. Our systems are emergent and unpredictable. Runbooks and canned playbooks won’t work; it takes your full creative self.
  28. Observability 2.0 advances the craft of software engineering. We are

    trying to make it faster and safer to bring change to the world. We are trying to make this a humane profession.
  29. The biggest obstacle between us and a better world, is

    that we don’t believe one is actually possible. Demand more from your tools. Demand more from your vendors. Everyone writes code. Everyone owns their code in production, And everybody deserves the tools to do it efficiently and well.