The Twin Mandate of Observability

The Twin Mandate What leaders (still) don’t get about observability

5 © 2025 Hound Technology, Inc. All rights reserved. +
=

6 © 2025 Hound Technology, Inc. All rights reserved. pre-2015:
Monitoring 2015: Observability 2025: Observability High cardinality, high dimensionality, explorability; debugging unknown-unknowns and novel system states An operational tool for operational outcomes: is it up or down, is it slow, are there errors? The sense-making apparatus of complex sociotechnical systems

7 © 2025 Hound Technology, Inc. All rights reserved. Most
companies now spend 20-25% of their infrastructure budget on observability tools.

8 © 2025 Hound Technology, Inc. All rights reserved. 50k
$ 24M $ Observability costs more. Observability Engineering 2009 2024 Monitoring • Operational tool • Operational outcomes Observability • The sense-making functionality of complex sociotechnical systems per year per year

9 © 2025 Hound Technology, Inc. All rights reserved. Observability
does more. Or it should. Observability Engineering 2009 2024 Linux Apache MySQL Php_perl_python

11 © 2025 Hound Technology, Inc. All rights reserved. Finance
metaphor

12 © 2025 Hound Technology, Inc. All rights reserved. Yes,
observability matters to leaders because it costs a lot… But poor observability is likely the obstacle standing between you and what your board/CEO/etc ACTUALLY care about…

14 © 2025 Hound Technology, Inc. All rights reserved. DORA
2025 report “AI’s primary role in software development is that of an amplifier. It magnifies the strengths of high-performing organizations and the dysfunctions of struggling ones.” “AI adoption now improves software delivery throughput, a key shift from last year. However, it still increases delivery instability. This suggests that while teams are adapting for speed, their underlying systems have not yet evolved to safely manage AI-accelerated development. “ “The greatest returns on AI investment come not from the tools themselves, but from a strategic focus on the underlying organizational system: the quality of the internal platform, the clarity of workflows, and the alignment of teams. Without this foundation, AI creates localized pockets of productivity that are often lost to downstream chaos.”

15 © 2025 Hound Technology, Inc. All rights reserved. “A
systems view directs AI’s potential” “The greatest returns will come from investing in the foundational systems that amplify AI’s beneﬁts” “Successful AI adoption is a systems problem, not a tools problem” “We concluded that when AI dramatically accelerates software development, our control systems—that’s us—must also speed up”

16 © 2025 Hound Technology, Inc. All rights reserved. The
job of any engineering leader is to craft sociotechnical systems that efficiently convert engineering labor into business outcomes. How? Fast feedback loops.

17 © 2025 Hound Technology, Inc. All rights reserved. X
causes Y Y also causes X == feedback loop Amplifying Balancing Nonlinear returns, exponential downstream effects Stabilizing or goal-seeking

18 © 2025 Hound Technology, Inc. All rights reserved. Observability
is the feedback loop of feedback loops The highest leverage place to intervene in a dysfunctional system.

19 © 2025 Hound Technology, Inc. All rights reserved. It’s
very hard to “see” poor observability. Here are some of the ways it manifests *within* engineering. It can be humbling to think of ourselves as normal people, but most of us are in fact pretty normal people, with many years of highly specialized experience.

“Ownership” Test “After you commit a piece of code, how do you know whether or not it is working (or how well it’s working) in production?” In organizations with weak observability, understanding your own code requires tremendous tribal knowledge and unreasonable acts of heroism. In organizations with strong observability, understanding your own changes is fast, effortless and automatic. You would have to try NOT to get feedback on how your changes went.

“Two-or-Three People” Test If you ask an engineer in your org a question about your systems, from simple (“how long does it take to ship a line of code here?”) to speciﬁc (“has this particular customer’s experience actually degraded since March, or are they imagining things?”), do they pull up their tools or turn and ask one of the 2-3 people who know how to use the tools? In organizations with weak observability, understanding your own code requires tremendous tribal knowledge and unreasonable acts of heroism.

“Customer Discovery” Test How many of your bugs are discovered by engineering vs by customers? You should actually track this. For every bug you fix, who found it first, your engineers or your customers? The cost of finding and fixing bugs goes up exponentially from the moment they are created.

“Mystery” Test What kind of mysteries do your engineers accept as the normal cost of doing business? • “Latency is higher on Tuesday for some reason” • “That job just fails sometimes” • “Some users say it’s broken, but we can’t reproduce it” Bad observability will slowly smother and kill off an engineer’s natural curiosity and drive. This is one of the most toxic consequences of bad tools. 😔

“Do Your SLOs Have Teeth” Test SLOs are slowly becoming the default in the industry — yay! — but too often, they are just for show and do not drive behavior. • Proliferation — do you have 600 SLOs, instead of 5-15? • Visibility — does anyone actually look at them? • Customer Pain — are customers actually in pain when you burn down? • Product Negotiation — did product agree to a number right away without arguing and investigating the consequences? They won’t stick to it when it matters most. • When an SLO is burned down, does engineering clear their roadmap of product feature work and start ﬁxing the backlog of reliability and tech debt?

Doom Loop: self-reinforcing dysfunction Weak observability makes debugging hard and incidents painful So… Hard debugging means incidents take longer to resolve So… Long, painful incidents make engineers afraid to deploy So… Fear of deploying slows feature delivery So… Slow delivery frustrates product and business teams So… Frustrated stakeholders pressure engineering to move faster So… Pressure to move faster leads to cutting corners So… Cutting corners creates more incidents and quality issues So… More incidents overwhelm the few people who can debug So… Overwhelmed engineers burn out and leave So… Engineer departures mean more tribal knowledge lost So… Lost knowledge makes debugging even harder And… we’re right back where we started, except for worse on every measure. 😔

26 © 2025 Hound Technology, Inc. All rights reserved. •
Write code → immediate feedback from tests • Deploy code → immediate feedback from production metrics • Make architectural changes → immediate feedback on performance and cost impact • Experience incidents → immediate feedback on contributing factors • Ship features → immediate feedback on user adoption and satisfaction High performing orgs are characterized by tight feedback loops everywhere: This makes excellence achievable through systematic refinement rather than acts of heroism and unsustainable levels of effort.

27 © 2025 Hound Technology, Inc. All rights reserved. Organizations
with weak observability often don’t realize what their problem is. What to do about it???

28 © 2025 Hound Technology, Inc. All rights reserved. A
systems approach to observability “Give me a lever long enough and a fulcrum on which to place it, and I shall move the world” — Archimedes Observability Engineering

29 © 2025 Hound Technology, Inc. All rights reserved. First
Not like infrastructure Run observability like a platform engineering team

30 © 2025 Hound Technology, Inc. All rights reserved. Platform
principles a Build like a product, own as little code as possible b Focus on developer experience and fast feedback loops c Applying design principles and user research d Make it simple and easy to do the right thing on autopilot e Oriented around the needs of the business f Staffed by engineers who understand devex Most of all: Limit the cognitive bandwidth demands on developers

31 © 2025 Hound Technology, Inc. All rights reserved. Second
Not a cost center Manage your observability like an investment

32 © 2025 Hound Technology, Inc. All rights reserved. You
can’t make more money by spending more on infrastructure. Observability is different. a Spending more on your observability team & tools can yield real, sizable returns on investment. It CAN be one of the most eﬃcient, high performing investments you can possibly make. b The key here is understanding the dual mandate of observability, and mapping it back to your company’s unique goals, products and objectives

dual mandate of observability Maps to customer happiness External Internal Maps to developer experience and your ability to move swiftly, with conﬁdence

34 © 2025 Hound Technology, Inc. All rights reserved. Customer
happiness a Observability plays a critical role in every single customer experience. b Responsive UX, fast queries, end to end performance tuning c Find problems before your users do External Revenue 53% Site visits abandoned >3s load 80% Users ﬁnd slow sites more annoying than down sites 100ms = 1% Page speed increase boosts revenue by 1% All data pulled from https://www.webuters.com/the-high-cost-of-slow-load-times-in-e-commerce

35 © 2025 Hound Technology, Inc. All rights reserved. a
Run more experiments b Build better products c Try more things d How are your users actually using what you built? e Be an opportunist Customer happiness External Revenue

36 © 2025 Hound Technology, Inc. All rights reserved. It’s
about your ability to move swiftly, with conﬁdence, as a business. Developer experience Internal Business value

37 © 2025 Hound Technology, Inc. All rights reserved. 1
How fast are your deploys? 2 How long to run tests? 3 How long to ship a single line of code? 4 How long to understand its impact on a single random user? a Shipping is your company’s heartbeat b Too many are in “software engineering death spiral” Developer experience Internal Business value

38 © 2025 Hound Technology, Inc. All rights reserved. a
The ceiling has gone WAY UP over the past decade in terms of how fast you should be able to develop and ship code b In 2015, you didn’t have access to a high quality developer tool chain unless you were behind the walled gardens of big tech. Now you do. c But these tools are more effective as a one-two punch: 1 Modern observability AND feature ﬂags 2 Modern observability AND progressive deployments 3 Modern observability AND canarying Developer experience Internal Business value

39 © 2025 Hound Technology, Inc. All rights reserved. Honestly,
that’s basically it Observability Engineering a Roll up to software engineering, not infra or IT b Mind the dual mandate c Build for fast feedback loops and minimize cognitive overhead • Run your observability teams like platform engineering • Manage your observability tools like an investment

40 © 2025 Hound Technology, Inc. All rights reserved. Change
is hard. Observability Engineering a “I think developer experience is not an outcome of using AI tools really well. It's a prerequisite to use them well. What's good for an individual human developer is also good for an agent. There's extra stuff that you might want to do for agents, but the physics of what makes software or codebases easy to contribute to, easy to change...that's important for AI agents and for humans. b “AI is an amplifier. If you have parts of your system that are a bit garbage, they're going to be amplified garbage now. And if you have really solid engineering practices that have stayed ahead of the industry curve, then that's going to be amplified and you're going to get even better results.” Laura Tacho https://www.heavybit.com/library/article/ai-productivity-for-engineering-teams But all our AI investments are riding on this…

41 © 2025 Hound Technology, Inc. All rights reserved. Multiple
Pillars Doom Engineering Orgs Observability Engineering Multiple pillars model (aka “three pillars”, aka “observability 1.0”) a Metrics -> time series db b Logs -> log aggregator c Traces -> tracing tool d Exceptions, errors -> exception tracker e (etc) Uniﬁed storage model (aka “observability 2.0”) a. Takes structured data b. Stores it once c. No dead ends Metrics Logs Traces Errors Data Etc

42 © 2025 Hound Technology, Inc. All rights reserved. Three
Pillars Cognitive Overload 🤯 Observability Engineering "What is the type deﬁnition?" "Is it a gauge? A rate? A count? Histogram?" "When does it reset?" "How do you relate them together?" "How much are these going to cost us?" Each metric is its own cardinality factory Logs = gobs of semi-related data may be structured 2025-08-09T12:45 INFO Customer Created 2025-08-09T12:50 TRACE { "tx_id": 123, "op": "debit", "status", "success" } 2025-08-09T12:55 TRACE { "trans_id": 2343, "operation": "credit" } 2025-08-09T12:50 TRACE { "tx_id": 123, "op": "debit", "status", "success" } 2025-08-09T12:55 TRACE { "trans_id": 2343, "operation": "credit" } 2025-08-09-12:56 ERROR Transaction rollback, Stack Trace... "Did everyone structure the logs the same way?" "How do changes to log records impact analysis?" "Do the attributes match across log records?" "How would I correlate things together in a systematic way?" ... and how do I relate these pillars of data? http.status_code.200.count http.status_code.200.max http.status_code.200.min http.status_code.200.p50 http.status_code.200.p95 http.status_code.500.count http.status_code.500.max ...

43 © 2025 Hound Technology, Inc. All rights reserved. Three
Pillars Cognitive Overload 🤯🤯🤯 Observability Engineering Each metric requires its own set of discrete keys to represent the measurements for each individual represented value. Lots of metrics = Cardinality Explosion! http.status_code.200.count http.status_code.200.max http.status_code.200.min http.status_code.200.p50 http.status_code.200.p95 http.status_code.500.count http.status_code.500.max ... for each status code... customer.order_amount.min customer.order_amount.max customer.order_amount.p25 customer.order_amount.p50 customer.order_amount.p75 customer.order_amount.p95 ... for each aspect of the order... http.endpoint."/customers".call.count http.endpoint."/customers".call.duration.min http.endpoint."/customers".call.duration.max http.endpoint."/customers".call.duration.p50 http.endpoint."/customers".call.duration.p75 http.endpoint."/customers".call.duration.p95 ... repeat for each endpoint you monitor...

44 © 2025 Hound Technology, Inc. All rights reserved. This
could be one structured data blob Observability Engineering { "name": "customer-order", "http.endpoint": "/customers/234/order", "http.method": "POST", "http.status_code": 200, "duration_ms": 3000, "customer.order_amount": 3040.20, "db.transaction.id": 2342345234, "db.operation": "credit", "db.query": "INSERT INTO ORDERS(customer_id, cart_id...) VALUES (?, ?, ?)" } More attributes? Add them! This could be a trace span or just a plain old log, it's just an EVENT! See the actual transactions! Aggregate over the events to see metric-like trends (avg, p95, etc)!

Pillars Doom Engineering Orgs Observability Engineering • Gartner says most customers use 10-20 tools • Therefore, cost multiplier is 10-20x at baseline • Cardinality a constant battle • Takes deep expertise and intuition to leap between datasets, since the relationships are not stored • Cognitive overhead is massive • 🔥🔥🔥 Not a feedback loop 🔥🔥🔥 Multiple pillars model (aka “three pillars”, aka “observability 1.0”) a Metrics -> time series db b Logs -> log aggregator c Traces -> tracing tool d Exceptions, errors -> exception tracker

Pillars Doom Engineering Orgs Observability Engineering • Cost multiplier is 1x • High cardinality baked in (translation: “inﬁnite custom metrics for free”) • Every software engineer knows how to handle structured data Uniﬁed storage model (aka “observability 2.0”) a Takes structured data b Stores it once c No dead ends

Multiple Pillars are Doomed Observability Engineering a The last observability company to be founded on the “three pillars” model was Chronosphere, in 2019 b Every observability startup founded since then has used the singular storage (“o11y 2.0”) model we sketched out in 2016 c I believe that on a ~5 year timeline, the industry will be moving towards a data lakehouse model (so do analysts, it seems) Main blocker is chicken/egg Everyone’s using the same terms to describe their product. If you haven’t seen the difference, it’s hard to see. We have to show people.

48 © 2025 Hound Technology, Inc. All rights reserved. If
all you need is an ops tool for operational outcomes, do monitoring. Observability Engineering In 2015 Everyone was paying monitoring prices for monitoring results In 2025 People are paying observability prices for monitoring results

49 © 2025 Hound Technology, Inc. All rights reserved. Success
in observability? Observability Engineering Investment Treat observability like an investment rather than a cost center Platform team Build your observability practice like a product team Tools + enablement Give your teams what they need to achieve their goals and…

work you put into making your systems resilient, discoverable, and humane will be used over and over.

The Twin Mandate of Observability

The Twin Mandate of Observability

More Decks by Charity Majors

Other Decks in Technology

Featured

Transcript