Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AIOps: Prove It! (An Open Letter to Vendors Sel...

AIOps: Prove It! (An Open Letter to Vendors Selling AI for SREs)

SREs are not known for being eager, optimistic early adopters of shiny new technologies. We are much more likely to subject you to lengthy monologuing about all of the ways said technologies are overhyped, under-delivered, and prone to spectacular, catastrophic systems failures. Which brings us to the topic of AI.

It’s easy to be cynical when there’s this much hype and easy money flying around, but generative AI is not a fad; it’s here to stay. Which means that even operators and cynics — no, especially operators and cynics — need to get off the sidelines and engage with it. How should responsible, forward-looking SREs evaluate the truth claims being made in the market without being reflexively antagonistic? How can we help our orgs steer into change, leveraging AI technologies to help our teams ship better software, faster? And for the vendors out there using AI to try and help solve traditional SRE domain problems, how should they demonstrate that they are engaging with these problems in good faith, that they are more than just hype and snake oil?

By Charity Majors and Fred Hebert, @ SRECon Americas 2025

Avatar for Charity Majors

Charity Majors

March 27, 2025
Tweet

More Decks by Charity Majors

Other Decks in Technology

Transcript

  1. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 2

    Charity Majors Co-Founder, CTO https://charity.wtf Fred Hebert Staff SRE https://ferd.ca
  2. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 3

    3 Who wants to hear another talk on AI??!?!?
  3. V1-25 “ 4 I’ve read a steadily increasing stream of

    articles about using AI in SRE, and I have yet to find one that inspires my trust. Each article makes impressive claims about the capabilities of AI and the way it can be applied to SRE tasks, but the vast majority are light on details. — Lex Neva https://www.honeycomb.io/blog/aiops-prove-it
  4. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. What

    are we being sold? 5 5 • “‘Your new SRE’ or SRE companion: answers questions, reviews PRs, pro-actively monitors and alerts about infra” • “Adaptive learning and zero maintenance mechanism that reduces firefighting” • “Adaptive learning and quick automatic root cause identification” • “Root cause as a service” • “Full understanding of the app, finds root causes, suggests fixes, generates post-incident analysis“ • “AI SREs going from supervised to unsupervised work, eventually doing most of the work autonomously” Big if true.
  5. V1-25 “ — John Allspaw, 2015 An Open Letter To

    Monitoring / Metrics / Alerting Companies Anomaly detection in software is, and always will be, an unsolved problem. Your company will not solve it. Your software will not solve it. Our people will improvise around it and adapt their work to cope with the fact that we will not always know what and how something is wrong at the exact time we need to know. 7 7 https://www.kitchensoap.com/2015/05/01/openlettertomonitoringproducts
  6. V1-25 “ — John Allspaw, 2015 An Open Letter To

    Monitoring / Metrics / Alerting Companies Stop thinking you’re trying to solve a troubleshooting problem; you’re not. [...] Instead of telling me about how your software will solve problems, show me you’re trying to build a product that is going to join my team as an awesome team member 8 8 https://www.kitchensoap.com/2015/05/01/openlettertomonitoringproducts
  7. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 9

    9 Nine times out of ten, you can replace “AI” with “automation” and it will still hold true. (But not ten times out of ten)
  8. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. How

    do you integrate it? 10 10 Employee • Longer interview process • Team member • Has skin in the game • Is expected to grow • Gains and needs local expertise • May show initiative • Highly context sensitive Outsourced Expertise • Not a core competency • Still needs specialization • Limited skin in the game, transactional • Cost/benefit analysis • Ongoing commodification Automation • Part of a technical ensemble; a component • Enhances capacities of people or machines • Commodified • Limited context sensitivity
  9. V1-25 Important questions to ask The way agents are designed

    and where humans are in the loop matter
  10. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 1.

    Are you supervising, or being augmented? 12 12 Supervision is the default The idea of “humans are better at” and “machines are better at” is inadequate. • This is a joint system: it’s teamwork • Supervision brings complacency • People need involvement to stay current • Automation augments people, people remove constraints on automation • Where you are in the loop matters
  11. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 2.

    What perspectives does it bake in? 13 13 What is within its universe? The agent can only act upon things within its reach: • What data sources are accessible? • What operations can it do? • What parts of the socio-technical system is within its context? • Is it exposing the right type of interface?
  12. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 3.

    How good is it delivering what it promises? 14 14 Things generally fall short What you are sold and what you get tends to vary at least a bit. • Do you need it to be perfect? • Is it going to become a “hero”? • What type of work does it displace? • What type of work does it create? • What happens when it has outages or becomes too slow? vs expectation reality
  13. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 4.

    Who is ultimately accountable for it? 15 15 Is it walking you down a specific path? The flexibility it provides can limit what you can do, and force you down specific paths: • Does it let you explore better? • Does it tell you where to look? • Does it show only what’s “important?” • Does it suggest what to do? • Does it force or take specific action? Who does the adapting? Accountability can sometimes be tracked down by asking who does the following when something goes wrong: • Who learns something? • Who does the fixing? • Who decides an incident is ongoing? • Who can change that definition? • Who adapts when new information comes up?
  14. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 16

    TL:DR; for Vendors Join Build a team player, or otherwise be predictable Extend Give more powers to existing agents, improve their abilities Ground Understand existing, known patterns to serve your users best
  15. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 17

    17 “AI-SRE” is not a threat to your job • No, you can’t “buy” SRE, any more than you can “buy” observability or reliability • But keep in mind that not everyone has world class SREs • Something can be better than nothing
  16. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 18

    18 We don’t actually expect vendors to listen or care about this. We *do* expect SREs to care.
  17. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 19

    19 We have long been the guardians of risk for complex software systems. But there are many different ways for an organization to fail.
  18. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 20

    20 This is a season of great change. Change brings opportunity. It also brings new risks. Even if you don’t believe it… even if you don’t want it to be true… People with money think it is.
  19. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 21

    21 Execs are facing a completely different set of pressures. Transparency can be a huge risk mitigator.
  20. V1-25 “ 22 Perhaps the most valuable skill in this

    new landscape isn’t prompt engineering or systems architecture, but adaptability - the willingness to evolve, to learn new skills, and to find your unique place in a rapidly changing field. — Annie Vella https://annievella.com/posts/the-software-engineering-identity-crisis/
  21. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 23

    23 To be an SRE is to expect the worst
  22. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 24

    24 Get your hands dirty Get curious Try vibe coding 🙃
  23. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 25

    25 You can’t help solve collective action problems by opting out of them. 💜💙💚💛🧡❤
  24. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. 26

    26 To be an SRE is to expect the worst …but keep working for the best.
  25. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. Can

    I buy the SRE skill then? 29 29 • Getting you above water is probably worth it ◦ nobody would just shut down their own company on a principled basis for better design of human-machine teamwork • However, know incidents are important ◦ They are a built-in feedback mechanisms on tradeoffs made in the past • Does it make the path for growing juniors harder? ◦ What compensating mechanisms do you need? ◦ Are there AI Ops solutions that improve your people’s skills rather than aiming to free them to do something else?
  26. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. Buyer

    beware 30 30 • Ground it in reality ◦ Check old incidents and challenges, and see how it would cope with them. ◦ Know and define failures or exit criteria beforehand • Find the right mindset, and experiment ◦ Do you need to take an adversarial approach to the agents’ outputs? ◦ Ask your people to list their work-arounds and tricks, share them. • Chaos Engineering ◦ Turn it off for everyone for a week: do you still find your system readable and manageable?
  27. V1-25 © 2025 Hound Technology, Inc. All Rights Reserved. Transformation

    happens regardless 31 31 • “It’s all platform Engineering?” 🌎󰳙🔫󰳙 “Always has been” ◦ Testing, QA, DBAs, sysadmins, all get folded in; SREs fit the pattern ◦ It becomes a “profile” some have, or a consultative role for a few experts • The better designs we advocate for get ignored, and things ship anyway ◦ This is also a pattern we have to consider ◦ Stopping a train just by standing in front of it is kind of unpleasant • Don’t attach yourself to your labels, carry your expertise regardless • Position yourself to become a force multiplier for your organization ◦ know how to best use (or not to use) the new building blocks