Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Practical Zombie Hunting for Kubernetes Users

Holly Cummins
March 27, 2025
7

Practical Zombie Hunting for Kubernetes Users

Zombies? Yup, zombies. Zombies are servers which aren’t doing useful work. They’re everywhere, costing money, eating electricity, and belching carbon. And they’re useless! Sadly, the cloud has *not* helped our zombie problem, and even Kubernetes hasn't helped.

One of the reasons zombies don’t get switched off is that no one knows they’re there. So how do we get rid of our pesky zombies? In this talk, Holly will explain the underlying technical and organisational factors that lead to zombies, and introduce a range of real-world zombie-hunting strategies. These include getting to grips with elasticity and utilisation, LightSwitchOps, FinOps, and the eco-monkey (it’s like the chaos monkey, but greener). Technologies covered include absurdly simple scripts, DailyClean, Kruize Autotune, and Backstage.

Holly Cummins

March 27, 2025
Tweet

More Decks by Holly Cummins

Transcript

  1. Holly Cummins Red Hat KubeCon | CloudNativeCon April 3, 2025

    Practical Zombie Hunting for Kubernetes Users
  2. #RedHat @hollycummins.com Hey boss, I created a Kubernetes cluster. I

    forgot it for 2 months. … and it’s €1000 a month. 2018
  3. what do these servers do? one is a backup for

    the other. @therealmarkw1, twitter
  4. what do these servers do? one is a backup for

    the other. yes, but what do they do? @therealmarkw1, twitter
  5. what do these servers do? one is a backup for

    the other. yes, but what do they do? @therealmarkw1, twitter no one has known for a couple of decades
  6. #RedHat @hollycummins.com “we run this as a batch job on

    weekends, but the servers stay up all week” “
  7. #RedHat @hollycummins.com “we run this as a batch job on

    weekends, but the servers stay up all week”
  8. #RedHat @hollycummins.com “we only use this system in UK working

    hours, but we leave it running 24/7 ” “
  9. #RedHat @hollycummins.com 2014 survey 29% of 4,000 active less than

    5% of the time https://www.anthesisgroup.com/wp-content/uploads/2019/11/Comatose-Servers-Redux-2017.pdf
  10. #RedHat @hollycummins.com the average server: 12 - 18% of capacity

    30 - 60 % of maximum power https://www.nrdc.org/sites/default/files/data-center-efficiency-assessment-IB.pdf
  11. #RedHat @hollycummins.com algorithms stack carbon awareness green software foundation: principles

    hardware efficiency electricity efficiency where when quarkus!
  12. #RedHat @hollycummins.com IT Department, UK Bank let’s figure out what

    all these cloud workloads are, since I’m paying for them long meetings
  13. #RedHat @hollycummins.com IT Department, UK Bank let’s figure out what

    all these cloud workloads are, since I’m paying for them long meetings zzzz zzzzzzz zz zzzzzz
  14. @holly_cummins #RedHat the scream is real this internal server doesn’t

    seem to have a purpose uh … why did the backbone of a client’s network just vanish? let’s turn it off!
  15. @holly_cummins #RedHat the scream is real this internal server doesn’t

    seem to have a purpose uh … why did the backbone of a client’s network just vanish? let’s turn it off! oops.
  16. #RedHat @hollycummins.com we don’t switch the server off because we’re

    not sure if it will come back on happens all the time
  17. #RedHat @hollycummins.com we don’t switch the server off because it

    would be too much work to recreate it happens all the time
  18. @holly_cummins #RedHat turning it off and on again must •

    be fast • actually work • idempotency
  19. @holly_cummins #RedHat turning it off and on again must •

    be fast • actually work • idempotency • resiliency
  20. #RedHat @hollycummins.com large UK bank, 2013 50% reduction in CPUs

    with a lease system self-destructing instances
  21. #RedHat @hollycummins.com large UK bank, 2013 50% reduction in CPUs

    with a lease system self-destructing instances
  22. @holly_cummins #RedHat timed shutoff we used to leave our applications

    running all the time @darkandnerdy, Chicago DevOpsDays
  23. @holly_cummins #RedHat timed shutoff we used to leave our applications

    running all the time when we scripted turning them off at night, we reduced our cloud bill by 30% @darkandnerdy, Chicago DevOpsDays
  24. @holly_cummins #RedHat my shell script to power down machines overnight

    saved my school €12,000 absurdly simple timed shutoff
  25. - Kruize Autotune - PEAKS (Power Efficiency Aware Kubernetes Scheduler)

    - OpenShift Cost Management Open Source utilization optimisation
  26. things that (maybe) don’t help #RedHat @hollycummins.com virtualisation 2019 survey

    30% of virtual servers doing no useful work 50% of virtual servers active less than 5% of the time
  27. “we solve the cold-start problem by … … keeping an

    instance running but not billing you”