fewer mothers (and other marginalized folks) in engineering.” “On call duty is what burned me out of tech.” “My time is too valuable to be on call. You want me writing features and delivering user value, not firefighting.”
users with the lifeblood of your engineers.” “I sacrificed MY health and sleep for 10 years of on call duty; now it’s YOUR turn.” “You aren’t a REAL engineer until you’ve debugged this live at 3 am.”
surface fixes, evading responsibility, flappy alerts, over-alerting, lack of training or support, snobbery…) But it doesn’t have to be this way. There are loads of toxic patterns around on call We can do so much better. 🥰
be: • Compatible with full adult lives & responsibilities • Rarely sleep-disturbing or life-impacting • The sharpest tool in your toolbox for creating alignment • Something engineers actually enjoy • Even … ✨volunteer-only✨
of software ownership A proxy metric for how well your team is performing, how functional your system is, and how happy your users are On call is a lot of things: A set of expert practices and techniques in its own right A miserable, torturous hazing ritual inflicted on those too junior to opt out ? 😬 😬 😬
Any engineers who have code in production. Is this payback?? 🤔 No!! Yes, ops has always had a streak of masochism. But this isn’t about making software engineers miserable too. Software ownership is the only way to make things better. For everyone.
easy on call rotation or a healthy system with a rough on call rotation What you WANT is to align engineering pain with user pain. and then you want to track that pain and pay it down.
of the job definition. We can’t get rid of all the outages and false alarms, and that isn’t the right goal. It’s way more stressful to get a false alarm for a component you have no idea how to operate, than to comfortably handle a real alarm for something you’re skilled at. Our targets should be about things we can do because they improve our operational health. @mononcqc — https://www.honeycomb.io/blog/tracking-on-call-health/
willing to be woken up a few times/year for their code. If you’re on an 8 person rotation, that’s one week every two months If you get woken up one time every other on call shift, that’s 3x/year and only once every 4 months This is achievable. By just about everyone. It’s not even that hard. You just have to care, and do the work.
to navigate production” “I have a new baby” “We have a follow-the-sun rotation” “I need my sleep / it’s stressful.” Learn! (It’s good for you!) Whose isn’t? (It will take you the least time) Ok fine. Nobody should have two alarms. Lucky you! (ownership still matters) Yeah, it is. (This is how we fix it.) “I just don’t want to.” There are lots of other kinds of software. Go work on one. not-so-thinly veiled engineering classism 🙄
alerting.” Alert only when users are in pain. Code should fail fast and hard; architecture should support partial, graceful degradation. Delete any paging alerts for symptoms (like “high CPU” or “disk fail”). Replace them with SLOs.
which correlate directly to real user pain. Better to spend down an SLO budget than suffer a full outage. Moving from symptom-based alerting to SLOs often drops the number of alerts by over 90%.
alert queue, for on call to sweep through and resolve first thing in the am, last thing at night. Stuff that needs attention, but not at 2 am. Prune these too! If it’s not actionable, axe it. No more than two lanes. There can be only two. You may need to spend some months investing in moving alerts from Lane 1 to Lane 2 by adding resiliency and chaos experiments.
what you are going to wake people up for. Actively curate a small list of rock solid e2e checks and SLO burn alerts. Take every alert as seriously as a heart attack. Track them, graph them, FIX THEM.
is messy. Each paging alert should have a link to documentation describing the check, how it works, and some starter links to debug it. (And there should only be a few!) Should be tracked and graphed. Especially out-of-hours.
people. (We have each new person draw the infra diagram for the next new person ☺) Let them shadow someone experienced before going it alone. Give them a buddy. It’s way more stressful to get paged for something you don’t know than for something you do. Encourage escalation.
a retro. Can this be auto remediated? Can it be diverted to lane two? What needs to be done to fix it for good? Teach everyone how to hold safe retros, and have them sit in on good ones — safety is learned by absorption and imitation, not lectures. Consider using something like jeli.io to get better over time.
permission before sleeping in after a rough on call night Nobody should ever have to be on call the night after a bad on call night. If the rate of change exceeds the team’s human SLOs, calm the fuck down. https://www.honeycomb.io/blog/kafka-migration-lessons-learned/ Link:
very good for them to stay in the technical path. On call is great for this. The ideal solution is for managers to pinch hit and substitute generously.
and explorability. Invest in observability. It’s not the same thing as monitoring, and you probably don’t have it. https://www.honeycomb.io/blog/observability-5-year-retrospective/ https://www.honeycomb.io/blog/observability-101-terminology-and-concepts/ https://www.honeycomb.io/blog/so-you-want-to-build-an-observability-tool/ Links:
(and evade vendor lock-in) by embracing OpenTelemetry now. Most of us are better at writing debuggable code than observable code, but in a cloud-native world, observability *is* debuggability. https://thenewstack.io/opentelemetry-otel-is-key-to-avoiding-vendor-lock-in/ https://www.honeycomb.io/observability-precarious-grasp-topic/ Links:
blobs (or “canonical log lines”) and spans. https://charity.wtf/2019/02/05/logs-vs-structured-events/ https://stripe.com/blog/canonical-log-lines Links: Metrics and logs cannot give you observability.
by inspecting your changes through the lens of your instrumentation, and asking: “is it doing what I expect? does anything else look weird?” Practice not just TDD, but ODD — Observability-Driven Development Check your instrumentation after every deploy. Make it muscle memory.
pit of despair, you’ll get paged less and less. Actively inspect and explore production every day. Instrument your code. Look for outliers. Find the bugs before your customers can report them. In order to stay that way, replace firefighting with active engagement. Drowning in a sea of useless metrics
impact of work that pays down tech debt, increases resiliency, improves dev speed — in retention, velocity, and user & employee happiness alike. If it’s too big to be fixed by on call, get it on the product roadmap. Make sure engineers have enough time to finish retro action items.
features. Treat them just like product work — scope and plan the projects, don’t dig it out of the couch cushions. Use SLOs (and human SLOs!) to assert the time you need to build a better system. Being on call gives you the moral authority to demand change.
within a few minutes of investigating, you need better observability. This is not normal or acceptable. Run chaos experiments (at 3pm, not 3am) to make sure you’ve fixed it, and consider running them continuously, forever. Stop over-investing in staging and under-investing in prod. Most bugs will only ever be found in prod.
or have their time abused, but success is not about “not having incidents”. Track things you can do, not things you hope don’t happen. https://www.honeycomb.io/blog/tracking-on-call-health/ Links: It’s about how confident people feel being on call, whether we react in a useful manner, and increasing operational quality and awareness.
performing, and (more importantly) what trajectory they are on. Be careful with incentives here, but some data is necessary. Managers who run their teams into the ground should never be promoted.
novel problems that move the business materially forward. Lower-performing teams spend their time firefighting, waiting on code review, waiting on each other, resolving merge conflicts, reproducing tricky bugs, solving problems they thought were fixed, responding to customer complaints, fixing flaky tests, running deploys by hand, fighting with their infrastructure, fighting with their tools, fighting with each other…endless yak shaving and toil.
an operational pit of doom than it is to dig your way out of one. Dedicated ops teams may be going the way of the dodo bird, but operational skills are in more demand than ever. Don’t under-invest — or underpay for them.
your “run tests and deploy” time down to 15 min or less. Invest in autodeploys after every merge. Deploy one engineer’s changeset at a time. Invest in progressive deployment
long does it take for code to go live? 🔥3 — How many of your deploys fail? 🔥4 — How long does it take to recover from an outage? 🔥5 — How often are you paged outside work hours? How high-performing is YOUR team?
to do with your knowledge of algorithms and data structures, sociotechnical (n) “Technology is the sum of ways in which social groups construct the material objects of their civilizations. The things made are socially constructed just as much as technically constructed. The merging of these two things, construction and insight, is sociotechnology” — wikipedia and much more to do with the sociotechnical system you participate in. Technical leadership should focus intensely on constructing and tightening the feedback loops at the heart of their system.
to operational health is rarely technical knowledge, it’s usually poor prioritization due to lack of hope. Occasionally, it’s shitty management. It’s way easier to work on a high-performing team with auto-deployments and observability than it is to work on systems without these things. If your team can write decent tests, you can do this.
SLOs… Now that we have dramatically fewer unknown-unknowns… Now that we have the instrumentation to swiftly pinpoint any cause… Now that we auto-deploy our changes to production within minutes… Now that night alerts are vanishingly rare, and the team is confident So, now that we’ve done all that… “I’m still not happy. You said I’d be HAPPY to be on call.”
on features or the roadmap that week. If you were on call this week, you get next Friday off. Automatically. Always. You work on the system. Whatever has been bothering you, whatever you think is broken … you work on that. Use your judgment. Have fun. this is your 20% time!! the goodies
Meaning. It helps to clarify and align incentives. It makes users truly, directly happy. It increases bonding and teaminess. I don’t believe you can truly be a senior engineer unless you’re good at on call. The one that we need to work on adding is autonomy…and not abusing people. On-call can help with these.
call. I mostly think engineers are like doctors: it’s part of the job. With one big exception. If you are struggling to get your engineers the time they need for systems work instead of just cranking features, you should start paying people a premium every time they are alerted out of hours. Pay them a LOT. Pay enough for finance to complain. Pay them until management is begging you to work on reliability to save them money. If execs don’t care about your people’s time and lives, convert it into something they do care about. ✨Money✨