buy or fork, or at least it’s not just that. I think all of this work was really important to give the movement credibility. But, I think we failed to communicate clearly what what it was like to live within the system, and what it was like to keep it working. I wanted to address that with this talk.
were growing as an organization. We believed that deploying as often as possible would lead to safety and development velocity, and I think we were vindicated in that. But ramping that up wasn’t immediate. It went like this.
Pretty much the opposite actually. So when we decided we were going to aim for deploying many times a day, there was a lot of pre-existing process that had to be destroyed. I’ll talk about some of that.
from a few dozen folks up to well over a hundred. And it’s nontrivial to go from a handful of people deploying 40 times a day up to a hundred people deploying 40 times per day.
there were challenges or the process broke down entirely. With this talk I wanted to put together a narrative of how we fought through some of the key problems we had.
engineering founders parted ways with the company. And those folks meant well, but they had been gatekeeping production pretty hard. We had not really been able to touch the site, and now we were about to expected to. There was also no production monitoring to speak of. We suddenly found ourselves out to sea.
tests. Using selenium, that web driver toolkit zenefits used to break the law. That seemed sensible as a way have at least a modicum of safety as started changing things. It was also nice that this was a proactive thing I could do with literally nobody helping me.
is that doing this well is at least as hard as writing a multithreaded program. And that’s a domain where human competence isn’t estimated to be high. The other problem with testing across process boundaries is that failure is a thing that can’t be entirely avoided. So what you tend to wind up with there is a lot of tests that work most of the time. Even if you’re really good and prevent most race conditions, you’re still going to have some defects. Given enough tests, that means that one is always failed.
that the whole point of tests is to gain confidence before changing things. And to the extent that there are false negatives in the tests, or the tests gum up the works, they’re doing the opposite of that. Tests are one way to gain some confidence that a change is safe. But that’s all they are. Just one way.
In abstract, every single line of code you deploy has some probability of breaking the site. So if you deploy a lot of lines of code at once, you’re just going break the site. And you stand a better chance of inspecting code for correctness the less of it there is.
can write an infinite number of tests and only asymptotically approach zero problems in production. So at least when building a website, your priority should be knowing things are broken and having the capability to fix them as quickly as possible. Preventing problems is a distant second.
created to support software that’s nothing at all like web software. The Linux kernel has many concurrent supported versions. A website, on the other hand, doesn’t really have a version at all. It has a current state.
may be great for that, but there’s no reason to suspect that things you do there should work as well for building an online application. But the tendency among engineers is to start with Github, and all of the workflow and cultural baggage that surrounds it. Then they bang on stuff from there until something’s in production.
point to the rituals we were performing with revision control. It’s better to conceive of a website as a living organism where all of the versions of it are all jumbled together.
development code in the production codebase, inside of an if block that turns it off. And you ship that development code to production as a matter of routine.
and ceremony around deploys. But those were both cases of deciding to destroy process as a team. It’s another thing entirely for individuals to yolo their own destruction of process. One major reason that happens is because the deploy process can be too slow.
faster and more dangerous way to do things and people will do that. They’ll replace running docker containers by hand. They’ll hand-edit files on the hosts. I want to stipulate that this doesn’t happen because people are evil, it happens because they’re people and they follow the path of least resistance.
hurry. Maybe you’re experiencing a SQL injection attack, or what have you. You don’t want to be trying to use a different set of fast deployment methods in a crisis.
Uber yolo’d a self-driving car trial in downtown San Francisco. It ended abruptly right after a video surfaced of one rolling right through a pedestrian intersection during a red light.
in the driver’s seat, but that person didn’t intervene. Uber blamed that person for the incident. But that’s the wrong way to look at it. The automation was capable enough that the human’s attention very understandably lapsed, but not capable enough to replace the human. The human and the car are, together, the system. Things you do to automate the car affect the human.
99% of your deploys are routine. Because in the rare case that something goes wrong, the human part of the system will be ready to react. Automation that only mostly works is often worse than no automation at all.
the moment you’re deploying. You need to know what you’re pushing, if anyone else is trying to push at the same time, whether or not the site is currently on fire, and so on.
in a digestible way, and they can lay out a workflow that comprises multiple steps. They’re also more accessible to people outside of engineering that might also want to deploy things. Which is great, because then you can build a culture where supportfolk can ship knowledge base changes, or designers can ship css changes.
live in the world where shipping code isn’t special, that’s not just a problem for the people doing the deployments. It’s a problem for anyone that has to react to deploys.
every respect. You’d show up on Saturday morning, and spend several days on it. It sucked. But one criticism you couldn’t make of this is that nobody knew it was happening. Brutality isn’t a great ethos, but it is at least an ethos.
a deploy by one team triggers alerts in the channel of some other team, who isn’t aware that a deploy just happened. You lose the obvious connection between the deploy notification and the alert if you do this.
deploying safely and often. But eventually, overall developer velocity hits an upper bound. You can’t deploy faster, but your organization continues to grow.
to make the day longer. Tech employees rise late in my experience, so you can get up at the crack of dawn and ship a bunch of stuff. That’s a bad solution though.
great deal of operational debt. And comes with a lot of other baggage, which I’ll avoid ranting about. Suffice it to say I think we should try to exhaust other options before we do that.
the changes being made look like this. Just changing one config setting, turning things on or off or doing rampups. These are safe, or at least quickly reversible. We can take these and make a faster deploy lane for them that runs in parallel and skips the tests.
are branching in code is code that isn’t executed in production. Conceptually, or even literally, you’re pushing code like this. These are really safe deploys, because the code is just dead weight that isn’t executed at all.
safely. One deploy, with two people making live changes at the same time. This is the sort of thing that will often go fine with no coordination. But when something does break, these people need to be in communication.
write a decent amount of code to solve a problem. We wrote a chatbot to help us coordinate. Someone was nice enough to port it to hubot. You can find it on npm now.
an arbitrarily chosen size. I think the size changed over time but for the purposes of demonstration let’s say we pick three people at a time to deploy together.
like to evolve alongside a continuous delivery effort. I don’t think you should take this as a set of instructions per se. It’s a toolbox you can use, but the details of your situation will differ.
dealt with were social problems. Some of these solutions involved writing code, but the hard part was the human organization. The hard parts in were maintaining a sense of community ownership over the state of the whole system.
it first, we tend to start with things that make software easiest to write. Or the specific load of development baggage that we’re most comfortable with. Then we hope that we can bang on some tooling and get working software in the end. There’s no reason to expect that this approach should work, or result in anything good.
like the problem is interesting technically, you might want to stop and ask yourself if you’re in the weeds. The tendency once you’ve programmed yourself into a serious hole is to keep programming. Maybe you should stop trying to program your way out of difficult situations.