Who do you trust? What do you control? What are your dependencies? Reliability on the Internet is an adrenaline-fueled adventure, but we all want a good night sleep and working service sometimes. Adam Surák takes a closer look at some reliability nightmares and explains how they could be dealt with, sharing the design learning outcomes of his experience running servers in almost 40 data centers across 15 regions, achieving close to 100% availability globally and 100% in the vast majority of the regions.
In order to demonstrate why we’re being impacted by our design and operations decisions, Adam quickly reviews the basics before exploring in detail SLAs that we commit to every day yet have only a vague idea of what they mean. Adam then offers an overview of blackbox monitoring tools, from very simple, low-precision tools testing traffic to very sophisticated, high-precision tools measuring real-user traffic. Although cloud solutions seem to be the silver bullet of everything for some, Adam explains that that’s not the case—the cloud has its own issues. Adam concludes with an overview of commonly underestimated dependencies in our software, infrastructure, and people.