The days of trying to prevent failures are gone. In today’s high-volume, cloud-based systems, anything that can go wrong will eventually go wrong. It is far better to spend our time engineering fault tolerance than pursuing the impossible goal of fault prevention. Not only is reliability engineering one of the highest paying jobs in software engineering today, it is also a job full of unique challenges that demand creative thinking and problem solving.
This talk is about the multiple aspects of reliability engineering that have become critically important as our world has become increasingly dependent on software systems.