Quora released new versions of the site 46 times. This was a normal day for us.” - Quora engineering.quora.com/Continuous-Deployment-at-Quora “Deployment every 11.6s, 1,079 max in one hour. 10,000 mean number of hosts per deployment, with 30,000 maximum” - Amazon. com youtube.com/watch?v=PW1lhU8n5So “On the Google Consumer Surveys team, 8 minutes after you commit code it's live in production.” - Google developers.google.com/live/shows/772717729 “10+ deploys per day.” - John Allspaw, 2009 youtube.com/watch?v=LdOe18KhtT4
disconnected from Xbox LIVE and found themselves unable to log back in. … The root cause of this outage was human error.” - Microsoft blogs.msdn.com/b/xblops/archive/2011/10/03/issues-with-xbox-live-earlier-today.aspx
one of our backup systems last Thursday night (03/17). Maintenance was scheduled to resolve the issue over the weekend. On working to resolve the issue, an administrator accidentally deleted the production database." support.gliffy.com/entries/ 98911057--Gliffy-Online-System-Outage
higher productivity ◦ Frees up floor space ◦ Improves safety ◦ Improves morale ◦ Reduces cost of inventory Benefits of Toyota One Piece Flow __as conceived by Taiichi Ohno
too large. That limits a task force to five to seven people, depending on their appetites” - Jeff Bezos medium.com/@benorama/ the-evolution-of-software-architecture-bd6ea674c477
Web, who probably don't know what ADO or UML or JPA even stand for, Deploy better systems at less cost in less time at lower risk than we see in the Enterprise. - Tim Bray
change fail rate How long is the delay between a request for a change, and a production system operating with that change implemented? How long does it take for an abnormal behavior in the system to be restored to the normal standard agreed way of operation? How many changes and features are being released to production in a fixed period of time? How often the system fails or service is disrupted?
noise ratio. e.g. Google SRE notes & StackExchange Alerts ◦ standard procedures and checklists. e.g. AWS Operational Checklist ◦ practice recovery from system failures. e.g. Netflix downtime & Xbox downtime ◦ practice backup with restore to dev/test. e.g. Netflix Priam ◦ infrastructure as code & auto-healing. e.g. Antifragile Systems ◦ simplicity is prerequisite for reliability. e.g. Forrester Devops & Simple Made Easy