Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Driven Post Mortems at Datadog - LinuxCon ...

Data Driven Post Mortems at Datadog - LinuxCon 2016

Ilan Rabinovitch

August 24, 2016
Tweet

More Decks by Ilan Rabinovitch

Other Decks in Technology

Transcript

  1. $ finger ilan@datadog [datadoghq.com] Name: Ilan Rabinovitch Role: Director, Technical

    Community 
 Interests: * Monitoring and Metrics * Large scale web operations * FL/OSS Community Events
  2. • SaaS based infrastructure and app monitoring • Open Source

    Agent • Time series data (metrics and events) • Processing nearly a trillion data points per day • Intelligent Alerting • We’re hiring! (www.datadoghq.com/careers/) Datadog Overview
  3. “THE PROBLEMS WE WORK ON AT DATADOG ARE HARD AND

    OFTEN DON'T HAVE OBVIOUS, CLEAN- CUT SOLUTIONS, SO IT'S USEFUL TO CULTIVATE YOUR TROUBLESHOOTING SKILLS, NO MATTER WHAT ROLE YOU WORK IN.” Internal Datadog Developer Guide
  4. “THE ONLY REAL MISTAKE IS THE ONE FROM WHICH WE

    LEARN NOTHING.” - Henry Ford
  5. “AN ANALYSIS OR DISCUSSION OF AN EVENT HELD SOON AFTER

    IT HAS OCCURRED, ESPECIALLY IN ORDER TO DETERMINE WHY IT WAS A FAILURE.” OXFORD ENGLISH DICTIONARY Oxford English Dictionary POSTMORTEM
  6. DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES WHAT

    IS DEVOPS? ▸ Culture ▸ Automation ▸ Metrics ▸ Sharing
  7. DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES OUR

    FOCUS AREA ▸ Culture ▸ Sharing
  8. CULTURE & SHARING RESOURCES BLAMELESS POSTMORTEMS ▸Blameless Postmortems by John

    Allspaw http://bit.ly/etsy-blameless ▸The Human Side of Postmortems by Dave Zwieback http://bit.ly/human-postmortem
  9. COLLECTING DATA IS CHEAP; NOT HAVING IT WHEN YOU NEED

    IT CAN BE EXPENSIVE SO INSTRUMENT ALL THE THINGS!
  10. HUMAN DATA DATA COLLECTION: WHAT? ▸ Their perspective ▸ What

    they did ▸ What they thought ▸ Why they thought/did it
  11. … we will be dramatically improving the tooling that humans

    (and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously … Joyent Postmortem
 http://bit.ly/joyent-post JOYENT US-EAST-1 POST-MORTEM 2014
  12. HUMAN DATA DATA COLLECTION: WHEN? ▸ As soon as possible.

    ▸ Memory drops sharply within 20 minutes ▸ Susceptibility to “false memory” increases
  13. HUMAN DATA DATA SKEW/CORRUPTION ▸ Blame/Fear of punitive action ▸

    Bias ▸ Anchoring ▸ Hindsight ▸ Outcome ▸ Availability ▸ Recency
  14. DATADOG POSTMORTEMS A FEW NOTES ▸ Postmortems emailed to company

    wide ▸ Scheduled recurring postmortem meetings
  15. DATADOG’S POSTMORTEM TEMPLATE (1/5) SUMMARY: WHAT HAPPENED? ▸ Describe what

    happened here at a high-level -- think of it as an abstract in a scientific paper. ▸ What was the impact on customers? ▸ What was the severity of the outage? ▸ What components were affected? ▸ What ultimately resolved the outage?
  16. DATADOG’S POSTMORTEM TEMPLATE (2/5) HOW WAS THE OUTAGE DETECTED? ▸

    We want to make sure we detected the issue early and would catch the same issue if it were to repeat. ▸ Did we have a metric that showed the outage? ▸ Was there a monitor on that metric? ▸ How long did it take for us to declare an outage?
  17. DATADOG’S POSTMORTEM TEMPLATE (3/5) HOW DID WE RESPOND? ▸ Who

    was the incident owner & who else was involved? ▸ Slack archive links and timeline of events! ▸ What went well? ▸ What didn’t go so well?
  18. DATADOG’S POSTMORTEM TEMPLATE (4/5) WHY DID IT HAPPEN? ▸ Deep

    dive into the cause ▸ Examples from this incident: ▸ http://bit.ly/dd-statuspage ▸ http://bit.ly/alq-postmortem
  19. DATADOG’S POSTMORTEM TEMPLATE (5/5) HOW DO WE PREVENT IT IN

    THE FUTURE? ▸ Link to Github issues and Trello cards ▸ Now? ▸ Next? ▸ Later? ▸ Follow up notes
  20. DATADOG’S POSTMORTEM TEMPLATE RECAP: ▸ What happened (summary)? ▸ How

    did we detect it? ▸ How did we respond? ▸ Why did it happen (deep dive)? ▸ Actionable next steps!
  21. KEEP LEARNING MORE RESOURCES ▸ The Infinite Hows - John

    Allspaw
 http://bit.ly/infinite-hows
 ▸ “Blameless” Postmortems don’t work - J Paul Reed
 http://bit.ly/blameless-dont-work ▸ Monitoring 101 - Alexis Lê-Quôc
 http://dtdg.co/monitoring-101-data