Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Liz Sander - Lowering the Stakes of Failure wit...

Liz Sander - Lowering the Stakes of Failure with Pre-mortems and Post-mortems

Failure can be scary. There are real costs to a company and its users when software crashes, models are inaccurate, or when systems go down. The emotional stakes feel high-- no one wants to be responsible for a failure. We can lower the stakes by creating spaces to learn from failures, and minimize their impact. This talk introduces two ways to address failure: blameless post-mortems, to learn from an incident; and pre-mortems, to identify modes of failure upfront.

https://us.pycon.org/2019/schedule/presentation/214/

PyCon 2019

May 05, 2019
Tweet

More Decks by PyCon 2019

Other Decks in Programming

Transcript

  1. Liz Sander Senior Data Scientist, Civis Analytics GitHub: elsander @sander_liz

    www.lizsander.com Lowering the Stakes of Failure with Pre-mortems and Post-mortems
  2. What does failure look like? It depends on your job!

    • System downtime • Security vulnerability • Shipping a critical bug • Model is very wrong or unfair • Net loss on a consulting engagement • Missing a critical deadline
  3. Failure isn’t (just) about you. Mistakes happen within a context!

    Team • Time pressures • Incentives • Norms • Training • Expertise
  4. Failure isn’t (just) about you. Mistakes happen within a context!

    Team • Time pressures • Incentives • Norms • Training • Expertise Process • Testing (automated or manual) • Documentation • Time/issue tracking • Code/methods review
  5. Individuals will make mistakes. We to think as teams to

    establish systems to catch and address them.
  6. What’s a post-mortem? • Process for documenting incidents, identifying root

    cause(s) of the incident, and determining action items to prevent/mitigate impact of future incidents • Post-mortems aren’t just for site reliability engineers! • Core process: meeting to discuss the incident • Core deliverable: post-mortem document and action items
  7. Why a “blameless” post-mortem? • Encourage people to report incidents

    and talk about them! • Focus on understanding and improving, rather than assigning blame • “Accountability, not responsibility”
  8. OK, but what if one person is directly responsible? This

    really isn’t true most of the time. “A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had.” - Site Reliability Engineering: How Google Runs Production Systems It’s great to reflect on how you individually can improve. But that’s a personal development issue, and not the point of a post-mortem. If someone has performance issues, their manager should address it, but not as part of a post-mortem.
  9. Before the post-mortem: who and what to bring? • Invite

    people who were directly involved • For major incidents, affected stakeholders Me Co-maintainer Facilitator IT Client Success PM
  10. Before the post-mortem Fill in: • Incident period (including initial

    trigger and resolution) • Status (usually resolved, but mention ongoing issues if any) • Summary • Impact from the perspective of users • Trigger (specific event(s) that caused the incident) • Detection • Resolution (what actions were taken, including unsuccessful ones) • Action items (items to be addressed)
  11. During the post-mortem • Facilitator (probably not the person most

    directly involved) • Read through timeline Time Event 12/5/17, 11AM Tagged 2.0.2 release, Liz discovered bug 12/5/17, 11-1 Liz worked with IT to try and revert the docker images 12/5/17, 1PM Liz brought co-maintainer in to start debugging 12/5/17, 6PM Liz and co-maintainer tagged 2.0.3 release, fixing the bug
  12. During the post-mortem • Agree on trigger, impact, and root

    causes • Trigger: what is the immediate cause? • Impact: what was affected • Root Causes: What are the underlying systems that resulted in the problem?
  13. During the post-mortem • What went well? • What parts

    of our process should we keep/replicate elsewhere? • What went badly? • Identify areas that need attention • Where did we get lucky? • Identify areas that didn’t break this time, but need attention to avoid future problems
  14. During the post-mortem • Agree on action items and owners

    • Updates to release checklist • Instructions for running tests in an environment that exactly matches prod • Never make a release without two maintainers available • Lessons Learned • Test in the prod environment before release • Triage a critically buggy release by cutting a new version that reverts to the latest working version
  15. What happens next? • Follow up on action items •

    Make the document available to the company • Keep postmortems together for future reference/learnings • Today’s “action items” may be tomorrow’s “what went well”s!
  16. Post-mortems are a great way to iteratively improve and learn

    from past failure. What if you’re starting a new project and want to avoid pitfalls in the first place?
  17. My First Pre-mortem • Flask app for exploring “audiences” within

    customer base • Big project with lots of moving parts • Cross-functional team • Data & models • Different client needs • Hard deadlines and high uncertainty
  18. Pre-mortem Structure: Brainstorm Our project has failed. What happened? Security

    breach ETL problems It’s too slow It’s hard to deploy It doesn’t actually solve the user problem No one wants to buy it Users don’t understand how to use the tool It takes way too long to build Models are bad
  19. Pre-mortem Structure: Organize Our project has failed. What happened? Risk

    Category Performance Security Timeline Feature Gap Non-Use
  20. Pre-mortem Structure: Estimate importance & Discuss Our project has failed.

    What happened? Risk Category Probability Impact P*I Performance 2 1.5 3 Security 1 3 3 Timeline 2.5 1.5 3.75 Feature Gap 2.5 2.5 6.25 Non-Use 1.5 3 4.5
  21. Pre-mortem Structure: Estimate importance & Discuss Our project has failed.

    What happened? What could we have done to avoid or mitigate the failure? Risk Category Probability Impact P*I Performance 2 1.5 3 Security 1 3 3 Timeline 2.5 1.5 3.75 Feature Gap 2.5 2.5 6.25 Non-Use 1.5 3 4.5
  22. Pre-mortem Structure: After the meeting • Send out notes •

    Check in on risks and action items regularly • Use your notes in retrospectives and post-mortems
  23. Why bother? • Team members (especially non-managers) can be reluctant

    to bring up concerns • Turns those concerns into a valuable asset • Reveals domain-specific issues to the whole team • Important to bring in both technical and non-technical stakeholders • Reflect on the project and processes before something fails • Helps get everyone on board
  24. Conclusion We can only learn from failure by bringing it

    into the open. But to do that, we need to lower the emotional stakes, both of failing and talking about failure. Pre-mortems and post-mortems are tools to do this, both before a project and after an incident. The most important thing is to focus on systems and processes, rather than blaming individuals.
  25. Resources Slides will be posted at www.lizsander.com Site Reliability Engineering:

    How Google Runs Production Systems (especially c. 15 on Postmortem Culture) - https://landing.google.com/sre/sre-book/chapters/postmortem-culture/ Pagerduty’s Post-mortem process (lots of links to example post-mortems) - https://response.pagerduty.com/after/post_mortem_process/ “The Pre-Mortem: A Simple Technique to Save Any Project from Failure” - https://www.riskology.co/pre-mortem-technique/ Atlassian “Team Playbook” on pre-mortems - https://www.atlassian.com/team-playbook/plays/pre-mortem
  26. What if I don’t have a team? • You can

    still do pre-mortems and post-mortems • Do them on your own, bring in other stakeholders for high-priority issues • Increased number/severity of incidents is a risk of single person teams! It’s a tough situation to be in
  27. How do I bring these to my workplace? • Talk

    to your team! • A department/team meeting is a good place • Buy-in from leads is really important • These strategies are fundamentally about evaluating failure points in systems, not maintaining server uptime