Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Detective Hat: Investigating Production Issues

Jenny Bramble
November 15, 2023
20

Detective Hat: Investigating Production Issues

It was a dark and stormy night, and our production system was out cold. It was our job to find out what had gone wrong and set it right. But how? Our developer-instincts weren't helping. Our tester's-intuition saw risk everywhere. We needed a third way; the hard-nosed, boots on the ground, skeptical mindset of the detective! Jenny Bramble, Director of Quality Engineering, and Adrian P. Dunston, Director of Backend Engineering at Papa will give you a powerful mental model for tracking down production issues in your codebase and a set of tools to help you along the way. Using noir and tv-detective stories as a backdrop, you'll learn how to: take ownership for a production behavior, hit the streets to find the relevant facts and metrics, confront any hard truths you may find, bring your tools to bear, and follow through to make production a better place for everyone. With the detective mindset, grit, and determination you can go beyond sleuthing out your systems' behaviors. You'll build better paths to observability and better processes to prevent and recover production incidents. You might even help your colleagues find detective hats of their own.

Jenny Bramble

November 15, 2023
Tweet

Transcript

  1. Hey, dev. We're getting reports of timeouts in the logs.

    Looks like we got a dead connection. 2
  2. We seem to be getting a lot of dead connections

    recently. Seems like somebody should look into it. 4
  3. Yeah, maybe somebody should. But I got my own problems.

    Besides, I don't do that sort of thing anymore. 5
  4. Seems to me like, if a person lets too many

    dead connections go, eventually something inside them dies too. 6
  5. I'm a developer. I've got a sweet little epic going.

    Lots of little tickets to take care of. But darnit she's right. It's a dirty world out there. 8
  6. Hi, I'm Jenny! • Director of Quality Engineering at Papa

    • Test-based human for most of my career • Human interfacing is my favorite thing • Two cats—Dante and Dax • My pronouns are she/her 11
  7. 14

  8. The purpose of this presentation is to help develop a

    mental model for tracking down production issues in your codebase. 15
  9. The Detective Hat Prologue - Mindset 1. Taking the Case

    2. Hitting the Streets 3. Chase Scene! 4. Putting them away… 16
  10. We have no idea what police-work is like. But we've

    got a soft spot for noir stories and detective shows. 19
  11. Production issues are a game you can win. The truth

    is out there. You can find it. 21
  12. This is also another opportunity to do what Andy Knight

    was saying with "Shift Right." It's a chance to learn your system better and deepen your mental model. 22
  13. Okay, my confession time. I love telling about old scars.

    They're shared history; even if you weren't there. You can also read the future in bumps, scars, and bruises. 23
  14. Testing is expansive, checking for edge cases and fanning out.

    Production bughunts are contractive, drawing a box around it and narrowing down. 24
  15. Someone comes in asking for help. There's something wrong, but

    they don't have a lot to go on, and there's little reward to be had from tracking it down. 27
  16. Naturally your first instinct is to tell them to buzz

    off. This brings us to the first rule of the detective hat. 28
  17. Believe them on the first report. Maybe it's not like

    they think it is. Maybe the problem isn't really in your area. But there IS a problem. 30
  18. Follow up on the impact and the reach of the

    problem. Maybe it doesn't need to be addressed. But remember, the decision isn't "is this really impacting people?" The decision is, "can I live with that impact?" 31
  19. What if it's a false positive? Well then, the problem

    is that your system is sending false positives. 32
  20. If they have a problem, and you tell them they

    don't, well now they have 2 problems. 33
  21. Now that you've taken the case, it's time to take

    a deep breath. They're not going to give you much to go on. There aren't footprints leading to the killer, and there's no DNA evidence to follow up on. This is the crux of why detective hat is different. 34
  22. There's no design, no well-crafted tickets. It's not as simple

    as upper or lower bound. It's up to you, and you'd rather this wasn't happening at all. Assume no one is going to help you. 35
  23. So we took the case. We took a deep breath.

    Let's conduct some interviews. 36
  24. Somebody reported this problem. Ask them questions. When? How often?

    What does it look like to them? How can we make it happen again? Who else should I talk to? 37
  25. If there are no clear clues there, ask around the

    neighborhood. Rule 1 is believe there is a problem, so start there. If this was real, how would I prove it? Who would see it? Be creative. Be bold. Knock on doors. Check the data, the logs, the event history. 38
  26. Write down everything they tell you. Don't believe a word

    of it. We always believe that there is a problem. We never believe anything else about it until we can prove it. On X-Files they say, Trust No One. 40
  27. Write down everything they tell you. Don't believe a word

    of it. Verify the important facts. Hard-data, that's what we want. Friday from Dragnet doesn't want to know how people feel about something. "Just the facts, ma'am." Find them. 41
  28. Write down everything they tell you. Don't believe a word

    of it. Verify the important facts. We'll be talking more about this in the next section. Collect, Doubt, and Rule-it-out 42
  29. Keep a record of the issue. Start it as soon

    as you start tracking. This is your incident narrative or "detective's notebook." Detective's Notebook 44
  30. Start with the problem as your users are seeing it.

    They reported something; write down specifically what that was. Also note the reach. This can evolve over time. • What's the problem? 45
  31. Here's where that collect, doubt, and rule it out comes

    in. You'll come up with a bunch of things that MIGHT be causing this. Other people will come assuming they know what it is. Consider it all "potential" and put it in your notes. • What's the problem? • Potential causes 46
  32. Similarly, there may be co-occuring oddities. Things that don't seem

    like a cause, but do seem related. This may help with triangulation later. Put it in the notes. • What's the problem? • Potential causes • Potential accessories (after-the-fact) 47
  33. Every fact is treasure, and you don’t know what it

    will be worth. Write them down as you go. Each step you take, each fact you find, each unanswered question. • What's the problem? • Potential causes • Potential accessories (after-the-fact) • Unanswered questions 48
  34. At the bottom of your notebook, put in the things

    that aren't relevant, but will be later. When tracking one bug, we often find others. • Tomorrow's problems ◦ unrelated bugs 49
  35. Maybe you can see the logs and the data but

    there's some crucial bit of observability you're lacking. Write it down and ask your team to make it available for next time. • Tomorrow's problems ◦ unrelated bugs ◦ blind spots 50
  36. "I'm not sure what the problem is yet, but this

    definitely wouldn't have happened if we'd only…" Sure. Good. Write it down, and head off that conversation. That's tomorrow's problem. • Tomorrow's problems ◦ unrelated bugs ◦ blind spots ◦ preventative steps 51
  37. Here's your detective's notebook • What's the problem? • Potential

    causes • Potential accessories (after-the-fact) • Unanswered questions • Tomorrow's problems 52
  38. Learn about the users' situation and motivations Hercule Poirot solves

    a mystery by understanding the psychology of the people he's investigating. Very often a production issue is a misunderstanding between user's expectation and what we THINK they're expecting. Ask "what does this mean to you?" 57
  39. Use a made-up example to illustrate where the problem might

    be. Go back and talk with your witnesses. Provide possible examples. What if it was like this? How would we know? 58
  40. Build up your logging and observability. Surveillance You know where

    it's failing, but not why. Logging is cheap. Be creative and put some thought into how it would be most effective. 59
  41. Make a change, throw the switch, and watch the metrics.

    The stakeout It's your codebase. If you have a theory about what's going on, why not act on that theory. Feature flags are your friends here. Design an experiment. Even if you don't find out where the bug is, you'll at least find out where it isn't. 60
  42. Go back and ask the incisive follow-up question "Oh and

    ahh, one more thing" Go back and re-interview the witness. Once you have verified facts in hand, the story they gave before will have new perspective. Listen again, and this time ask the follow up questions. Lots of solutions to bugs start with "Oh, and one more thing…" 62
  43. In the noir stories, there's often a moment, when the

    detective is warned off the case. "I'm telling you, Dick. Don't follow through on this one…" 64
  44. Nope, I checked everything. You can't stop thinking you know

    what's right. But you can start adding "unless there's something I don't know about." ...unless there's something I don't know about. 67
  45. The detective usually gets a little banged up during the

    chase. Be okay with being wrong. This is where those scars come from. This is what builds seniors. 68
  46. Communicating Status In a slow-burn issue, you want to report

    where you're at once or twice a day. In a major production incident, this is every half hour. Whether you have something new to report or not. 69
  47. Sherlock Holmes has a brother Mycroft. "He has no ambition

    and no energy. He will not even go out of his way to verify his own solutions, and would rather be considered wrong than take the trouble to prove himself right." Don't be Mycroft. 71
  48. Don't stop until you can tell the whole story with

    supported facts. Handing it off to the D.A. 72
  49. It's no use fixing a production issue if we don't

    learn from it. A production issue costs money. We paid for the thing. We may as well get our money's worth. 73
  50. Sometimes this means holding a post-mortem. Sometimes it means writing

    documentation or adding process or guardrails to ensure this doesn't happen again. 74
  51. Also review your blind-spots. Where were those holes in observability?

    What are the tools you WISH you had while you were in the thick of it? 75
  52. The Detective Hat Prologue - Mindset 1. Taking the Case

    2. Hitting the Streets 3. Chase Scene! 4. Putting them away… 76
  53. There is a certain joy in having been there, done

    that, and being too old for this. 77
  54. Because all it takes is that one case your folks

    can't solve, and you're back in the game! 78
  55. Good work, detective. Our users will sleep better at night

    knowing their connections are safe. 79
  56. Thanks, I appreciate your support. There's just one thing where

    you're wrong. I quit the detective game. I'm regular dev with regular dev problems. 80
  57. 84

  58. The answer to "How would we know if this works

    in prod?" is VERY OFTEN "We wouldn't." and "It doesn't." 85
  59. And the answer to "Why didn't you tell me?!" is

    often "We did." and "You weren't ready to hear it." 86
  60. Don't trust your devs. Don't trust your unit tests. Don't

    trust your ability to run something and say "there's no problem here." Prove it. Verify in prod. Check prod data. Keep asking, "What if I'm wrong? How would I know?" We are wrong all the time. 87
  61. Nobody expects code you tested in one environment to run

    the same in the world it's deployed to. And if that's your expectation, you're only going to get hurt. 88
  62. But take heart! Every production bug is a failure in

    process. Whether designer, reviewer, tester, or operations expert; we're all here to make quality software. And if the system doesn't work, we're all responsible. This isn't about you. 89
  63. We've been on the case a while now. We've asked

    questions, we've looked at logs, we've accepted hard truths. Now we're starting to get the picture, narrowing down suspects, closing in on the culprit. 90
  64. There are a generations of techniques we can bring to

    bear to flush them out from here. 91