Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Creating Awesome Change in SmartNews! En

Creating Awesome Change in SmartNews! En

Presentation deck for DevOpsDaysTokyo 2025, English version

Ikuo Suyama

April 14, 2025
Tweet

More Decks by Ikuo Suyama

Other Decks in Programming

Transcript

  1. Who am I? Ikuo Suyama • Staff Engineer • Ads

    Backend Expert • Nov. 2020~ SmartNews, Inc. • Interest: Fishing, Camping, Gunpla, Anime
  2. 4 Let me ask you to think for a moment...

    You show up to work in the morning, and your boss says, “Alright, starting today, go reduce incidents.” …What would you start with? That’s the kind of story I’ll be sharing today.
  3. 5 What I WILL talk about today 1. Lessons and

    insights from hands-on incident response 2. How I analyzed incidents and turned them into action 3. How we pushed a unified process across the org Disclaimer 1: Disclaimer 1: Super context-dependent, N=1 case! My story from the trenches of a special task force.
  4. 6 What I won’t(can’t) be talking about 1.Dev/Ops collaboration …

    Assumes Dev handles ops + incidents) 2.Applying “Best Practices” of SRE/DevOps … All about what I actually learned in the field Disclaimer 2: I’m not a pro SRE or DevOps guru!
  5. Phase 1: Assemble! Task Force “ACT”! Phase 2: Get our

    hands dirty! Phase 3: Halving incidents!? Phase 4: What remains, and what’s next Agenda
  6. 9 1-1. The Beginning It all started back in September...

    Too many incidents! We’re going to cut them in half. Let’s build a task force! The Awesome Change Team— “ACT”! CTO
  7. 10 …Could it be because you told us to ship

    a massive number of changes? MEɿ 1-1. The Beginning
  8. 11 …Could it be because you told us to ship

    a massive number of changes? MEɿ 1-1. The Beginning Hold up!
  9. 12 • CTO: Are incidents really happening that often? •

    How do we even define “a lot” of incidents? • Ikuo: Are we actually making that many changes? • Are changes even the root cause of these incidents? • And what kind of changes are we talking about? At this point—it was all just gut feeling. Hold up! (Though I’ve learned to respect a senior engineer’s nose for trouble.) 1-1. The Beginning
  10. 13 1-2. Assemble the Strongest Team With a six-month mission,

    Assembling “the Strongest Team” —at top priority! Top-down advantage: This project came straight from the Top tech leadership
  11. 14 Ads News Ranking Push Notification Core System (Infra) Mobile

    SmartView (Article) 1-2. Assemble the Strongest Team We pulled in the all-stars from every division…
  12. 15 Ads News Ranking Push Notification Core System (Infra) Mobile

    SmartView (Article) Ads Ikuo! News & Push D! Ranking R! CoreSystem T! Mobile M! SmartView T! VPoE K! * Let me call myself an “all-star” just for the sake of this story 🙏 (Manager) CTO Report To 1-2. Assemble the Strongest Team We pulled in the all-stars from every division…
  13. 16 Ads News Ranking Push Notification Core System (Infra) Mobile

    SmartView (Article) Ads Ikuo! News & Push D! Ranking R! CoreSystem T! Mobile M! SmartView T! VPoE K! (Manager) CTO Report To SRE wizard * Let me call myself an “all-star” just for the sake of this story 🙏 1-2. Assemble the Strongest Team We pulled in the all-stars from every division…
  14. 17 “We had just six months to succeed” —That pressure

    was real. Pulling aces from every team showed how serious the company was about this. At the same time... we had no excuses. The downside of top-down 1-2. Assemble the Strongest Team
  15. 18 1-3. Guiding the Team Taking on a messy, unsolved

    challenge • “Reduce incidents.” Sounds simple—turns out, it's massive. • Where do We even start? • What’s the real problem? What actually helps? • And... are there even that many incidents?
  16. 19 Set a clear goal • What does “Awesome Change”

    really mean? • Reduce critical incidents • Install SRE best practices into the org • Key KPIs to improve: • Mean Time Between Failure(MTBF) / Change Failure Rate(CFR) = # of incidents • Mean Time to Recover(MTTR) = Recovery time The “why are we here?” got crystal clear Thanks to our awesome VPoE 1-3. Guiding the Team
  17. 20 Set clear priorities • P0: Support incident handling •

    P1: Crush unresolved critical action items • P2: Prevent incidents by fixing root causes What We need to do right now? Clear! 1-3. Guiding the Team
  18. 22 2-1. P0: Supporting Incident Handling Get our Hands Dirty:

    Jump into every incident! • Whenever something went down, someone's PagerDuty in ACT went off. • Eventually, we invited the whole ACT squad to every active incident. • If it was in one’s home domain—they’d help fight the fire. • If not, they’d step in for status updates, bring in the right people, or handle comms with the business side. Brutal!!
  19. 24 It’s an anti-pattern… but it wasn’t all bad People

    started thinking: Incident = ACT. And we earned a lot of trust. ACT shows up when there’s trouble. ACT’s got your back during incidents. ACT’s getting stuff done! 2-1. P0: Supporting Incident Handling
  20. 25 2-2. P1. Crush unresolved critical action items The forgotten

    action items There were definitely ones that just got… forgotten • We had a culture of writing incident reports— • and even listing action items for prevention. • Awesome! • But those items weren’t being tracked. • No assignees. No due dates. No status. • WHAT?? • And the report format Different for each division… • Sometimes even different per person.
  21. 26 List out all the forgotten ones • Ideally, auto-generate

    Jira tickets and track them. • Maybe even send reminders. • But… each report had a totally different format. • Now what? Help me, ChatGPT… That’s not happening. 2-2. P1. Crush unresolved critical action items
  22. 27 Get our Hands Dirtyɿorganize the data by hand Lesson

    #1: Always store data in machine-readable format!! Heck yeah! we manually moved a year’s worth of incident AIs into a Notion database *Once it’s a database, you can pull it via API. Easy mode. Lesson #2: Don’t be afraid to get your hands dirty if it serves the mission. 2-2. P1. Crush unresolved critical action items
  23. 28 Get our Hands Dirtyɿchase down everything still unfinished This

    isn’t done?! Crush it!! Lesson #3: People aren’t ignoring the work—they’re just too busy. Getting our hands dirty! 2-2. P1. Crush unresolved critical action items
  24. 29 2-3. P2: Root Fixes/ Rolling Out a Unified Incident

    Response Process • Each division had its own way of handling incidents. • We’d tried to build a company-wide protocol before— • the “Incident Response Framework (IRF)”. • But... it never really caught on. • It only reflected the needs of one division. Why didn’t we already have a unified company-wide process?
  25. 30 • The IRF wasn’t bad as a process— •

    So we rebuilt it as a company-wide framework, • backed by domain knowledge from our all-star team. • And it had to be lightweight. • We also pulled ideas from public frameworks • e.g. Pager Duty Incident Response How we build a company-wide process and framework? Because our team covered every domain, we could actually make it happen. 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
  26. 31 IRF 2.0 Contents 1. Role, Playbook 2. Severity Definition

    3. Workflow 4. Communication Guideline 5. Incident Report Template, Postmortem Let me walk you through the key parts Please check the slide later for the details! 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
  27. 32 IRF 2.0: Role, Playbook • On-Call Engineer • The

    engineer on call. Triages alerts and escalates to the IC if necessary, initiating the IRF (declaring the incident). • Incident Commander(IC) • Leads the incident response. Brings in necessary people and organizes information. May also act as the CL (Communication Lead). • Usually a Tech Lead or Engineering Manager. • Their responsibility is not to directly fix the issue, but to organize and make decisions. • Responder • Handles the actual work—such as rollbacks, config changes, etc. • Communication Lead(CL) • Handles communication with external stakeholders (i.e., non-engineers). Key point: Separate responsibilities between IC and Responder 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
  28. 33 IRF 2.0: Severity Definition The IC makes a tentative

    severity judgment when declaring the incident. The final severity level is determined during the postmortem. • 🔥 SEV-1 • Complete failure of core UX features (e.g., news reading becomes unavailable) • 🧨 SEV-2 • Partial failure of core UX features, or complete failure of sub-UX features • 🕯 SEV-3 • Partial failure of sub-UX features It's crucial to estimate severity early on— severe incidents should be resolved faster 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
  29. 34 IRF 2.0: Workflow The flow from start to finish

    of an incident. 🩸 = Bleeding status (ongoing impact) 1. 🩸 Occurrence • An issue arises. Common triggers include deployments or config changes. 2. 🩸 Detection • The on-call engineer detects the issue via alerts. Triage begins. 3. 🩸 Declaration • The incident is officially declared. IRF begins under the IC's lead. External communication starts as needed. • While bleeding, updates must be continuously provided. 4. ❤🩹 Mitigation • Temporarily eliminate the cause (e.g., rollback) and stop further impact. 5. Resolution • Permanently fix the issue (e.g., bug fix, data correction). Bleeding is fully stopped. 6. Postmortem • Investigate root causes and discuss recurrence prevention based on the incident report. 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
  30. 35 IRF 2.0: Communication Guideline Defines where communication should take

    place (Slack channels): • #incident • Used for status updates to the entire company and for communication with external stakeholders. • #incident-irf-[incidentId]-[title] • or technical communication to resolve the issue. • All relevant discussions and information are gathered here. Having all discussions and info in one place makes writing the report much easier later 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
  31. 36 IRF 2.0: Incident Report Template & Postmortem A unified

    company-wide template includes: • Summary • Impact • Direct Cause, Mitigation • Root Cause Analysis (5-whys) • It’s crucial to analyze direct and root causes separately. • Based on root causes, define action items to prevent recurrence • Timeline • Use a machine-readable format!!!! We standardized templates across divisions (super important!) and centralized all postmortems. 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
  32. 37 We built it—but how do we make it land?

    “Here’s our amazing IRF 2.0. It’s perfect—so just read it and follow 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
  33. 38 We built it—but how do we make it land?

    “Here’s our amazing IRF 2.0. It’s perfect—so just read it and follow 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process Noooo!
  34. 39 Get Our Hands Dirty: Forcefully apply IRF2.0 by diving

    into every incident “Hello there, it’s me, the IRF guy 😎 Alright, I’ll be the Incident Commander this time! Everyone else, focus on firefighting!” Lesson #4: In an emergency, no one has time to learn a new protocol. Just do it! Lesson #5: Use it ourselves first, and build a feedback loop! 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
  35. 40 the three great virtues of a programmer: Laziness Just

    run /incident in Slack. I’ll take care of the rest Automatically creates incident tickets, dedicated Slack channels, and posts links to playbooks. Automate what works well after you’ve done it manually! 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process
  36. 41 A critical realization So, did the number of incidents

    actually go down? What happened to the MTTR? CTO Well… ME 2-4. P2: Root Fixes/ Enhancing Incident Clarity
  37. 42 A critical realization: We’re not tracking KPIs! How many

    incidents did we handle this month? How about last month…? How long did it take to resolve each one? 2-4. P2: Root Fixes/ Enhancing Incident Clarity
  38. 43 A critical realization: We’re not tracking KPIs! How many

    incidents did we handle this month? How about last month…? How long did it take to resolve each one? 2-4. P2: Root Fixes/ Enhancing Incident Clarity No clue!!
  39. 44 Let’s Look at the Data: What We Need Data

    Collection Visualization Do this, then that, and boom! 2-4. P2: Root Fixes/ Enhancing Incident Clarity
  40. 45 Data Collection Visualization Data Definition Is Key! Let’s Look

    at the Data: What We Need 2-4. P2: Root Fixes/ Enhancing Incident Clarity
  41. Data Definition: Modeling Incidents • Attributes of an Incident •

    Title • Status • State Machine.(we’ll get to this later) • Severity • SEV 1~3(IRF2.0) • Direct Cause • (explained later) • Direct Cause System • Group of components defined at the microservice level • Direct Cause Workload • Online Service, Offline Pipeline, … Define as many fields as possible using Enums! 46 Free-form input leads to high cardinality and makes proper analysis impossible 2-4. P2: Root Fixes/ Enhancing Incident Clarity
  42. Incident Modeling: Direct Cause The direct cause of the incident

    Define it in a way that makes it analyzable and actionable 47 2-4. P2: Root Fixes/ Enhancing Incident Clarity
  43. Incident Modeling: Status -- State Machine Define how incidents move

    through different states. Model it as a State Machine and clean it up. Then you can define time metrics for each state transition! 48 2-4. P2: Root Fixes/ Enhancing Incident Clarity
  44. Data Collection: Incident Report Update the incident report template to

    include the defined data fields. If the data definition is solid, the source can be flexible. (As long as the data is trustworthy, of course.) 49 2-4. P2: Root Fixes/ Enhancing Incident Clarity • Make required fields into mandatory attributes. • Add a Notion Database for timelines. • Have people record when states change. • Make it machine-readable!!!!!
  45. The rest is easy: do this, then this, and boom!

    50 ChatGPT did it overnight 2-4. P2: Root Fixes/ Enhancing Incident Clarity
  46. Incident Dashboard: Visualize the key metrics All green! 2-4. P2:

    Root Fixes/ Enhancing Incident Clarity Hold up!
  47. What about the past data? Reports before IRF 2.0 (the

    unified format) • Different formats across Divisions • Timelines as bullet points, missing data, and more… Give it three months and we’ll have plenty of data. Right? We only have six months!! 2-4. P2: Root Fixes/ Enhancing Incident Clarity
  48. What about past data? Of course—we Get Our Hands Dirty!

    Heck yeah! We manually migrated one year’s worth of incident reports We split the work across the team and powered through over a week 2-4. P2: Root Fixes/ Enhancing Incident Clarity Re: Lesson #2: Don’t be afraid to get your hands dirty if it serves the mission.
  49. Key Metric Breakdown: Observing and Splitting MTTR 1. Occurred 2.

    Detected 3. Declared 4. Mitigated 5. Resolved Time To Detect Time To Mitigate Now we know where the time is going—and where we stand! Time To Resolve 2-4. P2: Root Fixes/ Enhancing Incident Clarity
  50. 58 3-1. What Does It Mean to Reduce Incidents? Impact

    caused by incidents e.g. Revenue, Reputation, Effectiveness…. Right? Especially, Revenue Loss… No… it’s the What we really want to reduce… The number of incidents? 🤔
  51. 59 MTTD + MTTM Time to stop the bleeding Estimating

    the impact of an incident Severity Factor (impact level of an incident) Number of Incident × × In toC and ads, this pretty much defines the revenue impact • Shorten the time • Relatively easy to improve and quick to act on • Reduce severity • Ideal, but hard to control • Reduce incident count • Requires long-term efforts 3-1. What Does It Mean to Reduce Incidents?
  52. 60 MTTD + MTTM Time to stop the bleeding Estimating

    the impact of an incident Severity Factor (impact level of an incident) Number of Incident × × In toC and ads, this pretty much defines the revenue impact • Shorten the time • Relatively easy to improve and quick to act on • Reduce severity • Ideal, but hard to control • Reduce incident count • Requires long-term efforts 3-1. What Does It Mean to Reduce Incidents? We’re starting here It also aligns with the KPIs we set when ACT was first formed! But a few months in, we gained much better clarity.
  53. 61 3-2. Approaching Incident Resolution Time How do we reduce

    MTTR? Seriously? Lesson #6: If a top-tier ace jumps into an incident, it gets resolved faster (…maybe?)
  54. Improving MTTR Clarity: State Machine 1. Occurred 2. Detected 3.

    Declared 4. Mitigated 5. Resolved Time To Detect Time To Resolve Time To Mitigate Each one has different significance — and needs a different solution! Mainly in the alerting domain. Bleeding, Most critical — but also the easiest to improve. This is where IRF comes in. Time spent on root fixes and data correction. Bleeding has stopped — now it’s about accuracy, not speed. 3-2. Approaching Incident Resolution Time
  55. 63 Approaching MTTD(Mean Time To Detect) Improve alerting • Just

    slapping alerts on detected issues doesn’t work • “over-monitoring is a harder problem to solve than under-monitoring.” — SRE: How Google Runs Production System • Too many false positives → real alerts get buried • Shift to SLO + Error Budget–based alerting • Ideal: if it pages, it’s an incident • Not something you can fix overnight Still an open challenge Come back to it in Chapter 4 3-2. Approaching Incident Resolution Time
  56. 64 Approaching MTTM(Mean Time To Mitigate) A unified framework: IRF

    2.0 • Clear definition of what qualifies as an incident • Standardized response flow and communication guidelines • Separation of Responder and Commander roles • Training, with aces leading by example Deploying top aces + rolling out IRF 2.0 had a huge impact! 3-2. Approaching Incident Resolution Time
  57. 65 3-3. Approaching the Number of Incidents How do we

    reduce incidents themselves? Nooo… Lesson #7: Even if top-tear aces jump into incidents… The number of incidents won’t go down!!
  58. 66 Tackle the Bottlenecks We’ve got the data, don’t we?!

    Come forth—Incident Dashboard!! 3-3. Approaching the Number of Incidents
  59. 67 We started to see where incidents happen—and why. Backed

    by data, we tackled each root cause head-on! 3-3. Approaching the Number of Incidents Tackle the Bottlenecks
  60. 68 #1 Incident Cause Lack of Testing 3-3. Approaching the

    Number of Incidents Tackle the Bottlenecks
  61. 69 Approach #1 to Lack of Testing: Why not test

    before production? Postmortem discussion… • Why was it deployed without testing? • → Because it could only be tested in production. • Why only in production? • → Lack of data, broken staging environment, etc… • …. Alright! Let’s fix the staging environment! 3-3. Approaching the Number of Incidents
  62. 70 Approach #1 to Lack of Testing: Building Out Staging

    Environments Still in Progress: Way harder than we thought! • There are tons of components. • Each division—News, Ads, Infra—has different needs and usage. • Ads is B2B, tied directly to revenue → needs to be solid and stable • News is B2C, speed of feature delivery is key Trying to build staging for everything? Not realistic, not even useful. So we started with Ads, where the demand was highest 3-3. Approaching the Number of Incidents
  63. 71 Approach #2 to Lack of Testing: What about Unit

    Tests? • Why didn’t we catch it with unit tests? • Because we didn’t have any… • … 😭 Alright! Let’s collect test coverage! 3-3. Approaching the Number of Incidents
  64. 72 Approach #2 to Lack of Testing: Analyzing Unit Test

    Coverage • We charged into systems with no coverage tracking, and cranked out PRs to generate coverage reports. • Then we plotted unit test coverage vs. number of incidents by system # of Incident Ave. Coverage 3-3. Approaching the Number of Incidents
  65. 73 • Is there a correlation between test coverage and

    incidents? → There was a correlation. • But… does higher coverage actually reduce incidents? • Not sure. Correlation ≠ causation. • Still, when we looked at individual systems and teams with domain knowledge, the places with low coverage and high incidents did have clear reasons: • Hard to write tests / no testing culture / etc… Alright! Let’s storm the low-coverage zones and crank out unit tests! Approach #2 to Lack of Testing: Analyzing Unit Test Coverage 3-3. Approaching the Number of Incidents
  66. 74 Approach #2 to Lack of Testing: Building Out Unit

    Tests Get our hands dirty! add tests to everything we can get our hands on We hit 3–4 components… but it barely moved the needle We thought once we provided a few examples, others would follow… 3-3. Approaching the Number of Incidents 1. Use SonarQube to find files with high LOC and low coverage 2. Use LLMs to help generate tests 3. Repeat until the entire component hits 50% + coverage
  67. 75 • What if we just used LLMs to generate

    them? • Well… the accuracy just isn’t there yet. • Teams need a habit of writing unit tests continuously. • No incentive or shared sense of value • Everyone’s under deadline pressure • — no time left for tests(!) This is a team culture and organizational challenge. To be continued in Chapter 4… 3-3. Approaching the Number of Incidents Approach #2 to Lack of Testing: Building Out Unit Tests
  68. 77 Approaching Config Changes • Main types of “Config Changes”:

    • Mechanisms that control app behavior dynamically • A/B Testing and Feature Flags • Both rely on our custom-built platforms — but they’re complicated • Unintended A/B assignments and misconfigurations caused frequent issues Alright, let’s clean up A/B testing and feature flags! 3-3. Approaching the Number of Incidents
  69. 78 • Bulk deletion of unused (defaulted) feature flags •

    Establish usage guidelines for feature flags • Strengthen validation logic • (We were able to input configs that literally caused parse errors…) Teamed up with the A/B testing platform group, Pushed for major improvements including UX! 3-3. Approaching the Number of Incidents Approaching Config Changes
  70. 79 #2 Incident Cause Offline Batch …basically, Flink 3-3. Approaching

    the Number of Incidents Tackle the Bottlenecks
  71. 80 Approaching Offline Batch • A bunch of “offline” Flink

    jobs that are actually streaming • Server → Kafka → Flink → Scylla, ClickHouse, etc. • A custom-built platform by a specialized team • Few Flink experts on the app teams • Led to frequent issues with performance and restarts Alright! Let’s revamp the Flink platform! 3-3. Approaching the Number of Incidents
  72. 81 • Improved the platform itself • Better UI, automated

    deployments, and more • Spread best practices • Cleaned up documentation • Released template projects (including tests!) • Sent direct refactor PRs to various components • Implemented best practices and tests Collaborated with the platform teamTogether, Improved the platform and its docs! 3-3. Approaching the Number of Incidents Approaching Offline Batch
  73. 82 ACT era vs Before ACT Number of Incidents… +32%

    Increase!! MTTR… -48% Decrease!! Halved!!! 3-4. Results: Did We Really “Halve” Incidents?
  74. 3-4. Results: Did We Really “Halve” Incidents? 83 ACT era

    vs Before ACT Number of Incidents… +32% Increase!! MTTR… -48% Decrease!! Halved!!! Hold up!
  75. 84 Aren’t incidents actually on the rise…? • Seasonal spike?

    December was off the charts • Last-minute changes before the holidays? • A side effect of rolling out IRF2.0? • More sensitive detection due to tighter definitions • Maslow’s Hammer: • “If all you have is IRF, everything starts to look like an incident” • But after January, things started trending down Continuous improvement is still key 3-4. Results: Did We Really “Halve” Incidents?
  76. 85 On the other hand, MTTR was Halved! • Dramatic

    improvement in MTTMitigate • Thanks to the power of IRF2.0 • But MTTDetect barely improved • Detection is still an open challenge — Working on better alerting Definitely felt the momentum of change! 3-4. Results: Did We Really “Halve” Incidents?
  77. 86 General Comments No major changes in the severity breakdown—Impact

    levels showed a slight decrease We didn’t quite halve incidents… However, our challenges are clear, and we’ve laid the foundation for improvement! Let’s make it happen 3-4. Results: Did We Really “Halve” Incidents?
  78. 88 Again: We Want to Reduce Incidents… 4-1. Open Challenges:

    Risk Management & Alerting We’ve been laser-focused on reducing incidents… But we can never make them zero. No way…
  79. 89 Can we really make them zero? Should we? Let’s

    be real—Not happening To truly minimize incidents… • Just stop releasing? • A slow death 😇 • Throw infinite cost (people, time) at it? • More cost likely correlates with fewer incidents… • So do we keep testing until it feels 100% “safe”? 4-1. Open Challenges: Risk Management & Alerting
  80. 90 We want to balance delivery speed, quality, cost, and

    incident risk. • But the “right” balance is different for each system or project. • Required speed & release frequency • Cost we can throw in • Acceptable level of risk (≒ number of incidents, failure rate) • Ads is B2B and tied directly to revenue → needs to be rock-solid • News is B2C → speed of delivery comes first! Quantify our risk tolerance And use that to control how many incidents we accept. 4-1. Open Challenges: Risk Management & Alerting
  81. 91 Visualizing Risk Tolerance with SLOs and Error Budgets •

    SLO = Service Level Objective ~ How much failure is acceptable? • e.g. 99.9% available -> means 0.1% failure is “allowed” • Attach objectives to SLIs—metrics that reflect real UX harm • Error Budget = How much failure room we have left • When budget remains → we can step on the gas • Even bold releases are fair game • When budget runs out → UX is already suffering • No more risk—time to slow down — Ref: Implementing SLOs — Google SRE Error Budgets let us express risk tolerance—numerically. And in theory… this sounds pretty solid. 4-1. Open Challenges: Risk Management & Alerting
  82. 92 Improving Alerting — An Alert = an Incident Alright!

    Time to get those SLOs in place! • Alert based on how fast your Error Budget is burning: • Trigger alerts when the budget is being consumed too quickly • If you ignore it, you’ll run out — violate your SLO • That means real UX damage! • Don’t ignore it — it is an incident! — Ref: Alerting on SLOs Sounds Good 4-1. Open Challenges: Risk Management & Alerting
  83. 93 4-2. Remaining Challenge: Shaping Org & Culture We tried

    rolling out SLOs in some places, but… • Defining effective SLOs isn’t easy • Biz and PdMs don’t always have the answers • There’s no time to implement them • They can’t even find time to write tests! • And even if we set them up— • if no one respects them, what’s the point? There’s no silver bullet…
  84. 94 How do we make SLOs actually work? • We

    need buy-in across the org • It’s not just about engineers—biz and PdMs have to be in too • Need a deliberate approach to culture • Ultimately, it’s about what we truly value • Do we believe that balancing cost and risk with SLOs is worth it? We want to embed SLOs— and ultimately, the mindset of SRE—into our engineering culture. 4-2. Remaining Challenge: Shaping Org & Culture
  85. 95 • So how do we approach this? • Bottom-Up:

    • Educate and train engineers, Biz, PdMs—get everyone involved • Top-Down: • Win support and direction from leadership SRE and DevOps are culture — They don’t take root in a day. It takes sweat, patience, and steady effort. 4-2. Remaining Challenge: Shaping Org & Culture How do we make SLOs actually work?
  86. 96 4-3. What Comes Next… Our 6-month mission as ACT

    was coming to an end. • The challenge remained: Install SRE into SmartNews’s engineering culture • Implement and uphold SLOs • And more… • Boost observability • Track and act on DORA metrics, etcetc… These require ongoing effort How can we keep the momentum going— and tackle the remaining challenges even after ACT disbands?
  87. 97 How should we disband ACT? — One idea: a

    “Distributed SRE Team” After ACT ends, ex-members return to their teams and continue SRE work using X% of their time It sounded reasonable to me… maybe? 4-3. What Comes Next…
  88. 98 • Rejected • No one wanted SRE to be

    their full-time job. • And allocating “X%” of time… yeah, that never really works. • Our decision: • Ex-ACTors would keep helping and promoting SRE, but we’d take the time to build a dedicated SRE team. We made that call as a team. There’s still plenty left unfinished—but no regrets! How should we disband ACT? — Team’s Call 4-3. What Comes Next…
  89. 99 Our Awesome Change! Our (tough!!) six-month mission as ACT

    has ended. Did we truly create an “Awesome Change”? Honestly… I don’t know. But we do feel like “We’ve taken the first step on a long journey toward SRE.” 4-3. What Comes Next… And a huge thanks to my teammates for fighting through these past six months!
  90. 100 Your Awesome Change! Let’s make it happen You go

    back to work after this conference— and your boss says, “Alright, starting today, go reduce incidents.” …What would you start with? 4-3. What Comes Next…
  91. 101 Our battle is just beginning!! Stay tuned for the

    future adventures of the ex-ACTors!
  92. 103 References • Seeking SRE: Conversations About Running Production Systems

    at Scale • Site Reliability Engineering: How Google Runs Production Systems • SRE Google Workbook • Effective DevOps: Building a Culture of Collaboration, Affinity, and Tooling at Scale • Fearless Change: Patterns for Introducing New Ideas