Creating Awesome Change in SmartNews! En

Creating "Awesome Change" In SmartNews! 5IF.JTTJPOUP)BMWF*ODJEFOUTCZ5BTL'PSDFl"$5z April 15, 2025

Who am I? Ikuo Suyama • Staff Engineer • Ads
Backend Expert • Nov. 2020~ SmartNews, Inc. • Interest: Fishing, Camping, Gunpla, Anime

3 Today, Incidents! I’m going to talk about

4 Let me ask you to think for a moment...
You show up to work in the morning, and your boss says, “Alright, starting today, go reduce incidents.” …What would you start with? That’s the kind of story I’ll be sharing today.

5 What I WILL talk about today 1. Lessons and
insights from hands-on incident response 2. How I analyzed incidents and turned them into action 3. How we pushed a unified process across the org Disclaimer 1: Disclaimer 1: Super context-dependent, N=1 case! My story from the trenches of a special task force.

6 What I won’t(can’t) be talking about 1.Dev/Ops collaboration …
Assumes Dev handles ops + incidents) 2.Applying “Best Practices” of SRE/DevOps … All about what I actually learned in the field Disclaimer 2: I’m not a pro SRE or DevOps guru!

Phase 1: Assemble! Task Force “ACT”! Phase 2: Get our
hands dirty! Phase 3: Halving incidents!? Phase 4: What remains, and what’s next Agenda

01 Phase 1: Assemble! Task Force “ACT”!

9 1-1. The Beginning It all started back in September...
Too many incidents! We’re going to cut them in half. Let’s build a task force! The Awesome Change Team— “ACT”! CTO

10 …Could it be because you told us to ship
a massive number of changes? MEɿ 1-1. The Beginning

11 …Could it be because you told us to ship
a massive number of changes? MEɿ 1-1. The Beginning Hold up!

12 • CTO: Are incidents really happening that often? •
How do we even define “a lot” of incidents? • Ikuo: Are we actually making that many changes? • Are changes even the root cause of these incidents? • And what kind of changes are we talking about? At this point—it was all just gut feeling. Hold up! (Though I’ve learned to respect a senior engineer’s nose for trouble.) 1-1. The Beginning

13 1-2. Assemble the Strongest Team With a six-month mission,
Assembling “the Strongest Team” —at top priority! Top-down advantage: This project came straight from the Top tech leadership

14 Ads News Ranking Push Notification Core System (Infra) Mobile
SmartView (Article) 1-2. Assemble the Strongest Team We pulled in the all-stars from every division…

SmartView (Article) Ads Ikuo! News & Push D! Ranking R! CoreSystem T! Mobile M! SmartView T! VPoE K! * Let me call myself an “all-star” just for the sake of this story 🙏 (Manager) CTO Report To 1-2. Assemble the Strongest Team We pulled in the all-stars from every division…

SmartView (Article) Ads Ikuo! News & Push D! Ranking R! CoreSystem T! Mobile M! SmartView T! VPoE K! (Manager) CTO Report To SRE wizard * Let me call myself an “all-star” just for the sake of this story 🙏 1-2. Assemble the Strongest Team We pulled in the all-stars from every division…

17 “We had just six months to succeed” —That pressure
was real. Pulling aces from every team showed how serious the company was about this. At the same time... we had no excuses. The downside of top-down 1-2. Assemble the Strongest Team

18 1-3. Guiding the Team Taking on a messy, unsolved
challenge • “Reduce incidents.” Sounds simple—turns out, it's massive. • Where do We even start? • What’s the real problem? What actually helps? • And... are there even that many incidents?

19 Set a clear goal • What does “Awesome Change”
really mean? • Reduce critical incidents • Install SRE best practices into the org • Key KPIs to improve: • Mean Time Between Failure(MTBF) / Change Failure Rate(CFR) = # of incidents • Mean Time to Recover(MTTR) = Recovery time The “why are we here?” got crystal clear Thanks to our awesome VPoE 1-3. Guiding the Team

20 Set clear priorities • P0: Support incident handling •
P1: Crush unresolved critical action items • P2: Prevent incidents by fixing root causes What We need to do right now? Clear! 1-3. Guiding the Team

02 Phase 2: ”Get our hands dirty!”

22 2-1. P0: Supporting Incident Handling Get our Hands Dirty:
Jump into every incident! • Whenever something went down, someone's PagerDuty in ACT went off. • Eventually, we invited the whole ACT squad to every active incident. • If it was in one’s home domain—they’d help fight the fire. • If not, they’d step in for status updates, bring in the right people, or handle comms with the business side. Brutal!!

23 Anti-pattern: Pager Monkey *U`TFWFOOBNFEl1BHFS.POLFZz 'SPNl4FFLJOH43&z$IBQUFS5IF$VMUVSFPG43& Definitely a bad practice
Jumping into every incident won’t stop incidents… 2-1. P0: Supporting Incident Handling

24 It’s an anti-pattern… but it wasn’t all bad People
started thinking: Incident = ACT. And we earned a lot of trust. ACT shows up when there’s trouble. ACT’s got your back during incidents. ACT’s getting stuff done! 2-1. P0: Supporting Incident Handling

25 2-2. P1. Crush unresolved critical action items The forgotten
action items There were definitely ones that just got… forgotten • We had a culture of writing incident reports— • and even listing action items for prevention. • Awesome! • But those items weren’t being tracked. • No assignees. No due dates. No status. • WHAT?? • And the report format Different for each division… • Sometimes even different per person.

26 List out all the forgotten ones • Ideally, auto-generate
Jira tickets and track them. • Maybe even send reminders. • But… each report had a totally different format. • Now what? Help me, ChatGPT… That’s not happening. 2-2. P1. Crush unresolved critical action items

27 Get our Hands Dirtyɿorganize the data by hand Lesson
#1: Always store data in machine-readable format!! Heck yeah! we manually moved a year’s worth of incident AIs into a Notion database *Once it’s a database, you can pull it via API. Easy mode. Lesson #2: Don’t be afraid to get your hands dirty if it serves the mission. 2-2. P1. Crush unresolved critical action items

28 Get our Hands Dirtyɿchase down everything still unfinished This
isn’t done?! Crush it!! Lesson #3: People aren’t ignoring the work—they’re just too busy. Getting our hands dirty! 2-2. P1. Crush unresolved critical action items

29 2-3. P2: Root Fixes/ Rolling Out a Unified Incident
Response Process • Each division had its own way of handling incidents. • We’d tried to build a company-wide protocol before— • the “Incident Response Framework (IRF)”. • But... it never really caught on. • It only reflected the needs of one division. Why didn’t we already have a unified company-wide process?

30 • The IRF wasn’t bad as a process— •
So we rebuilt it as a company-wide framework, • backed by domain knowledge from our all-star team. • And it had to be lightweight. • We also pulled ideas from public frameworks • e.g. Pager Duty Incident Response How we build a company-wide process and framework? Because our team covered every domain, we could actually make it happen. 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

31 IRF 2.0 Contents 1. Role, Playbook 2. Severity Definition
3. Workflow 4. Communication Guideline 5. Incident Report Template, Postmortem Let me walk you through the key parts Please check the slide later for the details! 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

32 IRF 2.0: Role, Playbook • On-Call Engineer • The
engineer on call. Triages alerts and escalates to the IC if necessary, initiating the IRF (declaring the incident). • Incident Commander(IC) • Leads the incident response. Brings in necessary people and organizes information. May also act as the CL (Communication Lead). • Usually a Tech Lead or Engineering Manager. • Their responsibility is not to directly fix the issue, but to organize and make decisions. • Responder • Handles the actual work—such as rollbacks, config changes, etc. • Communication Lead(CL) • Handles communication with external stakeholders (i.e., non-engineers). Key point: Separate responsibilities between IC and Responder 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

33 IRF 2.0: Severity Definition The IC makes a tentative
severity judgment when declaring the incident. The final severity level is determined during the postmortem. • 🔥 SEV-1 • Complete failure of core UX features (e.g., news reading becomes unavailable) • 🧨 SEV-2 • Partial failure of core UX features, or complete failure of sub-UX features • 🕯 SEV-3 • Partial failure of sub-UX features It's crucial to estimate severity early on— severe incidents should be resolved faster 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

34 IRF 2.0: Workflow The flow from start to finish
of an incident. 🩸 = Bleeding status (ongoing impact) 1. 🩸 Occurrence • An issue arises. Common triggers include deployments or config changes. 2. 🩸 Detection • The on-call engineer detects the issue via alerts. Triage begins. 3. 🩸 Declaration • The incident is officially declared. IRF begins under the IC's lead. External communication starts as needed. • While bleeding, updates must be continuously provided. 4. ❤🩹 Mitigation • Temporarily eliminate the cause (e.g., rollback) and stop further impact. 5. Resolution • Permanently fix the issue (e.g., bug fix, data correction). Bleeding is fully stopped. 6. Postmortem • Investigate root causes and discuss recurrence prevention based on the incident report. 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

35 IRF 2.0: Communication Guideline Defines where communication should take
place (Slack channels): • #incident • Used for status updates to the entire company and for communication with external stakeholders. • #incident-irf-[incidentId]-[title] • or technical communication to resolve the issue. • All relevant discussions and information are gathered here. Having all discussions and info in one place makes writing the report much easier later 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

36 IRF 2.0: Incident Report Template & Postmortem A unified
company-wide template includes: • Summary • Impact • Direct Cause, Mitigation • Root Cause Analysis (5-whys) • It’s crucial to analyze direct and root causes separately. • Based on root causes, define action items to prevent recurrence • Timeline • Use a machine-readable format!!!! We standardized templates across divisions (super important!) and centralized all postmortems. 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

37 We built it—but how do we make it land?
“Here’s our amazing IRF 2.0. It’s perfect—so just read it and follow 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

38 We built it—but how do we make it land?
“Here’s our amazing IRF 2.0. It’s perfect—so just read it and follow 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process Noooo!

39 Get Our Hands Dirty: Forcefully apply IRF2.0 by diving
into every incident “Hello there, it’s me, the IRF guy 😎 Alright, I’ll be the Incident Commander this time! Everyone else, focus on firefighting!” Lesson #4: In an emergency, no one has time to learn a new protocol. Just do it! Lesson #5: Use it ourselves first, and build a feedback loop! 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

40 the three great virtues of a programmer: Laziness Just
run /incident in Slack. I’ll take care of the rest Automatically creates incident tickets, dedicated Slack channels, and posts links to playbooks. Automate what works well after you’ve done it manually! 2-3. P2: Root Fixes/ Rolling Out a Unified Incident Response Process

41 A critical realization So, did the number of incidents
actually go down? What happened to the MTTR? CTO Well… ME 2-4. P2: Root Fixes/ Enhancing Incident Clarity

42 A critical realization: We’re not tracking KPIs! How many
incidents did we handle this month? How about last month…? How long did it take to resolve each one? 2-4. P2: Root Fixes/ Enhancing Incident Clarity

43 A critical realization: We’re not tracking KPIs! How many
incidents did we handle this month? How about last month…? How long did it take to resolve each one? 2-4. P2: Root Fixes/ Enhancing Incident Clarity No clue!!

44 Let’s Look at the Data: What We Need Data
Collection Visualization Do this, then that, and boom! 2-4. P2: Root Fixes/ Enhancing Incident Clarity

45 Data Collection Visualization Data Definition Is Key! Let’s Look
at the Data: What We Need 2-4. P2: Root Fixes/ Enhancing Incident Clarity

Data Definition: Modeling Incidents • Attributes of an Incident •
Title • Status • State Machine.(we’ll get to this later) • Severity • SEV 1~3(IRF2.0) • Direct Cause • (explained later) • Direct Cause System • Group of components defined at the microservice level • Direct Cause Workload • Online Service, Offline Pipeline, … Define as many fields as possible using Enums! 46 Free-form input leads to high cardinality and makes proper analysis impossible 2-4. P2: Root Fixes/ Enhancing Incident Clarity

Incident Modeling: Direct Cause The direct cause of the incident
Define it in a way that makes it analyzable and actionable 47 2-4. P2: Root Fixes/ Enhancing Incident Clarity

Incident Modeling: Status -- State Machine Define how incidents move
through different states. Model it as a State Machine and clean it up. Then you can define time metrics for each state transition! 48 2-4. P2: Root Fixes/ Enhancing Incident Clarity

Data Collection: Incident Report Update the incident report template to
include the defined data fields. If the data definition is solid, the source can be flexible. (As long as the data is trustworthy, of course.) 49 2-4. P2: Root Fixes/ Enhancing Incident Clarity • Make required fields into mandatory attributes. • Add a Notion Database for timelines. • Have people record when states change. • Make it machine-readable!!!!!

The rest is easy: do this, then this, and boom!
50 ChatGPT did it overnight 2-4. P2: Root Fixes/ Enhancing Incident Clarity

Incident Dashboard: Visualize the key metrics All green! 2-4. P2:
Root Fixes/ Enhancing Incident Clarity

Incident Dashboard: Visualize the key metrics All green! 2-4. P2:
Root Fixes/ Enhancing Incident Clarity Hold up!

What about the past data? Reports before IRF 2.0 (the
unified format) • Different formats across Divisions • Timelines as bullet points, missing data, and more… Give it three months and we’ll have plenty of data. Right? We only have six months!! 2-4. P2: Root Fixes/ Enhancing Incident Clarity

What about past data? Of course—we Get Our Hands Dirty!
Heck yeah! We manually migrated one year’s worth of incident reports We split the work across the team and powered through over a week 2-4. P2: Root Fixes/ Enhancing Incident Clarity Re: Lesson #2: Don’t be afraid to get your hands dirty if it serves the mission.

Incident Dashboard: Visualizing key metrics 2-4. P2: Root Fixes/ Enhancing
Incident Clarity All green! All green!

Key Metric Breakdown: Observing and Splitting MTTR 1. Occurred 2.
Detected 3. Declared 4. Mitigated 5. Resolved Time To Detect Time To Mitigate Now we know where the time is going—and where we stand! Time To Resolve 2-4. P2: Root Fixes/ Enhancing Incident Clarity

Phase 3: Halving incidents!? 03

58 3-1. What Does It Mean to Reduce Incidents? Impact
caused by incidents e.g. Revenue, Reputation, Effectiveness…. Right? Especially, Revenue Loss… No… it’s the What we really want to reduce… The number of incidents? 🤔

59 MTTD + MTTM Time to stop the bleeding Estimating
the impact of an incident Severity Factor (impact level of an incident) Number of Incident × × In toC and ads, this pretty much defines the revenue impact • Shorten the time • Relatively easy to improve and quick to act on • Reduce severity • Ideal, but hard to control • Reduce incident count • Requires long-term efforts 3-1. What Does It Mean to Reduce Incidents?

60 MTTD + MTTM Time to stop the bleeding Estimating
the impact of an incident Severity Factor (impact level of an incident) Number of Incident × × In toC and ads, this pretty much defines the revenue impact • Shorten the time • Relatively easy to improve and quick to act on • Reduce severity • Ideal, but hard to control • Reduce incident count • Requires long-term efforts 3-1. What Does It Mean to Reduce Incidents? We’re starting here It also aligns with the KPIs we set when ACT was first formed! But a few months in, we gained much better clarity.

61 3-2. Approaching Incident Resolution Time How do we reduce
MTTR? Seriously? Lesson #6: If a top-tier ace jumps into an incident, it gets resolved faster (…maybe?)

Improving MTTR Clarity: State Machine 1. Occurred 2. Detected 3.
Declared 4. Mitigated 5. Resolved Time To Detect Time To Resolve Time To Mitigate Each one has different significance — and needs a different solution! Mainly in the alerting domain. Bleeding, Most critical — but also the easiest to improve. This is where IRF comes in. Time spent on root fixes and data correction. Bleeding has stopped — now it’s about accuracy, not speed. 3-2. Approaching Incident Resolution Time

63 Approaching MTTD(Mean Time To Detect) Improve alerting • Just
slapping alerts on detected issues doesn’t work • “over-monitoring is a harder problem to solve than under-monitoring.” — SRE: How Google Runs Production System • Too many false positives → real alerts get buried • Shift to SLO + Error Budget–based alerting • Ideal: if it pages, it’s an incident • Not something you can fix overnight Still an open challenge Come back to it in Chapter 4 3-2. Approaching Incident Resolution Time

64 Approaching MTTM(Mean Time To Mitigate) A unified framework: IRF
2.0 • Clear definition of what qualifies as an incident • Standardized response flow and communication guidelines • Separation of Responder and Commander roles • Training, with aces leading by example Deploying top aces + rolling out IRF 2.0 had a huge impact! 3-2. Approaching Incident Resolution Time

65 3-3. Approaching the Number of Incidents How do we
reduce incidents themselves? Nooo… Lesson #7: Even if top-tear aces jump into incidents… The number of incidents won’t go down!!

66 Tackle the Bottlenecks We’ve got the data, don’t we?!
Come forth—Incident Dashboard!! 3-3. Approaching the Number of Incidents

67 We started to see where incidents happen—and why. Backed
by data, we tackled each root cause head-on! 3-3. Approaching the Number of Incidents Tackle the Bottlenecks

68 #1 Incident Cause Lack of Testing 3-3. Approaching the
Number of Incidents Tackle the Bottlenecks

69 Approach #1 to Lack of Testing: Why not test
before production? Postmortem discussion… • Why was it deployed without testing? • → Because it could only be tested in production. • Why only in production? • → Lack of data, broken staging environment, etc… • …. Alright! Let’s fix the staging environment! 3-3. Approaching the Number of Incidents

70 Approach #1 to Lack of Testing: Building Out Staging
Environments Still in Progress: Way harder than we thought! • There are tons of components. • Each division—News, Ads, Infra—has different needs and usage. • Ads is B2B, tied directly to revenue → needs to be solid and stable • News is B2C, speed of feature delivery is key Trying to build staging for everything? Not realistic, not even useful. So we started with Ads, where the demand was highest 3-3. Approaching the Number of Incidents

71 Approach #2 to Lack of Testing: What about Unit
Tests? • Why didn’t we catch it with unit tests? • Because we didn’t have any… • … 😭 Alright! Let’s collect test coverage! 3-3. Approaching the Number of Incidents

72 Approach #2 to Lack of Testing: Analyzing Unit Test
Coverage • We charged into systems with no coverage tracking, and cranked out PRs to generate coverage reports. • Then we plotted unit test coverage vs. number of incidents by system # of Incident Ave. Coverage 3-3. Approaching the Number of Incidents

73 • Is there a correlation between test coverage and
incidents? → There was a correlation. • But… does higher coverage actually reduce incidents? • Not sure. Correlation ≠ causation. • Still, when we looked at individual systems and teams with domain knowledge, the places with low coverage and high incidents did have clear reasons: • Hard to write tests / no testing culture / etc… Alright! Let’s storm the low-coverage zones and crank out unit tests! Approach #2 to Lack of Testing: Analyzing Unit Test Coverage 3-3. Approaching the Number of Incidents

74 Approach #2 to Lack of Testing: Building Out Unit
Tests Get our hands dirty! add tests to everything we can get our hands on We hit 3–4 components… but it barely moved the needle We thought once we provided a few examples, others would follow… 3-3. Approaching the Number of Incidents 1. Use SonarQube to find files with high LOC and low coverage 2. Use LLMs to help generate tests 3. Repeat until the entire component hits 50% + coverage

75 • What if we just used LLMs to generate
them? • Well… the accuracy just isn’t there yet. • Teams need a habit of writing unit tests continuously. • No incentive or shared sense of value • Everyone’s under deadline pressure • — no time left for tests(!) This is a team culture and organizational challenge. To be continued in Chapter 4… 3-3. Approaching the Number of Incidents Approach #2 to Lack of Testing: Building Out Unit Tests

76 #2 Incident Cause Config Changes 3-3. Approaching the Number
of Incidents Tackle the Bottlenecks

77 Approaching Config Changes • Main types of “Config Changes”:
• Mechanisms that control app behavior dynamically • A/B Testing and Feature Flags • Both rely on our custom-built platforms — but they’re complicated • Unintended A/B assignments and misconfigurations caused frequent issues Alright, let’s clean up A/B testing and feature flags! 3-3. Approaching the Number of Incidents

78 • Bulk deletion of unused (defaulted) feature flags •
Establish usage guidelines for feature flags • Strengthen validation logic • (We were able to input configs that literally caused parse errors…) Teamed up with the A/B testing platform group, Pushed for major improvements including UX! 3-3. Approaching the Number of Incidents Approaching Config Changes

79 #2 Incident Cause Offline Batch …basically, Flink 3-3. Approaching
the Number of Incidents Tackle the Bottlenecks

80 Approaching Offline Batch • A bunch of “offline” Flink
jobs that are actually streaming • Server → Kafka → Flink → Scylla, ClickHouse, etc. • A custom-built platform by a specialized team • Few Flink experts on the app teams • Led to frequent issues with performance and restarts Alright! Let’s revamp the Flink platform! 3-3. Approaching the Number of Incidents

81 • Improved the platform itself • Better UI, automated
deployments, and more • Spread best practices • Cleaned up documentation • Released template projects (including tests!) • Sent direct refactor PRs to various components • Implemented best practices and tests Collaborated with the platform teamTogether, Improved the platform and its docs! 3-3. Approaching the Number of Incidents Approaching Offline Batch

82 ACT era vs Before ACT Number of Incidents… +32%
Increase!! MTTR… -48% Decrease!! Halved!!! 3-4. Results: Did We Really “Halve” Incidents?

3-4. Results: Did We Really “Halve” Incidents? 83 ACT era
vs Before ACT Number of Incidents… +32% Increase!! MTTR… -48% Decrease!! Halved!!! Hold up!

84 Aren’t incidents actually on the rise…? • Seasonal spike?
December was off the charts • Last-minute changes before the holidays? • A side effect of rolling out IRF2.0? • More sensitive detection due to tighter definitions • Maslow’s Hammer: • “If all you have is IRF, everything starts to look like an incident” • But after January, things started trending down Continuous improvement is still key 3-4. Results: Did We Really “Halve” Incidents?

85 On the other hand, MTTR was Halved! • Dramatic
improvement in MTTMitigate • Thanks to the power of IRF2.0 • But MTTDetect barely improved • Detection is still an open challenge — Working on better alerting Definitely felt the momentum of change! 3-4. Results: Did We Really “Halve” Incidents?

86 General Comments No major changes in the severity breakdown—Impact
levels showed a slight decrease We didn’t quite halve incidents… However, our challenges are clear, and we’ve laid the foundation for improvement! Let’s make it happen 3-4. Results: Did We Really “Halve” Incidents?

Phase 4: What remains, and what’s next 04

88 Again: We Want to Reduce Incidents… 4-1. Open Challenges:
Risk Management & Alerting We’ve been laser-focused on reducing incidents… But we can never make them zero. No way…

89 Can we really make them zero? Should we? Let’s
be real—Not happening To truly minimize incidents… • Just stop releasing? • A slow death 😇 • Throw infinite cost (people, time) at it? • More cost likely correlates with fewer incidents… • So do we keep testing until it feels 100% “safe”? 4-1. Open Challenges: Risk Management & Alerting

90 We want to balance delivery speed, quality, cost, and
incident risk. • But the “right” balance is different for each system or project. • Required speed & release frequency • Cost we can throw in • Acceptable level of risk (≒ number of incidents, failure rate) • Ads is B2B and tied directly to revenue → needs to be rock-solid • News is B2C → speed of delivery comes first! Quantify our risk tolerance And use that to control how many incidents we accept. 4-1. Open Challenges: Risk Management & Alerting

91 Visualizing Risk Tolerance with SLOs and Error Budgets •
SLO = Service Level Objective ~ How much failure is acceptable? • e.g. 99.9% available -> means 0.1% failure is “allowed” • Attach objectives to SLIs—metrics that reflect real UX harm • Error Budget = How much failure room we have left • When budget remains → we can step on the gas • Even bold releases are fair game • When budget runs out → UX is already suffering • No more risk—time to slow down — Ref: Implementing SLOs — Google SRE Error Budgets let us express risk tolerance—numerically. And in theory… this sounds pretty solid. 4-1. Open Challenges: Risk Management & Alerting

92 Improving Alerting — An Alert = an Incident Alright!
Time to get those SLOs in place! • Alert based on how fast your Error Budget is burning: • Trigger alerts when the budget is being consumed too quickly • If you ignore it, you’ll run out — violate your SLO • That means real UX damage! • Don’t ignore it — it is an incident! — Ref: Alerting on SLOs Sounds Good 4-1. Open Challenges: Risk Management & Alerting

93 4-2. Remaining Challenge: Shaping Org & Culture We tried
rolling out SLOs in some places, but… • Defining effective SLOs isn’t easy • Biz and PdMs don’t always have the answers • There’s no time to implement them • They can’t even find time to write tests! • And even if we set them up— • if no one respects them, what’s the point? There’s no silver bullet…

94 How do we make SLOs actually work? • We
need buy-in across the org • It’s not just about engineers—biz and PdMs have to be in too • Need a deliberate approach to culture • Ultimately, it’s about what we truly value • Do we believe that balancing cost and risk with SLOs is worth it? We want to embed SLOs— and ultimately, the mindset of SRE—into our engineering culture. 4-2. Remaining Challenge: Shaping Org & Culture

95 • So how do we approach this? • Bottom-Up:
• Educate and train engineers, Biz, PdMs—get everyone involved • Top-Down: • Win support and direction from leadership SRE and DevOps are culture — They don’t take root in a day. It takes sweat, patience, and steady effort. 4-2. Remaining Challenge: Shaping Org & Culture How do we make SLOs actually work?

96 4-3. What Comes Next… Our 6-month mission as ACT
was coming to an end. • The challenge remained: Install SRE into SmartNews’s engineering culture • Implement and uphold SLOs • And more… • Boost observability • Track and act on DORA metrics, etcetc… These require ongoing effort How can we keep the momentum going— and tackle the remaining challenges even after ACT disbands?

97 How should we disband ACT? — One idea: a
“Distributed SRE Team” After ACT ends, ex-members return to their teams and continue SRE work using X% of their time It sounded reasonable to me… maybe? 4-3. What Comes Next…

98 • Rejected • No one wanted SRE to be
their full-time job. • And allocating “X%” of time… yeah, that never really works. • Our decision: • Ex-ACTors would keep helping and promoting SRE, but we’d take the time to build a dedicated SRE team. We made that call as a team. There’s still plenty left unfinished—but no regrets! How should we disband ACT? — Team’s Call 4-3. What Comes Next…

99 Our Awesome Change! Our (tough!!) six-month mission as ACT
has ended. Did we truly create an “Awesome Change”? Honestly… I don’t know. But we do feel like “We’ve taken the first step on a long journey toward SRE.” 4-3. What Comes Next… And a huge thanks to my teammates for fighting through these past six months!

100 Your Awesome Change! Let’s make it happen You go
back to work after this conference— and your boss says, “Alright, starting today, go reduce incidents.” …What would you start with? 4-3. What Comes Next…

101 Our battle is just beginning!! Stay tuned for the
future adventures of the ex-ACTors!

Thank you for Your Kind Attention!

103 References • Seeking SRE: Conversations About Running Production Systems
at Scale • Site Reliability Engineering: How Google Runs Production Systems • SRE Google Workbook • Effective DevOps: Building a Culture of Collaboration, Affinity, and Tooling at Scale • Fearless Change: Patterns for Introducing New Ideas

Creating Awesome Change in SmartNews! En

Creating Awesome Change in SmartNews! En

More Decks by Ikuo Suyama

Other Decks in Programming

Featured

Transcript