Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Creating Awesome Change in SmartNews

Creating Awesome Change in SmartNews

Avatar for Ikuo Suyama

Ikuo Suyama

April 14, 2025
Tweet

More Decks by Ikuo Suyama

Other Decks in Technology

Transcript

  1. Who am I? / ͓·ͩΕ Ikuo Suyama / ಃࢁҭஉ •

    Staff Engineer • Ads Backend Expert • Nov. 2020~ SmartNews, Inc. • Interest: Fishing, Camping, Gunpla, Anime
  2. 1.࢝ಈ: Assemble! ಛघ෦ୂ “ACT”! 2.ॳಈ: “Get our hands dirty”! 3.༂ਐ:

    Incident Λ൒෼ʹ͢Δ!? 4.ؼؐ: ࢒͞Εͨ՝୊ͱ͜Ε͔Β Agenda
  3. 12 1-1. ࢝·Γ • CTO) Πϯγσϯτ͸ຊ౰ʹ “ଟ͍” ͷ͔ʁ • ͦ΋ͦ΋Πϯγσϯτ͕ଟ͍ͬͯͲ͏ఆٛ͢Δʁ

    • Ikuo) มߋ͸ຊ౰ʹଟ͍ͷ͔ʁ • ͦ΋ͦ΋Πϯγσϯτ͕ଟ͍ݪҼ͸มߋͳͷ͔ʁ • มߋͱ͸Կͷมߋͳͷ͔ʁ ͜ͷ࣌఺Ͱ͸૒ํࠜڌͷͳ͍ɺ”Χϯ” ͪΐͬͱ଴͍ͯʂ ※ͨͩ͠γχΞΤϯδχΞͷᄿ֮͸෠Εͳ͍
  4. 15 1-2. ࠷ڧνʔϜΛूΊΔ ֤Division͔ΒΤʔεୡ͕ू·Δ…ʂ Ads News Ranking Push Notification Core

    System (Infra) Mobile SmartView (Article) Ads Ikuo! News & Push D! Ranking R! CoreSystem T! Mobile M! SmartView T! VPoE K! ※ ࿩ͷ౎߹্ࣗ෼ͷ͜ͱ΋Τʔεͱݺ͹͍͓ͤͯͩ͘͞ئ͍͠·͢ (Manager) CTO Report To
  5. 16 1-2. ࠷ڧνʔϜΛूΊΔ ֤Division͔ΒΤʔεୡ͕ू·Δ…ʂ Ads News Ranking Push Notification Core

    System (Infra) Mobile SmartView (Article) Ads Ikuo! News & Push D! Ranking R! CoreSystem T! Mobile M! SmartView T! ※ ࿩ͷ౎߹্ࣗ෼ͷ͜ͱ΋Τʔεͱݺ͹͍͓ͤͯͩ͘͞ئ͍͠·͢ VPoE K! (Manager) CTO Report To SREŧŔŕŪũƄŝſ
  6. 19 1-3. νʔϜΛํ޲͚ͮΔ ໌֬ͳΰʔϧઃఆ • “Awesome Change” ͱ͸ • ΫϦςΟΧϧͳΠϯγσϯτΛݮΒ͢

    • SREϕετϓϥΫςΟεΛ૊৫ʹΠϯετʔϧ͢Δ • վળର৅KPI: • Mean Time Between Failure(MTBF) / Change Failure Rate(CFR) ʹΠϯγσϯτ਺ • Mean Time to Recover(MTTR) ʹΠϯγσϯτղܾ࣌ؒ “զʑ͸ͳͥ͜͜ʹ͍Δͷ͔” ͷݴޠԽʂ VPoE͕͏·͘΍ͬͯ͘Ε·ͨ͠
  7. 20 1-3. νʔϜΛํ޲͚ͮΔ ໌֬ͳϓϥΠΦϦςΟઃఆ • P0: ΠϯγσϯτϋϯυϦϯάΛαϙʔτ͢Δ • P1: ΫϦςΟΧϧ͔ͭফԽ͞Ε͍ͯͳ͍ΠϯγσϯτΞ

    ΫγϣϯΞΠςϜΛ௵͢ • P2: ΠϯγσϯτൃੜΛ๷͙ࠜຊతͳγεςϜվળ ؟ͷલ΍Δ͜ͱ͸໌֬ʂ
  8. 22 2-1. P0: ΠϯγσϯτϋϯυϦϯάͷαϙʔτ Get our Hands Dirtyɿ͢΂ͯͷΠϯγσϯτʹհೖ͢Δʂ • Πϯγσϯτ͕ى͜ΔͱɺͱΓ͋͑ͣACTϝϯόʔͷͩΕ͔ͷ

    PagerDuty͕໐Δ • ݁ہACTશһΛΠϯγσϯτ͕ى͍ͬͯ͜Δͱ͜Ζʹট଴͢Δ • ࣗ෼ͷग़਎υϝΠϯͰ͋Ε͹ফՐ׆ಈʹࢀՃ͢Δ • ͦ͏Ͱͳͯ͘΋ɺεςʔλεΞοϓσʔτ΍ඞཁͳਓࡐͷ֬อɺ Ϗδωεͱͷ࿈བྷ໾ͳͲΛങͬͯग़Δ ΩπΠ!!
  9. 25 2-2. P1:ΞΫγϣϯΞΠςϜΛ௵͢ ๨ΕڈΒΕͨΞΫγϣϯΞΠςϜͨͪ ؒҧ͍ͳ͘ରԠ͞Εͣɺ๨ΕڈΒΕ͍ͯΔ΍ͭΒ͕͍Δ • ΋ͱ΋ͱΠϯγσϯτϨϙʔτΛ࢒͢จԽ͕͋ͬͨ • ࠶ൃ๷ࢭͷΞΫγϣϯΞΠςϜ΋هࡌ͞Ε͍ͯͨ •

    ͢͹Β͍͠ʂʂ • ͕ɺΞΫγϣϯΞΠςϜ͸؅ཧ͞Ε͍ͯͳ͔ͬͨ • ୲౰ɺظݶɺ׬ྃεςʔλε • !!??!??! • ͔ͭɺϨϙʔτͷϑΥʔϚοτ͸Division͝ͱʹҟͳͬͨ • ͳΜͳΒ୲౰ऀ͝ͱʹҟͳͬͨ
  10. 27 2-2. P1:ΞΫγϣϯΞΠςϜΛ௵͢ Get our Hands DirtyɿखಈͰσʔλ੔ཧ ڭ܇̍: σʔλ͸ͳΔ΂͘ ػցॲཧͰ͖ΔϑΥʔϚοτͰ࢒ͤʂʂ

    ͤ΍ʂաڈ̍೥෼ͷΠϯγσϯτϨϙʔτͷAI શ෦खಈͰNotion DatabaseʹҠߦͨ͠Ζʂ ※Databaseʹͯ͠͠·͑͹APIܦ༝Ͱσʔλ͕औΕΔͷͰͲ͏ʹͰ΋ͳΔ ڭ܇̎ɿ໨తͷͨΊʹటष͍खஈΛऔΔ͜ͱΛԀ͏ͳʂʂ Get our Hands Dirty
  11. 29 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ • Division͕ҧ͑͹ΠϯγσϯτରԠͷ΍Γํɺྲّྀ΋ҟ ͳΔ • աڈʹ΋શࣾ౷ҰͷϓϩτίϧΛ࡞ΔࢼΈ͕͋ͬͨ

    • ”Incident Response Framework:IRF” • ͕ɺਁಁ/ར༻͍ͯ͠ͳ͔ͬͨ • ಛఆͷDivisionͷཁ͔݅͠ߟྀ͞Εͯͳ͔ͬͨʂ ͦ΋ͦ΋ͳͥશࣾͰ౷Ұ͞Εͨϓϩηε͕ͳ͔ͬͨʁ
  12. 30 • IRFࣗମ͸ϓϩηεͱͯ͠͸Α͘Ͱ͖͍ͯͨ • ͜ΕΛϕʔεʹɺ • શࣾͰ౷ҰͰ͖Δखॱ … ֤Τʔε͔ΒͷυϝΠϯ஌ࣝͱܦݧͷ౤ೖ •

    ͔ͭܰྔͳ΋ͷ • Πϯγσϯτͷ࠷தʹෳࡶͳखॱ͸଍ΛҾͬுΔʂ • ެ։͞Ε͍ͯΔଞࣾͷFW΋ࢀߟʹɺྑ͍ͱ͜ΖΛऔΓೖΕͨ • e.g. Pager Duty Incident Response Ͳ͏΍ͬͯશࣾ౷ҰϓϩηεɺϑϨʔϜϫʔΫΛ࡞Δʁ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ ͢΂ͯͷྖҬΛΧόʔͨ͠νʔϜ͔ͩͬͨΒͦ͜Մೳ
  13. 31 IRF 2.0 Contents 1. Role, Playbook 2. Severity Definition

    3. Workflow 4. Communication Guideline 5. Incident Report Template, Postmortem ॏཁͳͱ͜ΖΛ঺հ͠·͢ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ ৄࡉ͸εϥΠυΛ֬͝ೝ͍ͩ͘͞ʂ
  14. 32 IRF 2.0: Role, Playbook • On-Call Engineer • ΦϯίʔϧΛड͚ΔΤϯδχΞɻΞϥʔτͷτϦΞʔδΛߦ͍ɺඞཁͰ͋Ε͹ICʹΤε

    ΧϨʔγϣϯͯ͠IRFΛ։࢝͢Δ(IncidentΛએݴ͢Δ)ɻ • Incident Commander(IC) • ΠϯγσϯτରԠͷࢦشΛͱΔਓɻඞཁͳਓΛूΊɺ৘ใΛ੔ཧ͢Δɻ֎෦ͱͷίϛϡ χέʔγϣϯʢCLʣΛ݉຿͢Δ͜ͱ΋͋Δɻ௨ৗTech Lead/Engineering Managerɻ • ࣮ࡍͷՐফ͠࡞ۀͰ͸ͳ͘ɺ৘ใɾঢ়گ੔ཧͱ൑அ͕੹຿ • Responder • ࣮ࡍͷՐফ͠࡞ۀʢϩʔϧόοΫ΍ઃఆมߋʣΛߦ͏ɻ • Communication Lead(CL) • ֎෦εςʔΫϗϧμʔʢ͜͜Ͱ͸ΤϯδχΞҎ֎ʣͱͷίϛϡχέʔγϣϯΛ୲౰͢ Δɻ ICͱResponderͷ੹೚෼཭͕ΩϞ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ
  15. 33 IRF 2.0: Severity Definition IC͕Πϯγσϯτએݴ࣌ʹ࢑ఆతʹܾఆ͢Δɻ࠷ऴධՁ͸ϙετϞʔςϜͰܾ ·Δ • 🔥 SEV-1

    • χϡʔεߪಡͳͲίΞUXػೳ͕׬શఀࢭ • 🧨 SEV-2 • ίΞUXػೳͷҰ෦ఀࢭɺαϒUXػೳͷ׬શఀࢭ • 🕯 SEV-3 • αϒUXػೳͷҰ෦ఀࢭ ॳಈͷ࣌఺ͰSEVʹ౰ͨΓΛ͚͓ͭͯ͘͜ͱ͕؊ཁɻ ʢγϏΞͳΠϯγσϯτ͸ΑΓૣ͘ղܾ͍ͨ͠ʣ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ
  16. 34 IRF 2.0: Workflow Πϯγσϯτ͕࢝·͔ͬͯΒऴΘΔ·ͰͷྲྀΕɻ🩸 : Bleeding, ग़݂தͷεςʔλε 1. 🩸

    Occurrence/ൃੜ • ໰୊ͱͳΔࣄ৅ͷൃੜɻσϓϩΠ΍ઃఆมߋͳͲ͕τϦΨʔ 2. 🩸 Detection/ݕ஌ • ΞϥʔτͳͲʹΑΓɺOnCaller͕໰୊Λݕ஌ͨ͠ঢ়ଶɻτϦΞʔδΛ։࢝ 3. 🩸 Declaration/એݴ • Πϯγσϯτͷ ”એݴ”ɻIRFʹଇΓɺICͷࢦشͷ΋ͱࢭ݂ରԠ։࢝ • ಉ࣌ʹඞཁͳ֎෦ίϛϡχέʔγϣϯΛ։࢝ɻग़݂த͸ܧଓతͳΞοϓσʔτ 4. ❤🩹 Mitigation/؇࿨ • มߋͷϩʔϧόοΫͳͲͰҰ࣌ݪҼΛഉআɺඃ֐ͷ֦ࢄΛఀࢭ 5. Resolution/ղܾ • ෆ۩߹ͷमਖ਼΍σʔλิਖ਼ͳͲɺ߃ٱରԠͷ׬ྃɻ׬શࢭ݂ 6. Postmortem/ࣄޙ෼ੳ • ΠϯγσϯτϨϙʔτΛݩʹɺࠜຊݪҼͷٹ໋ͱ࠶ൃ๷ࢭࡦͷݕ౼ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ
  17. 35 IRF 2.0: Communication Guideline ίϛϡχέʔγϣϯʹ࢖͏৔ॴʢSlackͷνϟϯωϧʣͷఆٛ • #incident • શମ΁ͷεςʔλεप஌ɺ֎෦εςʔΫϗϧμʔͱͷίϛϡχέʔγϣϯ

    • #incident-irf-[incidentId]-[title] • ໰୊ղܾͷͨΊͷٕज़తͳίϛϡχέʔγϣϯɻؔ࿈͢Δ৘ใɾٞ࿦͢΂ͯूΊΔ • ඞཁʹԠͯ͡WAR ROOM(Online, Google Meet)Λཱͯͯू߹ ٞ࿦΍৘ใ͕̍ͭͷνϟϯωϧʹू·ͬͯΔͱϨϙʔτੜ੒࣌ʹศར 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ
  18. 36 IRF 2.0: Incident Report Template & Postmortem શࣾ౷ҰϑΥʔϚοτͷదԠ •

    Summary • Impact • Direct Cause, Mitigation • Root Cause Analysis (5-whys) • ௚઀ݪҼͱࠜຊݪҼΛ෼͚ͯ෼ੳ͢Δ͜ͱ͕ॏཁʂ • ͜͜ʹରͯ͠ΞΫγϣϯΞΠςϜΛઃఆ͠ɺ࠶ൃ๷ࢭ • Action Items • Timeline • ػցॲཧͰ͖ΔϑΥʔϚοτͰ!!!! DivisionࣄʹҟͳͬͨςϯϓϨʔτΛ౷Ұʢ͍ͩ͡ʣɺPostmortemͷҰݩԽ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ
  19. σʔλఆٛɿΠϯγσϯτΛϞσϦϯά͢Δ • Insidentͷଐੑ • Title • Status • State Machine.

    ޙड़ • Severity • SEV 1~3(IRF2.0) • Direct Cause • ޙड़ • Direct Cause System • MicroServiceͷҙຯ୯ҐͰͷίϯϙʔωϯτ܈ • Direct Cause Workload • Online Service, Offline Pipeline, … ՄೳͳݶΓEnumΛఆٛ͢Δ 46 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ ࣗ༝ೖྗͰ͸ΧʔσΟφϦςΟ͕ߴ͘ͳΓ͗ͯ͢·ͱ΋ͳ෼ੳ͕Ͱ͖ͳ͍
  20. σʔλऩूɿΠϯγσϯτϨϙʔτ ઌʹఆٛͨ͠σʔλ߲໨ΛؚΊΔΑ͏ɺΠϯγσϯτϨϙʔτͷςϯ ϓϨʔτΛΞοϓσʔτ • ඞཁͳ߲໨ΛඞਢೖྗͷAttributeʹ • EventTimelineΛೖྗ͢ΔNotionDatabase Λ௥Ճ • State͕มԽͨ࣌ؒ͠Λه࿥ͯ͠΋Β͏

    • ػցॲཧͰ͖ΔϑΥʔϚοτͰ!!!!! σʔλఆ͕͔ٛͬ͠Γ͍ͯ͠Ε͹ɺιʔε͸ϑϨΩγϒϧ (΋ͪΖΜ৴པͰ͖ΔσʔλͰ͋Δલఏ) 49 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ
  21. ॏཁͳࢦඪͷ؍ଌɿMTTRͷ؍ଌͱࡉ෼Խ 1. Occurred 2. Detected 3. Declared 4. Mitigated 5.

    Resolved Time To Detect Time To Mitigate Ͳ͜ʹ͕͔͔͍࣌ؒͬͯΔ͔ɺݱࡏ஍͕Θ͔ͬͨʂ Time To Resolve 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ
  22. 59 3-1. ΠϯγσϯτΛݮΒ͢ͱ͸ʁ MTTD + MTTM (ࢭ݂ʹ͔͔ͬͨ࣌ؒ) ΠϯγσϯτͷΠϯύΫτΛݟੵ΋Δ Severity Factor

    (ΠϯγσϯτͷӨڹ౓߹͍) Πϯγσϯτ਺ × Σ toC, ޿ࠂϏδωεͩͱ͍͍ͩͨ͜ΕͰRevenueΠϯύΫτ͕ܾ·Δ • ͳΔ΂͘୹͍ͨ͘͠ • վળ͕ൺֱత༰қɺख͕͚ͭ΍͍͢ • ͳΔ΂͘େ͖ͳΠϯγσϯτΛݮΒ͍ͨ͠ • ͕ɺίϯτϩʔϧ͕೉͍͠ • ͳΔ΂͘਺ΛݮΒ͍ͨ͠ • த௕ظͷ׆ಈ͕ඞཁ
  23. Σ 60 3-1. ΠϯγσϯτΛݮΒ͢ͱ͸ʁ MTTD + MTTM (ࢭ݂ʹ͔͔ͬͨ࣌ؒ) ΠϯγσϯτͷΠϯύΫτΛݟੵ΋Δ Severity

    Factor (ΠϯγσϯτͷӨڹ౓߹͍) Πϯγσϯτ਺ × • ͳΔ΂͘୹͍ͨ͘͠ • վળ͕ൺֱత༰қɺख͕͚ͭ΍͍͢ • ͳΔ΂͘େ͖ͳΠϯγσϯτΛݮΒ͍ͨ͠ • ͕ɺίϯτϩʔϧ͕೉͍͠ • ͳΔ΂͘਺ΛݮΒ͍ͨ͠ • த௕ظͷ׆ಈ͕ඞཁ ͔͜͜ΒखΛ෇͚͍ͯ͘ ACT݁੒౰ॳʹཱͯͨKPIͱ΋Ϛον͍ͯ͠Δʂ ͕ɺ਺ϲ݄ͷܦݧͰղ૾౓͕ΑΓ্͕ͬͨ
  24. MTTRͷղ૾౓Λ্͛ΔɿεςʔτϚγϯ 1. Occurred 2. Detected 3. Declared 4. Mitigated 5.

    Resolved Time To Detect Time To Resolve Time To Mitigate ͦΕͧΕॏཁ౓ɺରࡦ͕ҟͳΔʂ ݕ஌ʹ͔͔Δ࣌ؒɻ ओʹAlertingͷྖҬ ݕ஌ʙࢭ݂ʹ͔͔Δ࣌ؒɻ ࠷΋ΫϦςΟΧϧ͕ͩɺΞϓϩʔν͠΍͍͢ IRFͷྖҬ ࢭ݂ʙࠜຊରԠ/ิਖ਼ͳͲޙॲཧʹ͔͔Δ࣌ؒɻ ͢Ͱʹࢭ݂͞Ε͍ͯΔͷͰɺ଎͞ΑΓ΋ਖ਼֬͞ ͕ٻΊΒΕΔ 3-2. Πϯγσϯτղܾ࣌ؒ΁ͷΞϓϩʔν
  25. 63 MTTD(Mean Time To Detect) ʹର͢ΔΞϓϩʔν Ξϥʔτͷ੔උ • ݕ஌Ͱ͖ͳ͔ͬͨ/஗Εͨ໰୊ʹ৽͘͠ΞϥʔτΛ͚ͭΑ͏ɺ͸͏·͍͔͘ͳ͍ •

    “over-monitoring is a harder problem to solve than under-monitoring.” — SRE: How Google Runs Production System • False Positive Ξϥʔτଟ͗͢ɺຒ΋ΕΔ໰୊ • SLO + Error Budget ʹΑΔΞϥʔτ΁ͷγϑτ • Pager͞ΕͨΒΠϯγσϯτɺ͕ཧ૝ • ҰேҰ༦ʹ͸͍͔ͳ͍ʂ ࢒͞Εͨ՝୊ɻ̐ষͰ͓࿩͠·͢ʂ 3-2. Πϯγσϯτղܾ࣌ؒ΁ͷΞϓϩʔν
  26. 64 MTTM(Mean Time To Mitigate) ʹର͢ΔΞϓϩʔν ౷ҰFWɿIRF2.0 • Πϯγσϯτͷج४ͷ໌ࣔԽ •

    ରԠϑϩʔɾίϛϡχέʔγϣϯΨΠυϥΠϯͷ౷Ұ • Responder / Commander ͷ෼཭ • ٴͼτϨʔχϯάɺΤʔεୡ͕എதΛݟͤΔ Τʔε౤ೖͱIRF2.0ͷਁಁ͕ޮՌ͖ͯΊΜʂ 3-2. Πϯγσϯτղܾ࣌ؒ΁ͷΞϓϩʔν
  27. 69 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̍ɿຊ൪౤ೖલͷςετʁ PostMortemʹͯ… • Why. ͳͥςετͤͣʹຊ൪ʹ౤ೖ͞ΕΔͷͰ͔͢ʁ • ຊ൪Ͱ͔͠ςετͰ͖ͳ͍͔ΒͰ͢

    • Why. ͳͥຊ൪Ͱ͔͠ςετͰ͖ͳ͍ͷͰ͔͢ʁ • σʔλෆ଍ɺStaging౳ςετ؀ڥͷෆඋ • …. ϤγʂStaging؀ڥΛ੔උ͢Δͧʂ
  28. 70 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̍ɿStaging؀ڥ੔උ ಓ൒͹ɿ૝૾ͷ10ഒେมʂ • ίϯϙʔωϯτ͕ࢮ͵΄Ͳ͋Δ • News, Ads,

    InfraͱDivisionຖʹҟͳΔཁٻɺར༻ํ๏ • Ads͸toB, ͓ۚʹ௚݁ʂ͔ͬ͠Γ͔ͬͪΓ • News͸toC, ػೳఏڙεϐʔυ༏ઌʂ ͱΓ͋͑ͣ͢΂ͯSTG੔උ͠·͢ɺ͸ແཧͩ͠ҙຯ͕ͳͦ͞͏ɻ ൺֱతཁ๬͕େ͖͍Ads͔ΒରԠத
  29. 73 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̎ɿUnitTestͷ෼ੳ • ςετΧόϨοδͱো֐͸૬ؔ͢Δͷ͔ʁ → ૬ؔ͸ग़ͨɻ • ͕ɺςετΧόϨοδΛ্͛Ε͹ো֐͕ݮΔ͔ʁ͸Θ͔Βͳ͍

    ʢҼՌͰ͸ͳ͍ʣ • ͔͠͠υϝΠϯ஌ࣝΛ΋ͬͯɺγεςϜ/νʔϜ୯ҐͰݟͯΈΔͱɺ ͔֬ʹΧόϨοδ͕௿͘ɾΠϯγσϯτ͕ଟ͍ͱ͜Ζ͸ཧ༝͕͋Γ ͦ͏ • UTΛ࣮૷ͣ͠Β͍/UTΛ࣮૷͢ΔจԽ͕ͳ͍ etcetc… Ϥγʂͱʹ͔͘UT͕গͳ͍ͱ͜Ζʹಥܸͯ͠ UTΛ࣮૷͠·͘Δͧʂʂ
  30. 74 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̎ɿUnitTestͷ੔උ Get our Hands Dirtyɿยͬ୺͔ΒUTΛ͚ͭΔ 2. Sonarqube

    Ͱߦ਺͕ଟ͘ɺCoverage͕௿͍ϑΝΠϧΛݟ͚ͭΔ 3. LLMͷྗΛआΓͯUTΛ࣮૷͠·͘Δ 4. ίϯϙʔωϯτશମͰ> 50% ʹͳΔ·Ͱ܁Γฦ͠ ̏ʙ̐ίϯϙʔωϯτ΍͕ͬͨম͚ੴʹਫ αϯϓϧ͕͋Ε͹ɺޙ͸΍ͬͯ͘ΕΔͩΖ͏ͱࢥ͍ͬͯͨ…
  31. 75 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̎ɿUnitTestͷ੔උ • LLMͰࣗಈੜ੒ͨ͠Βʁ • ͍·ͷͱ͜Ζਫ਼౓͕͍·͍ͪ • ͦ΋ͦ΋UTΛܧଓతʹ࣮૷͢Δश׳͕νʔϜʹඞཁ

    • ͕ɺͦ͏͢ΔΠϯηϯςΟϒɺՁ஋ײ͕ແ͍ • ೲظʹ௥ΘΕ͍ͯͯɺUTʹׂ͕࣌ؒ͘ͳ͍(!) ૊৫ɺจԽ΁ͷΞϓϩʔν͕ඞཁʂ ࢒͞Εͨ՝୊ɻ4ষ΁ଓ͘…
  32. 77 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ઃఆมߋʹର͢ΔΞϓϩʔν • ओͳ”ઃఆมߋ” • ΦϯϥΠϯͰΞϓϦέʔγϣϯͷڍಈΛ੍ޚ͢Δػߏ • A/BςετɺFeature

    Flag • ͲͪΒ΋ಠ࣮ࣗ૷ͷϓϥοτϑΥʔϜΛ͕࣋ͭɺෳࡶ • ҙਤ͠ͳ͍Ϣʔβʔ΁ͷA/BదԠ΍ɺޡͬͨઃఆʹΑΔ໰୊ ͕ଟൃ ϤγʂA/BςετͱϑΟʔνϟʔϑϥάΛ ੔උ͢Δͧʂʂ
  33. 78 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ઃఆมߋʹର͢ΔΞϓϩʔν • ࢖ΘΕ͍ͯͳ͍ʢσϑΥϧτԽͨ͠ʣϑΟʔνϟϑϥά ͷҰ੪࡟আ • ϑΟʔνϟʔϑϥάར༻ج४ͷࡦఆ •

    όϦσʔγϣϯͷڧԽ • ʢύʔεΤϥʔʹͳΔઃఆ͕ೖྗͰ͖͍ͯͨ…ʣ AB ςετϓϥοτϑΥʔϜνʔϜͱ΋ڠۀ͠ɺ ϢʔβʔϏϦςΟؚΊେ෯ͳվળΛਪਐʂ
  34. 80 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν Offline Batch ʹର͢ΔΞϓϩʔν • ετϦʔϛϯάॲཧͳFilnkΦϑϥΠϯόον͕ଟ਺ • Server

    → Kafka → Flink → Scylla, ClickHouse, … • ઐ໳νʔϜʹΑΔಠࣗ։ൃϓϥοτϑΥʔϜ • ΞϓϦέʔγϣϯνʔϜʹFlinkΤΩεύʔτ͕গͳ͘ɺ ύϑΥʔϚϯε΍࠶ىಈ࣌ͷ໰୊͕ଟൃ ϤγʂFlink ϓϥοτϑΥʔϜΛ ੔උ͢Δͧʂʂ
  35. 81 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν Offline Batch ʹର͢ΔΞϓϩʔν • ϓϥοτϑΥʔϜࣗମͷվળ • UIͷվળɺࣗಈσϓϩΠɺ…

    • ϕετϓϥΫςΟεͷ෍ڭ • υΩϡϝϯτ੔උ • ςετΛؚΉςϯϓϨʔτϓϩδΣΫτͷެ։ • ֤ίϯϙʔωϯτʹ௚઀ϦϑΝΫλPRΛૹ෇ • ϕετϓϥΫςΟεͱςετΛ࣮૷ ϓϥοτϑΥʔϜνʔϜʹ΋ڠྗΛڼ͗ɺ ϓϥοτϑΥʔϜͷվળͱυΩϡϝϯτ੔උΛ࣮ࢪʂ
  36. 84 3-4. ݁ՌɿΠϯγσϯτ͸”൒෼”ʹͳͬͨͷ͔!? Πϯγσϯτ૿͑ͯ·͢Α…ʁ • قઅੑʁ݄͕̍̎ඈͼൈ͚ͯଟ͍ • ٳՋલͷ׈ΓࠐΈมߋࣄނʁ • IRF2.0ਁಁͷ෭࡞༻ʁ

    • ΠϯγσϯτఆٛʹΑΔݕ஌ײ౓ ͷ޲্ • ϚζϩʔͷϋϯϚʔ: “΋͠IRF͔͍࣋ͬͯ͠ ͳ͚Ε͹ɺ͢΂͕ͯΠϯγσϯτʹݟ͑Δ” • ݄̍Ҏ߱͸ݮগ܏޲ ܧଓతͳվળ׆ಈ͕ඞཁ
  37. 85 3-4. ݁ՌɿΠϯγσϯτ͸”൒෼”ʹͳͬͨͷ͔!? ҰํɺMTTR͸൒ݮʂ • ಛʹ MTTMitigate ʹܶతͳվળ • IRF2.0ޮՌͱߟ͍͑ͯΔ

    • ҰํMTTDetect͸େ͖ͳվળͳ͠ • Detection ͸ࠓޙͷ՝୊ɻΞ ϥʔτվળʹऔΓ૊Ή վળʹ͔֬ͳखԠ͑ʂ
  38. 89 ͦ΋ͦ΋ΠϯγσϯτΛ̌ʹ͍ͨ͠ʢͰ͖Δʣͷ͔ʁ ݱ࣮తʹͲͪΒ΋ແཧ… • ΠϯγσϯτΛۃখԽ͢Δʹ͸ʁ • ϦϦʔεΛͳ͘͢ʁ • →؇΍͔ͳࢮ😇 •

    ແݶʹίετʢϦιʔεɺ࣌ؒʣΛ౤Լ͢Δʁ • ౤ೖͨ͠ϦιʔεͱΠϯγσϯτൃੜ཰͸(͓ͦΒ͘)૬ؔ͢Δ • Αͬͯɺ100%҆શͱߟ͑ΒΕΔ·Ͱͻͨ͢Βςετ͢Δ 4-1. ࢒͞Εͨ՝୊ɿϦεΫ؅ཧ, Ξϥʔτվળ
  39. 90 ೲظͱ඼࣭ɺίετͱΠϯγσϯτͷόϥϯεΛऔΓ͍ͨ • ͕ɺγεςϜɾϓϩδΣΫτ͝ͱʹόϥϯε͸ҟͳΔ • ٻΊΒΕΔεϐʔυɺϦϦʔεස౓ • ౤ԼͰ͖Δίετ • ڐ༰Ͱ͖ΔϦεΫʢ㲈Πϯγσϯτ਺ɺมߋࣦഊ཰ʣ

    • ྫɿ • Ads͸toB, ͓ۚʹ௚݁ʂ͔ͬ͠Γ͔ͬͪΓ • News͸toC, ػೳఏڙεϐʔυ༏ઌʂ ʹϦεΫڐ༰౓Λ਺஋Խ͠ɺ ΠϯγσϯτΛίϯτʔϧ͍ͨ͠ 4-1. ࢒͞Εͨ՝୊ɿϦεΫ؅ཧ, Ξϥʔτվળ
  40. 91 ϦεΫڐ༰౓ͷ໌ࣔԽɿSLOͱError Budget • SLO = αʔϏεϨϕϧ໨ඪ ~ ͲΕ͘Β͍Τϥʔ͕ڐ͞ΕΔ͔ʁ •

    e.g. 99.9% available -> 0.1%͸ڐ༰͞ΕΔ • ࣮ࡍʹUXʹة֐͕͋ΔSLI(Indicator)ʹObjective(໨ඪ)Λ͚ͭΔ • Error Budget = ڐ༰Ͱ͖ΔΤϥʔ͕͋ͱͲΕ͘Β͍࢒͍ͬͯΔ͔ • Error Budget ͕࢒͍ͬͯΔ = ΞΫηϧΛ౿ΊΔ • ଟগແ๳ͳϦϦʔε΋ڐ༰Ͱ͖Δ • Error Budget ͕ރׇͨ͠ = ڐ༰Ͱ͖ͳ͍ϨϕϧͷUXͷᆝଛ • ͜ΕҎ্ϦεΫΛऔͬͯ͸͍͚ͳ͍ɻεϐʔυμ΢ϯ — Ref: Implementing SLOs — Google SRE Error BudgetʹΑͬͯϦεΫڐ༰౓ΛදݱͰ͖Δ ཧ࿦తʹ͸ྑͦ͞͏ 4-1. ࢒͞Εͨ՝୊ɿϦεΫ؅ཧ, Ξϥʔτվળ
  41. 92 Ξϥʔτͷվળ ʔ Ξϥʔτ = Πϯγσϯτ ϤγʂSLO Λ੔උ͢Δͧʂʂ • Error

    Budget ͷফඅ଎౓, Burn Rate ʹΑͬͯΞϥʔτ͢Δ • ٸ଎ͳ Error Budget ফඅΛΞϥʔτ • ์ஔ͢Δͱ༧ࢉ͕ރׇ͢Δɻͭ·ΓSLOʹҧ൓͢Δ • →࣮ࡍͷUXʹة֐͕͋Δʂ • →์ஔͯ͠͸͍͚ͳ͍ʂʹΠϯγσϯτ — Ref: Alerting on SLOs 4-1. ࢒͞Εͨ՝୊ɿϦεΫ؅ཧ, Ξϥʔτվળ ྑͦ͞͏
  42. 94 4-2. ࢒͞Εͨ՝୊ɿ૊৫ͱจԽ΁ͷΞϓϩʔν SLO Λػೳͤ͞Δʹ͸ʁ • ૊৫શମͷཧղͱڠྗ͕ෆՄܽ • ΤϯδχΞ͚ͩͰͳ͘ɺϏδωεɺPdM΋ר͖ࠐΉඞཁੑ •

    จԽ΁ͷΞϓϩʔν͕ඞཁ • ڀۃతʹ͸ͳʹΛՁ஋ͱ͢Δ͔ɺͱ͍͏࿩ • SLOͰίετͱϦεΫͷόϥϯεΛऔΔ͜ͱΛՁ஋Λ৴ ͡ɺ࣮ߦͰ͖Δ͔ ΤϯδχΞϦϯάจԽʹSLOɺ ͻ͍ͯ͸SREΛΠϯετʔϧ͍ͨ͠
  43. 95 4-2. ࢒͞Εͨ՝୊ɿ૊৫ͱจԽ΁ͷΞϓϩʔν SLO Λػೳͤ͞Δʹ͸ʁ • Ͳ͏Ξϓϩʔν͢Δ͔ • BottomUp: •

    Eng, Biz, PdM ΒεςʔΫϗϧμʔ΁ͷ෍ڭɺτ Ϩʔχϯά • TopDown: • ্૚෦͔Βͷࢧ࣋ɺࢦࣔ SRE, DevOps͸จԽɺҰ೔ʹͯ͠੒Βͣ ׼Λ͔͘஍ಓ͔ͭܧଓతͳ׆ಈ͕ඞཁ
  44. 96 4-3. ͜Ε͔Β… ACT൒೥ؒͷظݶ෇͖೚ظ͕ऴΘΖ͏ͱ͍ͯͨ͠ • ՝୊: SNͷΤϯδχΞϦϯάจԽʹSREΛΠϯετʔϧ͢Δ • SLO ͷ࣮૷ɺ९क

    • ଞʹ΋… • Observability ͷ޲্ • DORA Metrics ͷऩूٴͼ؂ࢹɺ९क, etcetc… → ܧଓతͳ׆ಈ͕ඞཁ ͜͜·ͰͷาΈΛࢭΊͣɺ ACTղࢄޙ΋࢒͞Εͨ՝୊ʹཱͪ޲͔͏ʹ͸ʁ
  45. 98 4-2. ͜Ε͔Β… ACTΛͲ͏ղࢄ͢Δ͔ʁ — νʔϜͷ౴͑ • ٫Լʂ • ΈΜͳSREΛϑϧλΠϜδϣϒʹ͍ͨ͠Θ͚Ͱ͸ͳ͍

    • “X%” ͷ࣌ؒΛΞϩέʔτ͢Δɺ͕ػೳͨͨ͠Ί͕͠ͳ͍ • ݁࿦ • Ex-ACTor͸ࠓޙ΋SREͷܒ໤΍खॿ͚Λߦ͏͕ɺ͕࣌ؒ ͔͔ͬͯ΋ઐ໳ͷSREνʔϜΛ্ཱͪ͛Δɻ νʔϜͰٞ࿦ܾͯ͠ΊΒΕͨɻ ΍Γ࢒ͨ͜͠ͱ͸ଟʑ͋Δ͕ޙչ͸ͳ͍ʂ
  46. 99 4-2. ͜Ε͔Β… ΅͘Βͷ Awesome Change! ACTͱͯ͠ͷ൒೥ؒͷʢΩπΠ!!ʣ೚ظ͸ऴΘͬͨ Awesome Change ͕࡞Ε͔ͨ…͸ਖ਼௚Θ͔Βͳ͍͚Ͳɺ

    SREͱ͍͏௕ཱྀ͍ͷҰาΛ౿Έग़ͤͨɺ ͱ͍͏ײ৮͸͋Δʂ ͦͯ͠6ϲ݄Λઓ͍ൈ͍ͨνʔϜϝΠτʹײँʂʂ
  47. 103 References • SREΛ͸͡ΊΑ͏―ݸਓͱ૊৫ʹΑΔ৴པੑ֫ಘ΁ͷୈҰา • SRE αΠτϦϥΠΞϏϦςΟΤϯδχΞϦϯά―Googleͷ৴པੑΛࢧ ͑ΔΤϯδχΞϦϯάνʔϜ • SRE

    Google Workbook • Effective DevOps 4ຊபʹΑΔ࣋ଓՄೳͳ૊৫จԽͷҭͯํ • Fearless Change ΞδϟΠϧʹޮ͘ ΞΠσΞΛ૊৫ʹ޿ΊΔͨΊͷ48 ͷύλʔϯ