Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Creating Awesome Change in SmartNews

Creating Awesome Change in SmartNews

Ikuo Suyama

April 14, 2025
Tweet

More Decks by Ikuo Suyama

Other Decks in Technology

Transcript

  1. Who am I? / ͓·ͩΕ Ikuo Suyama / ಃࢁҭஉ •

    Staff Engineer • Ads Backend Expert • Nov. 2020~ SmartNews, Inc. • Interest: Fishing, Camping, Gunpla, Anime
  2. 1.࢝ಈ: Assemble! ಛघ෦ୂ “ACT”! 2.ॳಈ: “Get our hands dirty”! 3.༂ਐ:

    Incident Λ൒෼ʹ͢Δ!? 4.ؼؐ: ࢒͞Εͨ՝୊ͱ͜Ε͔Β Agenda
  3. 12 1-1. ࢝·Γ • CTO) Πϯγσϯτ͸ຊ౰ʹ “ଟ͍” ͷ͔ʁ • ͦ΋ͦ΋Πϯγσϯτ͕ଟ͍ͬͯͲ͏ఆٛ͢Δʁ

    • Ikuo) มߋ͸ຊ౰ʹଟ͍ͷ͔ʁ • ͦ΋ͦ΋Πϯγσϯτ͕ଟ͍ݪҼ͸มߋͳͷ͔ʁ • มߋͱ͸Կͷมߋͳͷ͔ʁ ͜ͷ࣌఺Ͱ͸૒ํࠜڌͷͳ͍ɺ”Χϯ” ͪΐͬͱ଴͍ͯʂ ※ͨͩ͠γχΞΤϯδχΞͷᄿ֮͸෠Εͳ͍
  4. 15 1-2. ࠷ڧνʔϜΛूΊΔ ֤Division͔ΒΤʔεୡ͕ू·Δ…ʂ Ads News Ranking Push Notification Core

    System (Infra) Mobile SmartView (Article) Ads Ikuo! News & Push D! Ranking R! CoreSystem T! Mobile M! SmartView T! VPoE K! ※ ࿩ͷ౎߹্ࣗ෼ͷ͜ͱ΋Τʔεͱݺ͹͍͓ͤͯͩ͘͞ئ͍͠·͢ (Manager) CTO Report To
  5. 16 1-2. ࠷ڧνʔϜΛूΊΔ ֤Division͔ΒΤʔεୡ͕ू·Δ…ʂ Ads News Ranking Push Notification Core

    System (Infra) Mobile SmartView (Article) Ads Ikuo! News & Push D! Ranking R! CoreSystem T! Mobile M! SmartView T! ※ ࿩ͷ౎߹্ࣗ෼ͷ͜ͱ΋Τʔεͱݺ͹͍͓ͤͯͩ͘͞ئ͍͠·͢ VPoE K! (Manager) CTO Report To SREŧŔŕŪũƄŝſ
  6. 19 1-3. νʔϜΛํ޲͚ͮΔ ໌֬ͳΰʔϧઃఆ • “Awesome Change” ͱ͸ • ΫϦςΟΧϧͳΠϯγσϯτΛݮΒ͢

    • SREϕετϓϥΫςΟεΛ૊৫ʹΠϯετʔϧ͢Δ • վળର৅KPI: • Mean Time Between Failure(MTBF) / Change Failure Rate(CFR) ʹΠϯγσϯτ਺ • Mean Time to Recover(MTTR) ʹΠϯγσϯτղܾ࣌ؒ “զʑ͸ͳͥ͜͜ʹ͍Δͷ͔” ͷݴޠԽʂ VPoE͕͏·͘΍ͬͯ͘Ε·ͨ͠
  7. 20 1-3. νʔϜΛํ޲͚ͮΔ ໌֬ͳϓϥΠΦϦςΟઃఆ • P0: ΠϯγσϯτϋϯυϦϯάΛαϙʔτ͢Δ • P1: ΫϦςΟΧϧ͔ͭফԽ͞Ε͍ͯͳ͍ΠϯγσϯτΞ

    ΫγϣϯΞΠςϜΛ௵͢ • P2: ΠϯγσϯτൃੜΛ๷͙ࠜຊతͳγεςϜվળ ؟ͷલ΍Δ͜ͱ͸໌֬ʂ
  8. 22 2-1. P0: ΠϯγσϯτϋϯυϦϯάͷαϙʔτ Get our Hands Dirtyɿ͢΂ͯͷΠϯγσϯτʹհೖ͢Δʂ • Πϯγσϯτ͕ى͜ΔͱɺͱΓ͋͑ͣACTϝϯόʔͷͩΕ͔ͷ

    PagerDuty͕໐Δ • ݁ہACTશһΛΠϯγσϯτ͕ى͍ͬͯ͜Δͱ͜Ζʹট଴͢Δ • ࣗ෼ͷग़਎υϝΠϯͰ͋Ε͹ফՐ׆ಈʹࢀՃ͢Δ • ͦ͏Ͱͳͯ͘΋ɺεςʔλεΞοϓσʔτ΍ඞཁͳਓࡐͷ֬อɺ Ϗδωεͱͷ࿈བྷ໾ͳͲΛങͬͯग़Δ ΩπΠ!!
  9. 25 2-2. P1:ΞΫγϣϯΞΠςϜΛ௵͢ ๨ΕڈΒΕͨΞΫγϣϯΞΠςϜͨͪ ؒҧ͍ͳ͘ରԠ͞Εͣɺ๨ΕڈΒΕ͍ͯΔ΍ͭΒ͕͍Δ • ΋ͱ΋ͱΠϯγσϯτϨϙʔτΛ࢒͢จԽ͕͋ͬͨ • ࠶ൃ๷ࢭͷΞΫγϣϯΞΠςϜ΋هࡌ͞Ε͍ͯͨ •

    ͢͹Β͍͠ʂʂ • ͕ɺΞΫγϣϯΞΠςϜ͸؅ཧ͞Ε͍ͯͳ͔ͬͨ • ୲౰ɺظݶɺ׬ྃεςʔλε • !!??!??! • ͔ͭɺϨϙʔτͷϑΥʔϚοτ͸Division͝ͱʹҟͳͬͨ • ͳΜͳΒ୲౰ऀ͝ͱʹҟͳͬͨ
  10. 27 2-2. P1:ΞΫγϣϯΞΠςϜΛ௵͢ Get our Hands DirtyɿखಈͰσʔλ੔ཧ ڭ܇̍: σʔλ͸ͳΔ΂͘ ػցॲཧͰ͖ΔϑΥʔϚοτͰ࢒ͤʂʂ

    ͤ΍ʂաڈ̍೥෼ͷΠϯγσϯτϨϙʔτͷAI શ෦खಈͰNotion DatabaseʹҠߦͨ͠Ζʂ ※Databaseʹͯ͠͠·͑͹APIܦ༝Ͱσʔλ͕औΕΔͷͰͲ͏ʹͰ΋ͳΔ ڭ܇̎ɿ໨తͷͨΊʹటष͍खஈΛऔΔ͜ͱΛԀ͏ͳʂʂ Get our Hands Dirty
  11. 29 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ • Division͕ҧ͑͹ΠϯγσϯτରԠͷ΍Γํɺྲّྀ΋ҟ ͳΔ • աڈʹ΋શࣾ౷ҰͷϓϩτίϧΛ࡞ΔࢼΈ͕͋ͬͨ

    • ”Incident Response Framework:IRF” • ͕ɺਁಁ/ར༻͍ͯ͠ͳ͔ͬͨ • ಛఆͷDivisionͷཁ͔݅͠ߟྀ͞Εͯͳ͔ͬͨʂ ͦ΋ͦ΋ͳͥશࣾͰ౷Ұ͞Εͨϓϩηε͕ͳ͔ͬͨʁ
  12. 30 • IRFࣗମ͸ϓϩηεͱͯ͠͸Α͘Ͱ͖͍ͯͨ • ͜ΕΛϕʔεʹɺ • શࣾͰ౷ҰͰ͖Δखॱ … ֤Τʔε͔ΒͷυϝΠϯ஌ࣝͱܦݧͷ౤ೖ •

    ͔ͭܰྔͳ΋ͷ • Πϯγσϯτͷ࠷தʹෳࡶͳखॱ͸଍ΛҾͬுΔʂ • ެ։͞Ε͍ͯΔଞࣾͷFW΋ࢀߟʹɺྑ͍ͱ͜ΖΛऔΓೖΕͨ • e.g. Pager Duty Incident Response Ͳ͏΍ͬͯશࣾ౷ҰϓϩηεɺϑϨʔϜϫʔΫΛ࡞Δʁ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ ͢΂ͯͷྖҬΛΧόʔͨ͠νʔϜ͔ͩͬͨΒͦ͜Մೳ
  13. 31 IRF 2.0 Contents 1. Role, Playbook 2. Severity Definition

    3. Workflow 4. Communication Guideline 5. Incident Report Template, Postmortem ॏཁͳͱ͜ΖΛ঺հ͠·͢ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ ৄࡉ͸εϥΠυΛ֬͝ೝ͍ͩ͘͞ʂ
  14. 32 IRF 2.0: Role, Playbook • On-Call Engineer • ΦϯίʔϧΛड͚ΔΤϯδχΞɻΞϥʔτͷτϦΞʔδΛߦ͍ɺඞཁͰ͋Ε͹ICʹΤε

    ΧϨʔγϣϯͯ͠IRFΛ։࢝͢Δ(IncidentΛએݴ͢Δ)ɻ • Incident Commander(IC) • ΠϯγσϯτରԠͷࢦشΛͱΔਓɻඞཁͳਓΛूΊɺ৘ใΛ੔ཧ͢Δɻ֎෦ͱͷίϛϡ χέʔγϣϯʢCLʣΛ݉຿͢Δ͜ͱ΋͋Δɻ௨ৗTech Lead/Engineering Managerɻ • ࣮ࡍͷՐফ͠࡞ۀͰ͸ͳ͘ɺ৘ใɾঢ়گ੔ཧͱ൑அ͕੹຿ • Responder • ࣮ࡍͷՐফ͠࡞ۀʢϩʔϧόοΫ΍ઃఆมߋʣΛߦ͏ɻ • Communication Lead(CL) • ֎෦εςʔΫϗϧμʔʢ͜͜Ͱ͸ΤϯδχΞҎ֎ʣͱͷίϛϡχέʔγϣϯΛ୲౰͢ Δɻ ICͱResponderͷ੹೚෼཭͕ΩϞ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ
  15. 33 IRF 2.0: Severity Definition IC͕Πϯγσϯτએݴ࣌ʹ࢑ఆతʹܾఆ͢Δɻ࠷ऴධՁ͸ϙετϞʔςϜͰܾ ·Δ • 🔥 SEV-1

    • χϡʔεߪಡͳͲίΞUXػೳ͕׬શఀࢭ • 🧨 SEV-2 • ίΞUXػೳͷҰ෦ఀࢭɺαϒUXػೳͷ׬શఀࢭ • 🕯 SEV-3 • αϒUXػೳͷҰ෦ఀࢭ ॳಈͷ࣌఺ͰSEVʹ౰ͨΓΛ͚͓ͭͯ͘͜ͱ͕؊ཁɻ ʢγϏΞͳΠϯγσϯτ͸ΑΓૣ͘ղܾ͍ͨ͠ʣ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ
  16. 34 IRF 2.0: Workflow Πϯγσϯτ͕࢝·͔ͬͯΒऴΘΔ·ͰͷྲྀΕɻ🩸 : Bleeding, ग़݂தͷεςʔλε 1. 🩸

    Occurrence/ൃੜ • ໰୊ͱͳΔࣄ৅ͷൃੜɻσϓϩΠ΍ઃఆมߋͳͲ͕τϦΨʔ 2. 🩸 Detection/ݕ஌ • ΞϥʔτͳͲʹΑΓɺOnCaller͕໰୊Λݕ஌ͨ͠ঢ়ଶɻτϦΞʔδΛ։࢝ 3. 🩸 Declaration/એݴ • Πϯγσϯτͷ ”એݴ”ɻIRFʹଇΓɺICͷࢦشͷ΋ͱࢭ݂ରԠ։࢝ • ಉ࣌ʹඞཁͳ֎෦ίϛϡχέʔγϣϯΛ։࢝ɻग़݂த͸ܧଓతͳΞοϓσʔτ 4. ❤🩹 Mitigation/؇࿨ • มߋͷϩʔϧόοΫͳͲͰҰ࣌ݪҼΛഉআɺඃ֐ͷ֦ࢄΛఀࢭ 5. Resolution/ղܾ • ෆ۩߹ͷमਖ਼΍σʔλิਖ਼ͳͲɺ߃ٱରԠͷ׬ྃɻ׬શࢭ݂ 6. Postmortem/ࣄޙ෼ੳ • ΠϯγσϯτϨϙʔτΛݩʹɺࠜຊݪҼͷٹ໋ͱ࠶ൃ๷ࢭࡦͷݕ౼ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ
  17. 35 IRF 2.0: Communication Guideline ίϛϡχέʔγϣϯʹ࢖͏৔ॴʢSlackͷνϟϯωϧʣͷఆٛ • #incident • શମ΁ͷεςʔλεप஌ɺ֎෦εςʔΫϗϧμʔͱͷίϛϡχέʔγϣϯ

    • #incident-irf-[incidentId]-[title] • ໰୊ղܾͷͨΊͷٕज़తͳίϛϡχέʔγϣϯɻؔ࿈͢Δ৘ใɾٞ࿦͢΂ͯूΊΔ • ඞཁʹԠͯ͡WAR ROOM(Online, Google Meet)Λཱͯͯू߹ ٞ࿦΍৘ใ͕̍ͭͷνϟϯωϧʹू·ͬͯΔͱϨϙʔτੜ੒࣌ʹศར 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ
  18. 36 IRF 2.0: Incident Report Template & Postmortem શࣾ౷ҰϑΥʔϚοτͷదԠ •

    Summary • Impact • Direct Cause, Mitigation • Root Cause Analysis (5-whys) • ௚઀ݪҼͱࠜຊݪҼΛ෼͚ͯ෼ੳ͢Δ͜ͱ͕ॏཁʂ • ͜͜ʹରͯ͠ΞΫγϣϯΞΠςϜΛઃఆ͠ɺ࠶ൃ๷ࢭ • Action Items • Timeline • ػցॲཧͰ͖ΔϑΥʔϚοτͰ!!!! DivisionࣄʹҟͳͬͨςϯϓϨʔτΛ౷Ұʢ͍ͩ͡ʣɺPostmortemͷҰݩԽ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ
  19. σʔλఆٛɿΠϯγσϯτΛϞσϦϯά͢Δ • Insidentͷଐੑ • Title • Status • State Machine.

    ޙड़ • Severity • SEV 1~3(IRF2.0) • Direct Cause • ޙड़ • Direct Cause System • MicroServiceͷҙຯ୯ҐͰͷίϯϙʔωϯτ܈ • Direct Cause Workload • Online Service, Offline Pipeline, … ՄೳͳݶΓEnumΛఆٛ͢Δ 46 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ ࣗ༝ೖྗͰ͸ΧʔσΟφϦςΟ͕ߴ͘ͳΓ͗ͯ͢·ͱ΋ͳ෼ੳ͕Ͱ͖ͳ͍
  20. σʔλऩूɿΠϯγσϯτϨϙʔτ ઌʹఆٛͨ͠σʔλ߲໨ΛؚΊΔΑ͏ɺΠϯγσϯτϨϙʔτͷςϯ ϓϨʔτΛΞοϓσʔτ • ඞཁͳ߲໨ΛඞਢೖྗͷAttributeʹ • EventTimelineΛೖྗ͢ΔNotionDatabase Λ௥Ճ • State͕มԽͨ࣌ؒ͠Λه࿥ͯ͠΋Β͏

    • ػցॲཧͰ͖ΔϑΥʔϚοτͰ!!!!! σʔλఆ͕͔ٛͬ͠Γ͍ͯ͠Ε͹ɺιʔε͸ϑϨΩγϒϧ (΋ͪΖΜ৴པͰ͖ΔσʔλͰ͋Δલఏ) 49 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ
  21. ॏཁͳࢦඪͷ؍ଌɿMTTRͷ؍ଌͱࡉ෼Խ 1. Occurred 2. Detected 3. Declared 4. Mitigated 5.

    Resolved Time To Detect Time To Mitigate Ͳ͜ʹ͕͔͔͍࣌ؒͬͯΔ͔ɺݱࡏ஍͕Θ͔ͬͨʂ Time To Resolve 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ
  22. 59 3-1. ΠϯγσϯτΛݮΒ͢ͱ͸ʁ MTTD + MTTM (ࢭ݂ʹ͔͔ͬͨ࣌ؒ) ΠϯγσϯτͷΠϯύΫτΛݟੵ΋Δ Severity Factor

    (ΠϯγσϯτͷӨڹ౓߹͍) Πϯγσϯτ਺ × Σ toC, ޿ࠂϏδωεͩͱ͍͍ͩͨ͜ΕͰRevenueΠϯύΫτ͕ܾ·Δ • ͳΔ΂͘୹͍ͨ͘͠ • վળ͕ൺֱత༰қɺख͕͚ͭ΍͍͢ • ͳΔ΂͘େ͖ͳΠϯγσϯτΛݮΒ͍ͨ͠ • ͕ɺίϯτϩʔϧ͕೉͍͠ • ͳΔ΂͘਺ΛݮΒ͍ͨ͠ • த௕ظͷ׆ಈ͕ඞཁ
  23. Σ 60 3-1. ΠϯγσϯτΛݮΒ͢ͱ͸ʁ MTTD + MTTM (ࢭ݂ʹ͔͔ͬͨ࣌ؒ) ΠϯγσϯτͷΠϯύΫτΛݟੵ΋Δ Severity

    Factor (ΠϯγσϯτͷӨڹ౓߹͍) Πϯγσϯτ਺ × • ͳΔ΂͘୹͍ͨ͘͠ • վળ͕ൺֱత༰қɺख͕͚ͭ΍͍͢ • ͳΔ΂͘େ͖ͳΠϯγσϯτΛݮΒ͍ͨ͠ • ͕ɺίϯτϩʔϧ͕೉͍͠ • ͳΔ΂͘਺ΛݮΒ͍ͨ͠ • த௕ظͷ׆ಈ͕ඞཁ ͔͜͜ΒखΛ෇͚͍ͯ͘ ACT݁੒౰ॳʹཱͯͨKPIͱ΋Ϛον͍ͯ͠Δʂ ͕ɺ਺ϲ݄ͷܦݧͰղ૾౓͕ΑΓ্͕ͬͨ
  24. MTTRͷղ૾౓Λ্͛ΔɿεςʔτϚγϯ 1. Occurred 2. Detected 3. Declared 4. Mitigated 5.

    Resolved Time To Detect Time To Resolve Time To Mitigate ͦΕͧΕॏཁ౓ɺରࡦ͕ҟͳΔʂ ݕ஌ʹ͔͔Δ࣌ؒɻ ओʹAlertingͷྖҬ ݕ஌ʙࢭ݂ʹ͔͔Δ࣌ؒɻ ࠷΋ΫϦςΟΧϧ͕ͩɺΞϓϩʔν͠΍͍͢ IRFͷྖҬ ࢭ݂ʙࠜຊରԠ/ิਖ਼ͳͲޙॲཧʹ͔͔Δ࣌ؒɻ ͢Ͱʹࢭ݂͞Ε͍ͯΔͷͰɺ଎͞ΑΓ΋ਖ਼֬͞ ͕ٻΊΒΕΔ 3-2. Πϯγσϯτղܾ࣌ؒ΁ͷΞϓϩʔν
  25. 63 MTTD(Mean Time To Detect) ʹର͢ΔΞϓϩʔν Ξϥʔτͷ੔උ • ݕ஌Ͱ͖ͳ͔ͬͨ/஗Εͨ໰୊ʹ৽͘͠ΞϥʔτΛ͚ͭΑ͏ɺ͸͏·͍͔͘ͳ͍ •

    “over-monitoring is a harder problem to solve than under-monitoring.” — SRE: How Google Runs Production System • False Positive Ξϥʔτଟ͗͢ɺຒ΋ΕΔ໰୊ • SLO + Error Budget ʹΑΔΞϥʔτ΁ͷγϑτ • Pager͞ΕͨΒΠϯγσϯτɺ͕ཧ૝ • ҰேҰ༦ʹ͸͍͔ͳ͍ʂ ࢒͞Εͨ՝୊ɻ̐ষͰ͓࿩͠·͢ʂ 3-2. Πϯγσϯτղܾ࣌ؒ΁ͷΞϓϩʔν
  26. 64 MTTM(Mean Time To Mitigate) ʹର͢ΔΞϓϩʔν ౷ҰFWɿIRF2.0 • Πϯγσϯτͷج४ͷ໌ࣔԽ •

    ରԠϑϩʔɾίϛϡχέʔγϣϯΨΠυϥΠϯͷ౷Ұ • Responder / Commander ͷ෼཭ • ٴͼτϨʔχϯάɺΤʔεୡ͕എதΛݟͤΔ Τʔε౤ೖͱIRF2.0ͷਁಁ͕ޮՌ͖ͯΊΜʂ 3-2. Πϯγσϯτղܾ࣌ؒ΁ͷΞϓϩʔν
  27. 69 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̍ɿຊ൪౤ೖલͷςετʁ PostMortemʹͯ… • Why. ͳͥςετͤͣʹຊ൪ʹ౤ೖ͞ΕΔͷͰ͔͢ʁ • ຊ൪Ͱ͔͠ςετͰ͖ͳ͍͔ΒͰ͢

    • Why. ͳͥຊ൪Ͱ͔͠ςετͰ͖ͳ͍ͷͰ͔͢ʁ • σʔλෆ଍ɺStaging౳ςετ؀ڥͷෆඋ • …. ϤγʂStaging؀ڥΛ੔උ͢Δͧʂ
  28. 70 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̍ɿStaging؀ڥ੔උ ಓ൒͹ɿ૝૾ͷ10ഒେมʂ • ίϯϙʔωϯτ͕ࢮ͵΄Ͳ͋Δ • News, Ads,

    InfraͱDivisionຖʹҟͳΔཁٻɺར༻ํ๏ • Ads͸toB, ͓ۚʹ௚݁ʂ͔ͬ͠Γ͔ͬͪΓ • News͸toC, ػೳఏڙεϐʔυ༏ઌʂ ͱΓ͋͑ͣ͢΂ͯSTG੔උ͠·͢ɺ͸ແཧͩ͠ҙຯ͕ͳͦ͞͏ɻ ൺֱతཁ๬͕େ͖͍Ads͔ΒରԠத
  29. 73 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̎ɿUnitTestͷ෼ੳ • ςετΧόϨοδͱো֐͸૬ؔ͢Δͷ͔ʁ → ૬ؔ͸ग़ͨɻ • ͕ɺςετΧόϨοδΛ্͛Ε͹ো֐͕ݮΔ͔ʁ͸Θ͔Βͳ͍

    ʢҼՌͰ͸ͳ͍ʣ • ͔͠͠υϝΠϯ஌ࣝΛ΋ͬͯɺγεςϜ/νʔϜ୯ҐͰݟͯΈΔͱɺ ͔֬ʹΧόϨοδ͕௿͘ɾΠϯγσϯτ͕ଟ͍ͱ͜Ζ͸ཧ༝͕͋Γ ͦ͏ • UTΛ࣮૷ͣ͠Β͍/UTΛ࣮૷͢ΔจԽ͕ͳ͍ etcetc… Ϥγʂͱʹ͔͘UT͕গͳ͍ͱ͜Ζʹಥܸͯ͠ UTΛ࣮૷͠·͘Δͧʂʂ
  30. 74 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̎ɿUnitTestͷ੔උ Get our Hands Dirtyɿยͬ୺͔ΒUTΛ͚ͭΔ 2. Sonarqube

    Ͱߦ਺͕ଟ͘ɺCoverage͕௿͍ϑΝΠϧΛݟ͚ͭΔ 3. LLMͷྗΛआΓͯUTΛ࣮૷͠·͘Δ 4. ίϯϙʔωϯτશମͰ> 50% ʹͳΔ·Ͱ܁Γฦ͠ ̏ʙ̐ίϯϙʔωϯτ΍͕ͬͨম͚ੴʹਫ αϯϓϧ͕͋Ε͹ɺޙ͸΍ͬͯ͘ΕΔͩΖ͏ͱࢥ͍ͬͯͨ…
  31. 75 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̎ɿUnitTestͷ੔උ • LLMͰࣗಈੜ੒ͨ͠Βʁ • ͍·ͷͱ͜Ζਫ਼౓͕͍·͍ͪ • ͦ΋ͦ΋UTΛܧଓతʹ࣮૷͢Δश׳͕νʔϜʹඞཁ

    • ͕ɺͦ͏͢ΔΠϯηϯςΟϒɺՁ஋ײ͕ແ͍ • ೲظʹ௥ΘΕ͍ͯͯɺUTʹׂ͕࣌ؒ͘ͳ͍(!) ૊৫ɺจԽ΁ͷΞϓϩʔν͕ඞཁʂ ࢒͞Εͨ՝୊ɻ4ষ΁ଓ͘…
  32. 77 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ઃఆมߋʹର͢ΔΞϓϩʔν • ओͳ”ઃఆมߋ” • ΦϯϥΠϯͰΞϓϦέʔγϣϯͷڍಈΛ੍ޚ͢Δػߏ • A/BςετɺFeature

    Flag • ͲͪΒ΋ಠ࣮ࣗ૷ͷϓϥοτϑΥʔϜΛ͕࣋ͭɺෳࡶ • ҙਤ͠ͳ͍Ϣʔβʔ΁ͷA/BదԠ΍ɺޡͬͨઃఆʹΑΔ໰୊ ͕ଟൃ ϤγʂA/BςετͱϑΟʔνϟʔϑϥάΛ ੔උ͢Δͧʂʂ
  33. 78 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ઃఆมߋʹର͢ΔΞϓϩʔν • ࢖ΘΕ͍ͯͳ͍ʢσϑΥϧτԽͨ͠ʣϑΟʔνϟϑϥά ͷҰ੪࡟আ • ϑΟʔνϟʔϑϥάར༻ج४ͷࡦఆ •

    όϦσʔγϣϯͷڧԽ • ʢύʔεΤϥʔʹͳΔઃఆ͕ೖྗͰ͖͍ͯͨ…ʣ AB ςετϓϥοτϑΥʔϜνʔϜͱ΋ڠۀ͠ɺ ϢʔβʔϏϦςΟؚΊେ෯ͳվળΛਪਐʂ
  34. 80 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν Offline Batch ʹର͢ΔΞϓϩʔν • ετϦʔϛϯάॲཧͳFilnkΦϑϥΠϯόον͕ଟ਺ • Server

    → Kafka → Flink → Scylla, ClickHouse, … • ઐ໳νʔϜʹΑΔಠࣗ։ൃϓϥοτϑΥʔϜ • ΞϓϦέʔγϣϯνʔϜʹFlinkΤΩεύʔτ͕গͳ͘ɺ ύϑΥʔϚϯε΍࠶ىಈ࣌ͷ໰୊͕ଟൃ ϤγʂFlink ϓϥοτϑΥʔϜΛ ੔උ͢Δͧʂʂ
  35. 81 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν Offline Batch ʹର͢ΔΞϓϩʔν • ϓϥοτϑΥʔϜࣗମͷվળ • UIͷվળɺࣗಈσϓϩΠɺ…

    • ϕετϓϥΫςΟεͷ෍ڭ • υΩϡϝϯτ੔උ • ςετΛؚΉςϯϓϨʔτϓϩδΣΫτͷެ։ • ֤ίϯϙʔωϯτʹ௚઀ϦϑΝΫλPRΛૹ෇ • ϕετϓϥΫςΟεͱςετΛ࣮૷ ϓϥοτϑΥʔϜνʔϜʹ΋ڠྗΛڼ͗ɺ ϓϥοτϑΥʔϜͷվળͱυΩϡϝϯτ੔උΛ࣮ࢪʂ
  36. 84 3-4. ݁ՌɿΠϯγσϯτ͸”൒෼”ʹͳͬͨͷ͔!? Πϯγσϯτ૿͑ͯ·͢Α…ʁ • قઅੑʁ݄͕̍̎ඈͼൈ͚ͯଟ͍ • ٳՋલͷ׈ΓࠐΈมߋࣄނʁ • IRF2.0ਁಁͷ෭࡞༻ʁ

    • ΠϯγσϯτఆٛʹΑΔݕ஌ײ౓ ͷ޲্ • ϚζϩʔͷϋϯϚʔ: “΋͠IRF͔͍࣋ͬͯ͠ ͳ͚Ε͹ɺ͢΂͕ͯΠϯγσϯτʹݟ͑Δ” • ݄̍Ҏ߱͸ݮগ܏޲ ܧଓతͳվળ׆ಈ͕ඞཁ
  37. 85 3-4. ݁ՌɿΠϯγσϯτ͸”൒෼”ʹͳͬͨͷ͔!? ҰํɺMTTR͸൒ݮʂ • ಛʹ MTTMitigate ʹܶతͳվળ • IRF2.0ޮՌͱߟ͍͑ͯΔ

    • ҰํMTTDetect͸େ͖ͳվળͳ͠ • Detection ͸ࠓޙͷ՝୊ɻΞ ϥʔτվળʹऔΓ૊Ή վળʹ͔֬ͳखԠ͑ʂ
  38. 89 ͦ΋ͦ΋ΠϯγσϯτΛ̌ʹ͍ͨ͠ʢͰ͖Δʣͷ͔ʁ ݱ࣮తʹͲͪΒ΋ແཧ… • ΠϯγσϯτΛۃখԽ͢Δʹ͸ʁ • ϦϦʔεΛͳ͘͢ʁ • →؇΍͔ͳࢮ😇 •

    ແݶʹίετʢϦιʔεɺ࣌ؒʣΛ౤Լ͢Δʁ • ౤ೖͨ͠ϦιʔεͱΠϯγσϯτൃੜ཰͸(͓ͦΒ͘)૬ؔ͢Δ • Αͬͯɺ100%҆શͱߟ͑ΒΕΔ·Ͱͻͨ͢Βςετ͢Δ 4-1. ࢒͞Εͨ՝୊ɿϦεΫ؅ཧ, Ξϥʔτվળ
  39. 90 ೲظͱ඼࣭ɺίετͱΠϯγσϯτͷόϥϯεΛऔΓ͍ͨ • ͕ɺγεςϜɾϓϩδΣΫτ͝ͱʹόϥϯε͸ҟͳΔ • ٻΊΒΕΔεϐʔυɺϦϦʔεස౓ • ౤ԼͰ͖Δίετ • ڐ༰Ͱ͖ΔϦεΫʢ㲈Πϯγσϯτ਺ɺมߋࣦഊ཰ʣ

    • ྫɿ • Ads͸toB, ͓ۚʹ௚݁ʂ͔ͬ͠Γ͔ͬͪΓ • News͸toC, ػೳఏڙεϐʔυ༏ઌʂ ʹϦεΫڐ༰౓Λ਺஋Խ͠ɺ ΠϯγσϯτΛίϯτʔϧ͍ͨ͠ 4-1. ࢒͞Εͨ՝୊ɿϦεΫ؅ཧ, Ξϥʔτվળ
  40. 91 ϦεΫڐ༰౓ͷ໌ࣔԽɿSLOͱError Budget • SLO = αʔϏεϨϕϧ໨ඪ ~ ͲΕ͘Β͍Τϥʔ͕ڐ͞ΕΔ͔ʁ •

    e.g. 99.9% available -> 0.1%͸ڐ༰͞ΕΔ • ࣮ࡍʹUXʹة֐͕͋ΔSLI(Indicator)ʹObjective(໨ඪ)Λ͚ͭΔ • Error Budget = ڐ༰Ͱ͖ΔΤϥʔ͕͋ͱͲΕ͘Β͍࢒͍ͬͯΔ͔ • Error Budget ͕࢒͍ͬͯΔ = ΞΫηϧΛ౿ΊΔ • ଟগແ๳ͳϦϦʔε΋ڐ༰Ͱ͖Δ • Error Budget ͕ރׇͨ͠ = ڐ༰Ͱ͖ͳ͍ϨϕϧͷUXͷᆝଛ • ͜ΕҎ্ϦεΫΛऔͬͯ͸͍͚ͳ͍ɻεϐʔυμ΢ϯ — Ref: Implementing SLOs — Google SRE Error BudgetʹΑͬͯϦεΫڐ༰౓ΛදݱͰ͖Δ ཧ࿦తʹ͸ྑͦ͞͏ 4-1. ࢒͞Εͨ՝୊ɿϦεΫ؅ཧ, Ξϥʔτվળ
  41. 92 Ξϥʔτͷվળ ʔ Ξϥʔτ = Πϯγσϯτ ϤγʂSLO Λ੔උ͢Δͧʂʂ • Error

    Budget ͷফඅ଎౓, Burn Rate ʹΑͬͯΞϥʔτ͢Δ • ٸ଎ͳ Error Budget ফඅΛΞϥʔτ • ์ஔ͢Δͱ༧ࢉ͕ރׇ͢Δɻͭ·ΓSLOʹҧ൓͢Δ • →࣮ࡍͷUXʹة֐͕͋Δʂ • →์ஔͯ͠͸͍͚ͳ͍ʂʹΠϯγσϯτ — Ref: Alerting on SLOs 4-1. ࢒͞Εͨ՝୊ɿϦεΫ؅ཧ, Ξϥʔτվળ ྑͦ͞͏
  42. 94 4-2. ࢒͞Εͨ՝୊ɿ૊৫ͱจԽ΁ͷΞϓϩʔν SLO Λػೳͤ͞Δʹ͸ʁ • ૊৫શମͷཧղͱڠྗ͕ෆՄܽ • ΤϯδχΞ͚ͩͰͳ͘ɺϏδωεɺPdM΋ר͖ࠐΉඞཁੑ •

    จԽ΁ͷΞϓϩʔν͕ඞཁ • ڀۃతʹ͸ͳʹΛՁ஋ͱ͢Δ͔ɺͱ͍͏࿩ • SLOͰίετͱϦεΫͷόϥϯεΛऔΔ͜ͱΛՁ஋Λ৴ ͡ɺ࣮ߦͰ͖Δ͔ ΤϯδχΞϦϯάจԽʹSLOɺ ͻ͍ͯ͸SREΛΠϯετʔϧ͍ͨ͠
  43. 95 4-2. ࢒͞Εͨ՝୊ɿ૊৫ͱจԽ΁ͷΞϓϩʔν SLO Λػೳͤ͞Δʹ͸ʁ • Ͳ͏Ξϓϩʔν͢Δ͔ • BottomUp: •

    Eng, Biz, PdM ΒεςʔΫϗϧμʔ΁ͷ෍ڭɺτ Ϩʔχϯά • TopDown: • ্૚෦͔Βͷࢧ࣋ɺࢦࣔ SRE, DevOps͸จԽɺҰ೔ʹͯ͠੒Βͣ ׼Λ͔͘஍ಓ͔ͭܧଓతͳ׆ಈ͕ඞཁ
  44. 96 4-3. ͜Ε͔Β… ACT൒೥ؒͷظݶ෇͖೚ظ͕ऴΘΖ͏ͱ͍ͯͨ͠ • ՝୊: SNͷΤϯδχΞϦϯάจԽʹSREΛΠϯετʔϧ͢Δ • SLO ͷ࣮૷ɺ९क

    • ଞʹ΋… • Observability ͷ޲্ • DORA Metrics ͷऩूٴͼ؂ࢹɺ९क, etcetc… → ܧଓతͳ׆ಈ͕ඞཁ ͜͜·ͰͷาΈΛࢭΊͣɺ ACTղࢄޙ΋࢒͞Εͨ՝୊ʹཱͪ޲͔͏ʹ͸ʁ
  45. 98 4-2. ͜Ε͔Β… ACTΛͲ͏ղࢄ͢Δ͔ʁ — νʔϜͷ౴͑ • ٫Լʂ • ΈΜͳSREΛϑϧλΠϜδϣϒʹ͍ͨ͠Θ͚Ͱ͸ͳ͍

    • “X%” ͷ࣌ؒΛΞϩέʔτ͢Δɺ͕ػೳͨͨ͠Ί͕͠ͳ͍ • ݁࿦ • Ex-ACTor͸ࠓޙ΋SREͷܒ໤΍खॿ͚Λߦ͏͕ɺ͕࣌ؒ ͔͔ͬͯ΋ઐ໳ͷSREνʔϜΛ্ཱͪ͛Δɻ νʔϜͰٞ࿦ܾͯ͠ΊΒΕͨɻ ΍Γ࢒ͨ͜͠ͱ͸ଟʑ͋Δ͕ޙչ͸ͳ͍ʂ
  46. 99 4-2. ͜Ε͔Β… ΅͘Βͷ Awesome Change! ACTͱͯ͠ͷ൒೥ؒͷʢΩπΠ!!ʣ೚ظ͸ऴΘͬͨ Awesome Change ͕࡞Ε͔ͨ…͸ਖ਼௚Θ͔Βͳ͍͚Ͳɺ

    SREͱ͍͏௕ཱྀ͍ͷҰาΛ౿Έग़ͤͨɺ ͱ͍͏ײ৮͸͋Δʂ ͦͯ͠6ϲ݄Λઓ͍ൈ͍ͨνʔϜϝΠτʹײँʂʂ
  47. 103 References • SREΛ͸͡ΊΑ͏―ݸਓͱ૊৫ʹΑΔ৴པੑ֫ಘ΁ͷୈҰา • SRE αΠτϦϥΠΞϏϦςΟΤϯδχΞϦϯά―Googleͷ৴པੑΛࢧ ͑ΔΤϯδχΞϦϯάνʔϜ • SRE

    Google Workbook • Effective DevOps 4ຊபʹΑΔ࣋ଓՄೳͳ૊৫จԽͷҭͯํ • Fearless Change ΞδϟΠϧʹޮ͘ ΞΠσΞΛ૊৫ʹ޿ΊΔͨΊͷ48 ͷύλʔϯ