Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#9 “Automating chaos experiments in production”

#9 “Automating chaos experiments in production”

cafenero_777

June 14, 2023
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. $ which • Automating chaos experiments in production • Ali

    Basiri, Lorin Hochstein, Nora Jones, Haley Tucker • Net fl ix • ACM/IEEE ICSE-SEIP ’19 • (International Conference on Software Engineering, Software Engineering in Practice) • https://2019.icse-conferences.org/track/icse-2019-Software-Engineering-in- Practice?track=ICSE%20Software%20Engineering%20in%20Practice#program • https://arxiv.org/abs/1905.04648
  2. Agenda • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • Introduction • CONTEXT: NETFLIX • ChAP

    • MONOCLE • EXPERIMENT GENERATION • RESULTS • CHALLENGES AND LESSONS LEARNED • CONCLUSION
  3. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • ෼ࢄγεςϜͷ݈શੑ֬อγεςϜʢChaos Monkeyʣͷ֓ཁͱ࣮ӡ༻ʹ͍ͭͯ • ಡ΋͏ͱͨ͠ཧ༝ • ෼ࢄγεςϜ҆ఆӡ༻ʢͦ΋ͦ΋Ͳ͏͍͏໨ઢͰݟΔ΂͖͔ʣ

    • Chaos Monkeyͷ࣮ӡ༻ • Podcastܦ༝ • https://misreading.chat/2019/07/09/episode-64-automating-chaos-experiments-in- production/
  4. Introduction • ΠϯλʔωοταʔϏε=෼ࢄγεςϜ • Ϛγϯ2୆~਺ઍ୆ • ༧ظͤ͵ಈ࡞Λ෮׆ͤ͞Δઓུ • λΠϜΞ΢τɺϦτϥΠɺϑΥʔϧόοΫ •

    ҰൠʹɺͲΕ͙Β͍resiliency͕࣮ࡍʹޮ͔͘͸ෆ໌ • Chaos Engineering ʢ޻ֶతΞϓϩʔνʣ • Chaos experimentsΛࣗಈੜ੒͢ΔϓϥοτϑΥʔϜΛߏங͠3೥ӡ༻
  5. CONTEXT: NETFLIX • Մ༻ੑॏཁ • ϚϧνσόΠεͳಈը഑৴αʔϏε • ਖ਼ৗʹετϦʔϛϯάͰ͖Δ͔͕࠷΋ॏཁ (ex. 99.99%,

    4 nines) • Ϗδωεཁ݅ɻ௨৴ձࣾͷΑ͏ͳՄ༻ੑͰ͸ͳ͍ɻ • ϚΠΫϩαʔϏε • RPCΛհͯ͠૬ޓ௨৴͢ΔαʔϏε܈ • ex. ݕࡧػೳɺϝλσʔλදࣔ (HD, 5.1)ɺ+1Ϙλϯ • VizceralͰՄࢹԽ • ಠཱͯ͠αʔϏεσϓϩΠՄೳ • ো֐ൣғʢFault domainsʣΛখ͘͞Ͱ͖Δ
  6. CONTEXT: NETFLIX (Cont.) • Resilience through timeouts, retries, and fallbacks

    • ίϯτϩʔϧϓϨʔϯˏύϒϦοΫΫϥ΢υ • HWෆྑɺNWෆྑো֐ • શͯͷRPCʹtimeout, retries, fallbackΛઃఆ • Java HystricϥΠϒϥϦΛίϚϯυϥούʔͱͯ͠ར༻ • fallbackྫɿsuggestػೳ͕յΕͨ৔߹͸σϑΥϧτ݁ՌΛදࣔ • ”ࣦഊ”͕සൟʹ࣮ߦ͞Εͳ͍ͷͰɺظ଴௨Γʹಈ࡞͢Δ͔৴པੑ͕௿͍ • සൟʹ࣮ߦͰ͖ΔϓϥοτϑΥʔϜߏங΁
  7. ChAP: Chaos Automation Platform • Overview • αʔϏεҰ͕ͭྼԽͯ͠΋γεςϜશମ͕݈શੑΛҡ࣋Ͱ͖Δ͔ΛධՁ͢Δ • ϞσϧԽ

    • αʔϏε͕஗͘ͳΔʢϨεϙϯε࣌ؒ૿ՃʣɿHWϦιʔεރׇ • ނো͢ΔʢΤϥʔΛฦ͢ʣɿόάͷpush • FIT (Fault Injection Testing) γεςϜ • Net fl ixͰར༻͍ͯ͠Δڞ௨JavaϥΠϒϥϦ಺ͰFault InjectionΛϑοΫͯ͠ϝλσʔλΛຒΊࠐΉ • ࣦഊɿ࣮ߦͤͣʹྫ֎Λ౤͛Δ • Latencyɿ࣮ߦલʹΘ͟ͱ஗ΒͤΔ • REST, gRPC, Hystrix, EVcache, Cassandra client, etc
  8. ChAP: Chaos Automation Platform (Cont.) • ྫɿbookmarkαʔϏεΛࣦഊͤ͞Δ • bookmark: Ҏલݟ͍ͯͨϏσΦͷγʔΫϙδγϣϯΛ؅ཧ͢ΔαʔϏεɻ࠶౓ࢹௌ͸్த͔ΒݟΒΕΔɻ

    • bookmark͕յΕͯ΋ɺਖ਼ৗʹετϦʔϛϯάͤ͞Δํ͕ॏཁ • ࣮ݧྫɿΞΫςΟϒͳ1%Λcanaryͱ͢Δ • UI͔Βૢ࡞ɿbookmarkαʔϏεݺͼग़͠Λࣦഊʹͯ͠ɺAPIΛ؍࡯ • baseline/canary༻ʹผͷVIPΛׂΓ౰ͯ
  9. ChAP: Chaos Automation Platform (Cont.) • metricsपΓ • Atlas: telemetry

    system • ࠷ऴूܭ͸͜͜ɻλΠϜϥά5෼ • Mantis: streaming processing system • ֤ϚΠΫϩαʔϏεͷΠϕϯτΛॲཧ • ϏσΦ࠶ੜɾDL਺ΛΧ΢ϯτ͠ɺChAPʹຖඵૹ৴ • ҟৗ͕͋ͬͨ৔߹͸͙͢ʹ࣮ݧΛதࢭ • ςετ࣌ͷϦΫΤετͷྲྀΕ • Zuul: Front (Reverse Proxy) • ର৅ (1%)ΛϑΟϧλͯ͠ɺݺͼग़͢APIΛม͑Δ (API-baseline, API-canary) • ϦΫΤετʹfault injection meta-dataΛ෇༩ (bookmarkαʔϏεݺͼग़࣌͠ʹྫ֎)
  10. WebUI baseline&canary provision telemetry system ʢλΠϜϥά5෼ఔ౓ʣ streaming processing system ʢλΠϜϥά1ඵʣ

    Front (Reverse Proxy) ΠϕϯτϞχλ dashboard؅ཧ dashboard canary analysis system CD ग़͠෼͚
  11. ChAP: Chaos Automation Platform (Cont.) • Lumen: μογϡϘʔυ؅ཧ • baseline/canaryͰͷൺֱ

    • ओཁύϑΥʔϚϯεKPI ʢετϦʔϜ࠶ੜ੒ޭ਺SPS: Stream start/secʣ • health metrics: request rate, latency, error rate, CPU࢖༻཰ •
  12. ChAP: Chaos Automation Platform (Cont.) • ҆શࡦ • Business hours:

    9:00-17:00ͷΈɻΤϯδχΞ͕ਝ଎ʹରԠͰ͖Δ͸ͣ • Automation stop: ΧελϚʔΠϯύΫτ͕େ͖͍৔߹͸ૣΊʹࣗಈఀࢭ • Total Tra ff i c: ૯τϥϑΟοΫͷ5%ҎԼͷΈ࣮ݧՄೳ • Failover: Regionؒfailoverத͸࣮ݧͰ͖ͳ͍
  13. MONOCLE • MONOCLE: ϢʔβʢΤϯδχΞʣ͕࣮ݧ͠΍͍͢Α͏ͳ࣮ݧࣗಈੜ੒πʔϧ܈ • ΤϯδχΞ͕ChAPͰ࣮ݧΛఆٛͯ͠ར༻ -> ChAPνʔϜ͕࣮ݧΛࣗಈੜ੒ͯ͠ɺΤϯδχΞ͕ͦΕΛར༻ɻ • Service

    introspection • RPC client/HystrixίϚϯυ͔Βґଘ৘ใΛऔಘ • ࣦഊͯ͠΋҆શͦ͏͔ʁϑΥʔϧόοΫઃఆ͞Ε͍ͯΔ͔ʁ౳ • telemetry systemͳͲ͔ΒλΠϜΞ΢τ஋Λऔಘ • աڈ2िؒͷ଴ͪ࣌ؒʢฏۉ, 90, 99, 99.5%iileʣ౳ • ࣮ݧͷࣗಈੜ੒ • ࣦഊɺlatency௥Ճ (λΠϜΞ΢τະຬ or λΠϜΞ΢τ+) • WebUI֬ೝ
  14. EXPERIMENT GENERATION • Criticality score: ࣦഊͦ͠͏ͳ࣮ݧείΞ • ώϡʔϦεςΟοΫʹείΞ෇͚ • dependency

    priority (RPC client=1, Hystrix command=100) • աڈ7೔ؒʹ಺෦͔Β࣮ߦ͞Εׂͨ߹ (<0.1%=0,1%=10, <10%=100, else=1000) • ϦτϥΠ܎਺ (1+ϦτϥΠઃఆ਺) • ΠϯλϥΫγϣϯ਺ʢԿճݺ͹ΕΔ͔ʣ
  15. EXPERIMENT GENERATION (Cont.) • Prioritization Score: ֤࣮ݧͷ”࣮ߦ͢΂͖͔Ͳ͏͔”ͷείΞ • ҎԼͷੵ •

    Criticality score: ࣦഊͦ͠͏͔ʁ • Safety score (safe=1, unsafe=-1) • ېࢭ͞Εͨcallґଘɺґଘσʔλ͕ݹ͍ɺfailͯ͠΋fallback͕ͳ͍౳ • Experimental weight (failure=3, latency=2, latency causing failure=1) • >0͔ͭߴ͍ॱʹ࣮ߦ͢Δ
  16. CHALLENGES AND LESSONS LEARNED • Ϟσϧ͕୯७ա͗Δ • FITͷো֐ͷछྨ͸1छྨͷΈɺ࣮ࡍ͸ෳ਺ى͜Δ͜ͱ΋͋Δ • ΞϓϦέʔγϣϯ಺ͰͷΠϯδΣΫγϣϯͷݶք

    • JavaϥΠϒϥϦͷσϓϩΠʹ͕͔͔࣌ؒΔʢ਺ϲ݄ʣ • JavaҎ֎ͷݴޠར༻ (Node.jsͳͲ)͕ਐΜͰ͍Δ • ݴޠ͝ͱʹ४උ͢Δͷ͕ख͕͔͔ؒΔ • Istio/αʔϏεϝογϡతͳΞϓϩʔνʁ • ར༻͞Εͳ͍ɻɻ • ηϧϑαʔϏεܕ͸ීٴ͠ͳ͔ͬͨɺɺ • ࣗಈੜ੒Ͱ࢖ͬͯ΋Β͑ΔΑ͏ʹɺਫ਼౓޲্ʢِཅੑ཰Λ௿͘ʣͤͨ͞ • ޻਺͔͔ͬͨɻɻ • ϚΠφʔͳσόΠεରԠ • શମͷ੒ޭ཰͚ͩݟͯ΋௥͑ͳ͍ɺɺ • ΤϥʔΧ΢ϯτ • ҰൠʹΤϥʔ཰͸௿͍ͨΊɺಛఆϢʔβʢσόΠεʣ͕େ͖͘د༩͢ Δ͜ͱ͕͋Δ • Τϥʔ͕ଟ͍ͱ݁Ռ͕෼ੳͰ͖ͳ͘ͳΔ • ՄࢹԽͷ෭࡞༻ • MONOCLEͰ৘ใऩूɾUIੜ੒͚ͩͰઃఆෆඋΛݟ͚ͭΒΕͨ
  17. EoP