Upgrade to Pro — share decks privately, control downloads, hide ads and more …

#13 “Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Cloud-Scale Infrastructure”

#13 “Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Cloud-Scale Infrastructure”

cafenero_777

June 14, 2023
Tweet

More Decks by cafenero_777

Other Decks in Technology

Transcript

  1. Research Paper Introduction #13 “Gandalf: An Intelligent, End-To-End Analytics Service

    for Safe Deployment in Cloud-Scale Infrastructure ” ௨ࢉ#50 @cafenero_777 2020/09/03
  2. $ which • Gandalf: An Intelligent, End-To-End Analytics Service for

    Safe Deployment in Cloud-Scale Infrastructure • Ze Li†, Qian Cheng†, Ken Hsieh†, Yingnong Dang†, Peng Huang∗, Pankaj Singh† Xinsheng Yang†, Qingwei Lin‡, Youjiang Wu†, Sebastien Levy†, Murali Chintalapati† • †Microsoft Azure ∗Johns Hopkins University ‡Microsoft Research • NSDI ‘20 • https://www.usenix.org/conference/nsdi20/presentation/li
  3. Agenda • ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • Introduction • Background and Problem Statement

    • Gandalf System Design • Gandalf Algorithm Design • Evaluation • Discussion • Conclusion
  4. ֓ཁͱಡ΋͏ͱͨ͠ཧ༝ • ֓ཁ • ϦϦʔε࣌ͷӨڹΛϦΞϧλΠϜධՁ͠ɺϦϦʔεΛܧଓ͢Δ͔൱͔Λ൑ఆ • AzureͰ1.5೥ؒӡ༻͠ɺඇৗʹྑ͍ਫ਼౓Ͱਪఆ͠ࣄނ๷ࢭʹߩݙ • ಡ΋͏ͱͨ͠ཧ༝ •

    ڊେΠϯϑϥͷϦϦʔεख๏ʹڵຯ͕͋ͬͨͨΊɻ • Cloud Operator Days Tokyo 2020ͷਅน͞Μͷൃද͕impressͷهࣄͰ঺հ͞Ε͍ͯͨ • https://cloud.watch.impress.co.jp/docs/event/1268357.html
  5. Introduction • େن໛ͳΫϥ΢υ (MS Azure) • ༷ʑͳνʔϜ͕Πϯϑϥίʔυɺιϑτ΢ΣΞΛසൟʹߋ৽ • େن໛͔ͭෳࡶ͕ͩɺࣦഊ࣌ͷӨڹ͕େ͖͍ɻ i.e.

    VM࢖͑ͳ͘ͳΔ • όάΛݟ͚ͭΒΕͳ͍ • HW/SW, OS/Lib, prod/dev, ίϯϙʔωϯτ૬ޓ࡞༻ • stage/canary/pilot/light region/heavier region/half region pairs/other half region pairs • ࣗಈͰઃఆෆྑͳͲΛݟ͚ͭΕΒΕΔʁ • watchdogs -> ίϯϙʔωϯτ୯Ґ͔͠ݕ஌Ͱ͖ͳ͍ • ίϯϙʔωϯτؒͷ໰୊ʢAPI࢓༷ҧ൓ʣ͸ݕ஌Ͱ͖ͳ͍ • ϦϦʔε࣌ʹAPIෛՙ΍NW᫔᫓໰୊ -> ϦϦʔε๦͛ɺ؂ࢹγεςϜͷ৴པݮʢܗ֚Խʣ ຆͲͷϦϦʔε͸ෳ਺Ϋϥελ͕ର৅ શ୆ࢃ͘ʹ͸͕͔͔࣌ؒΔ ೔ʑͷϦϦʔε਺͸ଟ͔ͭ͘૿͑ଓ͚Δ
  6. Introduction (cont.) • Gandalf: ϦϦʔεͷ҆શΛอূ͢ΔղੳαʔϏε܊ • ϦϦʔε࣌ͷϝτϦΫεɾϩάΛ෼ੳ • αʔϏεϨϕϧɺύϑΥʔϚϯεΧ΢ϯλɺϓϩηεϕϯτ •

    ҟৗݕग़ɺ૬ؔ෼ੳɺΠϯύΫτධՁͰϞσϧԽ • γεςϜҟৗΛݕ஌ͯ͠ϦϦʔε࡞ۀ(rollout)ΛࢭΊͯ(no-go)ɺϩʔϧόοΫ(recall)ͤ͞Δ • ݁Ռ౳Λৄࡉʹग़ྗ • Time window: Real-time (1h) & Batch (30days) • Azure compute/NW agent, 18ϲ݄Ҏ্Քಇ • 8ϲ݄Ͱ155݅ͷॏେো֐Λิ଍ʢdata-plane: 92%ͷਫ਼౓ɺ100%ͷrecall, control-plane: 94%, 99.8%ʣ • ϦϦʔε࡞ۀࣗମ΋վળͰ͖ͨ
  7. Background and Problem Statement • IaaSͰͷσϓϩΠ • ଟ͘ͷίϯϙʔωϯτ͕ଟ૚ϨΠϠʔ (Fig.4) •

    ֤छϦϦʔεΛ࠷খݶͷϢʔβӨڹͰৗʹ·͘ (E.g. 2018 Meltdown/Spectre) • σϓϩΠ؂ࢹγεςϜ • खಈ͕བྷΉͱscaleͰ͖ͳ͍ -> େن໛ࣗಈԽ͕ඞཁ • Before: ϝʔϧঝೝɾΞυϗοΫͳϏϧυςετɾܦݧʹجͮ͘ϓϩηε • After: ਐḿ؂ࢹͱѱӨڹͷݕ஌ʢෆྑ෼ੳɺόάor notʣɾࣗಈఀࢭ(or not) • ৭ʑى͖Δ͚Ͳɺର৅͸Կʁ • HW ( fi rmware/Թ౓) • SWڝ߹ʹΑΔఆظෛՙ౳ • HWىҼʢఀిɾέʔϒϧஅʣ • code/con fi gෆྑ <- ࠓճͷର৅
  8. Gandalf System Design (ઃܭ্ͷ՝୊) • γεςϜɾγάφϧͷมԽ • طଘίϯϙʔωϯτͷലେͳγάφϧʹՃ͑ɺ৽͍͠ίϯϙʔωϯτɾγάφϧΛαϙʔτ͢Δඞཁ͋Γ • γάφϧͱAmbient

    noviseͱͷ෼཭ • ఆৗతʹ༗ΔΑ͏ͳো֐ʢHW, NW timeout, ൒ࢮো֐ʣ • ฏ೔ɾٳ೔Ͱͷcreate-VM APIͷ࢖ΘΕํͷҧ͍ • εϐʔυͱΧόϨοδόϥϯε • Өڹ͕͙͢ʹग़Δ΋ͷ΋༗Δ͕ɺʢϢʔβૢ࡞ͷੵΈॏͶͳͲɺʣ͕࣌ؒܦ͔ͬͯΒӨڹ͢Δ΋ͷ΋༗Δ • ྆ํΛΧόʔ͢Δඞཁ͋Γ • ݪҼͷಛఆ • ίϯϙʔωϯτͱো֐͸N:MϚοϐϯά • ̍ͭͷίϯϙʔωϯτ -> Nݸͷো֐ɺNݸͷίϯϙʔωϯτ-> 1ͭͷো֐ ϦϦʔε͠ͳͯ͘΋Կ͔ى͖ͯΔ ͙͢ʹӨڹग़ͨΓɺग़ͳ͔ͬͨΓ͢Δɻ Ұ೔Ͱෳ਺ίϯϙʔωϯτ͕σϓϩΠ͞ΕΔ
  9. Gandalf System Design (֓ཁͱdata-source) ϊʔυ಺
 HW metrics ֤छγάφϧ σϓϩΠ಺༰ (di

    ff ) εϐʔυϨΠϠʔͱόονϨΠϠʔʹ෼ׂ ʢো֐typeͷΧόϨοδΛ্͛Δʣ ݁ՌΛ֨ೲ ֤छUIɾΞϓϦ͕࢖͏ σϓϩΠΤϯδϯ͕݁ՌΛpollingͯ͠ɺ ඞཁͳΒϦϦʔεΛࢭΊΔ σϓϩΠΤϯδϯଆʹू໿σʔλΛ౉͢ (E.g. timestamp/node-ID/service-ID) ৽ίϯϙʔωϯτ௥Ճ࣌ͷ σʔλεΩʔϚ
  10. Gandalf System Design (Processing, orchestration/actions, ؂ࢹͱ਍அ ) • σʔλॲཧ͸ϥϜμΞʔΩςΫνϟͰ΍Δ •

    ϦΞϧλΠϜॲཧ+όονॲཧͷ߹ମ • http://lambda-architecture.net/ • MS Kusto • σʔλιʔε஗Ԇ͸਺෼ɺΫΤϦ஗Ԇ͸਺ඵ • લޙ1࣌ؒΛλʔήοτͱ͠ɺܰྔͳ෼ੳΞϧΰϦζϜʹಛԽͤͨ͞ • MS Cosmos DB • Hadoop FS + SQL like ΫΤϦ (Spark?) • ஗Ԇ͸௕͍͕ɺେྔͷσʔλΛ༻͍ͨෳࡶͳϞσϧΛॲཧՄೳ (C++) • 5෼͝ͱɺ1࣌ؒ͝ͱʹಈ͘ɺΠϯΫϦϝϯλϧʹ΍Δ͜ͱͰ్த͔Β΍Γ௚͠΋Մೳɻ http://www.intellilink.co.jp/article/column/bigdata-kk03.html
  11. Gandalf System Design (Orchestration/actions, ؂ࢹͱ਍அ ) • ΦʔέετϨʔγϣϯ • Azure

    Service fabric framework (ϚΠΫϩαʔϏε)ʹ֨ೲɺεέʔϧͤ͞Δ • ߴ଎(stream) V.S. ௿଎(batch) • ݁Ռ͸΄΅Ұக͢Δ • Ұக͠ͳ͍৔߹ɿbatch͚͕ͩਖ਼͍݁͠ՌΛग़͢Մೳੑͷ΄͏͕ߴ͍ • No-Go -> ࣗಈͰؔ܎ऀʹ࿈བྷ͠ΠϯγσϯτνέοτൃߦɺDevOpsΞϓϦ͕݁ՌΛར༻ • ϑϩϯτΤϯυ • ϦΞϧλΠϜͳϩʔϧΞ΢τKPIදࣔʢϩʔϧΞ΢τͷਐḿঢ়گɺNodeFaultsɺίϯςφFaultsɺOS CrashesɺAllocation FailuresͳͲʣ • Gandalf͕໰୊఺ʹڞ௨͢Δ৘ใʢHW SKU΍ಛఆଐੑΛ࣋ͭΠϯελϯεɺ౳ʣͷఏڙ • UI • ͦΕͧΕͷ؀ڥʹԿ͕Ͳ͏σϓϩΠ͞Ε͔ͨɺͷbinary decition page • όονॲཧͷ਍அ݁Ռ • ӨڹΛड͚ͨϊʔυɾΫϥελͳͲͷ਍அ৘ใ • σϓϩΠਐḿঢ়گ
  12. Gandalf Algorithm Design (֓ཁ) • ୯ମͰ΍Ζ͏ͱ͢Δͱݫ͍͠ɺɺ • ڭࢣ෇ֶ͖श • ʢγεςϜಈ࡞ɺ࡞ۀɺނোʣύλʔϯ͕ৗʹมԽ͢ΔͷͰӡ༻ෆՄ

    • աڈ͔Βͷਪఆ΋໾ʹཱͭͱ͸ݶΒͳ͍ • ҟৗݕ஌ • ಉ࣌ʹෳ਺ίϯϙʔωϯτ͕σϓϩΠ͞ΕΔͷͰɺԿ͕ҟৗ͔෼͔Βͳ͍ • ૬ؔ෼ੳ • γφϦΦ͕ෳࡶ͗ͯ͢ແཧ • Gandalf Model͸”͋Θٕͤ” 1. ੜσʔλ͔ΒγεςϜϨϕϧͷো֐Λҟৗݕ஌͢Δʢ୯ମͷӨڹɿ࣌ؒ૬ؔʣ 2. ෳ਺ͷίϯϙʔωϯτσϓϩΠͰݕग़͞Εͨো֐ୡΛ૬ؔ෼ੳͰݪҼɾཁૉΛಛఆʢӨڹൣғɿۭؒ૬ؔʣ 3. ӨڹൣғΛධՁͯ͠ϦϦʔε࡞ۀΛࢭΊΔ͔Ͳ͏͔ܾΊΔ
  13. Gandalf Algorithm Design (ҟৗݕ஌) • ੜσʔλ͔Β؆ܿͳFault signatureΛੜ੒ • ̍ͭͷΤϥʔίʔυ͕൚༻తͩͱɺෳ਺faultʹରԠͯ͠͠·͏Մೳੑ •

    HTTP API 500 ʢ༷ʑͳݪҼͰग़Δ͸ͣɻ௚઀͸࢖͑ͣʣ • Τϥʔϝοηʔδ͸ʢҰൠతʹ͸ʣߏ଄Խ͞Ε͍ͯͳ͍ • ςΩετΫϥελϦϯάͰfault signatureʹ͢Δ • "Null References “ͱ”NullReferenceException”͸ಉҰάϧʔϓͱ͢Δ • Fault signatureΛݩʹҟৗݕ஌ • HW/NWো֐ɺ൒ࢮ(gray failure)͸େن໛Ϋϥ΢υͰ͸ී௨ʹى͜ΔͨΊ୯७ͳ͖͍͠஋ϕʔεͩͱμϝ • signature͸Կઍ΋͋ΔͷͰɺͦΕͧΕ͖͍͠஋ઃఆ͢Δͷ͸ඇݱ࣮త • ϕʔεϥΠϯ͔Β༧ଌ͢ΔʢHolt-Winters forecastingʣɺ؍ଌ஋͕ظ଴஋͔Β4γάϚҎ্཭ΕͨΒ”ҟৗ”ͱݟͳ͢ • APIΤϥʔͷൃੜ਺͸γεςϜΤϥʔΑΓང͔ʹଟ͍Մೳੑ͋Γɻ -> zείΞͰඪ४Խʢن֨Խʁʣ͓ͯ͘͠
  14. Gandalf Algorithm Design (૬ؔ෼ੳ) • ݪҼ͕σϓϩΠ (rollout)͔ʁΛௐ΂Δඞཁ͋Γɻ • ϥϯμϜͳHW issue?

    ಉ࣌σϓϩΠͷӨڹʁ • 1: Ξϯαϯϒϧ౤ථ • ͋Δ࣌ؒ෯ͰɺerrorΛىͨ͜͠ͷ͕ͦͷcomponent͔ʁΛ౤ථ c: component e: fault t^f: TS of fault t^d: TS of deployment i͸࣌ؒൣғ(WD1=1, WD2=24, WD3=72, WD4=all, WD-1=72) k͸ͦͷ࣌σϓϩΠ͞Εͨnodeୡ ࢍ੒ݖ ڋ൱ݖ
  15. Gandalf Algorithm Design (૬ؔ෼ੳ) • 2-1: ࣌ؒతɾۭؒతͳ૬ؔ • ࣌ؒ૬ؔST •

    ۭؒ૬ؔSS • ඃٙίϯϙʔωϯτcΛಛఆ • 2-2: ࣌ؒతͳݮਰ • ৽͍͠σϓϩΠ͕ো֐Λൃੜͤ͞΍͍͢->ݹ͍σϓϩΠʹ͸࣌ؒݮਰΛೖΕΔ Wi͸ࢦ਺ॏΈʢܦݧଇͳExponential Weightsʣ W1>W2>W3>W4 σϓϩΠޙ͙͢ʹӨڹ͕͋Δ Pi, Bʢࢍ੒ථɾ൓ରථʣΛ༻͍͍ͯΔɻ Nf: t1~t2ؒʹσϓϩΠ͞Εͯfault͕ى͖ͨϊʔυ਺ Ndf: t1~t2ؒʹσϓϩΠʹؔΘΒͣfault͕ى͖ͨϊʔυ਺ SS<β: β=90% or 99%ͷͱ͖͸ແࢹ͞ΕΔ
  16. Gandalf Algorithm Design (૬ؔ෼ੳ) • 3: ൑அϓϩηε • ੩తͳ͖͍͠஋ઃఆͰ͸ͳ͘ɺΨ΢ε൑ผ෼ྨث (GDC:

    Gaussian discriminant classi fi er)Ͱಈతʹ training • ӨڹΛड͚ΔΫϥελͷ਺ɾϊʔυͷ਺ɾސ٬ͷ਺ͳͲɺల։ͷӨڹΛड͚ΔείʔϓΛධՁͯ͠ɺ Go/No-Go൑அ • 4:υϝΠϯφϨοδ෇༩ • ֤ϑΥʔϧτγάφϧͷॏΈ(0 - 100)ΛઃఆՄೳ • ྫ͑͹ϊΠζͷଟ͍γάφϧΛ0.01ʹઃఆ
  17. Evaluation (ϏδωεΠϯύΫτ) • Azure infraͰ׆༂த • 19 data-plane (host/guestߋ৽ɺagentߋ৽ͳͲ) •

    4 control-plane (XϦιʔεϓϩόΠμɺϑϩϯτΤϯυͳͲ) • ن໛ײ • ฏۉ27ສΠϕϯτ/dayΛॲཧɺ̒ԯճͷAPIίʔϧΛϩάอଘɺ20TB/dayΛ෼ੳ • σϓϩΠεϐʔυ • ૣ͘ͳͬͨɺͳ͔ͥʁ • ো֐Ξϥʔτൃੜ࣌ʹϦϦʔεϚωʔδϟɺ୲౰ऀɺґଘͷ୲౰ऀͰ΍ΓऔΓൃੜ • ΍ΓऔΓίετΛղফ͠ɺΞϥʔτͷূڌΛఏڙ • ΋ͬͱૣ͘Ͱ͖Δ͔ʁ • ετϦʔϛϯάॲཧ஗ԆΛݮΒ͢ -> σϓϩΠ͸ૣ͘ͳΔ͕ਫ਼౓͕௿͘ͳΔʢ଴ͭඞཁ͕͋ΔͷͰʣ • ඼࣭༧ଌΛ࢖͏ͱ͏·͍͔͘͘΋ʁ
  18. Evaluation (Bad rolloutΛਫ਼֬ʹ๷͙) • Stage->Canary->Pilot->Prod(ͷॳظ)ɻStage/Canary͸ϊΠζͳγάφϧ͕ଟ͍ • 2018೥1݄͔Β2018೥11݄ • Data-Plane: 92.4%ਫ਼౓Ͱ100%ϩʔϧόοΫ

    • agentো֐ɺOSΫϥογϡɺnodeো֐ɺunhealthy containerɺVM࠶ىಈ • Control-Plane: 94.9%ਫ਼౓ɺ99.8%ϩʔϧόοΫ @ 1200+ region • ݟಀ͠͸2݅ͷΈ • ෆ׬શͳϩάͰfalse negative • ʢຊ౰͸ಛఆΤϥʔͳͷʹʣҰൠతͳλΠϜΞ΢τΛు͍͍ͯͨ • ิ଍Ͱ͖ͨதͰଟ͍΋ͷ͸ʁ • ޓ׵ੑ໰୊ɿʢͦΜͳ͸ͣͳ͍ͷʹʣσϓϩΠͨ͠ϊʔυͷHW SKU/OS/Lib version͕ݹ͔ͬͨɻɻ • ܖ໿ҧ൓໰୊ɿAPI࢓༷ʹै͓ͬͯΒͣɺґଘίϯϙʔωϯτΛյͨ͠ɻɻ
  19. Evaluation (࣮ࡍʹ๷͛ͨࣄྫ) • 1: ίϯϙʔωϯτؒͷӨڹ (a.k.a. ଞͷίϯϙʔωϯτͷ͍ͤʹ͕ͪ͠໰୊) • CRPϦϦʔε࣌ো֐ΛFCͷ͍ͤʹͯ͠ɺCanaryɺPilotͰͷGandalfఀࢭΛ”खಈղআ”ͨ͠ •

    ͦͷޙఀࢭɻCRPϦϦʔεʹڧ͍૬͕ؔ͋Δ͜ͱΛࢦఠɻCRPΛ fi xͯ͠ແࣄղܾ • 2: ಛఆϦʔδϣϯӨڹʢa.k.a. ؀ڥ͹Β͚ΔͱͭΒ͍໰୊ʣ • Pilotޙ͸ίʔυ඼࣭໰୊ΑΓޓ׵ੑ໰୊ΛҾ͖͕ͪ • ಛఆϦʔδϣϯ(ೆϑϥϯε)͚ͩ࠷৽ͷDiskRP͕σϓϩΠ͞Ε͍ͯͨɻɻɻ • 3ސ٬ͷΈӨڹ • 3: જࡏతͳӨڹʢa.k.a. ͕͔͔࣌ؒΔͷ͸ݟ͚ͭʹ͍͘໰୊ʣ • network agent (NIC fi rmware/driverͷupdate script)ΛσϓϩΠ͕ͨ͠ɺdriverͷόʔδϣϯࢦఆϛε • 24࣌ؒ-72࣌ؒޙʹOSΫϥογϡɺSev2Ξϥʔτʢେن໛ͳސ٬Өڹʣ • ʢଞʹ΋ಉ࣌ਐߦͰσϓϩΠத͕ͩͬͨɺʣਖ਼֬ʹಛఆͰ͖ͨ
  20. Evaluation (૬ؔΞϧΰϦζϜͷޮՌ) • ݸʑͷޮՌ • EW͕ͳ͍ͱਫ਼౓͕མͪΔɺrecall͸มΘΒͣ • Τϥʔ͕୭ͷ͍͔ͤɺ͕ϒϨΔɻ͕ɺΤϥʔ͕༗Δ͜ͱ͸มΘΒͣɻ • ۭؒ૬ؔ͸ޮՌ͕େ͖͍

    • Өڹ͸ଞͷϊʔυʹ΋ಉ͡Α͏ʹ޿͕Δ • ݮਰΛೖΕͳ͍ͱɺ fi xޙͷσϓϩΠ΋ҎલͷγάφϧΛݕ஌ͯ͠͠·͏ • veteػߏ (ڋ൱ݖ)ΛೖΕͯɺͦͷσϓϩΠதίϯϙʔωϯτ͚ͩͷҙݟΛฉ͘ • ݸʑͷΞϧΰϦζϜͷॏཁ͕ͩɺۭؒ૬ؔɾ࣌ؒݮਰͷ͋Θٕ͕ͤฉ͘ • ΢Οϯυ΢αΠζ͸͋·ΓӨڹ͠ͳ͍ʂ • ߋ৽ͱҟৗͷ࣌ؒత۠ผΛ͚͍ͭͨ໨తͷͨΊɻ • ॏΈޮՌ • recallʹେ͖͘Өڹɻࢼݧظؒத͸৽͍͠18ݸͷॏΈௐ੔Λ4ճௐ੔ɻ ͢΂ͯ1ɺॏཁͳ΋ͷ͚ͩ1ɺܰඍͳ΋ͷ͚ͩ1
  21. Discussion • 18ϲ݄ͷײ૝ͳͲʢΤϯδχΞɾϦϦʔεϚωʔδϟʣ • ࢄࡏ͍ͯ͠ΔσʔλΛूΊΔͷ͕େมɺௐ੔͕େมɻ͜Ε͕վળɻ • GandalfΛ৴༻͢ΔΑ͏ʹͳͬͨ • ΞυϗοΫௐ͔ࠪΒର࿩ܕτϥϒϧγϡʔςΟϯά΁ •

    UIΛݟͯυϦϧμ΢ϯͰ͖ΔɻજࡏݪҼͳͲΛڧௐ (E.g. SKU Gen2.3) • ֶΜͩ͜ͱ • ϒϥοΫϘοΫε͸৴༻͞Εͳ͍ɻҙࢥܾఆϓϩηεʹ߹ΘͤͯϞσϧԽɾॲཧɾ݁Ռग़ྗ·ͱΊ • υϝΠϯ஌ࣝΛܧଓతʹऔΓೖΕΔ͜ͱ͕ޮՌΛ্͛Δ
  22. EoP