Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
#14 “The Network is Reliable"
Search
cafenero_777
June 14, 2023
Technology
0
87
#14 “The Network is Reliable"
ACM queue: July 23, 2014 Volume 12, issue 7
https://queue.acm.org/detail.cfm?id=2655736
cafenero_777
June 14, 2023
Tweet
Share
More Decks by cafenero_777
See All by cafenero_777
#51 “Empowering Azure Storage with RDMA”
cafenero_777
3
460
#49 “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems”
cafenero_777
2
110
#50 “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction”
cafenero_777
0
110
#33 “Destroying networks for fun (and profit)”
cafenero_777
0
91
#34 “MTPSA: Multi-Tenant Programmable Switches”
cafenero_777
0
54
#37 “Bluebird: High-performance SDN for Bare-metal Cloud Services”
cafenero_777
1
120
#39 “Profiling a warehouse-scale computer”
cafenero_777
0
40
#23 “VFP: A Virtual Switch Platform for Host SDN in the Public Cloud”
cafenero_777
0
220
#24 “Ananta: Cloud Scale Load Balancing”
cafenero_777
0
240
Other Decks in Technology
See All in Technology
Running JavaScript within Ruby
hmsk
3
400
「経験の点」の位置を意識したキャリア形成 / Career development with an awareness of the “point of experience” position
pauli
4
110
30代からでも遅くない! 内製開発の世界に飛び込み、最前線で戦うLLMアプリ開発エンジニアになろう
minorun365
PRO
15
4.7k
ガバクラのAWS長期継続割引 ~次の4/1に慌てないために~
hamijay_cloud
1
500
React ABC Questions
hirotomoyamada
0
560
Linuxのパッケージ管理とアップデート基礎知識
go_nishimoto
0
630
Dynamic Reteaming And Self Organization
miholovesq
3
670
Web Intelligence and Visual Media Analytics
weblyzard
PRO
1
5.9k
Porting PicoRuby to Another Microcontroller: ESP32
yuuu
4
490
ここはMCPの夜明けまえ
nwiizo
32
12k
PicoRabbit: a Tiny Presentation Device Powered by Ruby
harukasan
PRO
2
270
Conquering PDFs: document understanding beyond plain text
inesmontani
PRO
1
330
Featured
See All Featured
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
52
2.4k
Intergalactic Javascript Robots from Outer Space
tanoku
270
27k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
Producing Creativity
orderedlist
PRO
344
40k
Done Done
chrislema
184
16k
Large-scale JavaScript Application Architecture
addyosmani
512
110k
4 Signs Your Business is Dying
shpigford
183
22k
A better future with KSS
kneath
239
17k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
32
5.4k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
10
760
We Have a Design System, Now What?
morganepeng
52
7.5k
Building Flexible Design Systems
yeseniaperezcruz
329
39k
Transcript
Research Paper Introduction #14 “The Network is Reliable ~An informal
survey of real-world communications failures~” ௨ࢉ#52 @cafenero_777 2020/09/24
Agenda • ରจ • ֓ཁͱಡ͏ͱͨ͠ཧ༝ • Introduction • େنΠϯϑϥ •
DC NW • ΫϥυNW • ϗεςΟϯάϓϩόΠμʔ • ҬNW • Global Routing Error • NIC/Driver • ΞϓϦέʔγϣϯ • ·ͱΊ
$ which • The Network is Reliable: An informal survey
of real-world communications failures • Peter Bailis, UC Berkeley, Kyle Kingsbury, Jepsen Networks • ACM queue: July 23, 2014 Volume 12, issue 7 • https://queue.acm.org/detail.cfm?id=2655736
֓ཁͱಡ͏ͱͨ͠ཧ༝ • ֓ཁ • ༷ʑͳࣄۀऀͰNWোʢpartitionʣ͕ൃੜ͍ͯ͠Δ • ্هʹىҼ͢ΔࢄγεςϜোͷࣄྫΛ͘ௐࠪɻ • ઃܭ࣌ʹΑ͘ߟ͑ͯʂʢઃܭޙৼΓฦͬͯʂʣ •
ಡ͏ͱͨ͠ཧ༝ • ࣮ࡍʹى͖ͨোʹ͍ͭͯ͘ɾ۩ମతʹݴٴ͞Ε͍ͯͨͨΊɻ • Կ͔ͷจͰҾ༻͞Ε͍ͯͨʢΕͨɻɻʣ
ࢄίϯϐϡʔςΟϯά҆શਆʁ http://nighthacks.com/jag/res/Fallacies.html http://www.rgoarchitects.com/Files/fallacies.pdf
ࢄίϯϐϡʔςΟϯά҆શਆʁ • “ଟ͘ͷਓࢄγεςϜΛ࠷ॳʹߏங͢Δ ࣌ɺҎԼͷ8߲Λఆͯ͠͠·͏ɻ͜ΕΒ Քಇதʹඞͣൃੜ͠ɺେΛҾ͖ى͜ ͠ɺ௧͍ࣦഊܦݧΛҾ͖ى͜͢ɻ” • NWোͰͳ͘ਓͷࢥ͍ࠐΈ՝ɾɾɾ • ͘ɾͨ·ʹਂ͘ݟΔඞཁੑ
http://nighthacks.com/jag/res/Fallacies.html http://www.rgoarchitects.com/Files/fallacies.pdf
http://nighthacks.com/jag/res/Fallacies.html http://www.rgoarchitects.com/Files/fallacies.pdf Where you are?
ཧ৭ʑ͋Δ͕ɺɺ • ݱঢ় • ີ݁߹ɾ৫ڞ༗͞Εͣ • ΞϓϦϨΠϠʔͰ݁ • ਪଌͱᷚ •
ݱ࣮ͷোࣄྫΛ·ͱΊΔ • ͔ͦ͜ΒֶͿ
େنࢄΠϯϑϥͷࣄྫ (1/2) • MS Data Center by MS research •
5.2ݸͷσόΠεނো/day, 40.8ݸͷϦϯΫো/day • म෮࣌ؒͷதԝ5(࠷େ1िؒ), ύέϩεதԝ59,000ύέοτ • NWੑͰ43%্->NWোͷҰൠతͳݪҼഉআʹࢸΒͣ • HPΤϯλʔϓϥΠζཧNW By HP labͰͷαϙʔτνέοτੳ • ଓؔ࿈νέοτ11.4%, ͦͷ͏ͪ14%࠷ߴ༏ઌ • ࠷ߴ༏ઌͷରԠ࣌ؒͷதԝ2.75࣌ؒɺશதԝ18 • Google Chubby (ࢄϩοΫγεςϜ for খ༰ྔࢄετϨʔδ) • 61ݸͷఀࢭ@700Λௐࠪɺ9ͭ30ඵҎ্ఀࢭɿ4ͭNWىҼɺ2ͭ”NWଓىҼΒ͍͠”
େنࢄΠϯϑϥͷࣄྫ (2/2) • googleࢄγεςϜͷlesson & advice by Je ff Deanޚେͷ৽Ϋϥελௐࠪʢ࠷ॳͷ1ʣ
• 5ϥοΫ͕ෆ҆ఆʢ50%ύέϩεʣ • 8ϝϯςʢ4ϝϯς30ؒϥϯμϜͳύέϩεͷՄೳੑʣ • 3ϧʔλোʢ1࣌ؒܧଓʣ • ΞυόΠε • ෳόʔδϣϯڝ߹Λߋ৽Ͱ͖ΔநԽϨΠϠʔͰٵऩͤ͞Δ • NWো(partition)෮چޙʹϨϓϦέʔτௐ • Amazon Dynamo (KVS) • ”ैདྷͷෳ͞ΕͨRDBͰNW partitionʹରॲͰ͖ͳ͍”ͱݴٴɻconsistencyΛ٘ਜ਼ʹͯ͠ͰavailabilityΛऔͬͨ • Yahoo! PNUTS/Sherpa (ཧࢄDB) • ڧ͍߹ੑͰ͋ΔλΠϜϥΠϯ߹ੑʢશϨίʔυ͕શϨϓϦΧಉ͡ॱংͰߋ৽ॲཧ͞ΕΔʣΛαϙʔτ • ->NWো(partition)ɾαʔόোͰ͠ΜͲ͍ͷͰऑ͍߹ੑʹมߋɾɾɾ https://research.cs.cornell.edu/ladis2009/talks/dean-keynote-ladis2009.pdf
DC NWো • ిݯো • ToR͕ยํམͪΔɺͳ͔ͥ͏ยํͷToRམͪΔɺϥοΫؒ௨৴͕ग़དྷͳ͍ΦϯσϚϯυαʔϏε͕ఀࢭɾɾɾ • ->ඞͣ͠ϦϯΫোΛ͙ͷͰͳ͍ʢMS SIGCOM paper͕ࣔࠦʣ
• BPDUϑϥου • ϝϯςதʹSTPϑϥοϓ͠ʢBPDUن֨తʹൃੜ͠ͳ͍ʣBPDUϑϥουൃੜɻ2࣌ؒαʔϏεఀࢭ • Bridge loop/Miscon fi guration/Broken MAC cache (github) • ʢଟஈSWͰͳ͘ʣूεΠονΛಋೖ -> ϧʔϓൃੜ->ϦϯΫແޮԽ->Կނ͔ར༻ଳҬ100%ுΓ͖ɾɾ • ઃఆϛεঢ়ଶͰ1ຊམͱ͢->োݕग़ػߏ͕શஅͤ͞Δ->18μϯ • εΠον͕MACΞυϨεͷΩϟογϡΛਖ਼͘͠ߋ৽Ͱ͖ͳ͍ͨΊϒϩʔυΩϟετ͢ΔϑΝʔϜόάɾɾɾ • MLAG/STP/STONITH (github) • ϕϯμʔ͕ूεΠονͷಛఆagentΛఀࢭ->linkΛshutग़དྷͣ->ਖ਼ৗʹLAG/STP/L2ϓϩτίϧॲཧͰ͖ͣ->STP࠶ܭࢉͰ90ඵϒϩοΫ • ϑΝΠϧαʔό (Pacemaker/DRBD)͕͓ޓ͍ఀࢭͯ͠ͱஅ->STONITHʢ૬खΛڧ੍rebootʣ->NW෮چޙʹ྆ܥdown->ϑΝΠϧΞΫηεग़དྷͣ • खಈ෮چʢϓϥΠϚϦϊʔυʹ߹ΘͤΔɻ྆ܥϓϥΠϚϦͳΒϩάௐࠪɺɺʣʹ5͔͔࣌ؒͬͨ
ΫϥυωοτϫʔΫࣄྫ (1/2) • Isolated MongoDB primary on EC2 • EC2
WestϦʔδϣϯͰNWো-> 1primary/2secondary͕->෮چޙʹݹ͍primary͕”্ॻ͖”ͨ͠->̎࣌ؒͷ ॻ͖ࠐΈଛࣦɻ • োࣗମҰൠతͳͷɺɺ • Amnesia split-brain on EC2 • Ұ൩Ͱsplit brain, ӡ༻νʔϜ͕ยܥΛrestartͰղܾ • MongoDB/ElasticSearch on EC2 • NWো->ಛఆϊʔυӨڹ->αʔϏεશͯʹӨڹՄೳੑ • ඵɾ݄ճͷbackendఀࢭ -> -45ͷαʔϏεఀࢭͱESΠϯσοΫεഁଛɺఀࢭ12-4ճ·ͰΤεΧϨʔτ
ΫϥυωοτϫʔΫࣄྫ (2/2) • AWS EBSఀࢭ (2011/04/21) • US West AZ͔ΒEBSͷτϥϑΟοΫΛγϑτ
-> ϧʔςΟϯάϙϦγʔϛε->primary/secondary NWΛಉ࣌ʹஅ -> EBSϛϥʔετʔϜൃੜ-> -> EC2 12࣌ؒఀࢭ, EBS 80࣌ؒఀࢭ • RDSఀࢭʢAZ failoverόάͷͨΊɺ2.5%͕ࣦഊʣɺHeroku 16-60࣌ؒఀࢭ • Isolated Redis Primary on EC2 • NWো->Twilio՝ۚPrimary Redis͕->secondaryঢ֨ͳ͠Ͱprimaryʹॻ͖ࠐΈ->෮چޙͷ࠶ ಉظͰprimaryߴෛՙ->primaryΛखಈͰ࠶ىಈ->ޡͬͨcon fi gͰىಈ͠ɺread onlyͰىಈ->Twilio APIݺͼग़͠Ͱސ٬ʹ࠶νϟʔδ͢Δ->40Ͱ1.1%ސ٬ʹաٻ->SMS/௨ͰΫϨΧ500υϧ ٻ+3500υϧΛ͑ΔͱٻडෆՄ
ϗεςΟϯάϓϩόΠμʔ • ҆ՁͰ৴པੑ͕ߴ͍ʢϋζʣ+NW/ServerཧऀͰ͋Δඞཁ͕͋Δ • GlusterFS split brain (Freistil IT) •
ϧʔλϑΝʔϜΣΞόάͰ50%-100%ύέϩε->GluasterFS͕split brain-> ෮چޙ2ͭ ͷσʔληοτΛෆ߹Λղܾग़དྷͣ ->म෮ޙʹτϥϑΟοΫٸ૿͠Webϊʔυߴෛՙ • ಗ໊ͳओཁϗεςΟϯάۀऀʢ100-200nodeنʣ • 90ؒͰ5ͭpartition͕࣌ؒൃੜ • ෦ͱ֎෦Λͭͳ͙NWোɺ෦ͱཧNWΛো
ҬNW • WANোɿϧʔτ͕গͳ͍߹ɺෳDCͳDRඞཁͳͲ • CENICʢCorporation for Education Network Initiatives in
Californiaʣௐࠪ • ΧϦϑΥϧχΞશͷϧʔλΛ̑ௐࠪʢϦϯΫোɺeBGP/tracerouteσʔλʣ • 500Ҏ্ͷNW partitionΛൃݟ • SW 6ʢதԝ2.7ɺ19.9@95%ileʣ • HW 8.2࣌ؒʢதԝ32ɺ3.7@95%ileʣ • PagerDuty on 2 EC2 region/Linode • CA෦AWSϐΞϦϯά͕ྼԽ->EC2ϊʔυ͕ଓྼԽ͠latency૿Ճ->ΫΦϥϜશஅ->ϝοηʔδͷσΟεύονఀࢭ • ઃܭతʹߟྀ͞Ε͍͕ͯͨɺ݁Ռతʹ18ར༻ෆՄɺAPIϦΫΤετυϩοϓࢯɺΫΦϥϜ෮چ·ͰϖʔδԆ
Global Routing Error • Cloud fl are • path/AnycastΛۦ͢Δ23ͷDC •
DDoSରࡦͱͯ͠ಛఆαΠζͷύέοτΛdropͤ͞ΔΑ͏FlowSpecͰશΤοδϧʔλʹୡ->ύέοτʹҰகͤͣ->Ϋϥογϡ͢Δ·ͰRAMফඅ͠ଓ͚ͨ->ࣗ ಈ࠶ىಈ͠ͳ͍ɺmgmtΞΫηεͰ͖ͳ͍->Ұ෦෮چτϥϑΟοΫूதͯ͠ߴෛՙ->·ͩϑΥʔϧόοΫ->ݱखಈ࠶ىಈʢ30ޙʹ։࢝ɺ1࣌ؒར༻ෆՄʣ • Level3 2011 • JuniperͷϑΝʔϜΣΞόάͷͨΊɺόοΫϘʔϯఀࢭ • Time Warner Cable RIM BlackBerry, UK ISP͕ΦϑϥΠϯ • Global BGP outages • 2008ʹύΩελϯςϨίϜ͕youtubeΛϒϩοΫ->ͦͷʢϒϩοΫ͞ΕͨʣϧʔτΛଞͷISPʹใ->ΞΫηεෆՄʢBGP hijackʣ • 2010ʹσϡʔΫେֶ͕BGPͷ࣮ݧతͳϑϥάΛςετ͢Δ͜ͱͰಉ༷ͷޮՌΛ֬ೝ • ൃදऀऍɿBGP hijackʢඇਖ਼نͳAS͔ΒউखʹBGPใʣ݁ߏى͖ͯΔ • ྫɿ2018ʹGoogle͕ʢࣗಈԽ͞ΕͨγεςϜͷόάʁͰʣىͨ݅͜͠ https://gigazine.net/news/20180711-shutting-down-bgp-hijack-factory/
NIC/Driver • Broadcom BCM5709 and Friends • BCM5709 • ड৴ύέοτdrop͢Δ͕ɺૹ৴drop͠ͳ͍ϑΝʔϜόά
• ->primaryॲཧʢड৴ʣͰ͖ͳ͍͕ɺsecondaryprimary͕ੜ͖ͯΔͱࢥ͍ࠐΉ->secondaryʹfallback͠ͳ͍->5࣌ؒఀࢭ -> Sven Ulland ࢯ͕Linux 2.6.32Ͱใࠂɺ2.6.38·ͰղܾͰ͖ͣɻ • BCM5709ʢαʔόʣ͕crash/bu ff erᷓΕ࣌ʹແؔʹPAUSEϑϨʔϜΛग़͢->ToR(BCM56314/BCM56820)͕֦େ->NWશମ͕ো • BCM57711ͰδϟϯϘϑϨʔϜͰߴෛՙ࣌ʹϨΠςϯγѱԽ->ESX on iSCSIͰݦࡏԽ • Intel 82574: Packet of Death • EEPROM͕ਖ਼͘͠ fl ashग़དྷͣ->SIPͷड৴ύέοτΛNIC͕ແޮԽʢௐ͕ࠪඇৗʹ͍͠ɺɺʣ->cold restartͰ෮׆ɾɾ • DriverىҼͷGlusterFS partition • ΞοϓάϨʔυޙʹFlusterFSϖΞͰNWো->LAGແޮԽͯ͠෮چ->12࣌ؒޙʹ࠶ൃ->υϥΠόىҼͱಛఆͯ͠͠->σʔλෆ߹/VMͷϑΝ ΠϧγεςϜσʔλഁଛൃੜ
ΞϓϦϨϕϧͷো (1/3) • ཧNWىҼ͚ͩͰͳ͍ɿϓϩάϥϜɾOSεέδϡʔϥԆɺߴෛՙϓϩηε • CPUߴෛՙͱαʔϏε • ElasticSearch࠶ىಈ->Ϋϥελׂ->split brain->Ϋϥελ෮چ->indexՃআग़དྷͣ->αʔόׂ͕ΓͯΒΕ͍ͯͳ͍indexΛ෮چ ͠Α͏ͱ͢ΔʢͰ͖ͳ͍ͷͰCPUۭճΓʣ->20ར༻ఀࢭɺ6࣌ؒαʔϏεԼ
• ͍GCఀࢭͱI/O • ESΫϥελͰGCى͖Δ->secondary node͕primary deadએݴ->ຊprimaryࢮΜͰ͍ͳ͍ͷͰsplit brain(dual master) • I/OݪҼͰGCఀࢭ->IO_WAIT͕࣌ؒ૿Ճ->split brain/write loss/indexഁଛ • MySQL overload & pacemaker segfault (github) • MySQL primaryෛՙ -> secondaryΛঢ֨->secondaryͷcold cache͕͔ͬͨ->primaryʹfailoverͨ͠Α͏ͱ͕ͨ͠खಈͰఀࢭ • ཌprimaryଆͷมߋ͕secondaryʹө͞Ε͍ͯͳ͍͜ͱΛൃݟ->Replication ManagerͰͷखಈ෮چதʹsegfault ->ࣗಈɾखಈϨϓ Ϧέʔγϣϯڝ߹ɺʢ֎෦ΩʔʹҰ؏ੑ͕ͳ͔ͬͨͨΊʣଞਓͷprivate repoΛදࣔɾɾɾ
ΞϓϦϨϕϧͷো (2/3) • DRDB split brain • 2nodeͷ߹ʢNW partition͞Εͨ߹ʹʣ͕࣮ࣗ֬ʹprimaryͰ͋Δͱݴ͑ͳ͍-> ྆node͕primary/onlineঢ়ଶͰॻ͖ࠐΈΛड͚ೖΕɺϑΝΠϧγεςϜϨϕϧͰ૬ҧൃੜ
• VoldDB on EC2 • NWো->split brain->dual primary->replica૬ҧ->ॏେͳσʔλଛࣦ • Mystery RabbitMQ Partition • ࠶ૹগͳ͘ɺϝοηʔδ҆ఆɺϊʔυؒଓ҆ఆɺͰpartition͢Δɻɻ • partitionݕग़timeoutΛ2ʹ͢ΔͱසݮΔ͕ɺpartitonશʹ͙͜ͱग़དྷͣɻṖɻ
ΞϓϦϨϕϧͷো (3/3) • ElasticSearch Discovery Failure on EC2 • 2node
ESΫϥελͰσΟεΧόϦϝοηʔδަʹ3ඵҎ্͔͔Δͱ1/ 10ͷ֬Ͱdual masterʹɻɻʢ߱֨खಈͷΈʣ • timeout15ඵʹͯ͠ղܾ
݁ɿզʑͲ͜ʹ͔͏ͷ͔ʁ • ࢀরͱͯ͠ͷ·ͱΊ • ϓϩηεɾαʔόɾNICɾεΠονɾϩʔΧϧɾάϩʔόϧ • NWো”ಥવ”དྷΔɻex: ఆظupdate࣌, ϝϯς࣌ •
ҰํͰɺpartiton͕ى͖ͳ͍NW/γεςϜ͋Δɻex: ۚ༥ܥʢ৻ॏͳΤϯδχΞϦϯά+NWٕज़ਐԽ+͓ۚʣ • Google/Amazon (ലେͳنͷͨΊɺҰͭҰͭͷHWίετ)Startup(༧ࢉ͕ݶΒΕΔ) • ༷ʑͳো͕ى͖ΔɻHuman errorΛؚΉݱ࣮ͷࢄγεςϜͷ͕ى͖Δ • ʢpartition͕ى͖ΔલʹʣϦεΫΛ࠶ߟ͢Δ͜ͱ͕ॏཁ • ϗϫΠτϘʔυͰಈ͖Λ͍ͳ͕Βɻ • PartitionରԠ͢Δͱଟ͘ͷ߹ϝϦοτ͕ಘΒΕΔ • partitionରԠͷՃϨΠςϯγͱɺʢࣄޙʹ͔͔Δௐ࣌ؒݮͷʣϝϦοτ
ิɿࠓͲ͖ʢ2020ʣͰΔͳΒʁ • ܗࣜख๏ • Unknownͳstateʢͦͷఆ্ٛʣଘࡏ͠ͳ͍͜ͱΛ୲อͰ͖Δ • Chaos Engineering • ࣮ϛεɺγεςϜ݁߹࣌ͷෆ߹Λൃݟɾमਖ਼Ͱ͖Δ
EOP