Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
#14 “The Network is Reliable"
Search
cafenero_777
June 14, 2023
Technology
0
98
#14 “The Network is Reliable"
ACM queue: July 23, 2014 Volume 12, issue 7
https://queue.acm.org/detail.cfm?id=2655736
cafenero_777
June 14, 2023
Tweet
Share
More Decks by cafenero_777
See All by cafenero_777
#51 “Empowering Azure Storage with RDMA”
cafenero_777
3
490
#49 “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems”
cafenero_777
2
120
#50 “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction”
cafenero_777
0
130
#33 “Destroying networks for fun (and profit)”
cafenero_777
0
95
#34 “MTPSA: Multi-Tenant Programmable Switches”
cafenero_777
0
64
#37 “Bluebird: High-performance SDN for Bare-metal Cloud Services”
cafenero_777
1
130
#39 “Profiling a warehouse-scale computer”
cafenero_777
0
46
#23 “VFP: A Virtual Switch Platform for Host SDN in the Public Cloud”
cafenero_777
0
240
#24 “Ananta: Cloud Scale Load Balancing”
cafenero_777
0
280
Other Decks in Technology
See All in Technology
自作JSエンジンに推しプロポーザルを実装したい!
sajikix
1
180
【NoMapsTECH 2025】AI Edge Computing Workshop
akit37
0
210
Modern Linux
oracle4engineer
PRO
0
140
CDK CLIで使ってたあの機能、CDK Toolkit Libraryではどうやるの?
smt7174
4
190
会社紹介資料 / Sansan Company Profile
sansan33
PRO
6
380k
AI開発ツールCreateがAnythingになったよ
tendasato
0
130
roppongirb_20250911
igaiga
1
240
AI時代を生き抜くエンジニアキャリアの築き方 (AI-Native 時代、エンジニアという道は 「最大の挑戦の場」となる) / Building an Engineering Career to Thrive in the Age of AI (In the AI-Native Era, the Path of Engineering Becomes the Ultimate Arena of Challenge)
jeongjaesoon
0
200
下手な強制、ダメ!絶対! 「ガードレール」を「檻」にさせない"ガバナンス"の取り方とは?
tsukaman
2
450
Automating Web Accessibility Testing with AI Agents
maminami373
0
1.3k
Django's GeneratedField by example - DjangoCon US 2025
pauloxnet
0
150
プラットフォーム転換期におけるGitHub Copilot活用〜Coding agentがそれを加速するか〜 / Leveraging GitHub Copilot During Platform Transition Periods
aeonpeople
1
170
Featured
See All Featured
Music & Morning Musume
bryan
46
6.8k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
30
9.7k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.4k
Producing Creativity
orderedlist
PRO
347
40k
Agile that works and the tools we love
rasmusluckow
330
21k
Art, The Web, and Tiny UX
lynnandtonic
303
21k
Balancing Empowerment & Direction
lara
3
620
Building Adaptive Systems
keathley
43
2.7k
GraphQLとの向き合い方2022年版
quramy
49
14k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
1.6k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
127
53k
Transcript
Research Paper Introduction #14 “The Network is Reliable ~An informal
survey of real-world communications failures~” ௨ࢉ#52 @cafenero_777 2020/09/24
Agenda • ରจ • ֓ཁͱಡ͏ͱͨ͠ཧ༝ • Introduction • େنΠϯϑϥ •
DC NW • ΫϥυNW • ϗεςΟϯάϓϩόΠμʔ • ҬNW • Global Routing Error • NIC/Driver • ΞϓϦέʔγϣϯ • ·ͱΊ
$ which • The Network is Reliable: An informal survey
of real-world communications failures • Peter Bailis, UC Berkeley, Kyle Kingsbury, Jepsen Networks • ACM queue: July 23, 2014 Volume 12, issue 7 • https://queue.acm.org/detail.cfm?id=2655736
֓ཁͱಡ͏ͱͨ͠ཧ༝ • ֓ཁ • ༷ʑͳࣄۀऀͰNWোʢpartitionʣ͕ൃੜ͍ͯ͠Δ • ্هʹىҼ͢ΔࢄγεςϜোͷࣄྫΛ͘ௐࠪɻ • ઃܭ࣌ʹΑ͘ߟ͑ͯʂʢઃܭޙৼΓฦͬͯʂʣ •
ಡ͏ͱͨ͠ཧ༝ • ࣮ࡍʹى͖ͨোʹ͍ͭͯ͘ɾ۩ମతʹݴٴ͞Ε͍ͯͨͨΊɻ • Կ͔ͷจͰҾ༻͞Ε͍ͯͨʢΕͨɻɻʣ
ࢄίϯϐϡʔςΟϯά҆શਆʁ http://nighthacks.com/jag/res/Fallacies.html http://www.rgoarchitects.com/Files/fallacies.pdf
ࢄίϯϐϡʔςΟϯά҆શਆʁ • “ଟ͘ͷਓࢄγεςϜΛ࠷ॳʹߏங͢Δ ࣌ɺҎԼͷ8߲Λఆͯ͠͠·͏ɻ͜ΕΒ Քಇதʹඞͣൃੜ͠ɺେΛҾ͖ى͜ ͠ɺ௧͍ࣦഊܦݧΛҾ͖ى͜͢ɻ” • NWোͰͳ͘ਓͷࢥ͍ࠐΈ՝ɾɾɾ • ͘ɾͨ·ʹਂ͘ݟΔඞཁੑ
http://nighthacks.com/jag/res/Fallacies.html http://www.rgoarchitects.com/Files/fallacies.pdf
http://nighthacks.com/jag/res/Fallacies.html http://www.rgoarchitects.com/Files/fallacies.pdf Where you are?
ཧ৭ʑ͋Δ͕ɺɺ • ݱঢ় • ີ݁߹ɾ৫ڞ༗͞Εͣ • ΞϓϦϨΠϠʔͰ݁ • ਪଌͱᷚ •
ݱ࣮ͷোࣄྫΛ·ͱΊΔ • ͔ͦ͜ΒֶͿ
େنࢄΠϯϑϥͷࣄྫ (1/2) • MS Data Center by MS research •
5.2ݸͷσόΠεނো/day, 40.8ݸͷϦϯΫো/day • म෮࣌ؒͷதԝ5(࠷େ1िؒ), ύέϩεதԝ59,000ύέοτ • NWੑͰ43%্->NWোͷҰൠతͳݪҼഉআʹࢸΒͣ • HPΤϯλʔϓϥΠζཧNW By HP labͰͷαϙʔτνέοτੳ • ଓؔ࿈νέοτ11.4%, ͦͷ͏ͪ14%࠷ߴ༏ઌ • ࠷ߴ༏ઌͷରԠ࣌ؒͷதԝ2.75࣌ؒɺશதԝ18 • Google Chubby (ࢄϩοΫγεςϜ for খ༰ྔࢄετϨʔδ) • 61ݸͷఀࢭ@700Λௐࠪɺ9ͭ30ඵҎ্ఀࢭɿ4ͭNWىҼɺ2ͭ”NWଓىҼΒ͍͠”
େنࢄΠϯϑϥͷࣄྫ (2/2) • googleࢄγεςϜͷlesson & advice by Je ff Deanޚେͷ৽Ϋϥελௐࠪʢ࠷ॳͷ1ʣ
• 5ϥοΫ͕ෆ҆ఆʢ50%ύέϩεʣ • 8ϝϯςʢ4ϝϯς30ؒϥϯμϜͳύέϩεͷՄೳੑʣ • 3ϧʔλোʢ1࣌ؒܧଓʣ • ΞυόΠε • ෳόʔδϣϯڝ߹Λߋ৽Ͱ͖ΔநԽϨΠϠʔͰٵऩͤ͞Δ • NWো(partition)෮چޙʹϨϓϦέʔτௐ • Amazon Dynamo (KVS) • ”ैདྷͷෳ͞ΕͨRDBͰNW partitionʹରॲͰ͖ͳ͍”ͱݴٴɻconsistencyΛ٘ਜ਼ʹͯ͠ͰavailabilityΛऔͬͨ • Yahoo! PNUTS/Sherpa (ཧࢄDB) • ڧ͍߹ੑͰ͋ΔλΠϜϥΠϯ߹ੑʢશϨίʔυ͕શϨϓϦΧಉ͡ॱংͰߋ৽ॲཧ͞ΕΔʣΛαϙʔτ • ->NWো(partition)ɾαʔόোͰ͠ΜͲ͍ͷͰऑ͍߹ੑʹมߋɾɾɾ https://research.cs.cornell.edu/ladis2009/talks/dean-keynote-ladis2009.pdf
DC NWো • ిݯো • ToR͕ยํམͪΔɺͳ͔ͥ͏ยํͷToRམͪΔɺϥοΫؒ௨৴͕ग़དྷͳ͍ΦϯσϚϯυαʔϏε͕ఀࢭɾɾɾ • ->ඞͣ͠ϦϯΫোΛ͙ͷͰͳ͍ʢMS SIGCOM paper͕ࣔࠦʣ
• BPDUϑϥου • ϝϯςதʹSTPϑϥοϓ͠ʢBPDUن֨తʹൃੜ͠ͳ͍ʣBPDUϑϥουൃੜɻ2࣌ؒαʔϏεఀࢭ • Bridge loop/Miscon fi guration/Broken MAC cache (github) • ʢଟஈSWͰͳ͘ʣूεΠονΛಋೖ -> ϧʔϓൃੜ->ϦϯΫແޮԽ->Կނ͔ར༻ଳҬ100%ுΓ͖ɾɾ • ઃఆϛεঢ়ଶͰ1ຊམͱ͢->োݕग़ػߏ͕શஅͤ͞Δ->18μϯ • εΠον͕MACΞυϨεͷΩϟογϡΛਖ਼͘͠ߋ৽Ͱ͖ͳ͍ͨΊϒϩʔυΩϟετ͢ΔϑΝʔϜόάɾɾɾ • MLAG/STP/STONITH (github) • ϕϯμʔ͕ूεΠονͷಛఆagentΛఀࢭ->linkΛshutग़དྷͣ->ਖ਼ৗʹLAG/STP/L2ϓϩτίϧॲཧͰ͖ͣ->STP࠶ܭࢉͰ90ඵϒϩοΫ • ϑΝΠϧαʔό (Pacemaker/DRBD)͕͓ޓ͍ఀࢭͯ͠ͱஅ->STONITHʢ૬खΛڧ੍rebootʣ->NW෮چޙʹ྆ܥdown->ϑΝΠϧΞΫηεग़དྷͣ • खಈ෮چʢϓϥΠϚϦϊʔυʹ߹ΘͤΔɻ྆ܥϓϥΠϚϦͳΒϩάௐࠪɺɺʣʹ5͔͔࣌ؒͬͨ
ΫϥυωοτϫʔΫࣄྫ (1/2) • Isolated MongoDB primary on EC2 • EC2
WestϦʔδϣϯͰNWো-> 1primary/2secondary͕->෮چޙʹݹ͍primary͕”্ॻ͖”ͨ͠->̎࣌ؒͷ ॻ͖ࠐΈଛࣦɻ • োࣗମҰൠతͳͷɺɺ • Amnesia split-brain on EC2 • Ұ൩Ͱsplit brain, ӡ༻νʔϜ͕ยܥΛrestartͰղܾ • MongoDB/ElasticSearch on EC2 • NWো->ಛఆϊʔυӨڹ->αʔϏεશͯʹӨڹՄೳੑ • ඵɾ݄ճͷbackendఀࢭ -> -45ͷαʔϏεఀࢭͱESΠϯσοΫεഁଛɺఀࢭ12-4ճ·ͰΤεΧϨʔτ
ΫϥυωοτϫʔΫࣄྫ (2/2) • AWS EBSఀࢭ (2011/04/21) • US West AZ͔ΒEBSͷτϥϑΟοΫΛγϑτ
-> ϧʔςΟϯάϙϦγʔϛε->primary/secondary NWΛಉ࣌ʹஅ -> EBSϛϥʔετʔϜൃੜ-> -> EC2 12࣌ؒఀࢭ, EBS 80࣌ؒఀࢭ • RDSఀࢭʢAZ failoverόάͷͨΊɺ2.5%͕ࣦഊʣɺHeroku 16-60࣌ؒఀࢭ • Isolated Redis Primary on EC2 • NWো->Twilio՝ۚPrimary Redis͕->secondaryঢ֨ͳ͠Ͱprimaryʹॻ͖ࠐΈ->෮چޙͷ࠶ ಉظͰprimaryߴෛՙ->primaryΛखಈͰ࠶ىಈ->ޡͬͨcon fi gͰىಈ͠ɺread onlyͰىಈ->Twilio APIݺͼग़͠Ͱސ٬ʹ࠶νϟʔδ͢Δ->40Ͱ1.1%ސ٬ʹաٻ->SMS/௨ͰΫϨΧ500υϧ ٻ+3500υϧΛ͑ΔͱٻडෆՄ
ϗεςΟϯάϓϩόΠμʔ • ҆ՁͰ৴པੑ͕ߴ͍ʢϋζʣ+NW/ServerཧऀͰ͋Δඞཁ͕͋Δ • GlusterFS split brain (Freistil IT) •
ϧʔλϑΝʔϜΣΞόάͰ50%-100%ύέϩε->GluasterFS͕split brain-> ෮چޙ2ͭ ͷσʔληοτΛෆ߹Λղܾग़དྷͣ ->म෮ޙʹτϥϑΟοΫٸ૿͠Webϊʔυߴෛՙ • ಗ໊ͳओཁϗεςΟϯάۀऀʢ100-200nodeنʣ • 90ؒͰ5ͭpartition͕࣌ؒൃੜ • ෦ͱ֎෦Λͭͳ͙NWোɺ෦ͱཧNWΛো
ҬNW • WANোɿϧʔτ͕গͳ͍߹ɺෳDCͳDRඞཁͳͲ • CENICʢCorporation for Education Network Initiatives in
Californiaʣௐࠪ • ΧϦϑΥϧχΞશͷϧʔλΛ̑ௐࠪʢϦϯΫোɺeBGP/tracerouteσʔλʣ • 500Ҏ্ͷNW partitionΛൃݟ • SW 6ʢதԝ2.7ɺ19.9@95%ileʣ • HW 8.2࣌ؒʢதԝ32ɺ3.7@95%ileʣ • PagerDuty on 2 EC2 region/Linode • CA෦AWSϐΞϦϯά͕ྼԽ->EC2ϊʔυ͕ଓྼԽ͠latency૿Ճ->ΫΦϥϜશஅ->ϝοηʔδͷσΟεύονఀࢭ • ઃܭతʹߟྀ͞Ε͍͕ͯͨɺ݁Ռతʹ18ར༻ෆՄɺAPIϦΫΤετυϩοϓࢯɺΫΦϥϜ෮چ·ͰϖʔδԆ
Global Routing Error • Cloud fl are • path/AnycastΛۦ͢Δ23ͷDC •
DDoSରࡦͱͯ͠ಛఆαΠζͷύέοτΛdropͤ͞ΔΑ͏FlowSpecͰશΤοδϧʔλʹୡ->ύέοτʹҰகͤͣ->Ϋϥογϡ͢Δ·ͰRAMফඅ͠ଓ͚ͨ->ࣗ ಈ࠶ىಈ͠ͳ͍ɺmgmtΞΫηεͰ͖ͳ͍->Ұ෦෮چτϥϑΟοΫूதͯ͠ߴෛՙ->·ͩϑΥʔϧόοΫ->ݱखಈ࠶ىಈʢ30ޙʹ։࢝ɺ1࣌ؒར༻ෆՄʣ • Level3 2011 • JuniperͷϑΝʔϜΣΞόάͷͨΊɺόοΫϘʔϯఀࢭ • Time Warner Cable RIM BlackBerry, UK ISP͕ΦϑϥΠϯ • Global BGP outages • 2008ʹύΩελϯςϨίϜ͕youtubeΛϒϩοΫ->ͦͷʢϒϩοΫ͞ΕͨʣϧʔτΛଞͷISPʹใ->ΞΫηεෆՄʢBGP hijackʣ • 2010ʹσϡʔΫେֶ͕BGPͷ࣮ݧతͳϑϥάΛςετ͢Δ͜ͱͰಉ༷ͷޮՌΛ֬ೝ • ൃදऀऍɿBGP hijackʢඇਖ਼نͳAS͔ΒউखʹBGPใʣ݁ߏى͖ͯΔ • ྫɿ2018ʹGoogle͕ʢࣗಈԽ͞ΕͨγεςϜͷόάʁͰʣىͨ݅͜͠ https://gigazine.net/news/20180711-shutting-down-bgp-hijack-factory/
NIC/Driver • Broadcom BCM5709 and Friends • BCM5709 • ड৴ύέοτdrop͢Δ͕ɺૹ৴drop͠ͳ͍ϑΝʔϜόά
• ->primaryॲཧʢड৴ʣͰ͖ͳ͍͕ɺsecondaryprimary͕ੜ͖ͯΔͱࢥ͍ࠐΉ->secondaryʹfallback͠ͳ͍->5࣌ؒఀࢭ -> Sven Ulland ࢯ͕Linux 2.6.32Ͱใࠂɺ2.6.38·ͰղܾͰ͖ͣɻ • BCM5709ʢαʔόʣ͕crash/bu ff erᷓΕ࣌ʹແؔʹPAUSEϑϨʔϜΛग़͢->ToR(BCM56314/BCM56820)͕֦େ->NWશମ͕ো • BCM57711ͰδϟϯϘϑϨʔϜͰߴෛՙ࣌ʹϨΠςϯγѱԽ->ESX on iSCSIͰݦࡏԽ • Intel 82574: Packet of Death • EEPROM͕ਖ਼͘͠ fl ashग़དྷͣ->SIPͷड৴ύέοτΛNIC͕ແޮԽʢௐ͕ࠪඇৗʹ͍͠ɺɺʣ->cold restartͰ෮׆ɾɾ • DriverىҼͷGlusterFS partition • ΞοϓάϨʔυޙʹFlusterFSϖΞͰNWো->LAGແޮԽͯ͠෮چ->12࣌ؒޙʹ࠶ൃ->υϥΠόىҼͱಛఆͯ͠͠->σʔλෆ߹/VMͷϑΝ ΠϧγεςϜσʔλഁଛൃੜ
ΞϓϦϨϕϧͷো (1/3) • ཧNWىҼ͚ͩͰͳ͍ɿϓϩάϥϜɾOSεέδϡʔϥԆɺߴෛՙϓϩηε • CPUߴෛՙͱαʔϏε • ElasticSearch࠶ىಈ->Ϋϥελׂ->split brain->Ϋϥελ෮چ->indexՃআग़དྷͣ->αʔόׂ͕ΓͯΒΕ͍ͯͳ͍indexΛ෮چ ͠Α͏ͱ͢ΔʢͰ͖ͳ͍ͷͰCPUۭճΓʣ->20ར༻ఀࢭɺ6࣌ؒαʔϏεԼ
• ͍GCఀࢭͱI/O • ESΫϥελͰGCى͖Δ->secondary node͕primary deadએݴ->ຊprimaryࢮΜͰ͍ͳ͍ͷͰsplit brain(dual master) • I/OݪҼͰGCఀࢭ->IO_WAIT͕࣌ؒ૿Ճ->split brain/write loss/indexഁଛ • MySQL overload & pacemaker segfault (github) • MySQL primaryෛՙ -> secondaryΛঢ֨->secondaryͷcold cache͕͔ͬͨ->primaryʹfailoverͨ͠Α͏ͱ͕ͨ͠खಈͰఀࢭ • ཌprimaryଆͷมߋ͕secondaryʹө͞Ε͍ͯͳ͍͜ͱΛൃݟ->Replication ManagerͰͷखಈ෮چதʹsegfault ->ࣗಈɾखಈϨϓ Ϧέʔγϣϯڝ߹ɺʢ֎෦ΩʔʹҰ؏ੑ͕ͳ͔ͬͨͨΊʣଞਓͷprivate repoΛදࣔɾɾɾ
ΞϓϦϨϕϧͷো (2/3) • DRDB split brain • 2nodeͷ߹ʢNW partition͞Εͨ߹ʹʣ͕࣮ࣗ֬ʹprimaryͰ͋Δͱݴ͑ͳ͍-> ྆node͕primary/onlineঢ়ଶͰॻ͖ࠐΈΛड͚ೖΕɺϑΝΠϧγεςϜϨϕϧͰ૬ҧൃੜ
• VoldDB on EC2 • NWো->split brain->dual primary->replica૬ҧ->ॏେͳσʔλଛࣦ • Mystery RabbitMQ Partition • ࠶ૹগͳ͘ɺϝοηʔδ҆ఆɺϊʔυؒଓ҆ఆɺͰpartition͢Δɻɻ • partitionݕग़timeoutΛ2ʹ͢ΔͱසݮΔ͕ɺpartitonશʹ͙͜ͱग़དྷͣɻṖɻ
ΞϓϦϨϕϧͷো (3/3) • ElasticSearch Discovery Failure on EC2 • 2node
ESΫϥελͰσΟεΧόϦϝοηʔδަʹ3ඵҎ্͔͔Δͱ1/ 10ͷ֬Ͱdual masterʹɻɻʢ߱֨खಈͷΈʣ • timeout15ඵʹͯ͠ղܾ
݁ɿզʑͲ͜ʹ͔͏ͷ͔ʁ • ࢀরͱͯ͠ͷ·ͱΊ • ϓϩηεɾαʔόɾNICɾεΠονɾϩʔΧϧɾάϩʔόϧ • NWো”ಥવ”དྷΔɻex: ఆظupdate࣌, ϝϯς࣌ •
ҰํͰɺpartiton͕ى͖ͳ͍NW/γεςϜ͋Δɻex: ۚ༥ܥʢ৻ॏͳΤϯδχΞϦϯά+NWٕज़ਐԽ+͓ۚʣ • Google/Amazon (ലେͳنͷͨΊɺҰͭҰͭͷHWίετ)Startup(༧ࢉ͕ݶΒΕΔ) • ༷ʑͳো͕ى͖ΔɻHuman errorΛؚΉݱ࣮ͷࢄγεςϜͷ͕ى͖Δ • ʢpartition͕ى͖ΔલʹʣϦεΫΛ࠶ߟ͢Δ͜ͱ͕ॏཁ • ϗϫΠτϘʔυͰಈ͖Λ͍ͳ͕Βɻ • PartitionରԠ͢Δͱଟ͘ͷ߹ϝϦοτ͕ಘΒΕΔ • partitionରԠͷՃϨΠςϯγͱɺʢࣄޙʹ͔͔Δௐ࣌ؒݮͷʣϝϦοτ
ิɿࠓͲ͖ʢ2020ʣͰΔͳΒʁ • ܗࣜख๏ • Unknownͳstateʢͦͷఆ্ٛʣଘࡏ͠ͳ͍͜ͱΛ୲อͰ͖Δ • Chaos Engineering • ࣮ϛεɺγεςϜ݁߹࣌ͷෆ߹Λൃݟɾमਖ਼Ͱ͖Δ
EOP