Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
#24 “Ananta: Cloud Scale Load Balancing”
Search
cafenero_777
June 19, 2023
Technology
0
320
#24 “Ananta: Cloud Scale Load Balancing”
ACM SIGCOM ’13
https://dl.acm.org/doi/10.1145/2534169.2486026
cafenero_777
June 19, 2023
Tweet
Share
More Decks by cafenero_777
See All by cafenero_777
#51 “Empowering Azure Storage with RDMA”
cafenero_777
3
530
#49 “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems”
cafenero_777
2
120
#50 “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction”
cafenero_777
0
150
#33 “Destroying networks for fun (and profit)”
cafenero_777
0
110
#34 “MTPSA: Multi-Tenant Programmable Switches”
cafenero_777
0
77
#37 “Bluebird: High-performance SDN for Bare-metal Cloud Services”
cafenero_777
1
150
#39 “Profiling a warehouse-scale computer”
cafenero_777
0
57
#23 “VFP: A Virtual Switch Platform for Host SDN in the Public Cloud”
cafenero_777
0
260
#25 “Swift: Delay is Simple and Effective for Congestion Control in the Datacenter”
cafenero_777
0
180
Other Decks in Technology
See All in Technology
エンジニアとマネジメントの距離/Engineering and Management
ikuodanaka
3
650
JuliaTokaiとしてはこれが最後かもしれない(仮) for NGK2026S
antimon2
0
120
The Engineer with a Three-Year Cycle - 2
e99h2121
0
190
BiDiってなんだ?
tomorrowkey
2
490
漸進的過負荷の原則
sansantech
PRO
3
400
SOC2は、取った瞬間よりその後が面白い
3flower
1
230
Amazon Bedrock AgentCore EvaluationsでAIエージェントを評価してみよう!
yuu551
0
180
かわいい身体と声を持つ そういうものに私はなりたい
yoshimura_datam
0
520
Claude Codeベストプラクティスまとめ
minorun365
49
27k
Hardware/Software Co-design: Motivations and reflections with respect to security
bcantrill
1
260
3リポジトリーを2ヶ月でモノレポ化した話 / How I turned 3 repositories into a monorepo in 2 months
kubode
0
120
開発メンバーが語るFindy Conferenceの裏側とこれから
sontixyou
2
250
Featured
See All Featured
Why You Should Never Use an ORM
jnunemaker
PRO
61
9.7k
Fireside Chat
paigeccino
41
3.8k
Code Review Best Practice
trishagee
74
19k
The agentic SEO stack - context over prompts
schlessera
0
600
Rails Girls Zürich Keynote
gr2m
96
14k
Money Talks: Using Revenue to Get Sh*t Done
nikkihalliwell
0
140
Into the Great Unknown - MozCon
thekraken
40
2.2k
Design in an AI World
tapps
0
130
The innovator’s Mindset - Leading Through an Era of Exponential Change - McGill University 2025
jdejongh
PRO
1
85
Navigating Algorithm Shifts & AI Overviews - #SMXNext
aleyda
0
1.1k
First, design no harm
axbom
PRO
2
1.1k
DevOps and Value Stream Thinking: Enabling flow, efficiency and business value
helenjbeal
1
86
Transcript
Research Paper Introduction #24 “Ananta: Cloud Scale Load Balancing” ௨ࢉ#75
@cafenero_777 2021/06/24 1
Agenda • ରจ • ֓ཁͱಡ͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. BACKGROUND 3.
DESIGN 4. IMPLEMENTATION 5. MEASUREMENTS 6. OPERATIONAL EXPERIENCE 7. RELATED WORK 8. CONCLUSION 2
ରจ • Ananta: Cloud Scale Load Balancing • Parveen Patel,
Deepak Bansal, Lihua Yuan, Ashwin Murthy, Albert Greenberg, David A. Maltz, Randy Kern, Hemant Kumar, Marios Zikos, Hongyu Wu, Changhoon Kim, Naveen Karri • Microsoft • ACM SIGCOM ’13 • https://dl.acm.org/doi/10.1145/2534169.2486026 3
֓ཁͱಡ͏ͱͨ͠ཧ༝ • ֓ཁ • Ananta: Scalable L4LB (DSR/NAT) • ୯ҰVIPͰ100Gbps,
߹ܭͰ1TbpsҎ্ͷଳҬ෯ • Azure্Ͱಈ࡞ • ಡ͏ͱͨ͠ཧ༝ͱײ • AzureͷVFPจʹҾ༻ • ଞͷLB (Maglevͱ͔)Ͱ݁ߏҾ༻͞Ε͍ͯͨͷͰɻ 4 https://www.connectedpapers.com/main/5c295df1a7f302c97f6f379eab6abba592811d42/Ananta-cloud-scale-load-balancing/graph Ananta Maglev SilkRoad Beamer Faild Middleboxܥ ࢄɾߴޮܥ
1. Introduction • ΫϥυίϯϐϡʔςΟϯάͷීٴ • ߴ͍Քಇ (SLA), ϚϧνςφϯτɺେنτϥϑΟοΫ • 1VIP
100Gbps, 1000host/VIP, 6~60ճૢ࡞/1min • શSLAҧ/োͷ36%LBؔ࿈ • Ananta (αϯεΫϦοτޠͰແݶ) • Scalable L4 LB (NAT/DSR) • D-plane: ECMP (in NW), LB, NAT (in VFP/HV) • C-plane: SDN/Paxos, S-NAT࿈ܞ • 2011/09ʹAzureʹ100ಋೖ, 1Tbps, 100k VIPs • L4LB@Cloud, NWࢄγεςϜͱsclaingʹ͍ͭͯɺଌఆ݁Ռͱӡ༻݁ՌΛհ 5
2. BACKGROUND • Data Center Clos NW: 10Gαʔό*40kɺoversub 1:4, 400Gbps@Border
• VIPτϥϑΟοΫͷੑ࣭ • 44%VIPτϥϑΟοΫʢDC:DC֎=2:1ʣ • DCؒin/out1:1, σʔλಉظܥ • ཁ݅·ͱΊ 1. “Scale, Scale and Scale”: • ίετʢαʔόίετͷ1%, 400ଟ͗͢ɻʣ • ࠷େ1VIP 100Gbps & 1M current-conn, 100ճઃఆมߋ/ 2. ৴པੑ: N+1ߏͰͷࣗಈճ෮ɺϝϯςରԠ 3. “Any Service Anywhere”: L2υϝΠϯ੍ݶʹറΒΕͳ͍Α͏ʹ͢Δ 4. ςφϯτ: LBڞ༗ʹΑΔDoSӨڹʢଞͷސ٬͕ଳҬΛୣΘΕΔʣͷରࡦ 6 400Gbps 100Tbps
3. DESIGN (1/5) Principles & Architecture 7 • Scale outͰ͖ΔΑ͏ʹ
• RouterͷΑ͏ʹϑϩʔҡ࣋ػߏΛ࣋ͨͳ͍Α͏ʹ͢Δ • શಉظ͕ඞཁͳͷΘͳ͍ • i.e. WRR (Weighted Round Robin)ͱWeighted Random • ͍͠ॲཧHVଆͰΔʢΦϑϩʔυ͢Δʣ • ACL, Rate Limit, Metering • Ananta Manager (AM), Multiplexer (Mux), Host Agent (HA) • Inbound: IP-in-IP, NAT and DSR • Outbound: DIP->VIPVIP:sportͷmappingΛMUXͱHAͰಉظ͓ͯ͘͠ VIPใ/ECMP Selection/IP-in-IP L3 Routing Decap/DNAT NAT͠ DSR (Encapͳ͠) sportͱVIPΛཁٻ sportͱVIPΛઃఆ dportͱVIPͰ VMʹৼΓ͚
3. DESIGN (2/5) Principles & Architecture 8 • Fastpath: VIP
to VIP௨৴ɿLBΛbypath͠ɺHVؒͰ௨৴ͤ͞Δ • ࠷ॳLBΛ௨ͯ͠௨৴ • 3WHSྃ͢ΔͱDIP mappingใΛϦμΠϨΫτ • HA͕௨৴ͤ͞Δ • ҎޙLBΛ௨Βͳ͍ • ͬऔΓରࡦඞཁ ͜ͷ௨৴DIP2ͱmapping͞ΕͯΔΑ DIP1ඥ͚ DIP1/DIP2௨৴
3. DESIGN (3/5) Mux/Host Agent 9 • Mux Pool (Muxͷηοτ)
• Mux: BGP Speaker: VIPΛใɻো࣌ܦ࿏ॖୀɻTCP MD5ೝূ • AM͕VIP/DIP mappingΛMuxʹσϓϩΠɻ5tupleͰselection, hashؔɾseedશMuxͰڞ௨ʢECMPͰͲͷMuxʹ౸ୡͯ͠ಉ͡ॲཧΛอূʣ • ҰmappingΛࢀর͞ΕΔͱϑϩʔΛอ࣋ɻͨͩ͠ϝϞϦׂྔผʢSYN-Flood߈ܸରࡦʣ • ৴པͰ͖Δϑϩʔɿෳύέοτ->timeoutΊʹ͢Δ • ৴པͰ͖ͳ͍ϑϩʔɿ1ύέοτ->timeoutΊʹ͢Δ • Mux͕μϯ͢ΔͱECMPΨϥΨϥϙϯ • μϯதʹmappingมߋ͞ΕΔͱϑϩʔҡ࣋Ͱ͖ͳ͍ ->DHT (Distributed hash table)Λར༻ • Host Agent: શHV্ʹଘࡏɺFastpath, NAT, Health checkΛߦ͏ ʢP.8ͷઆ໌ʣ • ϙʔτͷ࠶ར༻ػೳ • Health checkMuxͰͳ͘HAଆͰΔɻ
3. DESIGN (4/5) Ananta Manager/Tenant Isolation 10 • Ananta Manager
(AM) • Paxosϕʔεͷࢄίϯτϩʔϥ • 5ϨϓϦΧͰՔಇɺ3ϨϓϦΧҎ্Ͱਖ਼ৗॲཧ • S-NAT: portׂΛόϧΫॲཧ • ςφϯτ • Muxຖʹಠཱ֤ͯ͠ςφϯτΞΠιϨʔγϣϯΛ࣮͢Εྑ͍ • AM: ཁٻFCFS(ઌணॱ: fi rst-come- fi rst-serve)͞ΕΔɻ͔ͭɺಉ͡Α͏ͳ৽نϦΫΤετऔΓԼ͛ɻ(2) • Mux: దͳଳҬ෯Λ͑ͨ߹ɺաଳҬʹൺྫͨ֬͠Ͱdrop and rate limit͢Δ • Top talker(Ұ൪௨৴͍ͯ͠Δ) VIPΛMux͔ΒҠಈͤ͞Δ
3. DESIGN (5/5) Alternatives 11 • DNS-based LB • ෛՙࢄͷࣄલ༧ଌ͕͍͠ʢClient͔ΒͷϦΫΤετ͕ภΔʣ???
• DNSΩϟογϡফ͑Δ·Ͱ͕͔͔࣌ؒΔ • stateful (NATͳͲ)͕Ͱ͖ͳ͍ • OpenFlow-based LB • ࢢൢOpenFlowσόΠεͰ2-4kϑϩʔ·ͰʢMux~Mϑϩʔঢ়ଶΛอ͍࣋ͨ͠ʣ • ςφϯτͷػೳ • BGPใͰ͖ͳ͍ʢAMʹͤΔʁʣ
4. IMPLEMENTATION • AM: Ԡੑॏཁ • SEDA (Staged event-driven Arch.)తͳϩοΫϑϦʔઃܭ
• thread poolڞ༗ʢ૯੍ݶʣ • ༏ઌʢྫɿVIP࡞༏ઌʣ • Paxos SDK + Discovery + Health MonitoringͰ࣮ • ϓϥΠϚϦ͕ॲཧΛߦ͏͜ͱΛอূ • upgrade࣌ʹAMΠϯελϯε͕1ͭҎ্མͪͳ͍͜ͱΛอূ • Mux: ΧʔωϧʢυϥΠόʣͰͷύέοτॲཧ + ϢʔβϞʔυͷBGPॲཧ • ΧʔωϧػೳΛͦͷ··͏: IPIP/RSS/IPv6 etc • 1VIPͰ20k DIP, 1.6M SNAT port mapping. ~Mͷಉ࣌ίωΫγϣϯใΛอ࣋ 12 *5 *8 *all
5. MEASUREMENTS Micro-benchmark 13 10VM * 2 tenantͰ1MB௨৴/connection ͔ᷮʹHostෛՙ͕૿͑Δ͕ɺMuxෛՙେ෯ʹԼΔ 10VM
* 5 tenant (baseτϥϑΟοΫ+SYN- fl ood * 10ճ) தʙߴෛՙͰDoSͷݟ͚͕ͭ͘ʹ͘͘ͳΔɻ ΄΅શͯ75msҎʹऩ·Δ ϙʔτ֬อͰ͖ͳ͍߹Ճ͕࣌ؒlong-tailͰ͔͔Δ Fastpath༗Γແ͠ͰͷCPUෛՙൺֱ SYN- fl ood Attack Mitigation S-NAT·ͰͷϨΠςϯγʔ
5. MEASUREMENTS Real World Data (1/2) • ߹ܭ1Tbps, 3ӡ༻ɺinter/intranet, ༻్ɿblob,
table/queue, storage 14 %ileతʹࠔΔγφϦΦ΄΅ແ͍ɻ <- 50ms <- 200ms <- max 2s req/5min@test tenant ฏۉՔಇ99.95% Muxߴෛՙ ʢSYN- fl oodʣ NW ޡݕ <- 75ms@50%ile <- max 2s ςφϯτɾMuxͷنʹґଘɻ SLAʹऩ·͍ͬͯΔɻ S-NATͷϦΫΤετ࣌ؒ Մ༻ੑ con fi gྃ࣌ؒ
5. MEASUREMENTS Real World Data (2/2) 15 800Mbps (220Kpps) /
core ॲཧ͕͔֬ʹECMP͞Ε͍ͯΔ ߹ܭ33.6Gbps: 2.4Gbps*14 Mux14ͷଳҬͱෛՙঢ়گ (25%)
6. OPERATIONAL EXPERIENCE • 3ؒΫϥυͰӡ༻ • HW LBʹ”ݟΓΛ͚ͭͨ”ཧ༝ɿDoS߈ܸରԠ͕Ͱ͖ͳ͍ɺྗੑ (elasticity)͕ͳ͍ɺଳҬ૿ՃɾՁ֨ѹྗʹݟ߹Θͳ͍ •
SW LBͷى͖࣮ͨࡍʹى͖ͨͱ՝ • AM dual primary: ݹ͍primaryػ͔ΒMuxϦΫΤετɺMuxଆ͜ΕΛڋ൱ɻ • Muxଆ͕ϦΫΤετڋ൱Λͨ͠ΒτϥϯβΫγϣϯΛ࣮ߦɺͰղܾ • IP-in-IPͷͨΊMTUมߋ, HA͕MSSௐ͢Δ͕ͣԿނ͔֎ΕͯMTU͑Ͱdrop • ͋ΔϗʔϜϧʔλʹMSS͕ fi x͞ΕΔόά • ͋ΔϞόΠϧOSͷTCPόάͰTCP࠶ଓ࣌ʹϑϧαΠζͷηάϝϯτΛͦͷ··͏όά • NWશମͷMTUΛ্͛ͨ • BGPͱLB͕ಉډ͍ͯ͠ΔͷͰɺଳҬ͋;ΕΔͱڞΕɻ͔͠1མͪΔͱτϥϑΟοΫ͕دΔͷͰ࿈োͷՄೳੑ • BGP/LBͰI/FΛ͚ΔɾϧʔλଆͰτϥϑΟοΫϨʔτΛߜΔɻBGP/LBಉډͷ΄͏͕ઃܭ͕γϯϓϧ • HW LBͷΞΠυϧίωΫγϣϯλΠϜΞτʢ̒̌ඵʣ • SW LBͰstateᷓΕରࡦʢDoSରࡦʣΛҾ͖ܧ͍ͩɻϞόΠϧ௨৴ுΓͬͺͳ͕͠ଟ͍ -> ͦͦVIP mapping͕͋ΔͷͰstateΛ࡞Βͳͯ͘ྑ͍->ແࣄλΠϜΞτΛ͘ Ͱ͖ͨ 16
7. RELATED WORK • HW LBεέʔϧΞοϓܕʢ1+1ܕʣ • धཁʹԠͨ͡εέʔϧΞοϓɾμϯ͕Ͱ͖ͳ͍ • ΫϥυڥͰՔಇཁ݅ͷͨΊN+1ͷੑ͕ඞཁ
• ԾΞϓϥΠΞϯεOSS (HAProxyͳͲ) • N+1͕Ͱ͖ͳ͍ɻNWো࣌εϖΞIP (I/F)Λ͏ҝL2υϝΠϯ੍ݶ • 1VIPΛεέʔϧͰ͖ͳ͍ • Embrace: ϗετଆͰಈ࡞ɺEgiΒ/RouteBricks: ίϞσΟςΟHWͰߴੑೳϧʔλΛ࣮ݱ • ETTM: શͯͷΤϯυϗετ͕ύέοτॲཧɻAnantaLB͚ͩઐ༻ͷαʔόɻ 17
8. CONCLUSION • Ananta • ࢄܕL4LB/NAT • Ϛϧνςφϯτɺߴ৴པੑɺӡ༻ͷ(Azureͷ)ཁ݅Λຬͨ͢Α͏ʹઃܭ • AzureҎ֎Ͱʹཱͭϋζ
• େن༻్Ͱίετʹݟ߹͏ɾscale-outͰ͖Δઃܭ͕ඞཁ • ECMP, BGP, DSR, Fastpath, HostଆNAT, rate limit • LB100ɺ10ສਓҎ্ʹVIPαʔϏεΛఏڙ 18
3ߦ·ͱΊ • ઃܭࢥMaglev (google)VPPLB (YNWLB2)ͱಉ͡ • BGP/ECMP/Consistent-hash/L4LB/DSR • εέʔϧͤ͞ΔͨΊͷ •
LBͰඞཁͳॲཧʢNATॲཧɾϔϧενΣοΫͳͲʣΛHVଆʹΦϑϩʔυ͢Δ • 1%ϧʔϧʢLBϊʔυΫϥελͷαʔόͷ1%·Ͱʣ • LBઃఆมߋരʢ75msʣ • Fastpath (్த͔ΒLBΛհ͞ͳ͍௨৴ʹΓସ͑Δ)ͰߋʹޮԽ 19
EoP 20