Lock in $30 Savings on PRO—Offer Ends Soon! ⏳
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
#24 “Ananta: Cloud Scale Load Balancing”
Search
cafenero_777
June 19, 2023
Technology
0
300
#24 “Ananta: Cloud Scale Load Balancing”
ACM SIGCOM ’13
https://dl.acm.org/doi/10.1145/2534169.2486026
cafenero_777
June 19, 2023
Tweet
Share
More Decks by cafenero_777
See All by cafenero_777
#51 “Empowering Azure Storage with RDMA”
cafenero_777
3
510
#49 “Gray Failure: The Achilles’ Heel of Cloud-Scale Systems”
cafenero_777
2
120
#50 “Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction”
cafenero_777
0
140
#33 “Destroying networks for fun (and profit)”
cafenero_777
0
97
#34 “MTPSA: Multi-Tenant Programmable Switches”
cafenero_777
0
68
#37 “Bluebird: High-performance SDN for Bare-metal Cloud Services”
cafenero_777
1
140
#39 “Profiling a warehouse-scale computer”
cafenero_777
0
51
#23 “VFP: A Virtual Switch Platform for Host SDN in the Public Cloud”
cafenero_777
0
240
#25 “Swift: Delay is Simple and Effective for Congestion Control in the Datacenter”
cafenero_777
0
180
Other Decks in Technology
See All in Technology
段階的に進める、 挫折しない自宅サーバ入門
yu_kod
4
1.6k
その意思決定、まだ続けるんですか? ~痛みを超えて未来を作る、AI時代の撤退とピボットの技術~
applism118
43
24k
巨大モノリスのリプレイス──機能整理とハイブリッドアーキテクチャで挑んだ再構築戦略
zozotech
PRO
0
390
AI 時代のデータ戦略
na0
3
600
レガシーシステム刷新における TypeSpec スキーマ駆動開発のすゝめ
tsukuha
4
850
AI駆動開発を実現するためのアーキテクチャと取り組み
baseballyama
17
15k
プラットフォームエンジニアリングとは何であり、なぜプラットフォームエンジニアリングなのか
doublemarket
0
350
AI時代のインシデント対応 〜時代を切り抜ける、組織アーキテクチャ〜
jacopen
4
170
Contract One Engineering Unit 紹介資料
sansan33
PRO
0
9.7k
リアーキテクティングのその先へ 〜品質と開発生産性の壁を越えるプラットフォーム戦略〜 / architecture-con2025
visional_engineering_and_design
0
8.5k
確実に伝えるHealth通知 〜半自動システムでほどよく漏れなく / JAWS-UG 神戸 #9 神戸へようこそ!LT会
genda
0
160
事業状況で変化する最適解。進化し続ける開発組織とアーキテクチャ
caddi_eng
1
8.9k
Featured
See All Featured
GraphQLの誤解/rethinking-graphql
sonatard
73
11k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
508
140k
Mobile First: as difficult as doing things right
swwweet
225
10k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.5k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
1.6k
Fireside Chat
paigeccino
41
3.7k
Balancing Empowerment & Direction
lara
5
770
Product Roadmaps are Hard
iamctodd
PRO
55
12k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.7k
The Language of Interfaces
destraynor
162
25k
Optimising Largest Contentful Paint
csswizardry
37
3.5k
Reflections from 52 weeks, 52 projects
jeffersonlam
355
21k
Transcript
Research Paper Introduction #24 “Ananta: Cloud Scale Load Balancing” ௨ࢉ#75
@cafenero_777 2021/06/24 1
Agenda • ରจ • ֓ཁͱಡ͏ͱͨ͠ཧ༝ 1. INTRODUCTION 2. BACKGROUND 3.
DESIGN 4. IMPLEMENTATION 5. MEASUREMENTS 6. OPERATIONAL EXPERIENCE 7. RELATED WORK 8. CONCLUSION 2
ରจ • Ananta: Cloud Scale Load Balancing • Parveen Patel,
Deepak Bansal, Lihua Yuan, Ashwin Murthy, Albert Greenberg, David A. Maltz, Randy Kern, Hemant Kumar, Marios Zikos, Hongyu Wu, Changhoon Kim, Naveen Karri • Microsoft • ACM SIGCOM ’13 • https://dl.acm.org/doi/10.1145/2534169.2486026 3
֓ཁͱಡ͏ͱͨ͠ཧ༝ • ֓ཁ • Ananta: Scalable L4LB (DSR/NAT) • ୯ҰVIPͰ100Gbps,
߹ܭͰ1TbpsҎ্ͷଳҬ෯ • Azure্Ͱಈ࡞ • ಡ͏ͱͨ͠ཧ༝ͱײ • AzureͷVFPจʹҾ༻ • ଞͷLB (Maglevͱ͔)Ͱ݁ߏҾ༻͞Ε͍ͯͨͷͰɻ 4 https://www.connectedpapers.com/main/5c295df1a7f302c97f6f379eab6abba592811d42/Ananta-cloud-scale-load-balancing/graph Ananta Maglev SilkRoad Beamer Faild Middleboxܥ ࢄɾߴޮܥ
1. Introduction • ΫϥυίϯϐϡʔςΟϯάͷීٴ • ߴ͍Քಇ (SLA), ϚϧνςφϯτɺେنτϥϑΟοΫ • 1VIP
100Gbps, 1000host/VIP, 6~60ճૢ࡞/1min • શSLAҧ/োͷ36%LBؔ࿈ • Ananta (αϯεΫϦοτޠͰແݶ) • Scalable L4 LB (NAT/DSR) • D-plane: ECMP (in NW), LB, NAT (in VFP/HV) • C-plane: SDN/Paxos, S-NAT࿈ܞ • 2011/09ʹAzureʹ100ಋೖ, 1Tbps, 100k VIPs • L4LB@Cloud, NWࢄγεςϜͱsclaingʹ͍ͭͯɺଌఆ݁Ռͱӡ༻݁ՌΛհ 5
2. BACKGROUND • Data Center Clos NW: 10Gαʔό*40kɺoversub 1:4, 400Gbps@Border
• VIPτϥϑΟοΫͷੑ࣭ • 44%VIPτϥϑΟοΫʢDC:DC֎=2:1ʣ • DCؒin/out1:1, σʔλಉظܥ • ཁ݅·ͱΊ 1. “Scale, Scale and Scale”: • ίετʢαʔόίετͷ1%, 400ଟ͗͢ɻʣ • ࠷େ1VIP 100Gbps & 1M current-conn, 100ճઃఆมߋ/ 2. ৴པੑ: N+1ߏͰͷࣗಈճ෮ɺϝϯςରԠ 3. “Any Service Anywhere”: L2υϝΠϯ੍ݶʹറΒΕͳ͍Α͏ʹ͢Δ 4. ςφϯτ: LBڞ༗ʹΑΔDoSӨڹʢଞͷސ٬͕ଳҬΛୣΘΕΔʣͷରࡦ 6 400Gbps 100Tbps
3. DESIGN (1/5) Principles & Architecture 7 • Scale outͰ͖ΔΑ͏ʹ
• RouterͷΑ͏ʹϑϩʔҡ࣋ػߏΛ࣋ͨͳ͍Α͏ʹ͢Δ • શಉظ͕ඞཁͳͷΘͳ͍ • i.e. WRR (Weighted Round Robin)ͱWeighted Random • ͍͠ॲཧHVଆͰΔʢΦϑϩʔυ͢Δʣ • ACL, Rate Limit, Metering • Ananta Manager (AM), Multiplexer (Mux), Host Agent (HA) • Inbound: IP-in-IP, NAT and DSR • Outbound: DIP->VIPVIP:sportͷmappingΛMUXͱHAͰಉظ͓ͯ͘͠ VIPใ/ECMP Selection/IP-in-IP L3 Routing Decap/DNAT NAT͠ DSR (Encapͳ͠) sportͱVIPΛཁٻ sportͱVIPΛઃఆ dportͱVIPͰ VMʹৼΓ͚
3. DESIGN (2/5) Principles & Architecture 8 • Fastpath: VIP
to VIP௨৴ɿLBΛbypath͠ɺHVؒͰ௨৴ͤ͞Δ • ࠷ॳLBΛ௨ͯ͠௨৴ • 3WHSྃ͢ΔͱDIP mappingใΛϦμΠϨΫτ • HA͕௨৴ͤ͞Δ • ҎޙLBΛ௨Βͳ͍ • ͬऔΓରࡦඞཁ ͜ͷ௨৴DIP2ͱmapping͞ΕͯΔΑ DIP1ඥ͚ DIP1/DIP2௨৴
3. DESIGN (3/5) Mux/Host Agent 9 • Mux Pool (Muxͷηοτ)
• Mux: BGP Speaker: VIPΛใɻো࣌ܦ࿏ॖୀɻTCP MD5ೝূ • AM͕VIP/DIP mappingΛMuxʹσϓϩΠɻ5tupleͰselection, hashؔɾseedશMuxͰڞ௨ʢECMPͰͲͷMuxʹ౸ୡͯ͠ಉ͡ॲཧΛอূʣ • ҰmappingΛࢀর͞ΕΔͱϑϩʔΛอ࣋ɻͨͩ͠ϝϞϦׂྔผʢSYN-Flood߈ܸରࡦʣ • ৴པͰ͖Δϑϩʔɿෳύέοτ->timeoutΊʹ͢Δ • ৴པͰ͖ͳ͍ϑϩʔɿ1ύέοτ->timeoutΊʹ͢Δ • Mux͕μϯ͢ΔͱECMPΨϥΨϥϙϯ • μϯதʹmappingมߋ͞ΕΔͱϑϩʔҡ࣋Ͱ͖ͳ͍ ->DHT (Distributed hash table)Λར༻ • Host Agent: શHV্ʹଘࡏɺFastpath, NAT, Health checkΛߦ͏ ʢP.8ͷઆ໌ʣ • ϙʔτͷ࠶ར༻ػೳ • Health checkMuxͰͳ͘HAଆͰΔɻ
3. DESIGN (4/5) Ananta Manager/Tenant Isolation 10 • Ananta Manager
(AM) • Paxosϕʔεͷࢄίϯτϩʔϥ • 5ϨϓϦΧͰՔಇɺ3ϨϓϦΧҎ্Ͱਖ਼ৗॲཧ • S-NAT: portׂΛόϧΫॲཧ • ςφϯτ • Muxຖʹಠཱ֤ͯ͠ςφϯτΞΠιϨʔγϣϯΛ࣮͢Εྑ͍ • AM: ཁٻFCFS(ઌணॱ: fi rst-come- fi rst-serve)͞ΕΔɻ͔ͭɺಉ͡Α͏ͳ৽نϦΫΤετऔΓԼ͛ɻ(2) • Mux: దͳଳҬ෯Λ͑ͨ߹ɺաଳҬʹൺྫͨ֬͠Ͱdrop and rate limit͢Δ • Top talker(Ұ൪௨৴͍ͯ͠Δ) VIPΛMux͔ΒҠಈͤ͞Δ
3. DESIGN (5/5) Alternatives 11 • DNS-based LB • ෛՙࢄͷࣄલ༧ଌ͕͍͠ʢClient͔ΒͷϦΫΤετ͕ภΔʣ???
• DNSΩϟογϡফ͑Δ·Ͱ͕͔͔࣌ؒΔ • stateful (NATͳͲ)͕Ͱ͖ͳ͍ • OpenFlow-based LB • ࢢൢOpenFlowσόΠεͰ2-4kϑϩʔ·ͰʢMux~Mϑϩʔঢ়ଶΛอ͍࣋ͨ͠ʣ • ςφϯτͷػೳ • BGPใͰ͖ͳ͍ʢAMʹͤΔʁʣ
4. IMPLEMENTATION • AM: Ԡੑॏཁ • SEDA (Staged event-driven Arch.)తͳϩοΫϑϦʔઃܭ
• thread poolڞ༗ʢ૯੍ݶʣ • ༏ઌʢྫɿVIP࡞༏ઌʣ • Paxos SDK + Discovery + Health MonitoringͰ࣮ • ϓϥΠϚϦ͕ॲཧΛߦ͏͜ͱΛอূ • upgrade࣌ʹAMΠϯελϯε͕1ͭҎ্མͪͳ͍͜ͱΛอূ • Mux: ΧʔωϧʢυϥΠόʣͰͷύέοτॲཧ + ϢʔβϞʔυͷBGPॲཧ • ΧʔωϧػೳΛͦͷ··͏: IPIP/RSS/IPv6 etc • 1VIPͰ20k DIP, 1.6M SNAT port mapping. ~Mͷಉ࣌ίωΫγϣϯใΛอ࣋ 12 *5 *8 *all
5. MEASUREMENTS Micro-benchmark 13 10VM * 2 tenantͰ1MB௨৴/connection ͔ᷮʹHostෛՙ͕૿͑Δ͕ɺMuxෛՙେ෯ʹԼΔ 10VM
* 5 tenant (baseτϥϑΟοΫ+SYN- fl ood * 10ճ) தʙߴෛՙͰDoSͷݟ͚͕ͭ͘ʹ͘͘ͳΔɻ ΄΅શͯ75msҎʹऩ·Δ ϙʔτ֬อͰ͖ͳ͍߹Ճ͕࣌ؒlong-tailͰ͔͔Δ Fastpath༗Γແ͠ͰͷCPUෛՙൺֱ SYN- fl ood Attack Mitigation S-NAT·ͰͷϨΠςϯγʔ
5. MEASUREMENTS Real World Data (1/2) • ߹ܭ1Tbps, 3ӡ༻ɺinter/intranet, ༻్ɿblob,
table/queue, storage 14 %ileతʹࠔΔγφϦΦ΄΅ແ͍ɻ <- 50ms <- 200ms <- max 2s req/5min@test tenant ฏۉՔಇ99.95% Muxߴෛՙ ʢSYN- fl oodʣ NW ޡݕ <- 75ms@50%ile <- max 2s ςφϯτɾMuxͷنʹґଘɻ SLAʹऩ·͍ͬͯΔɻ S-NATͷϦΫΤετ࣌ؒ Մ༻ੑ con fi gྃ࣌ؒ
5. MEASUREMENTS Real World Data (2/2) 15 800Mbps (220Kpps) /
core ॲཧ͕͔֬ʹECMP͞Ε͍ͯΔ ߹ܭ33.6Gbps: 2.4Gbps*14 Mux14ͷଳҬͱෛՙঢ়گ (25%)
6. OPERATIONAL EXPERIENCE • 3ؒΫϥυͰӡ༻ • HW LBʹ”ݟΓΛ͚ͭͨ”ཧ༝ɿDoS߈ܸରԠ͕Ͱ͖ͳ͍ɺྗੑ (elasticity)͕ͳ͍ɺଳҬ૿ՃɾՁ֨ѹྗʹݟ߹Θͳ͍ •
SW LBͷى͖࣮ͨࡍʹى͖ͨͱ՝ • AM dual primary: ݹ͍primaryػ͔ΒMuxϦΫΤετɺMuxଆ͜ΕΛڋ൱ɻ • Muxଆ͕ϦΫΤετڋ൱Λͨ͠ΒτϥϯβΫγϣϯΛ࣮ߦɺͰղܾ • IP-in-IPͷͨΊMTUมߋ, HA͕MSSௐ͢Δ͕ͣԿނ͔֎ΕͯMTU͑Ͱdrop • ͋ΔϗʔϜϧʔλʹMSS͕ fi x͞ΕΔόά • ͋ΔϞόΠϧOSͷTCPόάͰTCP࠶ଓ࣌ʹϑϧαΠζͷηάϝϯτΛͦͷ··͏όά • NWશମͷMTUΛ্͛ͨ • BGPͱLB͕ಉډ͍ͯ͠ΔͷͰɺଳҬ͋;ΕΔͱڞΕɻ͔͠1མͪΔͱτϥϑΟοΫ͕دΔͷͰ࿈োͷՄೳੑ • BGP/LBͰI/FΛ͚ΔɾϧʔλଆͰτϥϑΟοΫϨʔτΛߜΔɻBGP/LBಉډͷ΄͏͕ઃܭ͕γϯϓϧ • HW LBͷΞΠυϧίωΫγϣϯλΠϜΞτʢ̒̌ඵʣ • SW LBͰstateᷓΕରࡦʢDoSରࡦʣΛҾ͖ܧ͍ͩɻϞόΠϧ௨৴ுΓͬͺͳ͕͠ଟ͍ -> ͦͦVIP mapping͕͋ΔͷͰstateΛ࡞Βͳͯ͘ྑ͍->ແࣄλΠϜΞτΛ͘ Ͱ͖ͨ 16
7. RELATED WORK • HW LBεέʔϧΞοϓܕʢ1+1ܕʣ • धཁʹԠͨ͡εέʔϧΞοϓɾμϯ͕Ͱ͖ͳ͍ • ΫϥυڥͰՔಇཁ݅ͷͨΊN+1ͷੑ͕ඞཁ
• ԾΞϓϥΠΞϯεOSS (HAProxyͳͲ) • N+1͕Ͱ͖ͳ͍ɻNWো࣌εϖΞIP (I/F)Λ͏ҝL2υϝΠϯ੍ݶ • 1VIPΛεέʔϧͰ͖ͳ͍ • Embrace: ϗετଆͰಈ࡞ɺEgiΒ/RouteBricks: ίϞσΟςΟHWͰߴੑೳϧʔλΛ࣮ݱ • ETTM: શͯͷΤϯυϗετ͕ύέοτॲཧɻAnantaLB͚ͩઐ༻ͷαʔόɻ 17
8. CONCLUSION • Ananta • ࢄܕL4LB/NAT • Ϛϧνςφϯτɺߴ৴པੑɺӡ༻ͷ(Azureͷ)ཁ݅Λຬͨ͢Α͏ʹઃܭ • AzureҎ֎Ͱʹཱͭϋζ
• େن༻్Ͱίετʹݟ߹͏ɾscale-outͰ͖Δઃܭ͕ඞཁ • ECMP, BGP, DSR, Fastpath, HostଆNAT, rate limit • LB100ɺ10ສਓҎ্ʹVIPαʔϏεΛఏڙ 18
3ߦ·ͱΊ • ઃܭࢥMaglev (google)VPPLB (YNWLB2)ͱಉ͡ • BGP/ECMP/Consistent-hash/L4LB/DSR • εέʔϧͤ͞ΔͨΊͷ •
LBͰඞཁͳॲཧʢNATॲཧɾϔϧενΣοΫͳͲʣΛHVଆʹΦϑϩʔυ͢Δ • 1%ϧʔϧʢLBϊʔυΫϥελͷαʔόͷ1%·Ͱʣ • LBઃఆมߋരʢ75msʣ • Fastpath (్த͔ΒLBΛհ͞ͳ͍௨৴ʹΓସ͑Δ)ͰߋʹޮԽ 19
EoP 20