Evolution of Cloud Block Store

What's the Story in EBS Glory: Evolutions and Lessons in
Building Cloud Block Store Weidong Zhang, et. el. Presented by Andrey Satarin (@asatarin) May 2024 https://asatarin.github.io/talks/2024-05-evolution-of-cloud-block-store/

Outline • EBS Evolution from EBS1 to EBS3 / EBSX
• Elasticity in latency and throughput • Availability • Conclusions and references 2

Cloud Block Store aka Elastic Block Store 3

Cloud Block Store • Persistent Virtual Disk (VD) in the
cloud • Can attach to a virtual machine • Can scale IOPS / throughput / capacity in a wide range 4

EBS Architecture Evolution 5

Timeline (EBS1+EBS2) 2012 — EBS1 (TCP / HDD) 2015 —
EBS2 (Luna + RDMA / SSD) 2016 — Background Erasure Coding / Compression 6

Timeline (EBS3 + EBSX) 2019 — EBS3 (Solar + RDMA
/ SSD) 2020 — Foreground Erasure Coding / Compression 2021 — AutoPerformanceLevel (AutoPL) 2021 — Logical Failure Domain 2022 — EBSX (One Hop Solar / PMem + SSD) 2024 — Federated Block Manager 7

EBS1: An Initial Foray 8

9 EBS1

EBS1 Architecture • BlockManager (Paxos) maintains metadata about Virtual Disk
(VD) • BlockClient caches VD to block mappings • Data abstraction of chunk — 64Mb of data • ChunkManager (Paxos) stores metadata about chunks • 3 way replicated on top of local Ext4 file system • In-place updates 10

EBS1 Limitations • 3x space overhead due to replication •
Limits in performance and efficiency • VD performance is bound by a single BlockServer performance • Might suffer from hotspots • Hard to quantify and guarantee SLO with HDD and kernel TCP/IP 11

EBS2: Speedup with Space Efficiency 12

EBS2 Overview • Does not directly handle persistence or consensus
• Built on top of distributed file system Pangu • Log-Structured design of BlockServers translates writes into appends • Traffic split into frontend (client I/O) and backend (GC, compression) • Failover at the granularity of a segment instead of VD 13

EBS2 14

Disk 15

16 Log-Structured Block Device (LSBD)

17 Garbage Collection

EBS2 by the Numbers • Max IOPS 1M (10**6) —
50x compared to EBS1 • Max throughput 4000 MiB/s — 13x compared to EBS1 • Heavy network ampli fi cation of 4.69x — compared to 3x in EBS1 • Average space amplification of 1.29x — compared to 3x in EBS1 18

EBS2 by the Numbers • Max IOPS 1M (10**6) —
50x compared to EBS1 • Max throughput 4000 MiB/s — 13x compared to EBS1 • Heavy network ampli fi cation of 4.69x — compared to 3x in EBS1 • Average space amplification of 1.29x — compared to 3x in EBS1 19

EBS3: Reducing Network Amplification 20

EBS3 Overview • Adds compression and Erasure Coding (EC) on
the write path • Batches small writes with Fusion Write Engine (FWE) • Uses FPGA to offload compression from CPU • Network amplification ~1.59x (drops from 4.69x) • Space amplification ~0.77x • 7.3 GiB/s throughput per card 21

22 Write

23 Write

EBS3: Evaluation • 4,000 MiB/s throughput and 1M IOPS per
VD which is 13x and 50x higher than EBS1 • Huge performance improvements over EBS1 in FIO microbenchmark, RocksDB with YCSB and MySQL with Sysbench application workloads 24

EBS3: Elasticity 25

Elasticity: 4 Metrics • Latency both average and 99.999th %ile
• Throughput and IOPS • Capacity 26

Elasticity: Latency 27

28 Latency: Average and 99.999th %ile

29 Latency: Average

Average Latency: EBSX • Mostly in the hardware (network +
disk) • Developed EBSX — storing data in PMem and skipping 2nd hop to Pangu • Data in PMem eventually fl ushes to Pangu 30

31 Latency: Average

32 Latency: 99.999th %ile

Tail Latency (99.999th %ile) Main causes: • Contention with background
tasks (scrubbing, compaction) • Non-IO RPC destruction in IO thread Solutions: • Move background tasks to a separate thread • Speculative retry to another replica 33

34 Latency: 99.999th %ile

Elasticity: Throughput and IOPS 35

Throughput and IOPS: BlockClient • Move IO processing to the
user space • Offload IO to FPGA: bypass CPU, CRC calculations, packet transmissions • 2x100G network shifts bottleneck to PCIe bandwidth 36

Throughput and IOPS: BlockServer • Reduce data sector size to
128KiB allows 1000 IOPS per 1Gb (parallelism) • Base+Burst strategy: • Priority-based congestion control (Base/Burst priority) • Server-wide dynamic resource allocation • Cluster-wide hot spot mitigation • Max Base capacity 50K IOPS, max Burst 1M IOPS 37

Availability 38

Availability: Blast Radius • Global — e.g. abnormal behavior of
BlockManager • Regional — several VDs, e.g. BlockServer crash. More severe in EBS2 / EBS3 since BlockServer is responsible for more VDs • Individual — single VD. Can cascade into a regional even, e.g. “poison pill” 39

Availability: Control Plane 40

41 Federated BlockManager

Federated BlockManager • CentralManager managers other BlockManagers • Each BlockManager
manages hundreds of VD-level partitions • On BlockManager failure partitions are redistributed Compare to Millions of Tiny Databases / AWS Physalia. 42

Availability: Data Plane 43

Logical Failure Domain • Address “poison pill” problem in software.
Core idea is to isolate suspicious segments into a small number of BlockServers • Token bucket algorithm for segment migration. Capacity 3, +1 token every 30 minutes • Once tokens depleted only migrates to a fixed small (3 nodes) subset of BlockServers — “Logical Failure Domain” • Future failure domains merge into one 44

Conclusions 45

Conclusions • Evolution of architecture from EBS1 to EBS3 /
EBSX • Discusses lessons, tradeoffs and various design attempts • Talks about availability, elasticity, hardware of fl oad 46

References 47

References • Self reference for this talk (slides, video, transcript,
etc) https://asatarin.github.io/talks/2024-05-evolution-of-cloud-block-store/ • Paper “What’s the Story in EBS Glory: Evolutions and Lessons in Building Cloud Block Store” • Millions of Tiny Databases paper 48

Contacts • Follow me on Twitter @asatarin • Follow me
on Mastodon https://discuss.systems/@asatarin • Contact me on LinkedIn https://www.linkedin.com/in/asatarin/ • Watch my public talks https://asatarin.github.io/talks/ • Up-to-date contacts https://asatarin.github.io/about/ 49

Evolution of Cloud Block Store

Evolution of Cloud Block Store

More Decks by Andrey Satarin

Other Decks in Technology

Featured

Transcript