Building Cloud Block Store Weidong Zhang, et. el. Presented by Andrey Satarin (@asatarin) May 2024 https://asatarin.github.io/talks/2024-05-evolution-of-cloud-block-store/
(VD) • BlockClient caches VD to block mappings • Data abstraction of chunk — 64Mb of data • ChunkManager (Paxos) stores metadata about chunks • 3 way replicated on top of local Ext4 file system • In-place updates 10
Limits in performance and efficiency • VD performance is bound by a single BlockServer performance • Might suffer from hotspots • Hard to quantify and guarantee SLO with HDD and kernel TCP/IP 11
• Built on top of distributed file system Pangu • Log-Structured design of BlockServers translates writes into appends • Traffic split into frontend (client I/O) and backend (GC, compression) • Failover at the granularity of a segment instead of VD 13
50x compared to EBS1 • Max throughput 4000 MiB/s — 13x compared to EBS1 • Heavy network ampli fi cation of 4.69x — compared to 3x in EBS1 • Average space amplification of 1.29x — compared to 3x in EBS1 18
50x compared to EBS1 • Max throughput 4000 MiB/s — 13x compared to EBS1 • Heavy network ampli fi cation of 4.69x — compared to 3x in EBS1 • Average space amplification of 1.29x — compared to 3x in EBS1 19
the write path • Batches small writes with Fusion Write Engine (FWE) • Uses FPGA to offload compression from CPU • Network amplification ~1.59x (drops from 4.69x) • Space amplification ~0.77x • 7.3 GiB/s throughput per card 21
VD which is 13x and 50x higher than EBS1 • Huge performance improvements over EBS1 in FIO microbenchmark, RocksDB with YCSB and MySQL with Sysbench application workloads 24
tasks (scrubbing, compaction) • Non-IO RPC destruction in IO thread Solutions: • Move background tasks to a separate thread • Speculative retry to another replica 33
BlockManager • Regional — several VDs, e.g. BlockServer crash. More severe in EBS2 / EBS3 since BlockServer is responsible for more VDs • Individual — single VD. Can cascade into a regional even, e.g. “poison pill” 39
manages hundreds of VD-level partitions • On BlockManager failure partitions are redistributed Compare to Millions of Tiny Databases / AWS Physalia. 42
Core idea is to isolate suspicious segments into a small number of BlockServers • Token bucket algorithm for segment migration. Capacity 3, +1 token every 30 minutes • Once tokens depleted only migrates to a fixed small (3 nodes) subset of BlockServers — “Logical Failure Domain” • Future failure domains merge into one 44
etc) https://asatarin.github.io/talks/2024-05-evolution-of-cloud-block-store/ • Paper “What’s the Story in EBS Glory: Evolutions and Lessons in Building Cloud Block Store” • Millions of Tiny Databases paper 48
on Mastodon https://discuss.systems/@asatarin • Contact me on LinkedIn https://www.linkedin.com/in/asatarin/ • Watch my public talks https://asatarin.github.io/talks/ • Up-to-date contacts https://asatarin.github.io/about/ 49