WiscKey: Separating Keys from Values in SSD-conscious Storage

Paper: WiscKey: Separating Keys from Value in SSD-Conscious Storage Authors:
Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison Presenter: Arjun Sunil Kumar

Agenda 1. Motivation & Use Case 2. Introduction a. LSM
Tree b. Write and Read Ampliﬁcation c. SSD 3. WiscKey a. Key Idea b. Data Flow c. Challenges 4. Variants a. CouchBase b. TerarkDB

Motivation & Use Case 3

1950 2011 2017 2021 2006 2008 DB Storage Evolution 4

1950 2011 2017 2021 2006 2008 Storage Data Structure Evolution
5

Motivation - Large write/read ampliﬁcation on LSM-trees of Key-Value Stores
holding large Values - LSM-trees are optimized for HDDs; not optimal for SSDs

Use Case: Large Values Typically, - Values are not very
huge. - For large values, we prefer to store large data(PDF document/Images) in object store and metadata (URL) in the db. But there are use cases where we store - large JSON stored as BLOB in Database - Vector Embeddings stored as BLOB in Database

Introduction: LSM Tree 8

BTree LSM KV Storage Engine Classiﬁcation 9

In-place vs Out-of-place updates 20 15 25 12 18 20
25 15 12 18 Single Data Structure Collection of Data Structures 10

In-place vs Out-of-place updates 20 15 25 12 18 Delete
12 20 25 15 12 18 11

In-place vs Out-of-place updates Traverse and update the leaf node
inplace. (Might require coarse-grained or ﬁne-grained locking) 20 15 25 12 18 20 25 15 12 18 12 Insert 12 with TombStone Entry (Relatively faster update) Delete 12 12

In-place vs Out-of-place updates 20 15 25 18 Read 12
20 25 15 12 18 12 13

In-place vs Out-of-place updates 20 15 25 18 Reads are
faster as update is applied in-place Read 12 20 25 15 12 18 Merge On Read Reads are slower due to MoR 12 14

LSM Tree - Components Memtable Immutable Memtable Disk Memory WAL
SST SST SST SST SST SST SST SST L0 L1 L7 15

LSM Tree Ops Memtable Immutable Memtable Disk Memory WAL SST
SST SST SST SST SST SST SST L0 L1 L7 Write Compaction Read 19

LSM Tree Ops - Write Memtable Immutable Memtable Disk Memory
WAL SST SST SST SST SST SST SST SST L0 L1 L7 Write 20 <k,v>

WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Write 23 <k,v>

LSM Tree Ops - Compaction Memtable Immutable Memtable Disk Memory
WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Compaction 24

WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Compaction 25

WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Compaction SST 26

WAL SST SST SST SST SST SST SST L0 L1 L7 Compaction SST 27

LSM Tree Ops - Read Memtable Immutable Memtable Disk Memory
WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Read 28 [1,5] [3,5]

Introduction: SSD 31

SSD vs HDD HDD SSD Magnetic platters on rotating disk
Flash Memory Chip Random reads were slower due to seek Random reads are faster as there is no rotating disks Slower read/writes (100MB/s) Faster read/writes (500MB/s) No limitations Limited number of write cycles New Write can overwrite the Soft Deleted data blocks. Flash SSDs, must erase the Soft Deleted data blocks before new data can be written. No good for random writes due to seek time. No good for random writes due to GC. Img: https://www.backblaze.com/blog/ssd-vs-hdd-future-of-storage/

Introduction: W/R Ampliﬁcations 33

Write Ampliﬁcation Write ampliﬁcation occurs when the amount of physical
data written to a storage medium is greater than the amount of logical data intended to be written. Image Ref: https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf

Write Ampliﬁcation Image Ref: http://csl.snu.ac.kr/courses/4190.568/2019-1/30-wisckey.pdf

Read Amplification Read amplification refers to the phenomenon where the
amount of data read from the storage system is greater than the actual amount of data requested by a read operation. In the case of LSM trees, multiple versions of a record exist across different levels or SSTs. As a result, a single Read operation may require reading multiple data blocks or files to retrieve the most recent version of the requested record.

Introduction: Core Problems 37

High R/W ampliﬁcation Image Ref: http://csl.snu.ac.kr/courses/4190.568/2019-1/30-wisckey.pdf

Not optimized for SSD Image Ref: https://www.usenix.org/sites/default/files/conference/protected-files/fast16_slides_lu.pdf

WiscKey: Key Idea 40

Design Goals 1. Low Write Ampliﬁcation 2. Low Read Ampliﬁcation
3. SSD optimized (sequential writes and parallel random reads) 4. Feature rich API - Get/Put/Delete, Scan, Snapshot 5. Realistic Key Value Size

Key Idea Image Ref: http://csl.snu.ac.kr/courses/4190.568/2019-1/30-wisckey.pdf

WiscKey: Improvement #’s 43

Inserts - 100 GB data Intensive Compaction + Repeated R/W
+ Stall Foreground IOs Many Levels Small LSM-tree: less compaction, fewer levels to search, and better caching Image Ref: http://csl.snu.ac.kr/courses/4190.568/2019-1/30-wisckey.pdf

Random Lookup - 100K Image Ref: http://csl.snu.ac.kr/courses/4190.568/2019-1/30-wisckey.pdf

WiscKey: Design/Challenges 46

1. WiscKey Ops - Write Memtable Immutable Memtable Disk Memory
WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Write 47 <k,vptr> VLOG <k,v> <k,vptr> 1 2 3

2. WiscKey Ops - Delete Memtable Immutable Memtable Disk Memory
WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Write 48 <k,nil> VLOG <k,nil> 1 2

3. WiscKey Ops - Read Memtable Immutable Memtable Disk Memory
WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Read 49 [1,5] [3,5] VLOG 1 2

WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Read 50 [1,5] [3,5] VLOG 1 2

3. WiscKey Ops - Read Image Ref: http://csl.snu.ac.kr/courses/4190.568/2019-1/30-wisckey.pdf

WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Read 52 [1,5] [3,5] VLOG 1 2 Buffer

4. WiscKey Ops - Key Compaction Memtable Immutable Memtable Disk
Memory WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Compaction SST 53 VLOG

4. WiscKey Value Compaction/GC Image Ref: http://csl.snu.ac.kr/courses/4190.568/2019-1/30-wisckey.pdf

4. WiscKey Value Compaction/GC Image Reference: https://www.skyzh.dev/blog/2023-12-31-lsm-kv-separation-overview/#terarkdb

Separation of Key and Value bring concerns on consistency 1.
What if value was written and before writing key we crashed? 2. Can vlog have corrupted value? (ie random values) NOTE: - Values are appended sequentially. - We store <K,V, KSz, VSz> 5. Crash Consistency

5. Crash Consistency Image Ref: http://csl.snu.ac.kr/courses/4190.568/2019-1/30-wisckey.pdf

Further Optimizations 1. VLog Buffer a. Not for synchronous inserts.
b. Buffer writes in Memory and then all fsync() c. During lookup we check VLog Buffer ﬁrst. 2. Reuse VLog as WAL. a. Record head of VLOG periodically in LSM Tree as <K,V> <head, vptr>

Variants 60

Magma - From Couchbase Image Ref: https://www.vldb.org/pvldb/vol15/p3496-lakshman.pdf

Terrark - From Bytedance In the LSM tree, TerarkDB only
stores the keys corresponding to the large values and the file number where the value is located, <key, fileno>, without storing the offset of the large value. Read first needs to find the v-SST file corresponding to the key in the LSM tree. Then, it accesses the latest v-SST based on the dependency relationship and finds the corresponding key in the index of v-SST, and then accesses the value. Image Reference: https://www.skyzh.dev/blog/2023-12-31-lsm-kv-separation-overview/#terarkdb

Thank you! Icons: - ﬂaticon.com - thenounproject.com 63

WiscKey: Implementation 64

Parallel Read

Extra 67

VLOG Rewrite Issue Image Reference: https://www.skyzh.dev/blog/2023-12-31-lsm-kv-separation-overview/#terarkdb

WiscKey: Separating Keys from Values in SSD-con...

WiscKey: Separating Keys from Values in SSD-conscious Storage

More Decks by Arjun Sunil Kumar

Other Decks in Technology

Featured

Transcript