Upgrade to Pro — share decks privately, control downloads, hide ads and more …

WiscKey: Separating Keys from Values in SSD-conscious Storage

WiscKey: Separating Keys from Values in SSD-conscious Storage

WiscKey is an LSM-tree-based key-value store with a performance-oriented data layout that separates keys from values to minimize I/O amplification. The design of WiscKey is highly optimized for SSDs, leveraging both the sequential and random performance characteristics of the devices. The paper also discusses some of the challenges with this design and proposes feasible solutions to solve them.

Arjun Sunil Kumar

May 22, 2024
Tweet

More Decks by Arjun Sunil Kumar

Other Decks in Technology

Transcript

  1. Paper: WiscKey: Separating Keys from Value in SSD-Conscious Storage Authors:

    Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison Presenter: Arjun Sunil Kumar
  2. Agenda 1. Motivation & Use Case 2. Introduction a. LSM

    Tree b. Write and Read Amplification c. SSD 3. WiscKey a. Key Idea b. Data Flow c. Challenges 4. Variants a. CouchBase b. TerarkDB
  3. Motivation - Large write/read amplification on LSM-trees of Key-Value Stores

    holding large Values - LSM-trees are optimized for HDDs; not optimal for SSDs
  4. Use Case: Large Values Typically, - Values are not very

    huge. - For large values, we prefer to store large data(PDF document/Images) in object store and metadata (URL) in the db. But there are use cases where we store - large JSON stored as BLOB in Database - Vector Embeddings stored as BLOB in Database
  5. In-place vs Out-of-place updates 20 15 25 12 18 20

    25 15 12 18 Single Data Structure Collection of Data Structures 10
  6. In-place vs Out-of-place updates Traverse and update the leaf node

    inplace. (Might require coarse-grained or fine-grained locking) 20 15 25 12 18 20 25 15 12 18 12 Insert 12 with TombStone Entry (Relatively faster update) Delete 12 12
  7. In-place vs Out-of-place updates 20 15 25 18 Reads are

    faster as update is applied in-place Read 12 20 25 15 12 18 Merge On Read Reads are slower due to MoR 12 14
  8. LSM Tree - Components Memtable Immutable Memtable Disk Memory WAL

    SST SST SST SST SST SST SST SST L0 L1 L7 15
  9. LSM Tree - Components Memtable Immutable Memtable Disk Memory WAL

    SST SST SST SST SST SST SST SST L0 L1 L7 16
  10. LSM Tree - Components Memtable Immutable Memtable Disk Memory WAL

    SST SST SST SST SST SST SST SST L0 L1 L7 17
  11. LSM Tree - Components Memtable Immutable Memtable Disk Memory WAL

    SST SST SST SST SST SST SST SST L0 L1 L7 18
  12. LSM Tree Ops Memtable Immutable Memtable Disk Memory WAL SST

    SST SST SST SST SST SST SST L0 L1 L7 Write Compaction Read 19
  13. LSM Tree Ops - Write Memtable Immutable Memtable Disk Memory

    WAL SST SST SST SST SST SST SST SST L0 L1 L7 Write 20 <k,v>
  14. LSM Tree Ops - Write Memtable Immutable Memtable Disk Memory

    WAL SST SST SST SST SST SST SST SST L0 L1 L7 Write 21 <k,v>
  15. LSM Tree Ops - Write Memtable Immutable Memtable Disk Memory

    WAL SST SST SST SST SST SST SST SST L0 L1 L7 Write 22 <k,v>
  16. LSM Tree Ops - Write Memtable Immutable Memtable Disk Memory

    WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Write 23 <k,v>
  17. LSM Tree Ops - Compaction Memtable Immutable Memtable Disk Memory

    WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Compaction 24
  18. LSM Tree Ops - Compaction Memtable Immutable Memtable Disk Memory

    WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Compaction 25
  19. LSM Tree Ops - Compaction Memtable Immutable Memtable Disk Memory

    WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Compaction SST 26
  20. LSM Tree Ops - Compaction Memtable Immutable Memtable Disk Memory

    WAL SST SST SST SST SST SST SST L0 L1 L7 Compaction SST 27
  21. LSM Tree Ops - Read Memtable Immutable Memtable Disk Memory

    WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Read 28 [1,5] [3,5]
  22. LSM Tree Ops - Read Memtable Immutable Memtable Disk Memory

    WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Read 29 [1,5] [3,5]
  23. LSM Tree Ops - Read Memtable Immutable Memtable Disk Memory

    WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Read 30 [1,5] [3,5]
  24. SSD vs HDD HDD SSD Magnetic platters on rotating disk

    Flash Memory Chip Random reads were slower due to seek Random reads are faster as there is no rotating disks Slower read/writes (100MB/s) Faster read/writes (500MB/s) No limitations Limited number of write cycles New Write can overwrite the Soft Deleted data blocks. Flash SSDs, must erase the Soft Deleted data blocks before new data can be written. No good for random writes due to seek time. No good for random writes due to GC. Img: https://www.backblaze.com/blog/ssd-vs-hdd-future-of-storage/
  25. Write Amplification Write amplification occurs when the amount of physical

    data written to a storage medium is greater than the amount of logical data intended to be written. Image Ref: https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf
  26. Read Amplification Read amplification refers to the phenomenon where the

    amount of data read from the storage system is greater than the actual amount of data requested by a read operation. In the case of LSM trees, multiple versions of a record exist across different levels or SSTs. As a result, a single Read operation may require reading multiple data blocks or files to retrieve the most recent version of the requested record.
  27. Design Goals 1. Low Write Amplification 2. Low Read Amplification

    3. SSD optimized (sequential writes and parallel random reads) 4. Feature rich API - Get/Put/Delete, Scan, Snapshot 5. Realistic Key Value Size
  28. Inserts - 100 GB data Intensive Compaction + Repeated R/W

    + Stall Foreground IOs Many Levels Small LSM-tree: less compaction, fewer levels to search, and better caching Image Ref: http://csl.snu.ac.kr/courses/4190.568/2019-1/30-wisckey.pdf
  29. 1. WiscKey Ops - Write Memtable Immutable Memtable Disk Memory

    WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Write 47 <k,vptr> VLOG <k,v> <k,vptr> 1 2 3
  30. 2. WiscKey Ops - Delete Memtable Immutable Memtable Disk Memory

    WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Write 48 <k,nil> VLOG <k,nil> 1 2
  31. 3. WiscKey Ops - Read Memtable Immutable Memtable Disk Memory

    WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Read 49 [1,5] [3,5] VLOG 1 2
  32. 3. WiscKey Ops - Read Memtable Immutable Memtable Disk Memory

    WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Read 50 [1,5] [3,5] VLOG 1 2
  33. 3. WiscKey Ops - Read Memtable Immutable Memtable Disk Memory

    WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Read 52 [1,5] [3,5] VLOG 1 2 Buffer
  34. 4. WiscKey Ops - Key Compaction Memtable Immutable Memtable Disk

    Memory WAL SST SST SST SST SST SST SST SST SST L0 L1 L7 Compaction SST 53 VLOG
  35. Separation of Key and Value bring concerns on consistency 1.

    What if value was written and before writing key we crashed? 2. Can vlog have corrupted value? (ie random values) NOTE: - Values are appended sequentially. - We store <K,V, KSz, VSz> 5. Crash Consistency
  36. Further Optimizations 1. VLog Buffer a. Not for synchronous inserts.

    b. Buffer writes in Memory and then all fsync() c. During lookup we check VLog Buffer first. 2. Reuse VLog as WAL. a. Record head of VLOG periodically in LSM Tree as <K,V> <head, vptr>
  37. Terrark - From Bytedance In the LSM tree, TerarkDB only

    stores the keys corresponding to the large values and the file number where the value is located, <key, fileno>, without storing the offset of the large value. Read first needs to find the v-SST file corresponding to the key in the LSM tree. Then, it accesses the latest v-SST based on the dependency relationship and finds the corresponding key in the index of v-SST, and then accesses the value. Image Reference: https://www.skyzh.dev/blog/2023-12-31-lsm-kv-separation-overview/#terarkdb
  38. GC