Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Velocity London 2017 - A tour of sketching data...

Velocity London 2017 - A tour of sketching data structures

As the scale of data our systems produce continues to increase, the techniques our systems use to process it must evolve. Kiran Bhattaram explains why sketches are a good option for leveraging more sophisticated data structures.

Sketching data structures are probabilistic structures that store a summary of the full dataset. They’re specialized to answer specific questions (e.g., how many unique values a large dataset contains or what the p95 of the dataset is). By leveraging some neat mathematical properties, sketching data structures trades accuracy for a significant increase in both computational and storage efficiency.

Kiran covers real-world use cases of a few basic sketching data structures and explores the statistical underpinnings that make them work.

Kiran Bhattaram

October 20, 2017
Tweet

More Decks by Kiran Bhattaram

Other Decks in Technology

Transcript

  1. 1 S K E T C H I N G

    D ATA S T R U C T U R E S Velocity London
  2. 2 S K E T C H I N G

    D ATA S T R U C T U R E S Velocity London
  3. 4 Timeline W h e r e t o g

    o f r o m h e r e S o l v i n g p r o b l e m s M o t i v a t i o n S y s t e m s i n P r o d u c t i o n !
  4. 4 Timeline W h e r e t o g

    o f r o m h e r e S o l v i n g p r o b l e m s M o t i v a t i o n S y s t e m s i n P r o d u c t i o n !
  5. 4 Timeline W h e r e t o g

    o f r o m h e r e S o l v i n g p r o b l e m s M o t i v a t i o n S y s t e m s i n P r o d u c t i o n !
  6. 4 Timeline W h e r e t o g

    o f r o m h e r e S o l v i n g p r o b l e m s M o t i v a t i o n S y s t e m s i n P r o d u c t i o n !
  7. 4 Timeline W h e r e t o g

    o f r o m h e r e S o l v i n g p r o b l e m s M o t i v a t i o n S y s t e m s i n P r o d u c t i o n !
  8. 9 If you can tolerate error… 4 x 109 =>

    0.5 GiB to store IPv4 addresses how many IP addresses have we seen?
  9. 9 If you can tolerate error… 4 x 109 =>

    0.5 GiB to store IPv4 addresses vs. 1.5kB with a 2% error how many IP addresses have we seen?
  10. 9 If you can tolerate error… 4 x 109 =>

    0.5 GiB to store IPv4 addresses vs. 1.5kB with a 2% error how many IP addresses have we seen? x 358,000
  11. 11 how they work P(n) hash! uniform distribution [ ]

    stream of data . . . data structure
  12. 11 how they work P(n) hash! uniform distribution estimator [

    ] stream of data . . . data structure
  13. 11 how they work P(n) hash! uniform distribution estimator guess

    +/- ε [ ] stream of data . . . data structure
  14. 12 Estimators & Observables ✦ Order statistics: [10, 11, 10,

    01] ex: smallest value seen so far ✦ Bit-pattern: ex: longest run of contiguous 0s 10001010 ✦ Presence: ex: is the bit set?
  15. 15 Editor: Features 1. Feed of short stories without duplicates

    2. Working vocabulary size (# of unique words)
  16. 15 Editor: Features 1. Feed of short stories without duplicates

    2. Working vocabulary size (# of unique words) 3. Word length statistics
  17. 16 Editor: Analytics Requirements Fast: want real-time statistics Okay to

    be good ~enough Cheap to run: no data analytics team!
  18. 18 The Problem Google Chrome: ”is this URL known to

    be malicious?" is this element in this set? [ ]
  19. 18 The Problem Google Chrome: ”is this URL known to

    be malicious?" is this element in this set? [ ] Databases/LSM trees: “is this data on disk?”
  20. 18 The Problem Google Chrome: ”is this URL known to

    be malicious?" is this element in this set? [ ] Databases/LSM trees: “is this data on disk?” Story Feed: “have I read this short story?”
  21. 21 Hash Set — Insertion hash to a bitmap; test

    for presence [ ] array of size m
  22. 21 Hash Set — Insertion hash to a bitmap; test

    for presence [ ] array of size m hash ( ) mod m
  23. 21 Hash Set — Insertion hash to a bitmap; test

    for presence [ ] array of size m hash ( ) mod m
  24. 21 Hash Set — Insertion hash to a bitmap; test

    for presence [ ] array of size m hash ( ) mod m
  25. 21 Hash Set — Insertion hash to a bitmap; test

    for presence [ ] array of size m hash ( ) mod m
  26. 21 Hash Set — Insertion hash to a bitmap; test

    for presence [ ] array of size m hash ( ) mod m
  27. 21 Hash Set — Insertion hash to a bitmap; test

    for presence [ ] array of size m hash ( ) mod m
  28. 26 Intuition 1: don’t store the entire object! false positives!

    ( )n number of elements inserted m bits in the array P(bit = 0)
  29. 26 Intuition 1: don’t store the entire object! false positives!

    ( )n 1 - number of elements inserted m bits in the array P(bit = 0)
  30. 27 Intuition 2 — Multiply Hashing! run through k independent

    hash functions Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors"
  31. 27 Intuition 2 — Multiply Hashing! run through k independent

    hash functions Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors"
  32. 27 Intuition 2 — Multiply Hashing! run through k independent

    hash functions h1(x) h2(x) h3(x) Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors"
  33. 27 Intuition 2 — Multiply Hashing! run through k independent

    hash functions Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors"
  34. 27 Intuition 2 — Multiply Hashing! run through k independent

    hash functions Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors"
  35. 27 run through k independent hash functions Bloom, Burton H.

    (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors" Bloom Filter!
  36. 31 Bloom Filter — Error Rates! false positives! number of

    hash functions (k) false positive possibility optimal k!
  37. 32 Bloom Filters: a summary No false negatives Smaller memory

    footprint
 (store 4-8 bits vs. entire obj) Small (and tunable!) false positive rate Can’t retrieve or delete items
  38. 33 how they work: Bloom Filters [ ] stream of

    data . . . P(n) hash! uniform distribution
  39. 33 how they work: Bloom Filters [ ] stream of

    data . . . P(n) hash! uniform distribution data structure: bitmap
  40. 33 how they work: Bloom Filters [ ] stream of

    data . . . P(n) hash! uniform distribution data structure: bitmap estimator: presence
  41. 33 how they work: Bloom Filters [ ] stream of

    data . . . P(n) hash! uniform distribution data structure: bitmap estimator: presence guess +/- ε
  42. 36 An extension: Counting Bloom Filters allows for deletions Fan,

    Li et al. (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol" 0 0 0 0 0 0 0 0 0
  43. 36 An extension: Counting Bloom Filters allows for deletions 1

    1 1 Fan, Li et al. (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol" 0 0 0 0 0 0
  44. 36 An extension: Counting Bloom Filters allows for deletions 1

    1 1 1 1 1 Fan, Li et al. (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol" 0 0 0
  45. 36 An extension: Counting Bloom Filters allows for deletions 2

    2 1 1 1 1 1 Fan, Li et al. (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol" 0 0
  46. 36 An extension: Counting Bloom Filters allows for deletions 2

    1 1 1 Fan, Li et al. (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol" 1 0 0 0 0
  47. 2 1 2 1 37 An extension: Count-Min Sketch keep

    a count of the frequency of items seen min() estimator 2 3 1 2 4 h1 h2 h3 Cormode, Graham (2009). "Count-min sketch"
  48. 2 1 2 1 37 An extension: Count-Min Sketch keep

    a count of the frequency of items seen min() estimator 2 3 1 2 4 h1 h2 h3 Cormode, Graham (2009). "Count-min sketch"
  49. 38 Bloom Filters: A Summary • Hash Sets —> Bloom

    Filters • bits & multiple hashing!
  50. 38 Bloom Filters: A Summary • Hash Sets —> Bloom

    Filters • bits & multiple hashing! • Extensions: Counting
  51. 38 Bloom Filters: A Summary • Hash Sets —> Bloom

    Filters • bits & multiple hashing! • Extensions: Counting • Extensions: Count-min sketch
  52. 41 The Problem: Cardinality number of unique values in a

    collection [ ] advertising: number of “uniques”
  53. 41 The Problem: Cardinality traffic modeling: # of unique IP

    addresses number of unique values in a collection [ ] advertising: number of “uniques”
  54. 41 The Problem: Cardinality traffic modeling: # of unique IP

    addresses number of unique values in a collection [ ] advertising: number of “uniques” natural language processing: number of unique words
  55. 47 Bit patterns! 0101 1010 0010 0001 1100 1011 0101

    1011 1010 run of 3 0s => likely seen 8 numbers!
  56. 49 But the cardinality estimate could be so wrong! Techniques

    for increasing accuracy ~8 friends ~4 friends ~4 friends = ~5.33 friends x 3 trials
  57. 50 The Algorithm 000 001 010 011 100 101 110

    111 a register of m=8 bytes
  58. 50 The Algorithm 000 001 010 011 100 101 110

    111 a register of m=8 bytes 010 00010
  59. 50 The Algorithm Bucket the first log2 8 bits 000

    001 010 011 100 101 110 111 a register of m=8 bytes 010 00010
  60. 50 The Algorithm Bucket the first log2 8 bits 000

    001 010 011 100 101 110 111 count leading 0s a register of m=8 bytes 010 00010
  61. 50 The Algorithm Bucket the first log2 8 bits 000

    001 010 011 100 101 110 111 count leading 0s a register of m=8 bytes 010 00010
  62. 50 The Algorithm 3 Bucket the first log2 8 bits

    000 001 010 011 100 101 110 111 count leading 0s a register of m=8 bytes 010 00010
  63. 51 The Algorithm 110 01111 111 00100 111 00111 110

    01010 011 00000 100 00100 101 00011 101 01010 010 00011 000 01001 001 00111 001 01111 1 2 3 5 2 3 1 2 000 001 010 011 100 101 110 111 … …
  64. 52 The Algorithm 1 2 3 5 2 3 1

    2 000 001 010 011 100 101 110 111 take the harmonic mean of all of these!
  65. 52 The Algorithm 1 2 3 5 2 3 1

    2 000 001 010 011 100 101 110 111 take the harmonic mean of all of these! = 8 * 3.93 = 31.5
  66. 52 The Algorithm 1 2 3 5 2 3 1

    2 000 001 010 011 100 101 110 111 take the harmonic mean of all of these! = 8 * 3.93 = 31.5 (I used 28 values)
  67. 52 The Algorithm 1 2 3 5 2 3 1

    2 000 001 010 011 100 101 110 111 take the harmonic mean of all of these! = 8 * 3.93 = 31.5 (I used 28 values) Plus corrections for small and large values!
  68. 53 Merging Hyper Log Logs 1 2 3 5 2

    1 4 8 max() for each register 2 2 4 8 =
  69. 54 Hyper Log Log — Error Rates! over and under

    estimating Cardinality Space required (m) Error 109 1.5kB ~2% (vs. 0.5GiB!)
  70. 55 how they work: Hyper Log Logs [ ] stream

    of data . . . P(n) hash! uniform distribution
  71. 55 how they work: Hyper Log Logs [ ] stream

    of data . . . P(n) hash! uniform distribution data structure 1 2 3 5 2 3 1 2 000 001 010 011 100 101 110 111
  72. 55 how they work: Hyper Log Logs [ ] stream

    of data . . . P(n) hash! uniform distribution estimator (run of 0s) data structure 1 2 3 5 2 3 1 2 000 001 010 011 100 101 110 111
  73. 55 how they work: Hyper Log Logs [ ] stream

    of data . . . P(n) hash! uniform distribution estimator (run of 0s) guess +/- ε data structure 1 2 3 5 2 3 1 2 000 001 010 011 100 101 110 111
  74. 59 Editor: Text Analytics denote read stories (Bloom filters!) count

    unique words used estimate percentiles for word length
  75. 63 how they work - t digests [ ] stream

    of data . . . data structure
  76. 63 how they work - t digests [ ] stream

    of data . . . estimator data structure
  77. 63 how they work - t digests [ ] stream

    of data . . . estimator guess +/- ε data structure
  78. 78 A brief list of other sketches • Skip Lists

    • frequency: count-min sketch, heavy hitters, etc • membership: Bloom filters, Cuckoo hashing • cardinality: hyperloglog • geometric data: coresets, locality-sensitive hashing
  79. 79 tl;dr — error is a tradeoff in algorithms approximations

    are often Good Enough and a hell of a lot cheaper
  80. 83 Small Value Corrections 1 3 2 Estimate = m*log(m/#

    of un-init registers) = ~ 3.75 values
  81. 84 Large Value Corrections as the number of unique values

    approaches 2^(2^m), you start seeing hash collisions!
  82. 84 Large Value Corrections as the number of unique values

    approaches 2^(2^m), you start seeing hash collisions! => use a 64 bit hash & more bits in the registers!