Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Count-Min Sketch, Bloom Filter, TopK: Efficient...

Count-Min Sketch, Bloom Filter, TopK: Efficient probabilistic data structures

A Count-Min Sketch, a Bloom Filter, and a TopK might sound fancy, but they’re just smart ways to work with huge amounts of data using very little memory.

In this talk, we’ll explore three powerful probabilistic data structures that trade a bit of accuracy for a lot of speed and scalability. You’ll learn:

What Count-Min Sketch, Bloom Filter, and TopK actually are
How each of them works under the hood
How I used them together to build an efficient version of Trending Topics for Bluesky

By the end, you’ll see how these tools help you process large data streams without blowing up your memory, and how to apply them in real-world systems where being fast matters more than being perfect.

Avatar for Raphael De Lio

Raphael De Lio

April 17, 2025
Tweet

More Decks by Raphael De Lio

Other Decks in Programming

Transcript

  1. Deterministic vs Probabilistic Always exact May have false positives/negatives Accuracy

    Dynamic Fixed Memory Consumption Sets, Lists, Maps, Stacks, Trees Bloom Filters, Count-Min Sketch, Hyperloglog, TopK, T-Digest Examples Feature Deterministic Probabilistic
  2. A Data Structure Server • String • List • Set

    • Sorted Set • Vector Set • Hash • Stream • Geo • Bitmaps • Bitfield • JSON • TimeSeries • Bloom Filter • Count-Min Sketch • TopK • Cuckoo Filter • T Digest • HyperLogLog
  3. Building our own Trending Topics 1. Counting words 2. Deduplicating

    messages 3. Tracking Spikes Count-Min Sketch Bloom Filter TopK
  4. Count-Min Sketch • Probabilistic • Used to estimate the frequency

    of elements in a data stream • Somehow similar to a Sorted Set • Fixed Memory • Trade-o ff : might give wrong estimations
  5. Count-Min Sketch • Internally it’s a grid (sketch) of w

    (width) and d (depth) • The rows (d) represent the number of hash functions. The columns (w) represent the counter array for each of the hashing functions 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 fi xed size
  6. Count-Min Sketch: Incrementing 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 Hash1(“redis”) % 5 = 2 Hash2(“redis”) % 5 = 4 Hash3(“redis”) % 5 = 1 CMS.INCRBY terms redis 1 1 1 1 CMS.INCRBY terms redis 1
  7. 0 0 0 0 0 0 0 0 0 0

    0 0 0 0 0 Hash1(“pets”) % 5 = 0 Hash2(“pets”) % 5 = 3 Hash3(“pets”) % 5 = 1 CMS.INCRBY terms pets 1 1 1 1 1 1 2 Count-Min Sketch: Incrementing CMS.INCRBY terms pets 1
  8. 0 0 0 0 0 0 0 0 0 0

    0 0 0 0 0 Hash1(“cats”) % 5 = 3 Hash2(“cats”) % 5 = 4 Hash3(“cats”) % 5 = 0 CMS.INCRBY terms cats 1 1 1 1 1 1 2 1 2 1 Count-Min Sketch: Incrementing CMS.INCRBY terms cats 1
  9. 0 0 0 0 0 0 0 0 0 0

    0 0 0 0 0 Hash1(“dogs”) % 5 = 2 Hash2(“dogs”) % 5 = 1 Hash3(“dogs”) % 5 = 3 CMS.INCRBY terms dogs 1 1 1 1 1 1 2 1 2 1 2 1 1 Count-Min Sketch: Incrementing CMS.INCRBY terms dogs 1
  10. Count-Min Sketch: Querying 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 Hash1(“dogs”) % 5 = 2 Hash2(“dogs”) % 5 = 1 Hash3(“dogs”) % 5 = 3 CMS.QUERY terms dogs 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 CMS.QUERY terms dogs
  11. 0 0 0 0 0 0 0 0 0 0

    0 0 0 0 0 Hash1(“redis”) % 5 = 2 Hash2(“redis”) % 5 = 4 Hash3(“redis”) % 5 = 1 CMS.QUERY terms redis 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 Count-Min Sketch: Querying 2 CMS.QUERY terms redis
  12. Count-Min Sketch: Probability • The width determines the error rate.

    • The depth determines the con fi dence in this error rate. For a Sketch of 5/3: • Error rate: 40% • Con fi dence in this error rate: 99.87% 99.87% of the time, the counter will be within 40% of the true value For a Sketch of 2000/10: • Error rate: 0.1% • Con fi dence in this error rate: 99,99% 99.99% of the time, the counter will be within 0.1% of the true value
  13. Bloom Filter • Probabilistic • Used to test whether an

    element is possibly in a set • Somehow similar to a Set • Fixed Memory • Trade-o ff : might give false positives
  14. Bloom Filter • Internally it’s a 1D array. • It

    also works with multiple hash functions. 0 0 0 0 0 0 0 0 fi xed size
  15. Bloom Filter: Adding member Hash1(“I’m”) % 8 = 2 Hash2(“I’m”)

    % 8 = 4 Hash3(“I’m”) % 8 = 1 BF.ADD stop-words “I’m” 0 0 0 0 0 0 0 0 1 1 1 BF.ADD stop-words “I’m”
  16. Bloom Filter: Adding member Hash1(“lol”) % 5 = 0 Hash2(“lol”)

    % 8 = 3 Hash3(“lol”) % 8 = 1 BF.ADD stop-words “lol” 0 0 0 0 0 0 0 0 1 1 1 1 1 1 BF.ADD stop-words “lol”
  17. Bloom Filter: Checking if exists BF.EXISTS stop-words “I’m” 0 0

    0 0 0 0 0 0 1 1 1 1 1 1 Hash1(“I’m”) % 8 = 2 Hash2(“I’m”) % 8 = 4 Hash3(“I’m”) % 8 = 1 BF.EXISTS stop-words “I’m”
  18. Bloom Filter: Checking if exists BF.EXISTS stop-words “devoxx” 0 0

    0 0 0 0 0 0 1 1 1 1 1 1 Hash1(“devoxx”) % 8 = 5 Hash2(“devoxx”) % 8 = 4 Hash3(“devoxx”) % 8 = 1 BF.EXISTS stop-words “devoxx”
  19. Bloom Filter: Checking if exists BF.EXISTS stop-words “Brazil” 0 0

    0 0 0 0 0 0 1 1 1 1 1 1 Hash1(“Brazil”) % 8 = 3 Hash2(“Brazil”) % 8 = 4 Hash3(“Brazil”) % 8 = 1 BF.EXISTS stop-words “Brazil
  20. TopK • Probabilistic • Used to track the most frequent

    elements in a data stream • Somehow similar to a Sorted Set & Count-min Sketch • Fixed Memory • Trade-o ff : might give imprecise results
  21. Top K • It’s a 1D array of K slots

    • It also works with multiple hash functions • It receives a decay How it works 0 1 2 3 4 decay: 0.9
  22. Top K Incrementing counter 0 1 2 3 4 TOPK.INCRBY

    spiking-now “redis" 1 TOPK.INCRBY spiking-now “redis” 1 Hash(“redis”) % 5 = 1 redis: 1 decay: 0.9
  23. Top K Incrementing counter 0 1 2 3 4 TOPK.INCRBY

    spiking-now “devoxx” 1 TOPK.INCRBY spiking-now “devoxx” 1 Hash(“devoxx”) % 5 = 2 redis: 1 decay: 0.9 devoxx: 1
  24. Top K Incrementing counter 0 1 2 3 4 TOPK.INCRBY

    spiking-now “devoxx” 1 TOPK.INCRBY spiking-now “devoxx” 1 Hash(“devoxx”) % 5 = 2 redis: 1 decay: 0.9 devoxx: 1 devoxx: 2
  25. Top K Incrementing counter 0 1 2 3 4 TOPK.INCRBY

    spiking-now “java” 1 TOPK.INCRBY spiking-now “java” 1 Hash(“java”) % 5 = 2 redis: 1 decay: 0,9 devoxx: 2 devoxx: 1 java: 1 2 * 0,9 = 1,8
  26. Top K Incrementing counter 0 1 2 3 4 TOPK.INCRBY

    spiking-now “java” 1 TOPK.INCRBY spiking-now “java” 1 Hash(“java”) % 5 = 2 redis: 1 decay: 0,9 devoxx: 1 java: 1 1 * 0,9 = 0,9
  27. Detecting Spikes • Calculate the average of every term in

    the past three minutes • Compare with the current minute • “devoxx”: • current min = 455 times • -1 min = 400 times • -2 min = 350 times • -3 min = 300 times avg: 300 times } 455 - 300 300 = 0,5
  28. Counting Individual Terms ~2MB per minute ~120MB per hour ~2.8GB

    per day ~87GB per month ~1TB per year 156KB per minute 9.3MB per hour 223MB per day 6.7GB per month 80GB per year Sorted Set Count Min Sketch
  29. Deduplicating ~2.6MB per minute ~156MB per hour ~3.7GB per day

    ~111GB per month ~1.3TB per year 135KB per hour 3.2MB per day 97MB per month 1.2GB per year Set Bloom Filter
  30. Tracking Spikes ~2.6MB per minute ~156MB per hour ~3.7GB per

    day ~111GB per month ~1.3TB per year 284KB per 5 minutes 3.4MB per hour 82MB per day 2.4GB per month 288GB per year Sorted Set TopK