Count-Min Sketch, Bloom Filter, TopK: Efficient probabilistic data structures

Count-Min Sketch, Bloom Filter, Top K: Efficient probabilistic data structures
Raphael De Lio

Who is in Bluesky?

1 September • Bluesky gets 1 million new users Timeline

One of many things BSKY doesn’t have… didn’t

Blooming Trending Topics on Bluesky

How many posts per minute in Bluesky?

Approximately 2200

Approximately 95 million per month

How many unique words are mentioned in Bluesky per minute?

Approximately 22.000

How much memory do I need to analyze this amount
of data?

~2MB per minute ~120MB per hour ~2.8GB per day ~87GB
per month ~1TB per year

Bluesky has 35 million users today

What if I had a data structure with fixed size?

Probabilistic Data Structures

Deterministic vs Probabilistic Always exact May have false positives/negatives Accuracy
Dynamic Fixed Memory Consumption Sets, Lists, Maps, Stacks, Trees Bloom Filters, Count-Min Sketch, Hyperloglog, TopK, T-Digest Examples Feature Deterministic Probabilistic

A Data Structure Server • String • List • Set
• Sorted Set • Vector Set • Hash • Stream • Geo • Bitmaps • Bitfield • JSON • TimeSeries • Bloom Filter • Count-Min Sketch • TopK • Cuckoo Filter • T Digest • HyperLogLog

Building our own Trending Topics 1. Counting words 2. Deduplicating
messages 3. Detecting Spikes

Building our own Trending Topics 1. Counting words 2. Deduplicating
messages 3. Tracking Spikes Count-Min Sketch Bloom Filter TopK

#1 Counting individual terms

Sorted Set • Deterministic • Stores individual members with a
score • Dynamic Memory Growth

Count-Min Sketch • Probabilistic • Used to estimate the frequency
of elements in a data stream • Somehow similar to a Sorted Set • Fixed Memory • Trade-o ff : might give wrong estimations

How it works…

Count-Min Sketch • Internally it’s a grid (sketch) of w
(width) and d (depth) • The rows (d) represent the number of hash functions. The columns (w) represent the counter array for each of the hashing functions 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 fi xed size

Count-Min Sketch: Incrementing 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 Hash1(“redis”) % 5 = 2 Hash2(“redis”) % 5 = 4 Hash3(“redis”) % 5 = 1 CMS.INCRBY terms redis 1 1 1 1 CMS.INCRBY terms redis 1

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 Hash1(“pets”) % 5 = 0 Hash2(“pets”) % 5 = 3 Hash3(“pets”) % 5 = 1 CMS.INCRBY terms pets 1 1 1 1 1 1 2 Count-Min Sketch: Incrementing CMS.INCRBY terms pets 1

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 Hash1(“cats”) % 5 = 3 Hash2(“cats”) % 5 = 4 Hash3(“cats”) % 5 = 0 CMS.INCRBY terms cats 1 1 1 1 1 1 2 1 2 1 Count-Min Sketch: Incrementing CMS.INCRBY terms cats 1

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 Hash1(“dogs”) % 5 = 2 Hash2(“dogs”) % 5 = 1 Hash3(“dogs”) % 5 = 3 CMS.INCRBY terms dogs 1 1 1 1 1 1 2 1 2 1 2 1 1 Count-Min Sketch: Incrementing CMS.INCRBY terms dogs 1

Count-Min Sketch: Querying 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 Hash1(“dogs”) % 5 = 2 Hash2(“dogs”) % 5 = 1 Hash3(“dogs”) % 5 = 3 CMS.QUERY terms dogs 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 CMS.QUERY terms dogs

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 Hash1(“redis”) % 5 = 2 Hash2(“redis”) % 5 = 4 Hash3(“redis”) % 5 = 1 CMS.QUERY terms redis 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 Count-Min Sketch: Querying 2 CMS.QUERY terms redis

Count-Min Sketch: Probability • The width determines the error rate.
• The depth determines the con fi dence in this error rate. For a Sketch of 5/3: • Error rate: 40% • Con fi dence in this error rate: 99.87% 99.87% of the time, the counter will be within 40% of the true value For a Sketch of 2000/10: • Error rate: 0.1% • Con fi dence in this error rate: 99,99% 99.99% of the time, the counter will be within 0.1% of the true value

Demo time!

Other use cases • Logistics System • Usage Analytics

#2 Deduplicating Messages

Set • Deterministic • Stores individual members • Dynamic Memory
Growth

Bloom Filter • Probabilistic • Used to test whether an
element is possibly in a set • Somehow similar to a Set • Fixed Memory • Trade-o ff : might give false positives

How it works…

Bloom Filter • Internally it’s a 1D array. • It
also works with multiple hash functions. 0 0 0 0 0 0 0 0 fi xed size

Bloom Filter: Adding member Hash1(“I’m”) % 8 = 2 Hash2(“I’m”)
% 8 = 4 Hash3(“I’m”) % 8 = 1 BF.ADD stop-words “I’m” 0 0 0 0 0 0 0 0 1 1 1 BF.ADD stop-words “I’m”

Bloom Filter: Adding member Hash1(“lol”) % 5 = 0 Hash2(“lol”)
% 8 = 3 Hash3(“lol”) % 8 = 1 BF.ADD stop-words “lol” 0 0 0 0 0 0 0 0 1 1 1 1 1 1 BF.ADD stop-words “lol”

Bloom Filter: Checking if exists BF.EXISTS stop-words “I’m” 0 0
0 0 0 0 0 0 1 1 1 1 1 1 Hash1(“I’m”) % 8 = 2 Hash2(“I’m”) % 8 = 4 Hash3(“I’m”) % 8 = 1 BF.EXISTS stop-words “I’m”

Bloom Filter: Checking if exists BF.EXISTS stop-words “devoxx” 0 0
0 0 0 0 0 0 1 1 1 1 1 1 Hash1(“devoxx”) % 8 = 5 Hash2(“devoxx”) % 8 = 4 Hash3(“devoxx”) % 8 = 1 BF.EXISTS stop-words “devoxx”

Bloom Filter: Checking if exists BF.EXISTS stop-words “Brazil” 0 0
0 0 0 0 0 0 1 1 1 1 1 1 Hash1(“Brazil”) % 8 = 3 Hash2(“Brazil”) % 8 = 4 Hash3(“Brazil”) % 8 = 1 BF.EXISTS stop-words “Brazil

Demo time!

Other use cases • Username Availability Checker • Early Fraud
or Spam Detection

#3 Tracking Spikes

Sorted Set • Deterministic • Stores individual members with a
score • Dynamic Memory Growth

TopK • Probabilistic • Used to track the most frequent
elements in a data stream • Somehow similar to a Sorted Set & Count-min Sketch • Fixed Memory • Trade-o ff : might give imprecise results

Top K • It’s a 1D array of K slots
• It also works with multiple hash functions • It receives a decay How it works 0 1 2 3 4 decay: 0.9

Top K Incrementing counter 0 1 2 3 4 TOPK.INCRBY
spiking-now “redis" 1 TOPK.INCRBY spiking-now “redis” 1 Hash(“redis”) % 5 = 1 redis: 1 decay: 0.9

spiking-now “devoxx” 1 TOPK.INCRBY spiking-now “devoxx” 1 Hash(“devoxx”) % 5 = 2 redis: 1 decay: 0.9 devoxx: 1

spiking-now “devoxx” 1 TOPK.INCRBY spiking-now “devoxx” 1 Hash(“devoxx”) % 5 = 2 redis: 1 decay: 0.9 devoxx: 1 devoxx: 2

spiking-now “java” 1 TOPK.INCRBY spiking-now “java” 1 Hash(“java”) % 5 = 2 redis: 1 decay: 0,9 devoxx: 2 devoxx: 1 java: 1 2 * 0,9 = 1,8

spiking-now “java” 1 TOPK.INCRBY spiking-now “java” 1 Hash(“java”) % 5 = 2 redis: 1 decay: 0,9 devoxx: 1 java: 1 1 * 0,9 = 0,9

Detecting Spikes • Calculate the average of every term in
the past three minutes • Compare with the current minute • “devoxx”: • current min = 455 times • -1 min = 400 times • -2 min = 350 times • -3 min = 300 times avg: 300 times } 455 - 300 300 = 0,5

Demo time!

Concluding

Counting Individual Terms ~2MB per minute ~120MB per hour ~2.8GB
per day ~87GB per month ~1TB per year 156KB per minute 9.3MB per hour 223MB per day 6.7GB per month 80GB per year Sorted Set Count Min Sketch

Deduplicating ~2.6MB per minute ~156MB per hour ~3.7GB per day
~111GB per month ~1.3TB per year 135KB per hour 3.2MB per day 97MB per month 1.2GB per year Set Bloom Filter

Tracking Spikes ~2.6MB per minute ~156MB per hour ~3.7GB per
day ~111GB per month ~1.3TB per year 284KB per 5 minutes 3.4MB per hour 82MB per day 2.4GB per month 288GB per year Sorted Set TopK

"Have the knowledge and the wisdom to know when not
to use it” Venkat

RAPHAEL DE LIO DEVELOPER ADVOCATE Stay Curious! *

Count-Min Sketch, Bloom Filter, TopK: Efficient...

Count-Min Sketch, Bloom Filter, TopK: Efficient probabilistic data structures

More Decks by Raphael De Lio

Other Decks in Programming

Featured

Transcript