at Twitter -- co-author of scala+hadoop library @Scalding -- co-author of realtime analytics system @Summingbird •Former Assistant Professor of Electrical + Computer Engineering at Univ. Florida -- Physics Ph.D. Tuesday, February 11, 14
min c) • (a max b) max c = a max (b max c) • (a or b) or c = a or (b or c) • int addition: (a + b) + c = a + (b + c) • set union: (a u b) u c = a u (b u c) • harmonic sum: 1/(1/a + 1/b) • and vectors: [a1, a2] max [b1, b2] = [a1 max b1, a2 max b2] Example Monoids Tuesday, February 11, 14
0 such that 0+a=a +0=a for all a, they are called monoids. •Many computations are associative, or can be expressed that way. •Lack of associativity increases latency exponentially. Tuesday, February 11, 14
those with 1, read by taking AND. •writing uses boolean OR, that’s a monoid, so we can do this in parallel => lowers latency. Reading also a monoid (AND)! •We can tune false prob by tuning m(bits) and k(hashes), •p~exp(-m/(2n)) for n items, k=0.7m/n Tuesday, February 11, 14
on the site? Actions (look at Tweet x, follow user y, etc...) Users (>10^8) To count Set size, we may need to store the whole set (maybe all users?) for all these pairs of actions (HUGE!) Tuesday, February 11, 14
bucket holds max of ~1/m values, so each bucket estimates size: S/m ~ 2^r Harmonic mean estimates total size ~ 1/(1/m sum(1/(m2^r))) Tuesday, February 11, 14
r, MAX that with existing, read by taking HARMONIC_SUM of all buckets. • writing uses MAX, that’s a monoid, so we can do this in parallel => lowers latency. reading also uses monoid! (HARMONIC_SUM) • We can tune size error by tuning bucket count (m) and bits used to store r. • std. error ~ 1.04/sqrt(m) in HyperLogLog Tuesday, February 11, 14
OR those with 1, read by taking AND. • writing uses boolean OR, that’s a monoid, so we can do this in parallel => lowers latency. Reading also a monoid (AND)! • We can tune false prob by tuning m(bits) and k(hashes), • p~exp(-m/(2n)) for n items, k=0.7m/n in Bloomfilter Tuesday, February 11, 14
hour? 196 hours/week x 52 weeks/ year x 7 years of tweets Users (>10^8) If we make a key for each (user, hour) pair we have 10s of trillions potential keys Tuesday, February 11, 14
ADD those with 1, read by taking MIN. • writing uses numeric ADD, that’s a monoid, so we can do this in parallel => lowers latency. Reading also a monoid (MIN)! • We can tune error: Prob > 1 - delta, error is at most eps * (Total Count). • m = 1/eps, k = log(1/delta) in Count-Min-Sketch Tuesday, February 11, 14
m-dim binary space, read same hashes. Boolean OR Boolean AND HyperLogLog 1-hash into m dimensional real space, read whole space. Numeric MAX Harmonic Sum Count-min-sketch d-hashes onto d non-overlapping m dimensional spaces, read same hashes. Numeric Sum Numeric MIN Tuesday, February 11, 14
always Ordered (bools, reals, integers). •These monoids are all commutative. •The write monoid has: a + b >= a, b •The read monoid has: a + b <= a, b Tuesday, February 11, 14
as Sets, Maps, etc... familiar to programmers => accessibility. • Sampling in complex computations is hard! How to sample correlated events (edges in graphs, communities, etc...) hashing can sidestep but still be on a budget. • Hash-sketches are naturally are Monoids, and thus are highly efficient for map/ reduce or streaming applications. Tuesday, February 11, 14
years old. Lots to do! • There is clearly something general going on here, what is the larger theory than describes all of this? • Sketches can be composed, which allows non-experts to leverage them. • Sketches often have properties amenable to parallelization (Monoids)! Tuesday, February 11, 14