Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What Reading 5 Papers can yield for your Business

αλεx π
March 01, 2014
310

What Reading 5 Papers can yield for your Business

From my Lambda Days talk.

αλεx π

March 01, 2014
Tweet

Transcript

  1. 42

  2. Somehow Somehow: numbers evolve around percentiles clustering evolves around words

    & spam filters regressions are about predicting page views combinations/permutations are not even mentioned
  3. `someone` said “Measure everything” Let’s just say that you can

    efficiently measure locally, merge on stat machines percentiles are cool and everything, but mostly for latency
  4. Count-min sketch Millions of IPs How many requests per IP?

    Millions Object Allocations How many Objects of type N were allocated? Efficient for large amount of “keys” Approximates minimum amount of occurrences
  5. Update Given: K(rows) M (width) table(KxM) ! for i in

    (0..K) h = hash(item, seed) position = h % M table[i][position] += 1 end
  6. Lookup Given: K(rows) M (width) table(KxM) ! for i in

    (0..K) h = hash(item, seed) position = h % M return table[i][position] end
  7. Top-K Hold a CM-sketch Do an update on each event

    occurrence Do a lookup in CM-sketch Add looked up key + value to the top-list If top-list is over capacity, trim the smallest value
  8. Update Given: h={(p1 ,m1 ), . . . , (pB

    ,mB )} ! if p = pi for some i mi = mi + 1 else find min <-> between points merge (pi ,mi ) and (pi+1 ,mi+1 ) replace (pi ,mi ) with merged end
  9. Update Given: h={(p1 ,m1 ), . . . , (pB

    ,mB )} ! if p = pi for some i mi = mi + 1 else mB+1 =p (expanding) if (B + 1 > max-size) find min <-> between points merge (pi ,mi ) and (pi+1 ,mi+1 ) replace (pi ,mi ) with merged end end
  10. Histogram Initially written for streaming decision-trees Non-approximate, will give you

    exact stats Lightweight, since only has to keep N bins Very useful when you don’t know value bounds
  11. Number of distinct items in X To use with any

    of the tricks from `consolidation` section Approximate, although error is known HyperLogLog
  12. Operates on vectors (double) Assign each observation to cluster Take

    Eucledian distance measure, by minimising for each dimension ! Calculate new means to be centroids Repeat until converges Use results to yield a model K-means
  13. Operates on vectors (double) Model = class, variance, cluster mean

    Find the cluster closest to the observation 100% streamable Naive bayes
  14. Minimum amount of memory 100% commutative (Many) more algorithms available

    out there Data will show the way, no golden hammer Conclusions