Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Monoids for Large Scale Business Stats

Using Monoids for Large Scale Business Stats

Talk gave at HackersAvenue: October 2017 - Data Engineering.

References for the presentation can be found at https://github.com/ashwanthkumar/large-scale-business-stats-talk

Ashwanth Kumar

October 07, 2017
Tweet

More Decks by Ashwanth Kumar

Other Decks in Technology

Transcript

  1. overview - Stats for Batch Jobs - Stats for Streaming

    Workloads - Generalizing aggregations with Monoids - Abel - Some cool logos!
  2. Used Scalding (from Twitter) + Simple to express aggregations -

    Have to include intermediary data in the output just for stats stats as map-reduce jobs
  3. Used Scalding (from Twitter) + Simple to express aggregations -

    Have to include intermediary data in the output just for stats - Have to think about writing stats after writing production code Stats as Map-Reduce Jobs
  4. Used Scalding (from Twitter) + Simple to express aggregations -

    Have to include intermediary data in the output just for stats - Have to think about writing stats after writing production code - Not updated at-least till next run (Not “realtime”) stats as map-reduce jobs
  5. stats for streaming workloads Push computed rollups to InfluxDB +

    Realtime + Allows arbitrary functions as rollups - code as config
  6. stats for streaming workloads Push computed rollups to InfluxDB +

    Realtime + Allows arbitrary functions as rollups - code as config - Not distributable - since it allows arbitrary functions
  7. stats for streaming workloads Push computed rollups to InfluxDB +

    Realtime + Allows arbitrary functions as rollups - code as config - Not distributable - since it allows arbitrary functions - Stats emission and rollups are at separate places - making it difficult to test and keep them in sync
  8. - Have to include intermediary data in the output just

    for stats - Have to think about writing stats after writing production code - Not updated at-least till next run (Not “realtime”) - Not distributable - since it allows arbitrary functions - Stats emission and rollups are at separate places - making it difficult to test and keep them in sync stats for map-reduce jobs stats for streaming workloads
  9. An operation is considered a monoid if: (x . y)

    . z = x . (y . z) (associativity aka semigroup) identity . x = x . identity = x (identity) monoids trait Semigroup[T] { def plus(left: T, right: T): T } trait Monoid[T] extends Semigroup[T] { def zero: T }
  10. A monoid can also be commutative x . y =

    y . x monoids Commutative property of monoids are used for parallel processing on large datasets
  11. monoids - count / sum sum is associative sum(sum(2, 6),

    6) == sum(2, sum(6, 6)) sum(8, 6) == sum(2, 12) 14 == 14
  12. monoids - average Average of an average is not an

    average, aka, not associative avg(avg(2, 6), 6) != avg(2, avg(6, 6)) avg(4, 6) != avg(2, 6) 5 != 4
  13. monoids - average But Average can be associative, if we

    have total & count individually case class Average(total: Double, count: Long) { def toAvg: Double = total / count }
  14. parallel aggregations 3 4 ... 7 2 1 3 ...

    8 7 5 ... 1 3 4 ... 7 Σ A 2 1 3 ... 8 7 5 ... 1 Σ C Σ B Σ = Σ A +Σ B +Σ C
  15. - While sum is accurate, distinct counts in constant memory

    are not - Approximate structures like HyperLogLog can find unique counts in constant memory (and a known error bound) monoids - approximates
  16. - While sum is accurate, distinct counts in constant memory

    are not - Approximate structures like HyperLogLog can find unique counts in constant memory (and a known error bound) - 2 more HLL can be merged and their merge is both associative and commutative - can be expressed as a monoid monoids - approximates
  17. - Stats naturally can be expressed as Monoids - Monoids

    given they are associative (and some are also commutative), we can exploit them for massive parallel processing learnings so far
  18. - Stats naturally can be expressed as Monoids - Monoids

    given they are associative (and some are also commutative), we can exploit them for massive parallel processing - We need our stats to be real time even if they’re approximate for some metrics as long as the error bounds are known learnings so far
  19. Written in Scala Backed by RocksDB Uses twitter/algebird for HLL

    Uses Kafka for stats delivery abel Consumes stats in (near) Realtime Expose aggregations over HTTP Crunches 1M events in less than 15 seconds on 1 machine
  20. case class Metric[T <: Aggregate[T]] (key: Key, value: T with

    Aggregate[T]) abel internals Metric = Key * Aggregate (Semigroup)
  21. case class Time(time: Long, granularity: Long) case class Key(name:String, tags:SortedSet[String],

    time:Time = Time.Forever) abel internals Key = Name * Tags * Granularity * Timestamp
  22. Let’s find Unique count of a UPC occurring per site

    and across all sites at the granularities of every hour, every day and overall. That would need 6 metrics per record. abel internals
  23. - Peer-to-Peer system built using Suuchi - Kafka consumer auto

    rebalances the partitions across instances - Uses scatter-gather primitive in Suuchi to perform query time reductions of the metrics before serving it to the users distributed abel
  24. 1.1.1.1 1.1.1.2 1.1.1.3 A stats.service.ix 1.1.1.1 1.1.1.2 1.1.1.3 DNS based

    Load Balancing distributed abel architecture Count(“a”, 1L) Unique(“a”, 1L) Count(“c”, 1L) Unique(“a”, 1L) Count(“b”, 1L) monoid.plus monoid.plus monoid.plus
  25. Open source by facebook Fast persistent KV store Server Workloads

    Embeddable Optimized for SSDs rocksdb Fork of LevelDB Modelled after BigTable LSM Tree based SST files Written in C++
  26. - Serving our API in production for 3+ years -

    Search on hierarchical documents - Dynamic fields didn’t scale well on Solr - Brand / Store / Category Counts for a filter - Price History Service - More than a billion prices and serve online to REST queries rocksdb @indix
  27. - Stats (as Monoids) Storage System - All we want

    was approximate aggregates real-time - HTML Archive System - Stores ~120TB of url and timestamp indexed HTML pages - Real-time scheduler for our crawlers - Finds out which of the 20 urls to crawl now out of 3+ billion urls - Helps crawler crawl 20+ million urls everyday rocksdb @indix
  28. - sum / multiplication - (sorted) top-K elements - operations

    on a graph - eg. link reach on twitter graph - function should be associative and optionally commutative recursive reduction
  29. EOF