almost every distributed computing problem ‣ MapReduce over your raw data is flexible but slow ‣ Hadoop is not optimized for query latency ‣ To optimize queries, we need a query layer
optimize for? • Revenue over time broken down by demographic • Top publishers by clicks over the last month • Number of unique visitors broken down by any dimension • Not dumping the entire dataset • Not examining individual events
sourced in Oct. 2012 ‣ Growing Community • ~40 contributors from many different organizations ‣ Designed for low latency ingestion and aggregation • Optimized for the types of queries we were trying to make
2011-01-01T01:01:35Z bieberfever.com google.com Male USA 0 0.65! 2011-01-01T01:03:63Z bieberfever.com google.com Male USA 0 0.62! 2011-01-01T01:04:51Z bieberfever.com google.com Male USA 1 0.45! ...! 2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87! 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99! 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53
revenue! ! ! ! 2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70! 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18! ! ! ! ! 2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 1953 17 17.31! 2011-01-01T02:00:00Z bieberfever.com google.com Male UK 3194 170 34.01! ! ‣ Shard data by time ‣ Immutable chunks of data called “segments” ! ! Segment 2011-01-01T02/2011-01-01T03 ! ! Segment 2011-01-01T01/2011-01-01T02
Read consistency ‣ One thread scans one segment ‣ Multiple threads can access same underlying data ‣ Segment sizes -> computation completes in ms ‣ Simplifies distribution & replication
revenue! ! ! ! 2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70! 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18! ! ‣ Scan/load only what you need ‣ Compression! ‣ Indexes!
Druid gave us arbitrary data exploration & fast queries ‣ What about data freshness? • Batch loading is slow! • We need “real-time” • Alerts, operational monitoring, etc.
processor— one event at a time ‣ We can already process our data using Hadoop MapReduce ‣ Let’s translate that to streams ‣ “Load” operations stream data from Kafka ‣ “Map” operations are already stream-friendly ‣ “Reduce” operations can be windowed with partitioned state
within seconds ‣ Systems are fully decoupled ‣ Brief processing delays during maintenance ‣ Because we need to restart Storm topologies ‣ But query performance and availability are not affected
‣ Batch re-processing runs for all data older than a few hours ‣ Batch segments replace real-time segments in Druid ‣ Query broker merges results from both systems “Fixed up,” immutable, historical data –by Hadoop Realtime data –by Storm & Realtime Druid
‣ Kafka provides fast, reliable event transport ‣ Storm and Hadoop clean and prepare data for Druid ‣ Druid handles queries and manages the serving layer ‣ “Real-time Analytics Data Stack” ‣ …a.k.a. RAD Stack ‣ https://metamarkets.com/2014/building-a-data-pipeline/