Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real time Data Pipeline example

Real time Data Pipeline example

Avatar for Sam Bessalah

Sam Bessalah

April 29, 2014
Tweet

More Decks by Sam Bessalah

Other Decks in Programming

Transcript

  1. In reality • Volume grows increasingly • Real life environnement

    always complicated • Privacy, compliance, etc • ETL is a pain, not always feasible • Data is always messy, incoherent ,incomplete • E.g Date: “Sat Mar 1 10:12:53 PST,” “ 2014-03-01 18:12:53 +00:00” “1393697578”
  2. No silver bullet, but • Tackle the scalability problem upfront

    • Build a resilient, reliable data processing pipeline • Enforce auditing, verification, testing of your system
  3. Problem with this • Costly $$$ • High latency •

    Mostly batch oriented • Hard to evolve
  4. Properties of an efficient pipeline : • Keep data close

    to the source • Work on hot data • Avoid sampling, instead summarize or hash • Determine a common format , logical and physical • Make access to the data easy for analysis • Let the business drive question
  5. KAFKA • High throughput distributed messaging • Publish/Subscribe model •

    Categorizes messages into topics • Persists messages into disk. Allows message retention for a specified amount of time • Can have multiple producers and consumers
  6. STORM • Distributed real time data computation engine • Uses

    a graph of computations called topologies. E.g We can run many topologies to prepare data, and run real time machine learning models.
  7. What breaks at scale • Serialisation : Important part of

    your pipeline Take your pick : Protocol Buffers, Thrift, Avro Stay away of schemaless JSON. • Compression : Snappy, LZO, … • Storage format : Look at columnar storage formats like Parquet. Easier for OLAP operations.