Upgrade to Pro — share decks privately, control downloads, hide ads and more …

On The Social Impedance Mismatch in Data Storage

On The Social Impedance Mismatch in Data Storage

Data Storage affects the way systems are modelled. This slide set shall give a short justification why the data storage and -processing layer needs an overhaul to overcome its current limitations.

Avatar for Martin Scholl

Martin Scholl

February 27, 2012
Tweet

More Decks by Martin Scholl

Other Decks in Programming

Transcript

  1. I have a suspicion: Data Store Software is not social

    Martin Scholl <martin@infinipool.com> @zeit_geist
  2. •Notes • are a fact. can get copied. • consist

    of immutable & absolute entities • have a fixed beginning and ending • music sheet = music essence; “Music’s NoSQL DB” •Music • Making is a process. You can record but not copy music. • flows with the rhythm • lives by the interactions • is uniquely determined in space and time: the Music’s context
  3. Webserver Webserver Webserver Webserver API-Endpoint API-Endpoint API-Endpoint API-Endpoint Graph-DB Graph-DB

    Graph-DB Flock-DB 1. Social Interaction: Data + Context } 2. just Data }
  4. •Data Stores • store facts. • facts are fix and

    absolute • facts are uniquely determined by key / ID • Data Stores are the source of “truth” • contain what has happened.
  5. •Notes • are a fact. can get copied. • consist

    of immutable & absolute entities • have a fixed beginning and ending • music sheet = music essence; “Music’s NoSQL DB” •Data Stores • store facts. • facts are fix and absolute • facts are uniquely determined by key / ID • Data Stores are the source of “truth” • contain what has happened.
  6. Lose Information w/ your fav. Data Store • Data in

    a Data Store gets de-contextualized. • You don’t get to know the origin of data but just the fact itself. • irrecoverable information loss! • There is a severe social impedance mismatch
  7. Lose Information w/ your fav. Data Store • Data in

    a Data Store gets de-contextualized. • You don’t get to know the origin of data but just the fact itself. • irrecoverable information loss! • There is a severe social impedance mismatch
  8. Webserver Webserver Webserver Webserver API-Endpoint API-Endpoint API-Endpoint API-Endpoint Graph-DB Graph-DB

    Graph-DB Flock-DB Data } Data + Context } Data + Context Logic } Context- Engine Context- Engine Context- Engine
  9. Context Engine Requirements • must have a flexible programming model

    • must be scalable and resilient • must be able to integrate and process data from high velocity data sources
  10. Nathan Marz’s Storm • has a flexible programming model •

    is scalable and resilient • integrates and processes data from high velocity data sources
  11. Nathan Marz’s Storm • implemented in Clojure + Java •

    was Backtype proprietary • OpenSource’d Sep 2011 • is Eclipse Public License licensed • http://github.com/nathanmarz/storm
  12. What does Storm? • it’s like M/R but for real-time

    computation • works over streams • communicates tuples in a cluster Spout Bolt Bolt Bolt Bolt Bolt
  13. What does Storm? • Local Development mode or distributed •

    Starts JVMs (workers) • at-least-once message processing guarantee • Storm’s contributions: scalability, resiliency and processing guarantee Spout Bolt Bolt Bolt Bolt Bolt
  14. Some Use-Cases • Analysis on Event-Streams: • Filtering, Counting, Aggregation

    • Monitoring, etc. etc. • Parallel and Distributed RPC • Contextualization Spout Bolt Bolt Bolt
  15. Spout Acker Bolt Bolt ID V 42 40^4 Tuple(id=40) Bolt

    Tuple(id=4) Message Processing Guarantee
  16. Spout Acker Bolt Bolt ID V 42 40 ^ 4

    (id=40) Tuple(id=40) Bolt Tuple(id=4) Message Processing Guarantee
  17. Spout Acker Bolt Bolt ID V 42 4 (id=4) Bolt

    Tuple(id=40) Tuple(id=4) Message Processing Guarantee
  18. Resilience • a centralized component coordinates deployment and starts worker

    (Nimbus) • Workers run distributed & are supervised • Online State is persisted into Zookeeper • Every component may fail Nimbus ZK ZK ZK Worker Worker
  19. Use-Case • Use-Case: Online A/B Testing • Contextualization: determine Clique

    (A | B) online • Reconfigure A/B-Test really quick Spout Clickstream ∑ New Configuration User User
  20. Use-Case • Use-Case: Social Graph Update Propagation • Send E-Mail

    to B • Update Recommendation Matrix for A (and B) Spout ‘A follows B now’ A A Bolt B B Bolt New Configuration New ML Model Send EMail
  21. Contextualization with Storm • Contextualization ✓ • Store Users’ context

    in-memory using Bolts • Continuously persist state into stable storage • Towards real-time context to every request Spout Consolidated Event-Stream User User User Recom- mender Trending Stuff / global stats Anti- Spam
  22. On Storm • Storm is not a silver-bullet • Rather

    Storm is petri dish for real-time computation and coordination tasks • Topology changes: stop-start-cycle required • There is no Pig Latin / Hive for Storm • Advanced Topics are added with every release (e.g. Transactional Semantics)
  23. Lessons Learned • De-Contextualization is a bad thing. • Your

    data store won’t help you. • You have to add some magic to your stack. • Storm has the potential to become the Next Big Thing after Hadoop • Use Storm to fix the Social Impedance Mismatch Issue
  24. Want to change the world with real-time data? contact me:

    Martin Scholl <martin@infinipool.com> @zeit_geist
  25. Data Stores (DBMS, NoSQL) Event Systems (e.g. Storm, S4) Model

    Queries Data Focus Dataset Size Domain Pull Push Run Once Run Continuously Historic Live Retrieval & Storage Format Efficiency Throughput & Latency 10^9 10^6 Volume Velocity
  26. A Note on Time • Real-Time: milliseconds - seconds •

    Near Real-Time: seconds-minutes • Batch: minutes-