Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Realtime Analytics using MongoDB, Python, Geven...

rick446
October 03, 2011

Realtime Analytics using MongoDB, Python, Gevent, and ZeroMQ

With over 180,000 projects and over 2 million users, SourceForge has tons of data about people developing and downloading open source projects. Until recently, however, that data didn't translate into usable information, so Zarkov was born. Zarkov is system that captures user events, logs them to a MongoDB collection, and aggregates them into useful data about user behavior and project statistics. This talk will discuss the components of Zarkov, including its use of Gevent asynchronous programming, ZeroMQ sockets, and the pymongo/bson driver.

rick446

October 03, 2011
Tweet

More Decks by rick446

Other Decks in Technology

Transcript

  1. SourceForge s MongoDB - Tried CouchDB – liked the dev model,

    not so much the performance - Migrated consumer-facing pages (summary, browse, download) to MongoDB and it worked great (on MongoDB 0.8 no less!) - Built an entirely new tool platform around MongoDB (Allura)
  2. The Problem We’re Trying to Solve - We have lots of

    users (good) - We have lots of projects (good) - We don’t know what those users and projects are doing (not so good) - We have tons of code in PHP, Perl, and Python (not so good)
  3. Introducing Zarkov 0.0.1 Asynchronous TCP server for event logging with

    gevent Turn OFF “safe” writes, turn OFF Ming validation (or do it in the client) Incrementally calculate aggregate stats based on event log using mapreduce with {‘out’:’reduce’}
  4. Zarkov Architecture MongoDB BSON over ZeroMQ Journal Greenlet Commit Greenlet

    Write-ahead log Write-ahead log Aggregation Cron job
  5. Technologies - MongoDB - Fast (10k+ inserts/s single-threaded) - ZeroMQ - Built-in buffering - PUSH/PULL

    sockets (push never blocks, easy to distribute work) - BSON - Fast Python/C implementation - More types than JSON - Gevent - “green threads” for Python
  6. “Wow, it’s really fast; can it replace…” - Download statistics? - Google

    Analytics? - Project realtime statistics? “Probably, but it’ll take some work….”
  7. Moving towards production.... - MongoDB MapReduce: convenient, but not so fast

    - Global JS Interpreter Lock per mongod - Lots of writing to temp collections (high lock %) - Javascript without libraries (ick!) - Hadoop? Painful to configure, high latency, non-seamless integration with MongoDB
  8. Zarkov’s already doing a lot… So we added a lightweight

    map/ reduce framework - Write your map/reduce jobs in Python - Input/Output is MongoDB - Intermediate files are local .bson files - Use ZeroMQ for job distribution
  9. Quick Map/reduce Refresher def map_reduce(input_collection, query, output_collection, map, reduce): objects

    = input_collection.find(query) map_results = list(map(objects)) map_results.sort(key=operator.itemgetter(0)) for key, kv_pairs in itertools.groupby( (map_results, operator.itemgetter(0)): value = reduce(key, [ v for k,v in kv_pairs ]) output_collection.save( {"_id":key, "value":value})
  10. Quick Map/reduce Refresher def map_reduce(input_collection, query, output_collection, map, reduce): objects

    = input_collection.find(query) map_results = list(map(objects)) map_results.sort(key=operator.itemgetter(0)) for key, kv_pairs in itertools.groupby( (map_results, operator.itemgetter(0)): value = reduce(key, [ v for k,v in kv_pairs ]) output_collection.save( {"_id":key, "value":value}) Parallel
  11. Zarkov Map/Reduce - Phases managed by greenlets - Map and reduce jobs

    parceled out to remote workers via zmq PUSH/PULL - Adaptive timeout/retry to support dead workers - “Sort” is performed by mapper, generating a fixed # of reduce jobs
  12. Zarkov Web Service - We’ve got the data in, now how

    do we get it out? - Zarkov includes a tiny HTTP server $ curl -d foo='{"c":"sfweb", "b":"date/2011-07-01/", "e":"date/2011-07-04"}' http://localhost:8081/q {"foo": {"sflogo": [[1309579200000.0, 12774], [1309665600000.0, 13458], [1309752000000.0, 13967]], "hits": [[1309579200000.0, 69357], [1309665600000.0, 68514], [1309752000000.0, 68494]]}} - Values come out tweaked for use in flot
  13. MongoDB Tricks - Autoincrement integers are harder than in MySQL but

    not impossible - Unsafe writes, insert > update class IdGen(object):! @classmethod! def get_ids(cls, inc=1):! obj = cls.query.find_and_modify(! query={'_id':0},! update={! '$inc': dict(inc=inc),! },! upsert=True,! new=True)! return range(obj.inc - inc, obj.inc)
  14. MongoDB Pitfalls - Use databases to partition really big data (e.g.

    events), not collections - Avoid Javascript (mapreduce, group, $where) - Indexing is nice, but slows things down; use _id when you can - mongorestore is fast, but locks a lot
  15. Future Work   Remove SPoF   Better way of expressing

    aggregates   “ZQL”   Better web integration   WebSockets/Socket.io   Maybe trigger aggs based on event activity?