Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large Scale Data Service as a Service (RICON Ea...

Large Scale Data Service as a Service (RICON East 2013)

Presented by Brian Akins at RICON East 2013

Turner Broadcasting hosts several large sites that need to serve "data" to millions of clients over HTTP. A couple of years ago, we started building a generic service to solve this and to retire several legacy systems. We will discuss the general architecture, the growing pains, and why we decided to use Riak. We will also share some implementation details and the use of the service for a few large internet events.

About Brian

Brian Akins is Senior Principal Architect at Turner Broadcasting where he is focused primarily on scaling web applications and the organizations that build and support them. He is an old school C guy who has fell in love with Lua. He lives with his wife and four children in the suburbs of Atlanta.

Basho Technologies

May 13, 2013
Tweet

More Decks by Basho Technologies

Other Decks in Technology

Transcript

  1. Disclaimer • All  opinions  stated  are  those  of  the  presenter

     and  do  not   necessarily  reflect  those  of  Turner  or  any  of  its  affiliates  or   partners.
  2. About @bakins • Horrible  public  speaker • Born  and  raised

     in  Alabama • Husband  and  father  of  four • Senior  Principal  Architect,  Turner  Broadcas?ng • “Large”  web  site  and  systems  opera?ons • C  and  Lua  hacker • Learning  Go,  Erlang,  and  MIG  welding
  3. About  Turner  MPTO • Media  PlaKorm  Technology  &  Opera?ons •

    Turner’s  “web  people” – CNN,  NCAA,  NBA,  Adultswim,  Cartoon  Network,  Money,  iReport,   etc • Five  engineering  groups • One  opera?ons  group  
  4. The Problem • Lots of small, frequently updated “data files”

    • XML, JSON, CSV, etc • Every site implemented something slightly different
  5. CNN - Election 2008 • election.cnn.com • Simple reverse proxy-cache

    • One data center devoted to it • “moved” to www.cnn.com after event
  6. data.nba.com • scoreboards, stats, etc • XML, JSON, “.dat” etc

    • Simple Apache cluster • NFS stale filehandle issues • Fronted with HTTP cache
  7. An Opportunity • NCAA March Madness 2011 • Highest planned

    traffic event to date • Chance to build a scalable, generic solution
  8. Large Scale Data • “LSD” • Simple Architecture • GET/PUT

    interface • Publishing system handles multiple data centers
  9. “... as a service” • Managed by Chef • “data

    bag” - Simple json file • DNS Entry - CNAME to LSD • 15 minutes to get a service scaled for Elections traffic
  10. Why not websockets? • HTTP polling • Websockets were still

    “bleeding edge” • Browser support • Corporate environments • Proxies and firewalls
  11. Membase • Extremely fast • Every box is a single

    point of failure • Auto-failover is/was “scary” and error- prone • Data corruption and loss • Thought about professional support, then...
  12. CouchBase • membase front end, couchDB backend • 90% of

    our writes are updates • Reevaluate our data store choices
  13. Alternatives • Data store performance was not our major concern

    • we were blinded by the awesome performance of membase, so we over looked some warts
  14. Alternatives • CouchDB - append-only not a good fit •

    MongoDB - previous operational issues • Relational Database - Clustering/failover • Redis - roll our own sharding • Homegrown - um, no
  15. Why Riak? • Operational Stability • It just works •

    Operationally “Simple” • Performance is good enough • Map/Reduce, 2i, and Search - future uses
  16. Meanwhile... • LSD got more complex • openresty - nginx

    + Lua • rewrite rules, jsonp, etc • Business logic • More of a Lua app now
  17. Riak Implementation • Tried HTTP first • Keys issues -

    our keys were uri’s • decent performance
  18. Riak + PBC + Lua • Twice the performance •

    no support in nginx • we wrote our own and released it • https://github.com/bakins/lua-resty-riak
  19. Riak Infrastructure • Two independent clusters • Five “large-ish” nodes

    per cluster • Bucket Per Site • LevelDB • Chef
  20. 2 second cache • “protects” Riak • consistent performance •

    10+ times the performance • ngx_lua shared dictionaries • spin lock bottleneck • “sharded” shared dictionaries
  21. Haproxy • Works well, no need to replicate inside nginx

    • PBC - TCP load balance. Healthcheck HTTP and port ping • cache made any performance difference negligible here • well instrumented and supported
  22. Recent Events • 2012 Presidential Elections • CNN Homepage Video

    - Breaking News • 2013 March Madness • 2013 NBA All-Star and Playoffs
  23. The Future • Revisit websockets • Testing Riak multi-datacenter replication

    • offer “canned” 2i/MR queries • Redis as cache • Riak for other projects
  24. The Verdict • Riak - it just works (mostly) •

    We “take it for granted” • We’re hiring. Work remote.