Scaling the World's Largest Django App - DjangoCon 2010

Jason Yan @jasonyan David Cramer @zeeg Scaling the World’s Largest
Django App 1

What is DISQUS? 2

What is DISQUS? We are a comment system with an
emphasis on connecting communities http://disqus.com/about/ dis·cuss • dĭ-skŭs' 3

What is Scale? 17,000 requests/second peak 450,000 websites 15 million
proﬁles 75 million comments 250 million visitors (August 2010) 50M 100M 150M 200M 250M 300M Number of Visitors Our trafﬁc at a glance 4

Our Challenges • We can’t predict when things will happen
• Random celebrity gossip • Natural disasters • Discussions never expire • We can’t keep those millions of articles from 2008 in the cache • You don’t know in advance (generally) where the trafﬁc will be • Especially with dynamic paging, realtime, sorting, personal prefs, etc. 5

Our Challenges (cont’d) • High availability • Not a destination
site • Difﬁcult to schedule maintenance 6

Server Architecture 7

Server Architecture - Load Balancing • Load Balancing • Software,
HAProxy • High performance, intelligent server availability checking • Bonus: Nice statistics reporting • High Availability • heartbeat Image Source: http://haproxy.1wt.eu/ 8

Server Architecture • ~100 Servers • 30% Web Servers (Apache
+ mod_wsgi) • 10% Databases (PostgreSQL) • 25% Cache Servers (memcached) • 20% Load Balancing / High Availability (HAProxy + heartbeat) • 15% Utility Servers (Python scripts) 9

Server Architecture - Web Servers • Apache 2.2 • mod_wsgi
• Using `maximum-requests` to plug memory leaks. • Performance Monitoring • Custom middleware (PerformanceLogMiddleware) • Ships performance statistics (DB queries, external calls, template rendering, etc) through syslog • Collected and graphed through Ganglia 10

Server Architecture - Database • PostgreSQL • Slony-I for Replication
• Trigger-based • Read slaves for extra read capacity • Failover master database for high availability 11

Server Architecture - Database • Make sure indexes ﬁt in
memory and measure I/O • High I/O generally means slow queries due to missing indexes or indexes not in buffer cache • Log Slow Queries • syslog-ng + pgFouine + cron to automate slow query logging 12

Server Architecture - Database • Use connection pooling • Django
doesn’t do this for you • We use pgbouncer • Limits the maximum number of connections your database needs to handle • Save on costly opening and tearing down of new database connections 13

Our Data Model 14

Partitioning • Fairly easy to implement, quick wins • Done
at the application level • Data is replayed by Slony • Two methods of data separation 15

Vertical Partitioning Vertical partitioning involves creating tables with fewer columns
and using additional tables to store the remaining columns. http://en.wikipedia.org/wiki/Partition_(database) Posts Users Forums Sentry 16

Pythonic Joins posts = Post.objects.all()[0:25] # store users in a
dictionary based on primary key users = dict( (u.pk, u) for u in \ User.objects.filter(pk__in=set(p.user_id for p in posts)) ) # map users to their posts for p in posts: p._user_cache = users.get(p.user_id) Allows us to separate datasets 17

Pythonic Joins (cont’d) • Slower than at database level •
But not enough that you should care • Trading performance for scale • Allows us to separate data • Easy vertical partitioning • More efﬁcient caching • get_many, object-per-row cache 18

Designating Masters • Alleviates some of the write load on
your primary application master • Masters exist under speciﬁc conditions: • application use case • partitioned data • Database routers make this (fairly) easy 19

Routing by Application class ApplicationRouter(object): def db_for_read(self, model, **hints): instance
= hints.get('instance') if not instance: return None app_label = instance._meta.app_label return get_application_alias(app_label) 20

Horizontal Partitioning Horizontal partitioning (also known as sharding) involves splitting
one set of data into different tables. http://en.wikipedia.org/wiki/Partition_(database) Your Blog CNN Disqus Telegraph 21

Horizontal Partitions • Some forums have very large datasets •
Partners need high availability • Helps scale the write load on the master • We rely more on vertical partitions 22

Routing by Partition class ForumPartitionRouter(object): def db_for_read(self, model, **hints): instance
= hints.get('instance') if not instance: return None forum_id = getattr(instance, 'forum_id', None) if not forum_id: return None return get_forum_alias(forum_id) # Now, making sure hints are available forum.post_set.all() # What we used to do Post.objects.filter(forum=forum) 23

Optimizing QuerySets • We really dislike raw SQL • It
creates more work when dealing with partitions • Built-in cache allows sub-slicing • But isn’t always needed • We removed this cache 24

Removing the Cache • Django internally caches the results of
your QuerySet • This adds additional memory overhead • Many times you only need to view a result set once • So we built SkinnyQuerySet # 1 query qs = Model.objects.all()[0:100] # 0 queries (we don’t need this behavior) qs = qs[0:10] # 1 query qs = qs.filter(foo=bar) 25

Removing the Cache (cont’d) class SkinnyQuerySet(QuerySet): def __iter__(self): if self._result_cache
is not None: # __len__ must have been run return iter(self._result_cache) has_run = getattr(self, 'has_run', False) if has_run: raise QuerySetDoubleIteration("...") self.has_run = True # We wanted .iterator() as the default return self.iterator() Optimizing memory usage by removing the cache http://gist.github.com/550438 26

Atomic Updates • Keeps your data consistent • save() isnt
thread-safe • use update() instead • Great for things like counters • But should be considered for all write operations 27

Atomic Updates (cont’d) post = Post(pk=1) # a moderator approves
post.approved = True post.save() Thread safety is impossible with .save() Request 1 post = Post(pk=1) # the author adjusts their message post.message = ‘Hello!’ post.save() Request 2 28

Atomic Updates (cont’d) post = Post(pk=1) # a moderator approves
Post.objects.filter(pk=post.pk)\ .update(approved=True) So we need atomic updates Request 1 post = Post(pk=1) # the author adjusts their message Post.objects.filter(pk=post.pk)\ .update(message=‘Hello!’) Request 2 29

Atomic Updates (cont’d) def update(obj, using=None, **kwargs): """ Updates specified
attributes on the current instance. """ assert obj, "Instance has not yet been created." obj.__class__._base_manager.using(using)\ .filter(pk=obj) .update(**kwargs) for k, v in kwargs.iteritems(): if isinstance(v, ExpressionNode): # NotImplemented continue setattr(obj, k, v) A better way to approach updates http://github.com/andymccurdy/django-tips-and-tricks/blob/master/model_update.py 30

Delayed Signals • Queueing low priority tasks • even if
they’re fast • Asynchronous (Delayed) signals • very friendly to the developer • ..but not as friendly as real signals 31

Delayed Signals (cont’d) from disqus.common.signals import delayed_save def my_func(data, sender,
created, **kwargs): print data[‘id’] delayed_save.connect(my_func, sender=Post) We send a speciﬁc serialized version of the model for delayed signals This is all handled through our Queue 32

Caching • Memcached • Use pylibmc (newer libMemcached-based) • Ticket
#11675 (add pylibmc support) • Third party applications: • django-newcache, django-pylibmc 33

Caching (cont’d) • libMemcached / pylibmc is conﬁgurable with “behaviors”.
• Memcached “single point of failure” • Distributed system, but we must take precautions. • Connection timeout to memcached can stall requests. • Use `_auto_eject_hosts` and `_retry_timeout` behaviors to prevent reconnecting to dead caches. 34

Caching (cont’d) • Default (naive) hashing behavior • Modulo hashed
cache key cache for index to server list. • Removal of a server causes majority of cache keys to be remapped to new servers. CACHE_SERVERS = [‘10.0.0.1’, ‘10.0.0.2’] key = ‘my_cache_key’ cache_server = CACHE_SERVERS[hash(key) % len(CACHE_SERVERS)] 35

Caching (cont’d) • Better approach: consistent hashing • libMemcached (pylibmc)
uses libketama (http://tinyurl.com/lastfm-libketama) • Addition / removal of a cache server remaps (K/n) cache keys (where K=number of keys and n=number of servers) Image Source: http://sourceforge.net/apps/mediawiki/kai/index.php?title=Introduction 36

Caching (cont’d) • Thundering herd (stampede) problem • Invalidating a
heavily accessed cache key causes many clients to refill cache. • But everyone refetching to fill the cache from the data store or reprocessing data can cause things to get even slower. • Most times, it’s ideal to return the previously invalidated cache value and let a single client refill the cache. • django-newcache or MintCache (http:// djangosnippets.org/snippets/793/) will do this for you. • Prefer filling cache on invalidation instead of deleting from cache also helps to prevent the thundering herd problem. 37

Transactions • TransactionMiddleware got us started, but down the road
became a burden • For postgresql_psycopg2, there’s a database option, OPTIONS[‘autocommit’] • Each query is in its own transaction. This means each request won’t start in a transaction. • But sometimes we want transactions (e.g., saving multiple objects and rolling back on error) 38

Transactions (cont’d) • Tips: • Use autocommit for read slave
databases. • Isolate slow functions (e.g., external calls, template rendering) from transactions. • Selective autocommit • Most read-only views don’t need to be in transactions. • Start in autocommit and switch to a transaction on write. 39

Scaling the Team • Small team of engineers • Monthly
users / developers = 40m • Which means writing tests.. • ..and having a dead simple workﬂow 40

Keeping it Simple • A developer can be up and
running in a few minutes • assuming postgres and other server applications are already installed • pip, virtualenv • settings.py 41

Setting Up Local 1. createdb -E UTF-8 disqus 2. git
clone git://repo 3. mkvirtualenv disqus 4. pip install -U -r requirements.txt 5. ./manage.py syncdb && ./manage.py migrate 42

Sane Defaults from disqus.conf.settings.default import * try: from local_settings import
* except ImportError: import sys, traceback sys.stderr.write("Can't find 'localsettings.py’\n”) sys.stderr.write("\nThe exception was:\n\n") traceback.print_exc() settings.py from disqus.conf.settings.dev import * local_settings.py 43

Continuous Integration • Daily deploys with Fabric • several times
an hour on some days • Hudson keeps our builds going • combined with Selenium • Post-commit hooks for quick testing • like Pyﬂakes • Reverting to a previous version is a matter of seconds 44

Continuous Integration (cont’d) Hudson makes integration easy 45

Testing • It’s not fun breaking things when you’re the
new guy • Our testing process is fairly heavy • 70k (Python) LOC, 73% coverage, 20 min suite • Custom Test Runner (unittest) • We needed XML, Selenium, Query Counts • Database proxies (for read-slave testing) • Integration with our Queue 46

Testing (cont’d) # failures yield a dump of queries def
test_read_slave(self): Model.objects.using(‘read_slave’).count() self.assertQueryCount(1, ‘read_slave’) def test_button(self): self.selenium.click('//a[@class=”dsq-button”]') Query Counts Selenium Queue Integration class WorkerTest(DisqusTest): workers = [‘fire_signal’] def test_delayed_signal(self): ... 47

Bug Tracking • Switched from Trac to Redmine • We
wanted Subtasks • Emailing exceptions is a bad idea • Even if its localhost • Previously using django-db-log to aggregate errors to a single point • We’ve overhauled db log and are releasing Sentry 48

django-sentry Groups messages intelligently http://github.com/dcramer/django-sentry 49

django-sentry (cont’d) Similar feel to Django’s debugger http://github.com/dcramer/django-sentry 50

Feature Switches • We needed a safety in case a
feature wasn’t performing well at peak • it had to respond without delay, globally, and without writing to disk • Allows us to work out of trunk (mostly) • Easy to release new features to a portion of your audience • Also nice for “Labs” type projects 51

Feature Switches (cont’d) 52

Final Thoughts • The language (usually) isn’t your problem •
We like Django • But we maintain local patches • Some tickets don’t have enough of a following • Patches, like #17, completely change Django.. • ..arguably in a good way • Others don’t have champions Ticket #17 describes making the ORM an identify mapper 53

Housekeeping Want to learn from others about performance and scaling
problems? Birds of a Feather We’re Hiring! DISQUS is looking for amazing engineers Or play some StarCraft 2? 54

Questions 55

References django-sentry http://github.com/dcramer/django-sentry Our Feature Switches http://cl.ly/2FYt Andy McCurdy’s update()
http://github.com/andymccurdy/django-tips-and-tricks Our PyFlakes Fork http://github.com/dcramer/pyﬂakes SkinnyQuerySet http://gist.github.com/550438 django-newcache http://github.com/ericﬂo/django-newcache attach_foreignkey (Pythonic Joins) http://gist.github.com/567356 56

Scaling the World's Largest Django App - Django...

Scaling the World's Largest Django App - DjangoCon 2010

More Decks by David Cramer

Other Decks in Technology

Featured

Transcript