Intro to Caching: For Fun or For Profit

Intro to Caching for fun or for profit Julia H
Grace @jewelia

Hi , I’m Julia Head of engineering @ tindie (tindie.com)

Disclaimer •  Many of the techniques and topics discussed in
this talk were used by Tindie at one point or another over the past year. •  Tindie has grown substantially and our caching needs have changed. •  Thus, most of what is discussed will work for you most of the time, but may not work optimally at very large “web scale”.

Everybody Caches you probably cache already!

Definition: Caching •  Keeping duplicate data in multiple locations. • 
Why on earth would you want to do this? •  The beauty of caching is the one of the locations is often faster & easier to access than others.

Welcome to Your Cache! Yum yum yum

Analogy: Kitchen Cache •  Look at your kitchen. It’s a
food cache (sort of)! •  You keep groceries there. But you could just go to the store every time you needed something. •  After all the store has everything. So many delicious foods you can’t fit in your kitchen!

Analogy: Kitchen Cache •  But going to the grocery store
every time you need milk for your coffee is inconvenient and inefficient. •  So how often do you go? •  How do you balance wanting to buy everything with not having space for all of it? •  These are real problems of caching in life and in business.

Grocery Store More space, more items, but less convienent

Analogy: Kitchen Cache •  Kitchen == fast but space limited
storage (memory) •  Grocery Store == slow storage but more space (database, file system, remote API) •  It’s easier to go to your kitchen than the store. •  Just like it’s faster to read from memory than to read from disk.

Cache All Things! (not really) •  Our gut reaction to
caching is usually “So let’s stick everything in memory! It’s so fast!” •  Memory is expensive (and has other limitations). •  You could fit the content of the grocery store in your kitchen but that would be expensive overkill. Same concept applies in programming.

Simple Examples That you probably do already

Surprise! It’s cached! •  You’re probably caching already and you
don’t know it. •  Cache the results of a database query in a variable (using Django ORM): a_user = User.objects.get(id=1) print a_user.username print a_user.first_name •  We could just query the DB every time (like when we access username & first_name), but that would probably be slower than accessing a variable.

Caching to Avoid Computationally Intensive Tasks •  Store the results
of the function in a variable to avoid repeatedly calling complex computation. a = really_hairy_function(106,269,844,789) # Don’t ever do this def really_hairy_function(a,b,c,d): for j in range(a): for d in range(b): for d1 in range(c): for d2 in range(d): # block which runs in O(n^2)

Caching HTTP Requests •  Tindie used to use IPInfoDB API
to map country to IP address. Example: 98.207.195.205 == California, USA •  If we see future requests from that IP, we could call IPInfoDB API again or store the mapping in a cache. •  Here the cache could be a database or Memcache or Redis (both are likely faster than calling external API).

Caching Libraries •  There are Python libraries for storing the
results of external HTTP requests: •  CacheControl (https://github.com/ ionrock/cachecontrol)

Caching data that changes •  In previous examples we were
caching data that probably didn’t change often. •  But what about data that does change? •  Example: Tindie uses the GitHub API to get data about Open Hardware repositories. •  These repos could change at any time (add followers, accept pull requests, etc.)

Accuracy vs Speed •  We query GitHub for a lot
of data, and we don’t want to update our data if nothing has changed (we don’t want to update our cached copy). •  But we want our version to reflect most recent version in GitHub. You can’t have your cake and eat it too! •  Tradeoff: Data is behind/stale or you spend computational resources ensuring it’s fresh.

Push vs Poll •  This is a more difficult problem
that is often solved by having the service push you notifications instead of polling (polling == querying the service at specific time intervals). •  For example purposes we’ll poll GitHub. •  Many APIs don’t support push notifications so you have to poll.

GitHub API Example # Only update repos that have been
modified in past 3 weeks headers = {'Authorization': 'token %s' % settings.GITHUB_TOKEN} headers['If-Modified-Since'] = datetime.timedelta(days=21) r = requests.get("https://api.github.com/repos/%s/%s" % (repo_owner, repo_name),headers=headers) if r.status_code == 304: # Not Modified break

Complex Caching Lets get fancy!

Query Caching •  Tindie Python layer queries our database very
often. •  Sometimes we are simultaneously updating values in the database. •  Doesn’t always make sense to cache query results in a variable because it might quickly become incorrect or out of date.

Tindie Product Page This page requires a lot of queries

Query Cache •  Insert a cache layer (“query cache”) between
your application and your database. •  Typically Memcache and/or Redis are used for this. •  Memcache, Redis == in-memory (so they are fast) key-value data stores. •  Lookup in a key-value data store is O(n), so it is very computationally fast.

Query Cache Example •  The result of every select is
cached: select * from auth_user where id=1 id | username | first_name | last_name | email ----+--------------+------------+-----------+------------------------- 1 | julialovescaching | Julia | Grace | [email protected] •  Updates, deletes invalidate the cache (invalidation == remove the value from the cache b/c it has changed). Key: Value:

Johnny Cache •  Johnny Cache is one such query cache
for Django •  http://pythonhosted.org/johnny-cache/ •  Before you use this, ensure you understand cache invalidation and read http://jmoiron.net/blog/is-johnny- cache-for-you/

Template/HTML caching •  Tindie is built on Python/Django •  Django
templates must be compiled and rendered (typically “fast”, but what if you have hundreds of people on the same page and the content on that page hasn’t changed)? •  Cache blocks of HTML in Memcache or Redis.

A lot of this page is cached Especially individual product
info

Django Cache Example •  Django has built in support for
template fragment caching: {% load cache %} {% cache 500 sidebar %} <html goes here!> {% endcache %}

Implications of Caching •  We did more Memcache reads than
DB reads. •  This worked for us because Memcache reads are cheaper and faster than DB reads or compiling Django templates. •  But if your Memcache becomes slow then you have a problem. •  There is no free lunch (or silver bullet).

Too Good to be True? •  Doesn’t caching almost seem
too good to be true? •  Yes, sometimes it is. •  How do you decide which data “doesn’t change very often”? •  How often should you update your cached data? Hours? Days? Weeks? •  Every answer is wrong (or right! J)

Cache Invalidation •  Not everything fits in our cache. • 
Sometimes the data we have cached has changed and we have to update the cache (query cache handles this by detecting updates/deletes). •  Example: if I update my username, then in our cached version would return the wrong username.

Cache Invalidation •  For template caching we invalidate the cache
when, for example, a user updates or deletes their product. •  Don’t want to show stale data (“I updated my product but why isn’t it updated?!”) •  Alternatively we could let the data expire and after 10 minutes would be fresh.

Gotchas •  Almost all of Tindie’s content that doesn’t change
very often is cached. •  How long we cache pieces of data is something we continually tune. •  We actually got to the point where Memcache was a bottleneck (story for another PyLadies meetup J

Thanks! my info for caching: Julia H Grace @jewelia

Intro to Caching: For Fun or For Profit

Intro to Caching: For Fun or For Profit

More Decks by Julia Grace

Other Decks in Technology

Featured

Transcript