this talk were used by Tindie at one point or another over the past year. • Tindie has grown substantially and our caching needs have changed. • Thus, most of what is discussed will work for you most of the time, but may not work optimally at very large “web scale”.
food cache (sort of)! • You keep groceries there. But you could just go to the store every time you needed something. • After all the store has everything. So many delicious foods you can’t fit in your kitchen!
every time you need milk for your coffee is inconvenient and inefficient. • So how often do you go? • How do you balance wanting to buy everything with not having space for all of it? • These are real problems of caching in life and in business.
storage (memory) • Grocery Store == slow storage but more space (database, file system, remote API) • It’s easier to go to your kitchen than the store. • Just like it’s faster to read from memory than to read from disk.
caching is usually “So let’s stick everything in memory! It’s so fast!” • Memory is expensive (and has other limitations). • You could fit the content of the grocery store in your kitchen but that would be expensive overkill. Same concept applies in programming.
don’t know it. • Cache the results of a database query in a variable (using Django ORM): a_user = User.objects.get(id=1) print a_user.username print a_user.first_name • We could just query the DB every time (like when we access username & first_name), but that would probably be slower than accessing a variable.
of the function in a variable to avoid repeatedly calling complex computation. a = really_hairy_function(106,269,844,789) # Don’t ever do this def really_hairy_function(a,b,c,d): for j in range(a): for d in range(b): for d1 in range(c): for d2 in range(d): # block which runs in O(n^2)
to map country to IP address. Example: 98.207.195.205 == California, USA • If we see future requests from that IP, we could call IPInfoDB API again or store the mapping in a cache. • Here the cache could be a database or Memcache or Redis (both are likely faster than calling external API).
caching data that probably didn’t change often. • But what about data that does change? • Example: Tindie uses the GitHub API to get data about Open Hardware repositories. • These repos could change at any time (add followers, accept pull requests, etc.)
of data, and we don’t want to update our data if nothing has changed (we don’t want to update our cached copy). • But we want our version to reflect most recent version in GitHub. You can’t have your cake and eat it too! • Tradeoff: Data is behind/stale or you spend computational resources ensuring it’s fresh.
that is often solved by having the service push you notifications instead of polling (polling == querying the service at specific time intervals). • For example purposes we’ll poll GitHub. • Many APIs don’t support push notifications so you have to poll.
often. • Sometimes we are simultaneously updating values in the database. • Doesn’t always make sense to cache query results in a variable because it might quickly become incorrect or out of date.
your application and your database. • Typically Memcache and/or Redis are used for this. • Memcache, Redis == in-memory (so they are fast) key-value data stores. • Lookup in a key-value data store is O(n), so it is very computationally fast.
cached: select * from auth_user where id=1 id | username | first_name | last_name | email ----+--------------+------------+-----------+------------------------- 1 | julialovescaching | Julia | Grace | [email protected] • Updates, deletes invalidate the cache (invalidation == remove the value from the cache b/c it has changed). Key: Value:
for Django • http://pythonhosted.org/johnny-cache/ • Before you use this, ensure you understand cache invalidation and read http://jmoiron.net/blog/is-johnny- cache-for-you/
templates must be compiled and rendered (typically “fast”, but what if you have hundreds of people on the same page and the content on that page hasn’t changed)? • Cache blocks of HTML in Memcache or Redis.
DB reads. • This worked for us because Memcache reads are cheaper and faster than DB reads or compiling Django templates. • But if your Memcache becomes slow then you have a problem. • There is no free lunch (or silver bullet).
too good to be true? • Yes, sometimes it is. • How do you decide which data “doesn’t change very often”? • How often should you update your cached data? Hours? Days? Weeks? • Every answer is wrong (or right! J)
Sometimes the data we have cached has changed and we have to update the cache (query cache handles this by detecting updates/deletes). • Example: if I update my username, then in our cached version would return the wrong username.
when, for example, a user updates or deletes their product. • Don’t want to show stale data (“I updated my product but why isn’t it updated?!”) • Alternatively we could let the data expire and after 10 minutes would be fresh.
very often is cached. • How long we cache pieces of data is something we continually tune. • We actually got to the point where Memcache was a bottleneck (story for another PyLadies meetup J