Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RailsConf 2016: 5 Years of Scaling Rails to 80K...

RailsConf 2016: 5 Years of Scaling Rails to 80K RPS

Video: http://confreaks.tv/videos/railsconf2017-5-years-of-rails-scaling-to-80k-rps

Shopify has taken Rails through some of the world's largest sales: Superbowl, Celebrity Launches, and Black Friday. In this talk, we will go through the evolution of the Shopify infrastructure: from re-architecting and caching in 2012, sharding in 2013, and reducing the blast radius of every point of failure in 2014. To 2016, where we accomplished running our 325,000+ stores out of multiple datacenters. It'll be whirlwind tour of the lessons learned scaling one of the world's largest Rails deployments for half a decade.

Simon Hørup Eskildsen

April 27, 2017
Tweet

More Decks by Simon Hørup Eskildsen

Other Decks in Technology

Transcript

  1. 377,500+ SHOPS $29 BILLION+ 1900+ EMPLOYEES 2 DATACENTRES RUBY ON

    RAILS SINCE 2006 80K PEAK RPS 40+ DAILY DEPLOYS 20K-40K+ STEADY RPS
  2. 4 STOREFRONT CHECKOUT ADMIN API HEAVY READS CACHEABLE AVAILABILITY 80%

    TRAFFIC HEAVY WRITES EXTERNALS CONSISTENCY COMPLEX R/W CONSISTENCY COMPLEX R/W CONSISTENCY FAST COMPUTERS
  3. OPTIMIZATIONS OPTIMIZING THE HOT PATHS Debug logs were printed to

    identify all the work going into requests BACKGROUNDING CHECKOUTS Payment processing was pushed to background jobs INVENTORY OPTIMIZATIONS MYSQL lock contention too high with 1,000s of customers
  4. LOAD-TESTING FEEDBACK LOOP Are we actually improving? FULL PRODUCTION INTEGRATION

    TESTING Execute full checkout flow, simulate real users.
  5. IDENTITYCACHE class Product < ActiveRecord::Base include IdentityCache cache_index :handle, :unique

    => true cache_index :vendor, :product_type end product = Product.fetch_by_handle(handle) products = Product.fetch_by_vendor_and_product_type(vendor, product_type)
  6. class ProductController < ApplicationController around_filter :with_shop def show @product =

    @shop.products.find(params[:id]) end private def with_shop(&block) @shop = Shop.find_by_host(request.host) Sharding.with_shard(@shop.shard_id, &block) end end
  7. SHARDING DON’T SHARD (WHERE ARE YOU ON THE OPTIMIZATION SPECTRUM?)

    Sharding is hard, it took us a year! ARCHITECTURE DRAWBACKS Common-cases easy, edge-cases can now violate fundamentals. For example, cross-database transactions are now impossible. APPLICATION-LEVEL SHARDING Why did we choose it over a proxy or changing datastores?
  8. 19 Availability 70 80 90 100 Components 10 50 100

    500 1000 99.98 99.99 99.999 99.95
  9. 21 single component failure should not be able to compromise

    the performance or availability of the entire system
  10. 22 Checkout Admin Storefront MySQL Shard Unavailable Unavailable Degraded MySQL

    Master Available (if cached) Unavailable Available Kafka Available Degraded Available External HTTP API Degraded Available Unavailable redis-sessions Unavailable Unavailable Degraded logging (disk full) Unavailable Unavailable Unavailable Resiliency Matrix
  11. 23 https://github.com/shopify/toxiproxy Toxiproxy[:mysql_master].downstream(:latency, latency: 1000).apply do Shop.first # this takes

    at least 1s end Toxiproxy[/redis/].down do session[:user_id] # this will throw an exception end Simulate network problems with Toxiproxy
  12. Write a Toxiproxy test for each cell 24 # test/integration/resiliency_matrix_test.rb

    def test_section_a_mq_a_down Toxiproxy[:message_queue_a].down do get '/section_a' assert_response :success end end def test_section_b_datastore_b Toxiproxy[:datastore_b].down do get '/section_b' assert_response 500 end end # ... and every other cell
  13. Resiliency Maturity Pyramid 26 No resiliency effort Testing with mocks

    Toxiproxy tests and matrix Resiliency Patterns Production Practise Days (Games) Kill nodes Latency Application-Specific Fallbacks Kill DC
  14. ACTIVEFAILOVER: 10-60S FAILOVERS 3. FAILOVER DATABASE Move the writer for

    all shards to the new primary datacenter 1. FAILOVER TRAFFIC Set flag on load balancers to redirect traffic to new datacenter 2. READ-ONLY SHOPIFY Traffic going to new datacenter, but is read-only (no checkouts, changes) 4. TRANSFER JOBS Queued and delayed jobs are transferred to the new primary DC
  15. datacenter 1 pod 1 pod 3 pod 5 pod 2

    pod 4 pod 6 pod 1 pod 3 pod 5 pod 2 pod 4 pod 6 datacenter 2
  16. pod 1 pod 3 pod 2 pod 4 pod 1

    pod 3 pod 2 pod 4 GET /products/beautiful-shoe HTTP/1.1 Host: myshop.com sorting hat
  17. rule 1: any request must be annotated with a pod

    or shop rule 2: any request can only touch one pod
  18. 38 if Shitlist.include?(klass) super else error = <<-EOE New usage

    of this API is deprecated. Please come talk to the Pods team in #pods and we'll help you out! EOE raise ShitList::Error, error end