Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building the World's Largest Websites with Cons...

Building the World's Largest Websites with Consul and Terraform

This is a talk given at Xebicon.nl

A lot of context is missing but Xebia said they'd be uploading video of the talk shortly. Google to see if its up; the talk title is the same as the title of this deck. The talk is much more informative, including real world cases from some of our biggest users.

Mitchell Hashimoto

June 04, 2015
Tweet

More Decks by Mitchell Hashimoto

Other Decks in Programming

Transcript

  1. RISING  DATACENTER  COMPLEXITY DC VM VM VM VM VM VM

    VM VM VM VM VM VM VM VM VM VM C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C
  2. RISING  DATACENTER  COMPLEXITY DC-01 DC-02 VM VM VM VM VM

    VM VM VM C C C C C C C C C C C C C C C C C C C C C C C C
  3. QuesCons  that  Consul  Answers • Where  is  the  service  foo?

     (ex.  Where  is  the  database?)   • What  is  the  health  status  of  service  foo?   • What  is  the  health  status  of  the  machine/node  foo?   • What  is  the  list  of  all  currently  running  machines?   • What  is  the  configuraCon  of  service  foo?   • Is  anyone  else  currently  performing  operaCon  foo?  
  4. What  If  I  asked  you  to…   • create  a

     completely  isolated  second  environment  to  run  an  applicaCon   (staging,  QA,  dev,  etc.)?   • deploy  a  complex  new  applicaCon?     • update  an  exisCng  complex  applicaCon?     • document  how  our  infrastructure  is  architected?     • delegate  some  ops  to  smaller  teams?  (Core  IT  vs.  App  IT)
  5. SCALABILITY • ExpectaCon  of  high  QPS  per  resource   •

    CPU,  memory  are  valuable  resources   • One  less  server  for  uClity  =  one  more  server  for  
 serving  customers   • Push  vs.  Pull,  a.k.a.  edge  triggered  changes
  6. RESILIENCY • Probability  of  failure  goes  up  for  scale  

    • Embrace  failure  and  make  it  acceptable   • Constant  change  at  some  scale   • Self-­‐healing  systems  become  much  more  
 important  (automaCc  anC-­‐entropy)   • Central  sources  of  truth  become  liabiliCes
  7. DETERMINISM • Understand  the  full  effect  of  a  change  

    • Predictable  (but  not  necessarily  strict)  ordering
 of  a  change.   • LimiCng  surprises  that  can  cause  downCme
  8. CONFIG   MGMT  SERVER TRADITIONAL  SERVICE  CONFIGURATION Pull-­‐based,  long  intervals,

     computaConally  expensive WEB  1 WEB  2 WEB  N 14:00 14:07 14:03
  9. CONSUL-­‐TEMPLATE Template  Example global daemon maxconn {{key "haproxy/maxconn"}} defaults mode

    {{key "haproxy/mode"}}{{range ls "haproxy/timeouts"}} timeout {{.Key}} {{.Value}}{{end}} listen http-in bind *:8000{{range service "release.web"}} server {{.Node}} {{.Address}}:{{.Port}}{{end}}
  10. CONSUL-­‐TEMPLATE Execute  (as  a  service) $ consul-template \ -consul demo.consul.io

    \ -template “haproxy.ctmpl:/etc/haproxy/haproxy.conf:restart haproxy” -dry
  11. STEP  BY  STEP 1. Config  management  puts  down   configuraCon

     template   2. consul-­‐template  runs  as  a  service   3. Edge  triggers  config  changes,  restarts   service
  12. ZERO  TTL  DNS • Long-­‐held  connecCons  to  minimize  DNS  

    overhead   • Zero  TTL  ensures  most  up-­‐to-­‐date   informaCon
  13. RESILIENCY • Low-­‐TTL  DNS  records   • Ensures  availability  even

     if  Consul  is   unavailable   • Required  for  short-­‐held  connecCons  since   DNS  lookup  overhead  is  too  high  with  zero   TTL
  14. CONSUL  AGENT OPTION  #1:  CONSUL  SETTINGS Per-­‐service,  stale  reads  on

     non-­‐leaders WEB  PROCESS dns  query CONSUL  
 LEADER CONSUL  
 STANDBY
  15. CONSUL  AGENT OPTION  #2:  DNSMASQ  +  CONSUL Global,  works  if

     Consul  is  down WEB  PROCESS dns  query DNSMASQ
  16. OPTION  #3:  APPLICATION-­‐LEVEL  CACHE Works  if  almost  everything  is  down,

     strict  control  over  cache  Cmes WEB  PROCESS dns  query IN-­‐MEM  CACHE CONSUL  AGENT
  17. BEST  OPTION? The  first  two  opCons  are  usually   good

     enough,  will  buy  you  a  lot  of  runway
  18. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.4 10.0.1.5 10.0.1.6
  19. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.4 10.0.1.5 10.0.1.6
  20. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.5 10.0.1.6
  21. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul
  22. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul
  23. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul
  24. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.5 10.0.1.6 host: web.service.consul
  25. CONSUL CONSUL  MONITORING Removes  unhealthy  nodes  from  service  discovery  layer

    WEB  1 WEB  2 WEB  N dig web.service.consul 10.0.1.4 10.0.1.5 10.0.1.6 host: web.service.consul
  26. DISTRIBUTED  LOCKS • Building  block  for  distributed  systems   •

    Complexity  hidden  from  downstream   applicaCons,  like  a  mutex  stdlib
  27. CONSUL CONSUL  EXEC Runs  script  on  specific  nodes WEB  1

    WEB  2 DATABASE consul exec -service=“web” ./script.sh
  28. LARGE  SCALE  INFRA  UPDATE • Unexpected  inter-­‐dependencies   • Cross-­‐cloud

     changes   • Ordering  for  minimal  disrupCon   • Expected  Cme  for  complete  rollout
  29. Terraform  Plan What  are  you  going  to  do? + digitalocean_droplet.web

    backups: "" => "<computed>" image: "" => "centos-5-8-x32" ipv4_address: "" => "<computed>" ipv4_address_private: "" => "<computed>" name: "" => "tf-web" private_networking: "" => "<computed>" region: "" => "sfo1" size: "" => "512mb" status: "" => "<computed>" + dnsimple_record.hello domain: "" => "example.com" domain_id: "" => "<computed>" hostname: "" => "<computed>" name: "" => "test"
  30. OPS  DELEGATION • “Core”  operaCons  teams   • ApplicaCon  operaCons

     teams   • Eliminate  shadow  ops   • Safely  make  changes  without  
 negaCvely  affecCng  others   • Share  operaCons  knowledge
  31. Modules Unit  of  knowledge  sharing module “consul” { source =

    “github.com/hashicorp/consul/terraform/aws” servers = 5 } output “consul-address” { value = “${module.consul.server_address}” }
  32. Remote  State Unit  of  resource  sharing resource “terraform_remote_state” “consul” {

    backend = "atlas" config { path = “hashicorp/consul-prod” } } output “consul-address” { value = “${terraform_remote_state.consul.addr}” }
  33. SERVICE  COMPOSITION • Modern  infrastructures  are  almost  always
 “mulC-­‐provider”:  DNS

     in  CloudFlare,  compute
 in  AWS,  etc.     • Infrastructure  change  requires  composing  
 data  from  mulCple  services,  execuCng  change
 in  mulCple  services
  34. Service  ComposiUon ConnecCng  mulCple  service  providers resource “aws_instance” “web” {


    # … } resource “cloudflare_record” “www” { domain = “foo.com” name = “www” value = “${aws_instance.web.private_ip}” type = “A” }
  35. Logical  Resources Now  you’re  thinking  in  graphs resource “template_file” “data”

    { filename = “data.tpl” vars { address = “${var.addr}” } } resource “aws_instance” “web” {
 user_data = “${template_file.data.rendered}” } resource “cloudflare_record” “www” { domain = “foo.com” name = “www” value = “${aws_instance.web.private_ip}” type = “A” }
  36. HISTORY  OF  INFRA  CHANGE • See  who  did  what  when

     how   • See  what  changed  recently  to  diagnose
 some  monitoring  event   • Treat  infrastructure  as  a  sort  of  applicaCon
  37. INFRA  COLLABORATION • Achieve  applicaCon-­‐like  collaboraCon  with
 infrastructure  change  

    • Code  reviews,  safe  merges   • Understanding  the  effect  of  infrastructure  
 changes