Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Berlin 2013 - Session - Reza Spagnolo

Monitorama
September 20, 2013
270

Berlin 2013 - Session - Reza Spagnolo

Monitorama

September 20, 2013
Tweet

Transcript

  1. Hey  there  !   Who  am  I  ?   • 

    A  student   •  An  engineer,  for  9  years  now   •  Interested  in  building  systems   •  Dev  &  Ops  since  the  beginning  
  2. Namespaces   There  are  only  two  hard  things  in  Computer

      Science:  cache  invalida<on  and  naming  things.   -­‐-­‐  Phil  Karlton  
  3. Metrics  namespaces   •  Helps  your  mental  model   • 

    Helps  iden%fying  things   •  Dimensions:  loca%on,  versions,  etc  
  4. Monitoring  based  promo%on   Acceptance   Development   Produc%on  

    •  Produc%on  configura%on   •  Comparison   •  Log  analysis  
  5. Miner’s  canary   •  If  a  customer  lets  you  know

     about  a  problem   then  you  have  already  failed  at  least  twice   •  The  right  quan%ty     •  Filtering  –  see  the  right  picture   •  Document  changes  to  your  baselines  
  6. Architecture   •  Single  responsibility  principle   •  Orchestra%on  or

     Choreography   •  Dynamic  configura%on   •  Failover  and  feedback  cycles   •  Rate  limi%ng   •  Integra%on  paXerns  
  7. Single  responsibility  principle   •  (Micro-­‐)Services   •  Components  

    •  Small  number  of  dependencies   •  Predictable  failure  modes   •  Easier  adapta%on   •  Expecta%on  on  metrics  
  8. Orchestra%on  or  Choreography   •  Orchestra%on   – May  be  simpler

     to  reason  about   – Coupling  with  the  director   •  Choreography   – Possibly  more  flexible   – Beware  of  corrup%on  of  state  
  9. Failover  and  feedback  cycles   •  Automated  failover   • 

    Failover  stress   •  Beware  of  amplifying  effects   •  Break  cycles  
  10. Rate  limi%ng   •  Degraded  is  beXer  than  nothing  

    •  Not  only  at  the  top  level   •  Component  rate  limi%ng   •  Rate  limi%ng  should  be  dynamic   •  Rate  limi%ng  can  be  par%%oned   •  Clients  should  be  part  of  the  contract   •  Rate  limi%ng  is  aLer  all  handshaking   •  Handshaking:  within  the  protocol  or  out  of  band  
  11. Integra%on  and  component  PaXerns   •  Timeouts   •  Circuit

     breakers   •  Resource  pools   •  Fail  fast   •  Queue  and  retry   •  Applica%on  pings  and  sanity  checks  
  12. Addi%onal  prac%ces   •  Quaran%ne   •  Regenera%ve  infrastructure  

    •  Rollback  and  monitoring   •  Automa%on  of  SOP  –  Runbook  
  13. Automated  runbooks  and  checklists   •  Automate  your  SOP  

    •  Respond  to  failure  with  a   checklist   •  Automate  checklists  too   •  Helps  to  avoid  the   cogni%ve  bias  and  other   nasty  stuff  your  brain   does  
  14. Sources   •  Recovery  Oriented  Compu%ng  Papers   •  James

     Hamilton  LISA  paper   •  Release  It  !   •  Scalable  Internet  Architectures   •  A  ton  of  other  great  books  and  papers  
  15. The  value   Among  the  kinds  of  overhead:   • 

    The  opera%onal  one     •  The  customers  one   No  maXer  how  sophis%cated  is  our  monitoring  infrastructure  issues   no%fied  by  customers  are  at  the  end  the  most  important  ones  as  they   impact  their  experience  directly  and  are  oLen  discovering  unknown   bugs.     Freeing  up  the  team  as  much  as  possible  from  the  overhead  of  the   first  type  gives  more  %me  to  focus  on  the  issues  of  the  product  itself.