Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Berlin 2013 - Session - Jeff Weinstein

Monitorama
September 20, 2013
570

Berlin 2013 - Session - Jeff Weinstein

Monitorama

September 20, 2013
Tweet

Transcript

  1. How  monitoring  can  improve   the  rest  of  the  company

          Monitorama  EU  2013   @jeff_weinstein  
  2. Monitoring  can  wildly  improve     the  whole  company  by

      sharing  data     and  sharing  techniques.  
  3. Monitoring  Folks   Developers   Business     Analysts  

    ExecuIves   &  Product   Data     ScienIsts   Data  
  4. Apps  &   Services  &   Systems   Users  

    Data   Code  &   Config   Monitoring  
  5. Data  Processing   Apps   Systems   Logs  /  

    Events   Metrics   Graphs   &  Alerts   Apps   3rd  Party   Reports  &   Queries   ETL   AnalyIc   Systems   Monitoring:  Streaming   BI:  Batch  
  6. Data  Needs   Logs   Metrics   Logs   Metrics

      Streaming   Batch   Data   Monitoring   BI  
  7. Data  Tools  Stack   Monitoring   •  Ad  hoc  

    –  sed,  grep,  awk   –  ES,  LogStash,  Splunk,  …   •  Storage   –  Hosts,  Ganglia,  OTSDB   –  Central  syslog  server   •  VisualizaIon/ReporIng   –  Graphite,  RRDTool,  3rd  party   –  Homegrown   •  AlerIng/EscalaIon     –  Nagios,  Sensu,  PagerDuty,  …   Rest  of  company   •  Ad  hoc   –  Excel,  SQL,  Hive   –  MapReduce,  …   •  Storage   –  Lots  o’  databases,  Excel   –  Hadoop,  RDBMS…   •  VisualizaIon/ReporIng   –  Excel,  R,  Tableau  ...   –  Dinosaur  apps,  …   •  AlerIng/EscalaIon     –  nada  
  8. Views   Unintelligible  generated  views   Too  granular  for  long

     term  trends   Lack  of  historical   Intolerant  to  anomalies  
  9. Team  and  incenIves   •  What  team?   •  Change

     vs.  reliability   •  Planning   •  Budget   •  Churn  
  10. Good  or  bad?   •  Specific  Tools   •  Decentralized

      •  Focus   •  Ownership   •  Lost  context   •  Siloed  work   •  Data  dark   •  Misunderstanding  
  11. End  to  End  Data  Pipeline   ü Structured  logs   ü (Config)

      ü Measure  once   ü AutomaIc  metrics   ü API   ü Graph  tools   ü Glossary   ü AnnotaIons  and  tags   ü Pipeline  
  12. Structured  events   •  JSON  (or  whatever)   •  (opIonal)

     config   •  Tags  per  key   – Type   – Tag:  latency,  funnel,…   – DescripIon   – Storage  
  13. Auto:  Graphs,  Glossary,  &  Storage   •  Graphs  and  dashboards

      •  *  templates   •  Views  and  stats   •  Glossary   •  Batch  analyIcs   •  Long  term  storage  
  14. Developers   •  Logging  toolkit   •  Data  pipeline  

    •  Pain  points   •  Outage  causes   •  Deployment  pracIces   •  EscalaIon  playbook   •  Measurement  as  TDD   •  Monitor  staging  env  
  15. Business  Analysts   •  Structured  logs     •  Config

     for  ETL   •  Metrics  definiIons     •  Slices  and  visualizaIons   •  Data  size  and  cardinality   •  Outages  and  delays   •  Flexibility   •  VisualizaIon  and  tools  
  16. Data  ScienIsts   •  Access  to  (meta)data   •  Query

     monitoring   •  StaIsIcs  and  models   •  New  data  streams   •  Context  of  data  issues   •  What’s  in  the  logs   •  Validate  algorithms   •  Teach  stats  and  models!  
  17. Product  &  ExecuIves   •  Curated  dashboards   •  Graph/alert

     tools   •  Learn  the  business   •  PrioriIze  alerts  by  $   •  Incident  post  mortems     •  Metrics  granularity   •  Data  driven  decisions   •  Recognize  and  celebrate  
  18. Icons  from  The  Noun  Project:  Dmitry  Baranovskiy,  Benjamin  Orlovski,  Luis

     Prado,  MikaDo  Nguyen,  Yarden  Gilboa,  Javier  Cabezas,  Icons  Pusher,  Jeremy  Bristol,  Blake  Thomas,  RiIka  Khasgiwale,   Mayene  de  Leon,  Yorlmar  Campos,  Sergey  Shmid   @jeff_weinstein   Thanks!  hiring  ;)