Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Infrastructure & System Monitoring using Promet...

Infrastructure & System Monitoring using Prometheus

Marco Pas

June 04, 2017
Tweet

More Decks by Marco Pas

Other Decks in Programming

Transcript

  1. Infrastructure & System Monitoring using Prometheus Marco Pas Philips Lighting

    Software geek, hands on Developer/Architect/DevOps Engineer @marcopas
  2. Some stuff about me... • Mostly doing cloud related stuff

    ◦ Java, Groovy, Scala, Spring Boot, IOT, AWS, Terraform, Infrastructure • Enjoying the good things • Chef leuke dingen doen == “trying out cool and new stuff” • Currently involved in a big IOT project • Wannabe chef, movie & Netflix addict
  3. Agenda • Monitoring ◦ Introducing you to a Scary Movie

    • Prometheus overview (demo’s) ◦ Running Prometheus ◦ Gathering host metrics ◦ Introducing Grafana ◦ Monitoring Docker containers ◦ Alerting ◦ Instrumenting your own code ◦ Service Discovery (Consul) integration
  4. Our scary movie “The Happy Developer” • Lets push out

    features • I can demo so it works :) • It works with 1 user, so it will work with multiple • Don’t worry about performance we will just scale using multiple machines/processes • Logging is into place
  5. Logging “recording to diagnose a system” Monitoring “observation, checking and

    recording” http_requests_total{method="post",code="200"} 1027 1395066363000 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 Logging != Monitoring
  6. Why Monitoring? • Know when things go wrong ◦ Detection

    & Alerting • Be able to debug and gain insight • Detect changes over time and drive technical/business decisions • Feed into other systems/processes (e.g. security, automation)
  7. What to monitor? IT Network Operating System Services Applications Capture

    Monitoring Information Functional Monitoring Operational Monitoring metric data
  8. Houston we have Storage problem! Storage metric data metric data

    metric data metric data metric data metric data metric data metric data metric data How to store the mass amount of metrics and also making them easy to query?
  9. Time Series - Database • Time series data is a

    sequence of data points collected at regular intervals over a period of time. (metrics) ◦ Examples: ▪ Device data ▪ Weather data ▪ Stock prices ▪ Tide measurements ▪ Solar flare tracking • The data requires aggregation and analysis Time Series Database metric data • High write performance • Data compaction • Fast, easy range queries
  10. metric name and a set of key-value pairs, also known

    as labels <metric name>{<label name>=<label value>, ...} value [ timestamp ] http_requests_total{method="post",code="200"} 1027 1395066363000 Time Series - Data format
  11. Prometheus Prometheus is an open-source systems monitoring and alerting toolkit

    originally built at SoundCloud. It is now a standalone open source project and maintained independently of any company. https://prometheus.io Implemented using
  12. Prometheus Components • The main Prometheus server which scrapes and

    stores time series data • Client libraries for instrumenting application code • A push gateway for supporting short-lived jobs • Special-purpose exporters (for HAProxy, StatsD, Graphite, etc.) • An alertmanager • Various support tools • WhiteBox Monitoring instead of probing [aka BlackBox Monitoring]
  13. List of Job Exporters • Prometheus managed: ◦ JMX ◦

    Node ◦ Graphite ◦ Blackbox ◦ SNMP ◦ HAProxy ◦ Consul ◦ Memcached ◦ AWS Cloudwatch ◦ InfluxDB ◦ StatsD ◦ ... • Custom ones: ◦ Database ◦ Hardware related ◦ Messaging systems ◦ Storage ◦ HTTP ◦ APIs ◦ Logging ◦ … https://prometheus.io/docs/instrumenting/exporters/
  14. # file: prometheus.yml global: scrape_interval: 15s # Set the scrape

    interval to every 15 seconds. Default is every 1 minute. # some settings intentionally removed!! # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' static_configs: - targets: ['localhost:9090']
  15. 34 # file: docker-compose.yml version: '2' services: prometheus: image: prom/prometheus:latest

    → Using official prometheus container volumes: - $PWD:/etc/prometheus → Mount local directory used for config + data ports: - "9090:9090" → Port mapping used for this container host:container command: - "-config.file=/etc/prometheus/prometheus.yml" → Prometheus configuration
  16. # file: docker-compose.yml version: '2' services: prometheus: → Runnning prometheus

    as Docker container image: prom/prometheus:latest → Using official prometheus container volumes: - $PWD:/etc/prometheus → Mount local directory used for config + data ports: - "9090:9090" → Port mapping used for this container host:container command: - "-config.file=/etc/prometheus/prometheus.yml" → Prometheus configuration node-exporter: image: prom/node-exporter:latest → Using node exporter as an additional container ports: - '9100:9100' → Port mapping used for this container host:container
  17. 38 # file: prometheus.yml global: scrape_interval: 15s # Set the

    scrape interval to every 15 seconds. Default is every 1 minute. # some settings intentionally removed!! # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node-exporter' static_configs: - targets: ['node-exporter:9100']
  18. # file: docker-compose.yml version: '2' services: # some code intentionally

    removed!! grafana: image: grafana/grafana:latest → Using official prometheus container ports: - "3000:3000" → Port mapping used for this container host:container You get the idea :)
  19. Alerting Configuration • Alert Rules ◦ What are the settings

    where we need to alert upon? • Alert Manager ◦ Where do we need to send the alert to?
  20. # file: alert.rules ALERT serviceDownAlert IF absent(((time() - container_last_seen{name="<service_name>"}) <

    5)) FOR 5s LABELS { severity = "critical", → setting the labels so we can use them in the AlertManager service = "backend" } ANNOTATIONS { → information used in the alert event SUMMARY = "Container Instance down", DESCRIPTION = "Container Instance is down for more than 15 sec." }
  21. # file: alert-manager.yml global: → Global settings smtp_smarthost: 'mailslurper:2500' smtp_from:

    '[email protected]' smtp_require_tls: false route: → Routing receiver: mail # Fallback → Fallback is there is no match routes: - match: severity: critical → Match on label! continue: true → Continue with other receivers if there is a match receiver: mail → Determine the receiver - match: severity: critical receiver: slack
  22. # file: alert-manager.yml (continued) receivers: - name: mail → mail

    receiver email_configs: - to: '[email protected]' - name: slack → slack receiver slack_configs: - send_resolved: true username: 'AlertManager' channel: '#alert' api_url: 'THIS IS A VERY SECRET URL :)’
  23. # file: prometheus.yml global: scrape_interval: 15s # Set the scrape

    interval to every 15 seconds. Default is every 1 minute. # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "alert.rules" # some settings intentionally removed!!
  24. Instrumenting your own code! • Counter ◦ A cumulative metric

    that represents a single numerical value that only ever goes up • Gauge ◦ Single numerical value that can arbitrarily go up and down • Histogram ◦ Samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values • Summary ◦ Histogram + total count of observations + sum of all observed values, it calculates configurable quantiles over a sliding time window
  25. Available Languages • Official ◦ Go, Java or Scala, Python,

    Ruby • Unofficial ◦ Bash, C++, Common Lisp, Elixir, Erlang, Haskell, Lua for Nginx, Lua for Tarantool, .NET / C#, Node.js, PHP, Rust // Spring Boot example -> file: build.gradle dependencies { compile('org.springframework.boot:spring-boot-starter-web') testCompile('org.springframework.boot:spring-boot-starter-test') compile('io.prometheus:simpleclient_spring_boot:0.0.21') → Add dependency }
  26. Prometheus Client Libaries: SpringBoot Example @EnablePrometheusEndpoint @EnableSpringBootMetricsCollector @RestController @SpringBootApplication public

    class DemoApplication { public static void main(String[] args) { SpringApplication.run(DemoApplication.class, args); } static final Counter requests = Counter.build() → create metric type counter .name("helloworld_requests_total") → set metric name .help("HelloWorld Total requests.").register(); → register the metric @RequestMapping("/helloworld") String home() { requests.inc(); → increment the counter with 1 (helloworld_requests_total) return "Hello World!"; } }