Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Effective Infrastructure Monitoring with Grafana

David
September 20, 2019

Effective Infrastructure Monitoring with Grafana

In this talk David will show Grafana's advanced features to manage a fleet of Linux hosts. He will also show relevant metrics from node exporter and how they can be turned into alerts.

David

September 20, 2019
Tweet

More Decks by David

Other Decks in Technology

Transcript

  1. I’m David Working on Explore, Prometheus, and Loki at Grafana

    Labs Previously: Unifying metrics/logs/traces at Kausal, work on WeaveScope [email protected] Twitter: @davkals
  2. Monitoring at Grafana Labs K8s on GKE Prometheus for metrics

    Loki for logs Jaeger for distributed tracing Monitoring mixins
  3. Monitoring by alerting Photo by Randy Tarampi on Unsplash Alerts

    are part of monitoring mixins Ideally linked to a runbook
  4. Node exporter arp bcache bondig boottime conntrack cpu diskstats entropy

    filesystem ipvs loadavg meminfo netclass netstat nfs pressure sockstat stat textfile time uname vmstat xfs zfs
  5. Node exporter arp bcache bondig boottime conntrack cpu diskstats entropy

    filesystem ipvs loadavg meminfo netclass netstat nfs pressure sockstat stat textfile time uname vmstat xfs zfs Kimchi!!! https://gitlab.com/bjk-gitlab
  6. System chart Applications System Libraries System Call Interface VFS Sockets

    Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU
  7. System chart with node_exporter metrics: node_…. Applications System Libraries System

    Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU up vmstat filesystem xfs zfs drbd diskstats mdadm bcache conntrack arp netclass netdev wifi infiniband bonding netstat Hardware thermal_zone edac entropy hwmon timex
  8. CPU utilisation: seconds spent on the CPU per second Applications

    System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU (1 - avg without (cpu, mode) ( rate(node_cpu_seconds_total{mode="idle"}[1m]))) * 100
  9. CPU saturation: Load average / number of CPUs Applications System

    Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU node_load1 / count without (cpu) ( count without (mode) ( node_cpu_seconds_total ) )
  10. CPU saturation: Track waiting time instead of number of processes

    Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU node_load1 node_pressure... Needs Linux (kernel 4.20+ and/or CONFIG_PSI)
  11. Memory utilisation and “saturation” Applications System Libraries System Call Interface

    VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU Virtual Memory DRAM 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) rate(node_vmstat_pgpgin[1m]) + rate(node_vmstat_pgpgout[1m])
  12. Disk utilisation and disk IO queue length Applications System Libraries

    System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU rate(node_disk_io_time_seconds_total[5m]) iostat avgqu-sz equivalent: rate(node_disk_io_time_weighted_seconds_total[5m]) Details: https://www.robustperception.io/mapping-iostat-to-the-node-exporters-node_disk_-metrics I/O Controller Disk Swap
  13. Available disk space and disk space alerts; multiple alerts with

    varying severity Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU 1 - (max by (device) (node_filesystem_avail_bytes) / max by (device) (node_filesystem_size_bytes)) node_filesystem_avail_bytes/node_filesystem_size_bytes < 0.4 and predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0 and node_filesystem_readonly == 0 for 1h severity: 'warning' Details: https://github.com/prometheus/node_exporter/tree/master/docs/node-mixin VFS File Systems Volume Manager Block Device Interface
  14. Network throughput and Applications System Libraries System Call Interface VFS

    Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port rate(node_network_receive_bytes_total{device!="lo"}[5m]) rate(node_network_transmit_bytes_total{device!="lo"}[5m]) rate(node_network_receive_drop_total{device!="lo"}[5m]) rate(node_network_transmit_drop_total{device!="lo"}[5m]) Sockets TCP/UDP IP Ethernet Network Controller
  15. K8s host connectivity issues: conntrack table limits Applications System Libraries

    System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port node_nf_conntrack_entries / node_nf_conntrack_entries_limit Sockets TCP/UDP IP Ethernet
  16. Collector gotchas - Some collectors produce lots of time-series (filesystem,

    firewalls, networking, systemd) - Some legacy collectors execute scripts or programs - Dropping via relabeling rules is tedious - Pro tip: run 2 two node exporters - one with minimal, one with full configuration - 10x savings on number of time-series - Low overhead since everything is lazily loaded on scrape https://github.com/RichiH Entropy!!
  17. Textfile collector - Includes text files from a given directory

    to be part of the scrape - Make sure to write atomically - Don’t let node_exporter run scripts as root, use crontab as root to write output to text file instead # INFO Last time Ansible successfully ran ansible_last_run_timestamp 1568903175 # INFO Last time backup successfully ran backup_last_run_timestamp 1568903175 # INFO Track which features are enabled on host my_bare_metal_feature_enabled 1 # INFO Track SSD wearout smartmon_media_wearout_indicator 95665 https://github.com/prometheus-community/node-exporter-textfile-collector-scripts
  18. Fleet overview: cluster vs. node; Template query: label_values(node_exporter_build_info, instance) Node

    exporter mixin: https://github.com/prometheus/node_exporter/tree/master/docs/node-mixin
  19. Deploy Loki Bare metal options: - static service discovery -

    systemd journal support Example documentation: https://github.com/grafana/loki/blob/master/docs/promtail/examples.md