Effective Infrastructure Monitoring with Grafana

Eﬀective infrastructure monitoring with Grafana David Kaltschmidt @davkals All Systems
Go! Sept. 2019 #AllSystemsGo

I’m David Working on Explore, Prometheus, and Loki at Grafana
Labs Previously: Unifying metrics/logs/traces at Kausal, work on WeaveScope [email protected] Twitter: @davkals

Monitoring at Grafana Labs

Monitoring at Grafana Labs K8s on GKE Prometheus for metrics
Loki for logs Jaeger for distributed tracing Monitoring mixins

Monitoring by alerting Photo by Randy Tarampi on Unsplash

Monitoring by alerting Photo by Randy Tarampi on Unsplash Alerts
are part of monitoring mixins Ideally linked to a runbook

Prometheus with Node Exporter

Node exporter arp bcache bondig boottime conntrack cpu diskstats entropy
ﬁlesystem ipvs loadavg meminfo netclass netstat nfs pressure sockstat stat textﬁle time uname vmstat xfs zfs

Node exporter arp bcache bondig boottime conntrack cpu diskstats entropy
ﬁlesystem ipvs loadavg meminfo netclass netstat nfs pressure sockstat stat textﬁle time uname vmstat xfs zfs Kimchi!!! https://gitlab.com/bjk-gitlab

Mapping of Linux observability tools

System chart Applications System Libraries System Call Interface VFS Sockets
Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU

System chart with node_exporter metrics: node_…. Applications System Libraries System
Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU up vmstat filesystem xfs zfs drbd diskstats mdadm bcache conntrack arp netclass netdev wifi infiniband bonding netstat Hardware thermal_zone edac entropy hwmon timex

CPU utilisation: seconds spent on the CPU per second Applications
System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU (1 - avg without (cpu, mode) ( rate(node_cpu_seconds_total{mode="idle"}[1m]))) * 100

CPU saturation: Load average / number of CPUs Applications System
Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU node_load1 / count without (cpu) ( count without (mode) ( node_cpu_seconds_total ) )

CPU saturation: Track waiting time instead of number of processes
Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU node_load1 node_pressure... Needs Linux (kernel 4.20+ and/or CONFIG_PSI)

Memory utilisation and “saturation” Applications System Libraries System Call Interface
VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU Virtual Memory DRAM 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) rate(node_vmstat_pgpgin[1m]) + rate(node_vmstat_pgpgout[1m])

Disk utilisation and disk IO queue length Applications System Libraries
System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU rate(node_disk_io_time_seconds_total[5m]) iostat avgqu-sz equivalent: rate(node_disk_io_time_weighted_seconds_total[5m]) Details: https://www.robustperception.io/mapping-iostat-to-the-node-exporters-node_disk_-metrics I/O Controller Disk Swap

Available disk space and disk space alerts; multiple alerts with
varying severity Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU 1 - (max by (device) (node_filesystem_avail_bytes) / max by (device) (node_filesystem_size_bytes)) node_filesystem_avail_bytes/node_filesystem_size_bytes < 0.4 and predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0 and node_filesystem_readonly == 0 for 1h severity: 'warning' Details: https://github.com/prometheus/node_exporter/tree/master/docs/node-mixin VFS File Systems Volume Manager Block Device Interface

Network throughput and Applications System Libraries System Call Interface VFS
Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port rate(node_network_receive_bytes_total{device!="lo"}[5m]) rate(node_network_transmit_bytes_total{device!="lo"}[5m]) rate(node_network_receive_drop_total{device!="lo"}[5m]) rate(node_network_transmit_drop_total{device!="lo"}[5m]) Sockets TCP/UDP IP Ethernet Network Controller

K8s host connectivity issues: conntrack table limits Applications System Libraries
System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port node_nf_conntrack_entries / node_nf_conntrack_entries_limit Sockets TCP/UDP IP Ethernet

Let’s build a conntrack alert

Collector gotchas - Some collectors produce lots of time-series (filesystem,
firewalls, networking, systemd) - Some legacy collectors execute scripts or programs - Dropping via relabeling rules is tedious - Pro tip: run 2 two node exporters - one with minimal, one with full configuration - 10x savings on number of time-series - Low overhead since everything is lazily loaded on scrape https://github.com/RichiH Entropy!!

Bonus: Node exporter’s textﬁle collector Coffee!!! https://github.com/beorn7

Textfile collector - Includes text files from a given directory
to be part of the scrape - Make sure to write atomically - Don’t let node_exporter run scripts as root, use crontab as root to write output to text file instead # INFO Last time Ansible successfully ran ansible_last_run_timestamp 1568903175 # INFO Last time backup successfully ran backup_last_run_timestamp 1568903175 # INFO Track which features are enabled on host my_bare_metal_feature_enabled 1 # INFO Track SSD wearout smartmon_media_wearout_indicator 95665 https://github.com/prometheus-community/node-exporter-textfile-collector-scripts

Fleet monitoring

Example from dashboards.gitlab.com based on node_uname_info

Deviations: Gamifying ﬂeet management

Fleet overview: Example from dashboards.gitlab.com

Fleet overview: cluster vs. node; Template query: label_values(node_exporter_build_info, instance) Node
exporter mixin: https://github.com/prometheus/node_exporter/tree/master/docs/node-mixin

Thank you! Questions? @davkals We're hiring remote/EU/US-east. https://grafana.com/about/careers/

Log aggregation with Loki

See your logs in Grafana

Deploy Loki Bare metal options: - static service discovery -
systemd journal support Example documentation: https://github.com/grafana/loki/blob/master/docs/promtail/examples.md

Effective Infrastructure Monitoring with Grafana

Effective Infrastructure Monitoring with Grafana

David

More Decks by David

Other Decks in Technology

Featured

Transcript

Eﬀective infrastructure monitoring with Grafana David Kaltschmidt @davkals All Systems

I’m David Working on Explore, Prometheus, and Loki at Grafana

Monitoring at Grafana Labs

Monitoring at Grafana Labs K8s on GKE Prometheus for metrics

Monitoring by alerting Photo by Randy Tarampi on Unsplash

Monitoring by alerting Photo by Randy Tarampi on Unsplash Alerts

Prometheus with Node Exporter

Node exporter arp bcache bondig boottime conntrack cpu diskstats entropy

Node exporter arp bcache bondig boottime conntrack cpu diskstats entropy

Mapping of Linux observability tools

System chart Applications System Libraries System Call Interface VFS Sockets

System chart with node_exporter metrics: node_…. Applications System Libraries System

CPU utilisation: seconds spent on the CPU per second Applications

CPU saturation: Load average / number of CPUs Applications System

CPU saturation: Track waiting time instead of number of processes

Memory utilisation and “saturation” Applications System Libraries System Call Interface

Disk utilisation and disk IO queue length Applications System Libraries

Available disk space and disk space alerts; multiple alerts with

Network throughput and Applications System Libraries System Call Interface VFS

K8s host connectivity issues: conntrack table limits Applications System Libraries

Let’s build a conntrack alert

Collector gotchas - Some collectors produce lots of time-series (ﬁlesystem,

Bonus: Node exporter’s textﬁle collector Coffee!!! https://github.com/beorn7

Textﬁle collector - Includes text ﬁles from a given directory

Fleet monitoring

Example from dashboards.gitlab.com based on node_uname_info

Deviations: Gamifying ﬂeet management

Fleet overview: Example from dashboards.gitlab.com

Fleet overview: cluster vs. node; Template query: label_values(node_exporter_build_info, instance) Node

Thank you! Questions? @davkals We're hiring remote/EU/US-east. https://grafana.com/about/careers/

Log aggregation with Loki

See your logs in Grafana

Deploy Loki Bare metal options: - static service discovery -