In this talk David will show Grafana's advanced features to manage a fleet of Linux hosts. He will also show relevant metrics from node exporter and how they can be turned into alerts.
Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU
System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU (1 - avg without (cpu, mode) ( rate(node_cpu_seconds_total{mode="idle"}[1m]))) * 100
Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU node_load1 / count without (cpu) ( count without (mode) ( node_cpu_seconds_total ) )
Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU node_load1 node_pressure... Needs Linux (kernel 4.20+ and/or CONFIG_PSI)
VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU Virtual Memory DRAM 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) rate(node_vmstat_pgpgin[1m]) + rate(node_vmstat_pgpgout[1m])
System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU rate(node_disk_io_time_seconds_total[5m]) iostat avgqu-sz equivalent: rate(node_disk_io_time_weighted_seconds_total[5m]) Details: https://www.robustperception.io/mapping-iostat-to-the-node-exporters-node_disk_-metrics I/O Controller Disk Swap
varying severity Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU 1 - (max by (device) (node_filesystem_avail_bytes) / max by (device) (node_filesystem_size_bytes)) node_filesystem_avail_bytes/node_filesystem_size_bytes < 0.4 and predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0 and node_filesystem_readonly == 0 for 1h severity: 'warning' Details: https://github.com/prometheus/node_exporter/tree/master/docs/node-mixin VFS File Systems Volume Manager Block Device Interface
firewalls, networking, systemd) - Some legacy collectors execute scripts or programs - Dropping via relabeling rules is tedious - Pro tip: run 2 two node exporters - one with minimal, one with full configuration - 10x savings on number of time-series - Low overhead since everything is lazily loaded on scrape https://github.com/RichiH Entropy!!
to be part of the scrape - Make sure to write atomically - Don’t let node_exporter run scripts as root, use crontab as root to write output to text file instead # INFO Last time Ansible successfully ran ansible_last_run_timestamp 1568903175 # INFO Last time backup successfully ran backup_last_run_timestamp 1568903175 # INFO Track which features are enabled on host my_bare_metal_feature_enabled 1 # INFO Track SSD wearout smartmon_media_wearout_indicator 95665 https://github.com/prometheus-community/node-exporter-textfile-collector-scripts