Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Prometheus 実践入門 #hbstudy 79 / introduction-to-p...
Search
rrreeeyyy
November 21, 2017
Technology
16
3.3k
Prometheus 実践入門 #hbstudy 79 / introduction-to-prometheus-practice
#hbstudy 79 で Prometheus の話をしました
rrreeeyyy
November 21, 2017
Tweet
Share
More Decks by rrreeeyyy
See All by rrreeeyyy
Incident Response Practices: Waroom's Features and Future Challenges
rrreeeyyy
0
160
An Efficient Incident Response Training with AI / SRE NEXT 2024 Sponsor Session
rrreeeyyy
1
3.4k
カンファレンスから見る SRE トレンド 2024 / SRE Trends from Conferences in 2024 #SRE_Findy
rrreeeyyy
4
2.2k
信頼性の育て方 / mackerel-meetup-15
rrreeeyyy
9
2.4k
SRE の歩き方・進め方 / sre-walk-through-procedure
rrreeeyyy
0
8.5k
「信頼性」を保ちつつ大規模サービスをリニューアルする / cookpad-tech-kitchen-service-embedded-sres
rrreeeyyy
11
12k
Cookpad and Prometheus
rrreeeyyy
6
20k
SRE-Lounge-8-Cookpad-Microservice-Architecture-Overview
rrreeeyyy
5
5.2k
A survey of anomaly detection methodologies for web system
rrreeeyyy
5
1.2k
Other Decks in Technology
See All in Technology
Oracle Cloud Infrastructureデータベース・クラウド:各バージョンのサポート期間
oracle4engineer
PRO
28
12k
Exadata Database Service on Dedicated Infrastructure(ExaDB-D) UI スクリーン・キャプチャ集
oracle4engineer
PRO
2
3.2k
OCI Vault 概要
oracle4engineer
PRO
0
9.7k
VideoMamba: State Space Model for Efficient Video Understanding
chou500
0
190
AWS Media Services 最新サービスアップデート 2024
eijikominami
0
190
信頼性に挑む中で拡張できる・得られる1人のスキルセットとは?
ken5scal
2
530
Terraform CI/CD パイプラインにおける AWS CodeCommit の代替手段
hiyanger
1
240
隣接領域をBeyondするFinatextのエンジニア組織設計 / beyond-engineering-areas
stajima
1
270
第1回 国土交通省 データコンペ参加者向け勉強会③- Snowflake x estie編 -
estie
0
120
これまでの計測・開発・デプロイ方法全部見せます! / Findy ISUCON 2024-11-14
tohutohu
3
370
Application Development WG Intro at AppDeveloperCon
salaboy
0
180
初心者向けAWS Securityの勉強会mini Security-JAWSを9ヶ月ぐらい実施してきての近況
cmusudakeisuke
0
120
Featured
See All Featured
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
250
21k
Music & Morning Musume
bryan
46
6.2k
Learning to Love Humans: Emotional Interface Design
aarron
273
40k
We Have a Design System, Now What?
morganepeng
50
7.2k
Fontdeck: Realign not Redesign
paulrobertlloyd
82
5.2k
Designing for humans not robots
tammielis
250
25k
Happy Clients
brianwarren
98
6.7k
Product Roadmaps are Hard
iamctodd
PRO
49
11k
Fireside Chat
paigeccino
34
3k
Visualization
eitanlees
145
15k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
329
21k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
48k
Transcript
Prometheus ࣮ફೖ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy )
1
Agenda • Prometheus ʹ͍ͭͯ • Prometheus ͷԽʹ͍ͭͯ • Prometheus ͷεέʔϧઓུʹ͍ͭͯ
• Prometheus ͷσʔλอ࣋ظؒʹ͍ͭͯ • Alertmanager ʹ͍ͭͯ • Alertmanager ͷԽʹ͍ͭͯ • Exporter ʹ͍ͭͯ • ࣮ࡍͷࢹͰ͑ͦ͏ͳ Exporter ʹ͍ͭͯ • Rule ϑΝΠϧͷཧʹ͍ͭͯ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 2
Prometheus ʹ͍ͭͯ • Prometheus OSS ͷϞχλϦϯάπʔϧ • ݱࡏͷ࠷৽όʔδϣϯ 2.0.0
(11/8 ϦϦʔε) • Google ʹଘࡏ͍ͯ͠Δ Borgmon ͱ͍͏ϞχλϦϯάπʔϧʹΠϯεύΠΞ͞Ε͍ͯΔ • Borgmon ʹ͍ͭͯ SRE ຊ 10 ষΛಡΉͱৄ͘͠ॻ͍ͯ͋Δ • ࣍ͷΑ͏ͳಛ͕͋Δ • Pull ܕͷΞʔΩςΫνϟ • ͦΕͳΓʹߴͳ࣌ܥྻσʔλϕʔε • PromQL ʹΑΔϓϩάϥϚϒϧͳ࣌ܥྻσʔλॲཧ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 3
hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 4
ͳͥ Prometheus Λબ͢Δͷ͔ • ߴ͍ղೳͰͷϝτϦΫεͷอଘʹ͑ΒΕΔ • Pull ܕͷΞʔΩςΫνϟͰൺֱత୯७ͳߏͰӡ༻Ͱ͖Δ • Service
Discovery ͕ॆ࣮͍ͯ͠Δ • PromQL ͷදݱྗ͕ߴ༷͘ʑͳ౷ܭ͕औΕΔ • CNCF ೖΓΛՌͨ͠ Kubernetes ͷ࿈ܞՄೳ • σϑΝΫτͱ͞ΕΔπʔϧͱͷ࿈ܞ͕Մೳͳͷॏཁ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 5
Prometheus ͷઃఆʹ͍ͭͯ • ΠϯετʔϧجຊతʹόΠφϦΛஔ͚ͩ͘ • ࢹ͢ΔରΛ scrape_configs Ͱॻ͍͍͚ͯͩ͘ • جຊతʹ૿ݮʹରԠͰ͖ΔΑ͏ʹ
*_sd_config Λ͏Α͏ʹ͢Δ • ରԠ͢Δ sd ͕ͳ͍࣌ file_sd_config ͰସͰ͖ΔՄೳੑ͕͋Δ • ࢦఆͷϑΥʔϚοτͰϑΝΠϧʹॻ͖ࠐΜͰஔ͘ͱ reload ແ͠ͰಡΜͰ͘ΕΔ • μογϡϘʔυͳͲجຊతʹ Grafana Λͬͯ࡞ΔΑ͏ʹ͢Δ • Datasource Λ Prometheus ʹͯ͠ඳը͢ΔରΛ PromQL Ͱॻ͚Δ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 6
ઃఆྫ: EC2 ͷΠϯελϯε༻ͷઃఆ - job_name: 'node' ec2_sd_configs: - region: ap-northeast-1
port: 9100 relabel_configs: - source_labels: [__meta_ec2_instance_state] regex: ^running$ # running ͷ͚ͩ action: keep - source_labels: [__meta_ec2_tag_Role] regex: ^(app|db)$ # Role λάʹ app, db ͕͍͍ͭͯΔͷ͚ͩ action: keep - source_labels: [__meta_ec2_tag_Name] # target_label Λࢦఆ͓ͯ͘͠ͱɺ target_label: instance # PromQL ͰͷߜΓࠐΈ݅ͱͯ͠ɺ - source_labels: [__meta_ec2_tag_Role] # ઃఆͨ͠ϥϕϧΛར༻Ͱ͖ΔΑ͏ʹͳΔ target_label: role - source_labels: [__meta_ec2_tag_Status] target_label: status - source_labels: [__meta_ec2_instance_type] target_label: instance_type - source_labels: [__meta_ec2_availability_zone] target_label: availability_zone - source_labels: [__meta_ec2_vpc_id] target_label: vpc_id hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 7
hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 8
hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 9
Prometheus ͷԽʹ͍ͭͯ • Prometheus ͷԽ୯७ʹαʔόΛ 2 ىಈ͢Δ͚ͩ 1 • Pull
ܕͳͷͰ 2 ىಈ͓͚ͯͩ͘͠ͰԽʹͳΔ • σʔλ࠷େͰ scrape_interval ͕ͣΕ͚ͨͩͣΕΔ • ݱ࣮ʹʹͳΔ͜ͱগͳ͍ • ࣮ࡍʹϑϩϯτʹ Nginx Λઃஔͯ͠ยํ͕མͪͨΒ͏ยํ͕ࢀর͞ΕΔΑ͏ʹ͢Δ • άϥϑͷඳըʹ͏ Grafana ͕ࢀর͢ΔσʔλιʔεΛ Nginx ͷϗετʹઃఆ͢Δ 1 h$ps:/ /github.com/prometheus/prometheus/issues/1500 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 10
Prometheus ͷεέʔϧઓུʹ͍ͭͯ • ϝτϦΫε͕ඦສ͙Β͍·Ͱ 1 ηοτͰेࡹ͚Δͣ • ૿͖͑ͯͨ߹ DC োυϝΠϯຖʹ
1 ηοτͣͭ Prometheus Λ༻ҙ͢Δ 2 • ෳͷ Prometheus Λ༻ҙͨ͠߹ϑΣσϨʔγϣϯΛߦ͏͜ͱ͕ग़དྷΔ • ԼҐͷ Prometheus ͷ /federate ΤϯυϙΠϯτΛεΫϨΠϓ͢Δ • େମͷ߹ԼҐͷ Prometheus Ͱ Record Λ͍σʔλΛू্ͨ͠ͰϑΣσϨʔγϣϯ͢Δ • ͘͠ Grafana Ͱࢀর͢ΔσʔλιʔεΛ͚ΔͳͲ͕ߟ͑ΒΕΔ • ྫ͑ CloudFlare ͰίϩέʔγϣϯຖʹσʔλΛूͯ͠ϑΣσϨʔγϣϯ͍ͯ͠Δ 3 3 h$ps:/ /promcon.io/2017-munich/slides/monitoring-cloudflares-planet-scale-edge-network-with-prometheus.pdf 2 h$ps:/ /www.robustpercep2on.io/scaling-and-federa2ng-prometheus/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 11
Digression: Record ʹ͍ͭͯ • Prometheus Ͱ Recording rule ͱ͍͏ͷΛఆٛग़དྷΔ 4
• Recording rule ఆٛͨ͠ PromQL ΛҰఆִؒͰ࣮ߦͰ͖Δ • ࣮ߦ݁ՌΛผͷ໊લͷ࣌ܥྻσʔλͱͯ͠อଘ͢Δ͜ͱ͕ग़དྷΔ • ࣌ܥྻσʔλͷαϯϓϦϯάϑΣσϨʔγϣϯ࣌ͷूʹ͏ • ࣮ߦִؒ Rule ͷ interval ͔ evaluation_interval Ͱܾఆ͞ΕΔ • Record Ͱఆٛͨ͠ Alert rule Ͱར༻Մೳ 4 h$ps:/ /prometheus.io/docs/prometheus/latest/configura8on/recording_rules/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 12
Digression: Record/Alert ͷྫ groups: - name: mysql.rules rules: - record:
mysql_slave_lag_seconds expr: mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay - alert: MySQLReplicationLag expr: (mysql_slave_lag_seconds > 30) and ON(instance) (predict_linear(mysql_slave_lag_seconds[5m], 60 * 2) > 0) for: 1m labels: severity: critical annotations: description: The mysql slave replication has fallen behind and is not recovering summary: MySQL slave replication is lagging hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 13
Prometheus ͷσʔλอ࣋ظؒʹ͍ͭͯ • Prometheus ࣌ܥྻσʔλΛظؒอଘ͢Δͷʹ͋·Γద͍ͯ͠ͳ͍ 5 • ߴͳΫΤϦॲཧΛ࣮ݱ͢ΔͨΊͷΞʔΩςΫνϟ্ͷ੍ • σϑΥϧτͰͷ࣌ܥྻσʔλͷอ࣋ظؒ
15 ؒ • Long-term storage ͱ͍͏ผͷετϨʔδʹσʔλΛอଘ͢Δํ͕ࣜਪ͞Ε͍ͯΔ 6 • ࣮ࡍʹ HTTP Ͱ protocol buffer ͷσʔλ͕ඈΜͰདྷΔ͚ͩ • InfluxDB S3 Chronix Λ remote storage ͱ͢Δ࣮͕ଘࡏ͍ͯ͠Δ • Prometheus ͷઃఆͷ remote_read remote_write Ͱઃఆ͢Δ • ͘͠ storage.tsdb.retention Λͨ͘͠ prometheus ʹ federaFon ͤ͞ΔͳͲ 6 h$ps:/ /prometheus.io/docs/prometheus/latest/storage/#remote-storage-integra9ons 5 h$p:/ /techlife.cookpad.com/entry/7meseries-database-001 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 14
Alertmanager ʹ͍ͭͯ • Prometheus ͷ alert Λड͚औΓϋϯυϧͯ͘͠ΕΔͷ 7 • Ξϥʔτͷάϧʔϐϯά,
ϧʔςΟϯά, ॏෳഉআ͕ग़དྷΔ • Ξϥʔτͷݕࡧɾ௨ͷࢭͳͲ͕ WebUI / amtool ίϚϯυ͔ΒՄೳ • Prometheus Ͱͳ࣮ͯ͘ಈ͘ 8 • /api/v1/alerts ΤϯυϙΠϯτʹ JSON Λ POST ͍ͯ͠Δ͚ͩ 8 h$ps:/ /prometheus.io/docs/aler5ng/clients/ 7 h$ps:/ /prometheus.io/docs/aler5ng/alertmanager/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 15
Alertmanager ͷઃఆʹ͍ͭͯ • ΠϯετʔϧجຊతʹόΠφϦΛஔ͚ͩ͘ • Ξϥʔτͷ௨ϧʔϧɾॏෳഉআϧʔϧͳͲΛهड़͢Δ • Ξϥʔτͷϧʔϧࣗମ Prometheus ͷํʹఆٛ͢Δ
• Prometheus ͷ Rule Ͱఆ༷ٛͨ͠ʑͳϥϕϧ͕ར༻Մೳ • جຊతʹϥϕϧͷΛݩʹͯ͠௨ઌΛܾఆ͢Δ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 16
Prometheus ͷ Alert rule ͷઃఆྫ groups: - name: linux rules:
- alert: InstanceDown # AlertnameɻҰൠతʹ Grouping ͰΘΕΔ expr: up == 0 # ࣮ࡍʹΞϥʔτͷᮢͱͯ͠ΘΕΔ PromQL ͷ for: 1m # 1 ؒҎ্ܧଓͨ͠߹ʹ alertmanager ʹΔ labels: # ͜ͷ͕ Alertmanager ଆͰར༻Մೳ severity: CRITICAL annotations: # Slack Ͱ௨͞ΕΔࡍʹ annotations ͕ར༻͞ΕΔɻ description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.' summary: Instance {{ $labels.instance }} down - alert: CPUUtilization expr: 100 - (avg(rate(node_cpu{job="node",mode="idle"}[1m])) BY (instance) * 100) > 60 for: 1m labels: severity: CRITICAL annotations: description: '{{ $labels.instance }} has been use high cpu more than 1 minutes.' summary: Instance {{ $labels.instance }} cpu utilization is high hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 17
global: resolve_timeout: 5m route: group_by: ['alertname', 'instance'] # receiver ʹ௨͢Δ݅ʹઃఆ
group_wait: 30s # ࠷ॳͷॏෳഉআͷͨΊʹͭඵ group_interval: 5m # άϧʔϓʹରͯ͠௨Λߦ͏ִؒ # ࠷ॳ 30 ඵͬͯ௨->Ҏޙ৽͍͠Ξϥʔτ͕͋Ε 5 ຖʹ௨ repeat_interval: 1h # ࠶ૹ͞ΕΔ·Ͱͷ࣌ؒ(resolve ͍ͯ͠ͳ͚ΕԿͳ͘ͱ 1h ຖʹ௨) routes: # ΞϥʔτͷϧʔςΟϯάͷઃఆ - match_re: # Rule Ͱઃఆͨ͠λάʹରͯ͠ϧʔςΟϯάΛॻ͚Δ service: ^sre$ receiver: 'sre-pagerduty' receivers: # ΞϥʔτΛड͚औΔରͷઃఆ - name: 'sre-page' # webhook, email, pagerduty ͕͑Δ pagerduty_configs: - service_key: xxxxxxxxxxxxxxxxxxxxxxxx inhibit_rules: # Ξϥʔτͷॏෳഉআͷઃఆ - source_match: # طʹΞϥʔτ໊ɾΠϯελϯε໊͕ಉ͡, severity: 'critical' # critical ͷ alert ͕͋Δ߹ɺ target_match: # warning ͷϚʔδ͞ΕͯऔΓѻΘΕΔ severity: 'warning' equal: ['alertname', 'instance'] hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 18
Alertmanager ͷԽʹ͍ͭͯ • Alertmanager ͷԽ -mesh ΦϓγϣϯΛ͏͜ͱͰՄೳ • جຊతʹશͯͷϊʔυͰࣗΛؚΊͯ -mesh.peer
Λෳճࢦఆ͢Δ • ex.) alertmanager -mesh.peer alertmanager-001 -mesh.peer alertmanager-002 • TCP ͷ 6783 ൪ϙʔτͰ 001 ͱ 002 ͕ΓͱΓΛ։࢝͢Δ • Prometheus ͷ alerting ઃఆ߲ͷ targets ʹ 2 ͭͷ alertmanager Λهड़͢Δ • ෦తʹ weaveworks/mesh 9 ͕༻͞ΕͯԽ͕࣮ݱ͞Ε͍ͯΔ • gossip protocol (membership) Λ༻͍ͯ CAP ͷ AP Λຬ͍ͨͯ͠Δ • ωοτϫʔΫతʹஅ͞Εͨ߹ͳͲΞϥʔτ͕ॏෳͯ͠ૹΒΕͯ͘Δ 9 h$ps:/ /github.com/weaveworks/mesh hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 19
ઃఆྫ: Alertmanager ͱ࿈ܞ͢Δ Prometheus ͷઃఆ alerting: alertmanagers: - ec2_sd_configs: #
alertmanager ࣗମͷ - region: ap-northeast-1 # service discovery ग़དྷΔ port: 9093 relabel_configs: - source_labels: [__meta_ec2_instance_state] regex: ^running$ # running ͷͷ action: keep - source_labels: [__meta_ec2_tag_Role] regex: ^alertmanager$ # Role λά͕ alertmanager ʹͳ͍ͬͯΔͷ action: keep hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 20
Exporter ʹ͍ͭͯ • Prometheus ͕ Pull ͠ʹ͍͘ઌͷαʔόΛ Exporter ͱ͍͏ •
༻్औಘ͍ͨ͠ϝτϦΫεʹԠ༷ͯ͡ʑͳ Exporter ͕͋Δ 10 • node_exporter: Linux ͷඪ४తͳϝτϦΫε • mysqld_exporter: MySQL ͷඪ४తͳϝτϦΫε • nginx_exporter: nginx_status ͷϝτϦΫε • mtail: ϩάΛ tail ͰݟͯϝτϦΫεʹมͰ͖Δ • snmp_exporter: SNMP ͷ͔ΒϝτϦΫεʹมͰ͖Δ 10 h%ps:/ /github.com/prometheus/prometheus/wiki/Default-port-alloca<ons hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 21
Exporter Λࣗ࡞͢Δ • γϯϓϧͳ HTTP ͷ endpoint Λ༻ҙ͢Δ͚ͩͰ exporter ʹͳΔ
11 • 'metrics_name value\n' Λు͘ΤϯυϙΠϯτ͕͋Εྑ͍ • ΞϓϦέʔγϣϯݻ༗ͷϝτϦΫεͳͲ؆୯ʹऩूͰ͖Δ • جຊతʹ exporter ଆͰ raw ͳΛग़ͯ͠ Prometheus ଆͰूܭ͢ΔΑ͏ʹ͢Δ • ͘͠ protocol buffer ͷϑΥʔϚοτ͋Δ • ͳ͍ͷ࡞ΔࣄʹͳΔ͕ݴޠറΓ͕ͳ͘ϑΥʔϚοτ؆୯ͳͷͰ͘͠ͳ͍ • ࣮ࡍʹ API Gateway + Lambda Ͱ AWS ͷϝτϦΫεΛग़ྗ͢ΔΛ࡞ͬͨΓ͍ͯ͠Δ 11 h$ps:/ /prometheus.io/docs/instrumen4ng/exposi4on_formats/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 22
࣮ࡍʹPrometheusࣗମͷϝτϦΫεΛோΊ༷ͨࢠ $ curl localhost:9090/metrics # HELP go_gc_duration_seconds A summary of
the GC invocation durations. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 5.9729e-05 go_gc_duration_seconds{quantile="0.25"} 9.75e-05 go_gc_duration_seconds{quantile="0.5"} 0.000117034 go_gc_duration_seconds{quantile="0.75"} 0.000157237 go_gc_duration_seconds{quantile="1"} 0.0067897 go_gc_duration_seconds_sum 10.408703235 go_gc_duration_seconds_count 33117 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 54 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 23
Digression: Exporter ϙʔτരൃ • Prometheus ͷ Wiki 9 ΛݟΕ͔Δ௨Γ 1
Exporter 1 ϙʔτΛ͏ • 1 ͭͷΠϯελϯεʹෳͷ Exporter ΛೖΕΔͱϙʔτΛͨ͘͞Μ͏ • sg ͳͲͷϑΝΠΞΥʔϧͷઃఆΛ͢Δͷ໘ • ͋·ΓෳͷϙʔτΛ Prometheus ʹ͚ͯެ։͢Δඞཁͳ͍ • rrreeeyyy/exporter_proxy 12 ͳͲΛͬͯղܾ͢Δ • ಛఆͷϙʔτΛͬͯ Prometheus ଆͷ metrics_path Λར༻ͯ͠ Exporter Λผ͢Δ 12 h%ps:/ /github.com/rrreeeyyy/exporter_proxy 9 h$ps:/ /github.com/weaveworks/mesh hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 24
PromQL ʹ͍ͭͯ • Prometheus Ͱ࣌ܥྻσʔλΛॲཧ͢ΔͨΊʹ༻͢ΔΫΤϦݴޠ • ׳ΕΔ·Ͱ͘͠ײ͡Δ͕׳ΕΔͱදݱྗ͕ߴ͘ศར • ೖެࣜυΩϡϝϯτͱݸਓతʹ DigitalOcean
ͷࢿྉ͕ྑ͔ͬͨ 13 14 • Alering PromQL Λར༻ͯ͠ߦ͏ • ౷ܭతʹॲཧͨ݁͠ՌͷΞϥʔτϧʔϧͳͲ͕ ॻ͚Δ • Aler:ng ͷ࣌ irate() Ͱͳ͘ rate() Λͬͨ΄͏͕ྑ͍ͳͲͷҙ͋Δ 14 h%ps:/ /www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-2 13 h%ps:/ /www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-1 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 25
CPU ༻Λܭࢉ͢Δ PromQL • node_exporter Ͱऩूͨ͠ϗετ୯Ґͷ CPU ༻࣍ͷΑ͏ʹॻ͚Δ 15 •
100% ͔Β idle ͷΛҾ͍ͯΠϯελϯεΛج४ʹͯ͠ฏۉΛऔΔ • node_cpu ʹ CPU ίΞຖͷ͕ೖ͍ͬͯΔͨΊ • Alert Rule ʹ͢Δ߹ irate Λ rate ʹ͠ɺඌʹ >60 ͷᮢΛॻ͘ 100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[5m])) * 100) 15 h%ps:/ /www.robustpercep3on.io/understanding-machine-cpu-usage/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 26
Disk ༻ͷΞϥʔτΛग़͢ Alert rule ઃఆ 16 - name: node.rules rules:
- alert: DiskWillFillIn4Hours expr: predict_linear(node_filesystem_free{job="node"}[1h], 4 * 3600) < 0 for: 5m labels: severity: page • predict_linear ͷઢܗճؼ͕͑ΔͷͰ 4 ࣌ؒޙʹσΟ εΫ༰ྔ͕ 0 ҎԼʹͳΔΑ͏ͳͷΛΞϥʔτग़དྷΔ 16 h%ps:/ /www.robustpercep3on.io/reduce-noise-from-disk-space-alerts/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 27
Digression: Rule ϑΝΠϧͷཧʹ͍ͭͯ • Alert rule ͷཧΛ Prometheus Ͱߦ͏ඞཁ͕͋Δ •
Rule ϑΝΠϧγϯϓϧͳ YAML Ͱॻ͔ΕΔ • Zabbix ͳͲ͔ΒݟΔͱػೳ໘ʹෆΛײ͡Δ • Role Template Macro ͕͍͍ͨ... • ਖ਼ͳͱ͜ΖΉ͠Ζ͓͍ͷօ͞Μ͕Ͳ͏ཧ͍ͯ͠Δͷ͔Γ͍ͨ • WebUI (Promgen ͱ͔ʁ) ͕ݱঢ়༗ྗͳؾ͢Δ • τϦοΩʔͳ͜ͱͤͣγϯϓϧʹ͠Ζɺͱ͍͏ҙݟΘ͔Δ • Kubernetes ʹର͢Δ ksonnet ͷΑ͏ʹ jsonnet Ͱॻ͍ͯΈΔͱ͍͏Ҋ͋Γͦ͏ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 28
·ͱΊ • Prometheus Λຊ൪ʹಋೖ͢Δʹ͋ͨͬͯߟ͑ΔࣄΛઆ໌͠·ͨ͠ • Խɾεέʔϧઓུɾσʔλอ࣋ظؒͳͲ • Alertmanager Λຊ൪ಋೖ͢Δʹ͋ͨͬͯߟ͑ΔࣄΛઆ໌͠·ͨ͠ •
Խɾ࣮ࡍͷઃఆͳͲɹ • Exporter ͷࣗ࡞ PromQL ʹ͍ͭͯઆ໌͠·ͨ͠ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 29