$30 off During Our Annual Pro Sale. View Details »
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Prometheus 実践入門 #hbstudy 79 / introduction-to-p...
Search
rrreeeyyy
November 21, 2017
Technology
16
3.4k
Prometheus 実践入門 #hbstudy 79 / introduction-to-prometheus-practice
#hbstudy 79 で Prometheus の話をしました
rrreeeyyy
November 21, 2017
Tweet
Share
More Decks by rrreeeyyy
See All by rrreeeyyy
Rethinking Incident Response: Context-Aware AI in Practice - Incident Buddy Edition -
rrreeeyyy
0
180
Rethinking Incident Response: Context-Aware AI in Practice
rrreeeyyy
3
2.2k
Incident Response Practices: Waroom's Features and Future Challenges
rrreeeyyy
0
260
An Efficient Incident Response Training with AI / SRE NEXT 2024 Sponsor Session
rrreeeyyy
1
5.7k
カンファレンスから見る SRE トレンド 2024 / SRE Trends from Conferences in 2024 #SRE_Findy
rrreeeyyy
4
2.5k
信頼性の育て方 / mackerel-meetup-15
rrreeeyyy
10
2.8k
SRE の歩き方・進め方 / sre-walk-through-procedure
rrreeeyyy
0
8.9k
「信頼性」を保ちつつ大規模サービスをリニューアルする / cookpad-tech-kitchen-service-embedded-sres
rrreeeyyy
11
13k
Cookpad and Prometheus
rrreeeyyy
6
21k
Other Decks in Technology
See All in Technology
AWS Bedrock AgentCoreで作る 1on1支援AIエージェント 〜Memory × Evaluationsによる実践開発〜
yusukeshimizu
4
270
「Managed Instances」と「durable functions」で広がるAWS Lambdaのユースケース
lamaglama39
0
170
Symfony AI in Action
el_stoffel
2
390
【AWS re:Invent 2025速報】AIビルダー向けアップデートをまとめて解説!
minorun365
4
420
AIにおける自由の追求
shujisado
3
480
Agentic AI Patterns and Anti-Patterns
glaforge
1
150
シンプルを極める。アンチパターンなDB設計の本質
facilo_inc
2
1.6k
モバイルゲーム開発におけるエージェント技術活用への試行錯誤 ~開発効率化へのアプローチの紹介と未来に向けた展望~
qualiarts
0
530
freeeにおけるファンクションを超えた一気通貫でのAI活用
jaxx2104
3
1.4k
useEffectってなんで非推奨みたいなこと言われてるの?
maguroalternative
10
6.4k
Karate+Database RiderによるAPI自動テスト導入工数をCline+GitLab MCPを使って2割削減を目指す! / 20251206 Kazuki Takahashi
shift_evolve
PRO
1
320
たかが特別な時間の終わり / It's Only the End of Special Time
watany
28
7.6k
Featured
See All Featured
ReactJS: Keep Simple. Everything can be a component!
pedronauck
666
130k
Large-scale JavaScript Application Architecture
addyosmani
514
110k
Stop Working from a Prison Cell
hatefulcrawdad
273
21k
Fantastic passwords and where to find them - at NoRuKo
philnash
52
3.5k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
9
1.1k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
37
2.6k
Rails Girls Zürich Keynote
gr2m
95
14k
Writing Fast Ruby
sferik
630
62k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
1.8k
Unsuck your backbone
ammeep
671
58k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.7k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
120
20k
Transcript
Prometheus ࣮ફೖ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy )
1
Agenda • Prometheus ʹ͍ͭͯ • Prometheus ͷԽʹ͍ͭͯ • Prometheus ͷεέʔϧઓུʹ͍ͭͯ
• Prometheus ͷσʔλอ࣋ظؒʹ͍ͭͯ • Alertmanager ʹ͍ͭͯ • Alertmanager ͷԽʹ͍ͭͯ • Exporter ʹ͍ͭͯ • ࣮ࡍͷࢹͰ͑ͦ͏ͳ Exporter ʹ͍ͭͯ • Rule ϑΝΠϧͷཧʹ͍ͭͯ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 2
Prometheus ʹ͍ͭͯ • Prometheus OSS ͷϞχλϦϯάπʔϧ • ݱࡏͷ࠷৽όʔδϣϯ 2.0.0
(11/8 ϦϦʔε) • Google ʹଘࡏ͍ͯ͠Δ Borgmon ͱ͍͏ϞχλϦϯάπʔϧʹΠϯεύΠΞ͞Ε͍ͯΔ • Borgmon ʹ͍ͭͯ SRE ຊ 10 ষΛಡΉͱৄ͘͠ॻ͍ͯ͋Δ • ࣍ͷΑ͏ͳಛ͕͋Δ • Pull ܕͷΞʔΩςΫνϟ • ͦΕͳΓʹߴͳ࣌ܥྻσʔλϕʔε • PromQL ʹΑΔϓϩάϥϚϒϧͳ࣌ܥྻσʔλॲཧ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 3
hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 4
ͳͥ Prometheus Λબ͢Δͷ͔ • ߴ͍ղೳͰͷϝτϦΫεͷอଘʹ͑ΒΕΔ • Pull ܕͷΞʔΩςΫνϟͰൺֱత୯७ͳߏͰӡ༻Ͱ͖Δ • Service
Discovery ͕ॆ࣮͍ͯ͠Δ • PromQL ͷදݱྗ͕ߴ༷͘ʑͳ౷ܭ͕औΕΔ • CNCF ೖΓΛՌͨ͠ Kubernetes ͷ࿈ܞՄೳ • σϑΝΫτͱ͞ΕΔπʔϧͱͷ࿈ܞ͕Մೳͳͷॏཁ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 5
Prometheus ͷઃఆʹ͍ͭͯ • ΠϯετʔϧجຊతʹόΠφϦΛஔ͚ͩ͘ • ࢹ͢ΔରΛ scrape_configs Ͱॻ͍͍͚ͯͩ͘ • جຊతʹ૿ݮʹରԠͰ͖ΔΑ͏ʹ
*_sd_config Λ͏Α͏ʹ͢Δ • ରԠ͢Δ sd ͕ͳ͍࣌ file_sd_config ͰସͰ͖ΔՄೳੑ͕͋Δ • ࢦఆͷϑΥʔϚοτͰϑΝΠϧʹॻ͖ࠐΜͰஔ͘ͱ reload ແ͠ͰಡΜͰ͘ΕΔ • μογϡϘʔυͳͲجຊతʹ Grafana Λͬͯ࡞ΔΑ͏ʹ͢Δ • Datasource Λ Prometheus ʹͯ͠ඳը͢ΔରΛ PromQL Ͱॻ͚Δ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 6
ઃఆྫ: EC2 ͷΠϯελϯε༻ͷઃఆ - job_name: 'node' ec2_sd_configs: - region: ap-northeast-1
port: 9100 relabel_configs: - source_labels: [__meta_ec2_instance_state] regex: ^running$ # running ͷ͚ͩ action: keep - source_labels: [__meta_ec2_tag_Role] regex: ^(app|db)$ # Role λάʹ app, db ͕͍͍ͭͯΔͷ͚ͩ action: keep - source_labels: [__meta_ec2_tag_Name] # target_label Λࢦఆ͓ͯ͘͠ͱɺ target_label: instance # PromQL ͰͷߜΓࠐΈ݅ͱͯ͠ɺ - source_labels: [__meta_ec2_tag_Role] # ઃఆͨ͠ϥϕϧΛར༻Ͱ͖ΔΑ͏ʹͳΔ target_label: role - source_labels: [__meta_ec2_tag_Status] target_label: status - source_labels: [__meta_ec2_instance_type] target_label: instance_type - source_labels: [__meta_ec2_availability_zone] target_label: availability_zone - source_labels: [__meta_ec2_vpc_id] target_label: vpc_id hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 7
hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 8
hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 9
Prometheus ͷԽʹ͍ͭͯ • Prometheus ͷԽ୯७ʹαʔόΛ 2 ىಈ͢Δ͚ͩ 1 • Pull
ܕͳͷͰ 2 ىಈ͓͚ͯͩ͘͠ͰԽʹͳΔ • σʔλ࠷େͰ scrape_interval ͕ͣΕ͚ͨͩͣΕΔ • ݱ࣮ʹʹͳΔ͜ͱগͳ͍ • ࣮ࡍʹϑϩϯτʹ Nginx Λઃஔͯ͠ยํ͕མͪͨΒ͏ยํ͕ࢀর͞ΕΔΑ͏ʹ͢Δ • άϥϑͷඳըʹ͏ Grafana ͕ࢀর͢ΔσʔλιʔεΛ Nginx ͷϗετʹઃఆ͢Δ 1 h$ps:/ /github.com/prometheus/prometheus/issues/1500 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 10
Prometheus ͷεέʔϧઓུʹ͍ͭͯ • ϝτϦΫε͕ඦສ͙Β͍·Ͱ 1 ηοτͰेࡹ͚Δͣ • ૿͖͑ͯͨ߹ DC োυϝΠϯຖʹ
1 ηοτͣͭ Prometheus Λ༻ҙ͢Δ 2 • ෳͷ Prometheus Λ༻ҙͨ͠߹ϑΣσϨʔγϣϯΛߦ͏͜ͱ͕ग़དྷΔ • ԼҐͷ Prometheus ͷ /federate ΤϯυϙΠϯτΛεΫϨΠϓ͢Δ • େମͷ߹ԼҐͷ Prometheus Ͱ Record Λ͍σʔλΛू্ͨ͠ͰϑΣσϨʔγϣϯ͢Δ • ͘͠ Grafana Ͱࢀর͢ΔσʔλιʔεΛ͚ΔͳͲ͕ߟ͑ΒΕΔ • ྫ͑ CloudFlare ͰίϩέʔγϣϯຖʹσʔλΛूͯ͠ϑΣσϨʔγϣϯ͍ͯ͠Δ 3 3 h$ps:/ /promcon.io/2017-munich/slides/monitoring-cloudflares-planet-scale-edge-network-with-prometheus.pdf 2 h$ps:/ /www.robustpercep2on.io/scaling-and-federa2ng-prometheus/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 11
Digression: Record ʹ͍ͭͯ • Prometheus Ͱ Recording rule ͱ͍͏ͷΛఆٛग़དྷΔ 4
• Recording rule ఆٛͨ͠ PromQL ΛҰఆִؒͰ࣮ߦͰ͖Δ • ࣮ߦ݁ՌΛผͷ໊લͷ࣌ܥྻσʔλͱͯ͠อଘ͢Δ͜ͱ͕ग़དྷΔ • ࣌ܥྻσʔλͷαϯϓϦϯάϑΣσϨʔγϣϯ࣌ͷूʹ͏ • ࣮ߦִؒ Rule ͷ interval ͔ evaluation_interval Ͱܾఆ͞ΕΔ • Record Ͱఆٛͨ͠ Alert rule Ͱར༻Մೳ 4 h$ps:/ /prometheus.io/docs/prometheus/latest/configura8on/recording_rules/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 12
Digression: Record/Alert ͷྫ groups: - name: mysql.rules rules: - record:
mysql_slave_lag_seconds expr: mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay - alert: MySQLReplicationLag expr: (mysql_slave_lag_seconds > 30) and ON(instance) (predict_linear(mysql_slave_lag_seconds[5m], 60 * 2) > 0) for: 1m labels: severity: critical annotations: description: The mysql slave replication has fallen behind and is not recovering summary: MySQL slave replication is lagging hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 13
Prometheus ͷσʔλอ࣋ظؒʹ͍ͭͯ • Prometheus ࣌ܥྻσʔλΛظؒอଘ͢Δͷʹ͋·Γద͍ͯ͠ͳ͍ 5 • ߴͳΫΤϦॲཧΛ࣮ݱ͢ΔͨΊͷΞʔΩςΫνϟ্ͷ੍ • σϑΥϧτͰͷ࣌ܥྻσʔλͷอ࣋ظؒ
15 ؒ • Long-term storage ͱ͍͏ผͷετϨʔδʹσʔλΛอଘ͢Δํ͕ࣜਪ͞Ε͍ͯΔ 6 • ࣮ࡍʹ HTTP Ͱ protocol buffer ͷσʔλ͕ඈΜͰདྷΔ͚ͩ • InfluxDB S3 Chronix Λ remote storage ͱ͢Δ࣮͕ଘࡏ͍ͯ͠Δ • Prometheus ͷઃఆͷ remote_read remote_write Ͱઃఆ͢Δ • ͘͠ storage.tsdb.retention Λͨ͘͠ prometheus ʹ federaFon ͤ͞ΔͳͲ 6 h$ps:/ /prometheus.io/docs/prometheus/latest/storage/#remote-storage-integra9ons 5 h$p:/ /techlife.cookpad.com/entry/7meseries-database-001 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 14
Alertmanager ʹ͍ͭͯ • Prometheus ͷ alert Λड͚औΓϋϯυϧͯ͘͠ΕΔͷ 7 • Ξϥʔτͷάϧʔϐϯά,
ϧʔςΟϯά, ॏෳഉআ͕ग़དྷΔ • Ξϥʔτͷݕࡧɾ௨ͷࢭͳͲ͕ WebUI / amtool ίϚϯυ͔ΒՄೳ • Prometheus Ͱͳ࣮ͯ͘ಈ͘ 8 • /api/v1/alerts ΤϯυϙΠϯτʹ JSON Λ POST ͍ͯ͠Δ͚ͩ 8 h$ps:/ /prometheus.io/docs/aler5ng/clients/ 7 h$ps:/ /prometheus.io/docs/aler5ng/alertmanager/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 15
Alertmanager ͷઃఆʹ͍ͭͯ • ΠϯετʔϧجຊతʹόΠφϦΛஔ͚ͩ͘ • Ξϥʔτͷ௨ϧʔϧɾॏෳഉআϧʔϧͳͲΛهड़͢Δ • Ξϥʔτͷϧʔϧࣗମ Prometheus ͷํʹఆٛ͢Δ
• Prometheus ͷ Rule Ͱఆ༷ٛͨ͠ʑͳϥϕϧ͕ར༻Մೳ • جຊతʹϥϕϧͷΛݩʹͯ͠௨ઌΛܾఆ͢Δ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 16
Prometheus ͷ Alert rule ͷઃఆྫ groups: - name: linux rules:
- alert: InstanceDown # AlertnameɻҰൠతʹ Grouping ͰΘΕΔ expr: up == 0 # ࣮ࡍʹΞϥʔτͷᮢͱͯ͠ΘΕΔ PromQL ͷ for: 1m # 1 ؒҎ্ܧଓͨ͠߹ʹ alertmanager ʹΔ labels: # ͜ͷ͕ Alertmanager ଆͰར༻Մೳ severity: CRITICAL annotations: # Slack Ͱ௨͞ΕΔࡍʹ annotations ͕ར༻͞ΕΔɻ description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.' summary: Instance {{ $labels.instance }} down - alert: CPUUtilization expr: 100 - (avg(rate(node_cpu{job="node",mode="idle"}[1m])) BY (instance) * 100) > 60 for: 1m labels: severity: CRITICAL annotations: description: '{{ $labels.instance }} has been use high cpu more than 1 minutes.' summary: Instance {{ $labels.instance }} cpu utilization is high hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 17
global: resolve_timeout: 5m route: group_by: ['alertname', 'instance'] # receiver ʹ௨͢Δ݅ʹઃఆ
group_wait: 30s # ࠷ॳͷॏෳഉআͷͨΊʹͭඵ group_interval: 5m # άϧʔϓʹରͯ͠௨Λߦ͏ִؒ # ࠷ॳ 30 ඵͬͯ௨->Ҏޙ৽͍͠Ξϥʔτ͕͋Ε 5 ຖʹ௨ repeat_interval: 1h # ࠶ૹ͞ΕΔ·Ͱͷ࣌ؒ(resolve ͍ͯ͠ͳ͚ΕԿͳ͘ͱ 1h ຖʹ௨) routes: # ΞϥʔτͷϧʔςΟϯάͷઃఆ - match_re: # Rule Ͱઃఆͨ͠λάʹରͯ͠ϧʔςΟϯάΛॻ͚Δ service: ^sre$ receiver: 'sre-pagerduty' receivers: # ΞϥʔτΛड͚औΔରͷઃఆ - name: 'sre-page' # webhook, email, pagerduty ͕͑Δ pagerduty_configs: - service_key: xxxxxxxxxxxxxxxxxxxxxxxx inhibit_rules: # Ξϥʔτͷॏෳഉআͷઃఆ - source_match: # طʹΞϥʔτ໊ɾΠϯελϯε໊͕ಉ͡, severity: 'critical' # critical ͷ alert ͕͋Δ߹ɺ target_match: # warning ͷϚʔδ͞ΕͯऔΓѻΘΕΔ severity: 'warning' equal: ['alertname', 'instance'] hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 18
Alertmanager ͷԽʹ͍ͭͯ • Alertmanager ͷԽ -mesh ΦϓγϣϯΛ͏͜ͱͰՄೳ • جຊతʹશͯͷϊʔυͰࣗΛؚΊͯ -mesh.peer
Λෳճࢦఆ͢Δ • ex.) alertmanager -mesh.peer alertmanager-001 -mesh.peer alertmanager-002 • TCP ͷ 6783 ൪ϙʔτͰ 001 ͱ 002 ͕ΓͱΓΛ։࢝͢Δ • Prometheus ͷ alerting ઃఆ߲ͷ targets ʹ 2 ͭͷ alertmanager Λهड़͢Δ • ෦తʹ weaveworks/mesh 9 ͕༻͞ΕͯԽ͕࣮ݱ͞Ε͍ͯΔ • gossip protocol (membership) Λ༻͍ͯ CAP ͷ AP Λຬ͍ͨͯ͠Δ • ωοτϫʔΫతʹஅ͞Εͨ߹ͳͲΞϥʔτ͕ॏෳͯ͠ૹΒΕͯ͘Δ 9 h$ps:/ /github.com/weaveworks/mesh hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 19
ઃఆྫ: Alertmanager ͱ࿈ܞ͢Δ Prometheus ͷઃఆ alerting: alertmanagers: - ec2_sd_configs: #
alertmanager ࣗମͷ - region: ap-northeast-1 # service discovery ग़དྷΔ port: 9093 relabel_configs: - source_labels: [__meta_ec2_instance_state] regex: ^running$ # running ͷͷ action: keep - source_labels: [__meta_ec2_tag_Role] regex: ^alertmanager$ # Role λά͕ alertmanager ʹͳ͍ͬͯΔͷ action: keep hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 20
Exporter ʹ͍ͭͯ • Prometheus ͕ Pull ͠ʹ͍͘ઌͷαʔόΛ Exporter ͱ͍͏ •
༻్औಘ͍ͨ͠ϝτϦΫεʹԠ༷ͯ͡ʑͳ Exporter ͕͋Δ 10 • node_exporter: Linux ͷඪ४తͳϝτϦΫε • mysqld_exporter: MySQL ͷඪ४తͳϝτϦΫε • nginx_exporter: nginx_status ͷϝτϦΫε • mtail: ϩάΛ tail ͰݟͯϝτϦΫεʹมͰ͖Δ • snmp_exporter: SNMP ͷ͔ΒϝτϦΫεʹมͰ͖Δ 10 h%ps:/ /github.com/prometheus/prometheus/wiki/Default-port-alloca<ons hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 21
Exporter Λࣗ࡞͢Δ • γϯϓϧͳ HTTP ͷ endpoint Λ༻ҙ͢Δ͚ͩͰ exporter ʹͳΔ
11 • 'metrics_name value\n' Λు͘ΤϯυϙΠϯτ͕͋Εྑ͍ • ΞϓϦέʔγϣϯݻ༗ͷϝτϦΫεͳͲ؆୯ʹऩूͰ͖Δ • جຊతʹ exporter ଆͰ raw ͳΛग़ͯ͠ Prometheus ଆͰूܭ͢ΔΑ͏ʹ͢Δ • ͘͠ protocol buffer ͷϑΥʔϚοτ͋Δ • ͳ͍ͷ࡞ΔࣄʹͳΔ͕ݴޠറΓ͕ͳ͘ϑΥʔϚοτ؆୯ͳͷͰ͘͠ͳ͍ • ࣮ࡍʹ API Gateway + Lambda Ͱ AWS ͷϝτϦΫεΛग़ྗ͢ΔΛ࡞ͬͨΓ͍ͯ͠Δ 11 h$ps:/ /prometheus.io/docs/instrumen4ng/exposi4on_formats/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 22
࣮ࡍʹPrometheusࣗମͷϝτϦΫεΛோΊ༷ͨࢠ $ curl localhost:9090/metrics # HELP go_gc_duration_seconds A summary of
the GC invocation durations. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 5.9729e-05 go_gc_duration_seconds{quantile="0.25"} 9.75e-05 go_gc_duration_seconds{quantile="0.5"} 0.000117034 go_gc_duration_seconds{quantile="0.75"} 0.000157237 go_gc_duration_seconds{quantile="1"} 0.0067897 go_gc_duration_seconds_sum 10.408703235 go_gc_duration_seconds_count 33117 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 54 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 23
Digression: Exporter ϙʔτരൃ • Prometheus ͷ Wiki 9 ΛݟΕ͔Δ௨Γ 1
Exporter 1 ϙʔτΛ͏ • 1 ͭͷΠϯελϯεʹෳͷ Exporter ΛೖΕΔͱϙʔτΛͨ͘͞Μ͏ • sg ͳͲͷϑΝΠΞΥʔϧͷઃఆΛ͢Δͷ໘ • ͋·ΓෳͷϙʔτΛ Prometheus ʹ͚ͯެ։͢Δඞཁͳ͍ • rrreeeyyy/exporter_proxy 12 ͳͲΛͬͯղܾ͢Δ • ಛఆͷϙʔτΛͬͯ Prometheus ଆͷ metrics_path Λར༻ͯ͠ Exporter Λผ͢Δ 12 h%ps:/ /github.com/rrreeeyyy/exporter_proxy 9 h$ps:/ /github.com/weaveworks/mesh hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 24
PromQL ʹ͍ͭͯ • Prometheus Ͱ࣌ܥྻσʔλΛॲཧ͢ΔͨΊʹ༻͢ΔΫΤϦݴޠ • ׳ΕΔ·Ͱ͘͠ײ͡Δ͕׳ΕΔͱදݱྗ͕ߴ͘ศར • ೖެࣜυΩϡϝϯτͱݸਓతʹ DigitalOcean
ͷࢿྉ͕ྑ͔ͬͨ 13 14 • Alering PromQL Λར༻ͯ͠ߦ͏ • ౷ܭతʹॲཧͨ݁͠ՌͷΞϥʔτϧʔϧͳͲ͕ ॻ͚Δ • Aler:ng ͷ࣌ irate() Ͱͳ͘ rate() Λͬͨ΄͏͕ྑ͍ͳͲͷҙ͋Δ 14 h%ps:/ /www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-2 13 h%ps:/ /www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-1 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 25
CPU ༻Λܭࢉ͢Δ PromQL • node_exporter Ͱऩूͨ͠ϗετ୯Ґͷ CPU ༻࣍ͷΑ͏ʹॻ͚Δ 15 •
100% ͔Β idle ͷΛҾ͍ͯΠϯελϯεΛج४ʹͯ͠ฏۉΛऔΔ • node_cpu ʹ CPU ίΞຖͷ͕ೖ͍ͬͯΔͨΊ • Alert Rule ʹ͢Δ߹ irate Λ rate ʹ͠ɺඌʹ >60 ͷᮢΛॻ͘ 100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[5m])) * 100) 15 h%ps:/ /www.robustpercep3on.io/understanding-machine-cpu-usage/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 26
Disk ༻ͷΞϥʔτΛग़͢ Alert rule ઃఆ 16 - name: node.rules rules:
- alert: DiskWillFillIn4Hours expr: predict_linear(node_filesystem_free{job="node"}[1h], 4 * 3600) < 0 for: 5m labels: severity: page • predict_linear ͷઢܗճؼ͕͑ΔͷͰ 4 ࣌ؒޙʹσΟ εΫ༰ྔ͕ 0 ҎԼʹͳΔΑ͏ͳͷΛΞϥʔτग़དྷΔ 16 h%ps:/ /www.robustpercep3on.io/reduce-noise-from-disk-space-alerts/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 27
Digression: Rule ϑΝΠϧͷཧʹ͍ͭͯ • Alert rule ͷཧΛ Prometheus Ͱߦ͏ඞཁ͕͋Δ •
Rule ϑΝΠϧγϯϓϧͳ YAML Ͱॻ͔ΕΔ • Zabbix ͳͲ͔ΒݟΔͱػೳ໘ʹෆΛײ͡Δ • Role Template Macro ͕͍͍ͨ... • ਖ਼ͳͱ͜ΖΉ͠Ζ͓͍ͷօ͞Μ͕Ͳ͏ཧ͍ͯ͠Δͷ͔Γ͍ͨ • WebUI (Promgen ͱ͔ʁ) ͕ݱঢ়༗ྗͳؾ͢Δ • τϦοΩʔͳ͜ͱͤͣγϯϓϧʹ͠Ζɺͱ͍͏ҙݟΘ͔Δ • Kubernetes ʹର͢Δ ksonnet ͷΑ͏ʹ jsonnet Ͱॻ͍ͯΈΔͱ͍͏Ҋ͋Γͦ͏ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 28
·ͱΊ • Prometheus Λຊ൪ʹಋೖ͢Δʹ͋ͨͬͯߟ͑ΔࣄΛઆ໌͠·ͨ͠ • Խɾεέʔϧઓུɾσʔλอ࣋ظؒͳͲ • Alertmanager Λຊ൪ಋೖ͢Δʹ͋ͨͬͯߟ͑ΔࣄΛઆ໌͠·ͨ͠ •
Խɾ࣮ࡍͷઃఆͳͲɹ • Exporter ͷࣗ࡞ PromQL ʹ͍ͭͯઆ໌͠·ͨ͠ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 29