Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Get Instrumented: How Prometheus Can Unify Your...
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Hynek Schlawack
May 31, 2016
Programming
11k
4
Share
Get Instrumented: How Prometheus Can Unify Your Metrics
Hynek Schlawack
May 31, 2016
More Decks by Hynek Schlawack
See All by Hynek Schlawack
Python’s True Superpower
hynek
0
250
Design Pressure
hynek
0
1.8k
Subclassing, Composition, Python, and You
hynek
3
540
Classy Abstractions @ Python Web Conf
hynek
0
240
On the Meaning of Version Numbers
hynek
0
410
Maintaining a Python Project When It’s Not Your Job
hynek
1
2.5k
How to Write Deployment-friendly Applications
hynek
0
2.6k
Solid Snakes or: How to Take 5 Weeks of Vacation
hynek
2
5.9k
Beyond grep – PyCon JP
hynek
1
3.7k
Other Decks in Programming
See All in Programming
Signal Forms: Beyond the Basics @ngBaguette 2026 in Paris
manfredsteyer
PRO
0
150
色即是空、空即是色、データサイエンス
kamoneggi
1
200
Moments When Things Go Wrong
aurimas
3
120
TypeScriptだけでAIエージェントを作る フロント・エージェント・インフラのフルスタック実践
har1101
6
1.1k
Oxlintのカスタムルールの現況
syumai
5
770
LLM Plugin for Node-REDの利用方法と開発について
404background
0
130
oxlintはeslint/typescript-eslintを置き換えられるのか
shomafujita
2
270
Oxlintはいかにしてtsgolintのlint ruleを呼び出しているのか
syumai
2
960
Skillは並べた。動かなかった。契約で繋いだ。— 65個のSkillから、自走する開発サイクルへ
junholee
0
770
ふつうのFeature Flag実践入門
irof
6
3.2k
生成AI時代にこそ効くGo | Why Go Works in the Age of Generative AI
mom0tomo
8
2.9k
ビジネスモデルから紐解く、AI+型駆動開発
hirokiomote
2
3.4k
Featured
See All Featured
How Software Deployment tools have changed in the past 20 years
geshan
0
34k
Taking LLMs out of the black box: A practical guide to human-in-the-loop distillation
inesmontani
PRO
3
2.2k
The Hidden Cost of Media on the Web [PixelPalooza 2025]
tammyeverts
2
310
Faster Mobile Websites
deanohume
310
31k
Claude Code のすすめ
schroneko
67
220k
Context Engineering - Making Every Token Count
addyosmani
9
920
Gemini Prompt Engineering: Practical Techniques for Tangible AI Outcomes
mfonobong
2
410
Between Models and Reality
mayunak
4
310
How to train your dragon (web standard)
notwaldorf
97
6.6k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
55k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
17k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
16
2k
Transcript
Hynek Schlawack Get Instrumented How Prometheus Can Unify Your Metrics
Goals
Goals
Goals
Goals
Goals
Service Level
Service Level Indicator
Service Level Indicator Objective
Service Level Indicator Objective (Agreement)
Metrics
Metrics avg latency 0.3 0.5 0.8 1.1 2.6
Metrics 12:00 12:01 12:02 12:03 12:04 avg latency 0.3 0.5
0.8 1.1 2.6
Metrics 12:00 12:01 12:02 12:03 12:04 avg latency 0.3 0.5
0.8 1.1 2.6 server load 0.3 1.0 2.3 3.5 5.2
None
Instrument
Instrument
Instrument
Instrument
Instrument
None
None
Metric Types
Metric Types ❖counter
Metric Types ❖counter ❖gauge
Metric Types ❖counter ❖gauge ❖summary
Metric Types ❖counter ❖gauge ❖summary ❖histogram
Metric Types ❖counter ❖gauge ❖summary ❖histogram ❖ buckets (1s, 0.5s,
0.25, …)
Averages
❖ avg(request time) ≠ avg(UX) Averages
❖ avg(request time) ≠ avg(UX) ❖ avg({1, 1, 1, 1,
10}) = 2.8 Averages
❖ avg(request time) ≠ avg(UX) ❖ avg({1, 1, 1, 1,
10}) = 2.8 Averages
❖ avg(request time) ≠ avg(UX) ❖ avg({1, 1, 1, 1,
10}) = 2.8 Averages
❖ avg(request time) ≠ avg(UX) ❖ avg({1, 1, 1, 1,
10}) = 2.8 ❖ median({1, 1, 1, 1, 10}) = 1 Averages
❖ avg(request time) ≠ avg(UX) ❖ avg({1, 1, 1, 1,
10}) = 2.8 ❖ median({1, 1, 1, 1, 10}) = 1 Averages
❖ avg(request time) ≠ avg(UX) ❖ avg({1, 1, 1, 1,
10}) = 2.8 ❖ median({1, 1, 1, 1, 10}) = 1 ❖ median({1, 1, 100_000}) = 1 Averages
Percentiles
Percentiles nth percentile P of a data set = P
≥ n% of values
None
50th percentile = 1 ms
50th percentile = 1 ms 50% of requests done by
1 ms
Percentiles
Percentiles P {1, 1, 100_000} 50th 1
Percentiles P {1, 1, 100_000} 50th 1 95th 90_000
None
None
None
Naming
Naming backend1_app_http_reqs_msgs_post backend1_app_http_reqs_msgs_get …
Naming backend1_app_http_reqs_msgs_post backend1_app_http_reqs_msgs_get … app_http_reqs_total
Naming backend1_app_http_reqs_msgs_post backend1_app_http_reqs_msgs_get … app_http_reqs_total
Naming backend1_app_http_reqs_msgs_post backend1_app_http_reqs_msgs_get … app_http_reqs_total
Naming backend1_app_http_reqs_msgs_post backend1_app_http_reqs_msgs_get … app_http_reqs_total{meth="POST", path="/msgs", backend="1"} app_http_reqs_total{meth="GET", path="/msgs", backend="1"}
… app_http_reqs_total
None
None
1. resolution = scraping interval
1. resolution = scraping interval 2. missing scrapes = less
resolution
Pull: Problems ❖ short lived jobs
None
Pull: Problems ❖ short lived jobs ❖ target discovery
Configuration scrape_configs: - job_name: 'prometheus' target_groups: - targets: - 'localhost:9090'
Configuration scrape_configs: - job_name: 'prometheus' target_groups: - targets: - 'localhost:9090'
Configuration scrape_configs: - job_name: 'prometheus' target_groups: - targets: - 'localhost:9090'
Configuration scrape_configs: - job_name: 'prometheus' target_groups: - targets: - 'localhost:9090'
{instance="localhost:9090",job="prometheus"}
None
Pull: Problems ❖ target discovery ❖ short lived jobs ❖
Heroku/NATed systems
Pull: Advantages
Pull: Advantages ❖ multiple Prometheis easy
Pull: Advantages ❖ multiple Prometheis easy ❖ outage detection
Pull: Advantages ❖ multiple Prometheis easy ❖ outage detection ❖
predictable, no self-DoS
Pull: Advantages ❖ multiple Prometheis easy ❖ outage detection ❖
predictable, no self-DoS ❖ easy to instrument 3rd parties
Metrics Format # HELP req_seconds Time spent \ processing a
request in seconds. # TYPE req_seconds histogram req_seconds_count 390.0 req_seconds_sum 177.0319407
Metrics Format # HELP req_seconds Time spent \ processing a
request in seconds. # TYPE req_seconds histogram req_seconds_count 390.0 req_seconds_sum 177.0319407
Metrics Format # HELP req_seconds Time spent \ processing a
request in seconds. # TYPE req_seconds histogram req_seconds_count 390.0 req_seconds_sum 177.0319407
Metrics Format # HELP req_seconds Time spent \ processing a
request in seconds. # TYPE req_seconds histogram req_seconds_count 390.0 req_seconds_sum 177.0319407
Metrics Format # HELP req_seconds Time spent \ processing a
request in seconds. # TYPE req_seconds histogram req_seconds_count 390.0 req_seconds_sum 177.0319407
Percentiles req_seconds_bucket{le="0.05"} 0.0 req_seconds_bucket{le="0.25"} 1.0 req_seconds_bucket{le="0.5"} 273.0 req_seconds_bucket{le="0.75"} 369.0 req_seconds_bucket{le="1.0"}
388.0 req_seconds_bucket{le="2.0"} 390.0 req_seconds_bucket{le="+Inf"} 390.0
Percentiles req_seconds_bucket{le="0.05"} 0.0 req_seconds_bucket{le="0.25"} 1.0 req_seconds_bucket{le="0.5"} 273.0 req_seconds_bucket{le="0.75"} 369.0 req_seconds_bucket{le="1.0"}
388.0 req_seconds_bucket{le="2.0"} 390.0 req_seconds_bucket{le="+Inf"} 390.0
Percentiles req_seconds_bucket{le="0.05"} 0.0 req_seconds_bucket{le="0.25"} 1.0 req_seconds_bucket{le="0.5"} 273.0 req_seconds_bucket{le="0.75"} 369.0 req_seconds_bucket{le="1.0"}
388.0 req_seconds_bucket{le="2.0"} 390.0 req_seconds_bucket{le="+Inf"} 390.0
None
Aggregation
Aggregation sum( rate( req_seconds_count[1m] ) )
Aggregation sum( rate( req_seconds_count[1m] ) )
Aggregation sum( rate( req_seconds_count[1m] ) )
Aggregation sum( rate( req_seconds_count[1m] ) )
Aggregation sum( rate( req_seconds_count{dc="west"}[1m] ) )
Aggregation sum( rate( req_seconds_count[1m] ) ) by (dc)
Percentiles histogram_quantile( 0.9, rate( req_seconds_bucket[10m] ))
Percentiles histogram_quantile( 0.9, rate( req_seconds_bucket[10m] ))
Percentiles histogram_quantile( 0.9, rate( req_seconds_bucket[10m] ))
Percentiles histogram_quantile( 0.9, rate( req_seconds_bucket[10m] ))
Percentiles histogram_quantile( 0.9, rate( req_seconds_bucket[10m] ))
None
None
Internal
Internal ❖ great for ad-hoc
Internal ❖ great for ad-hoc ❖ 1 expr per graph
Internal ❖ great for ad-hoc ❖ 1 expr per graph
❖ templating
PromDash
PromDash ❖ best integration
PromDash ❖ best integration ❖ former official
PromDash ❖ best integration ❖ former official ❖ now deprecated
❖ don’t bother
Grafana
Grafana ❖ pretty & powerful
Grafana ❖ pretty & powerful ❖ many integrations
Grafana ❖ pretty & powerful ❖ many integrations ❖ mix
and match!
Grafana ❖ pretty & powerful ❖ many integrations ❖ mix
and match! ❖ use this!
None
Alerts & Scrying
Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) <
0 FOR 5m
Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) <
0 FOR 5m
Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) <
0 FOR 5m
Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) <
0 FOR 5m
Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) <
0 FOR 5m
Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) <
0 FOR 5m
None
None
None
Environment
None
Apache nginx Django PostgreSQL MySQL MongoDB CouchDB redis Varnish etcd
Kubernetes Consul collectd HAProxy statsd graphite InfluxDB SNMP
Apache nginx Django PostgreSQL MySQL MongoDB CouchDB redis Varnish etcd
Kubernetes Consul collectd HAProxy statsd graphite InfluxDB SNMP
node_exporter
node_exporter cAdvisor
System Insight
System Insight ❖ load
System Insight ❖ load ❖ procs
System Insight ❖ load ❖ procs ❖ memory
System Insight ❖ load ❖ procs ❖ memory ❖ network
System Insight ❖ load ❖ procs ❖ memory ❖ network
❖ disk
System Insight ❖ load ❖ procs ❖ memory ❖ network
❖ disk ❖ I/O
mtail
mtail ❖ follow (log) files
mtail ❖ follow (log) files ❖ extract metrics using regex
mtail ❖ follow (log) files ❖ extract metrics using regex
❖ can be better than direct
Moar
Moar ❖ Edges: web servers/HAProxy
Moar ❖ Edges: web servers/HAProxy ❖ black box
Moar ❖ Edges: web servers/HAProxy ❖ black box ❖ databases
Moar ❖ Edges: web servers/HAProxy ❖ black box ❖ databases
❖ network
So Far
So Far ❖ system stats
So Far ❖ system stats ❖ outside look
So Far ❖ system stats ❖ outside look ❖ 3rd
party components
Code
cat-or.not
cat-or.not ❖ HTTP service
cat-or.not ❖ HTTP service ❖ upload picture
cat-or.not ❖ HTTP service ❖ upload picture ❖ meow!/nope meow!
from flask import Flask, g, request from cat_or_not import is_cat
app = Flask(__name__) @app.route("/analyze", methods=["POST"]) def analyze(): g.auth.check(request) return ("meow!" if is_cat(request.files["pic"]) else "nope!") if __name__ == "__main__": app.run()
from flask import Flask, g, request from cat_or_not import is_cat
app = Flask(__name__) @app.route("/analyze", methods=["POST"]) def analyze(): g.auth.check(request) return ("meow!" if is_cat(request.files["pic"]) else "nope!") if __name__ == "__main__": app.run()
from flask import Flask, g, request from cat_or_not import is_cat
app = Flask(__name__) @app.route("/analyze", methods=["POST"]) def analyze(): g.auth.check(request) return ("meow!" if is_cat(request.files["pic"]) else "nope!") if __name__ == "__main__": app.run()
pip install prometheus_client
from prometheus_client import \ start_http_server # … if __name__ ==
"__main__": start_http_server(8000) app.run()
process_virtual_memory_bytes 156393472.0 process_resident_memory_bytes 20480000.0 process_start_time_seconds 1460214325.21 process_cpu_seconds_total 0.169999999998 process_open_fds 8.0
process_max_fds 1024.0
process_virtual_memory_bytes 156393472.0 process_resident_memory_bytes 20480000.0 process_start_time_seconds 1460214325.21 process_cpu_seconds_total 0.169999999998 process_open_fds 8.0
process_max_fds 1024.0
process_virtual_memory_bytes 156393472.0 process_resident_memory_bytes 20480000.0 process_start_time_seconds 1460214325.21 process_cpu_seconds_total 0.169999999998 process_open_fds 8.0
process_max_fds 1024.0
process_virtual_memory_bytes 156393472.0 process_resident_memory_bytes 20480000.0 process_start_time_seconds 1460214325.21 process_cpu_seconds_total 0.169999999998 process_open_fds 8.0
process_max_fds 1024.0
process_virtual_memory_bytes 156393472.0 process_resident_memory_bytes 20480000.0 process_start_time_seconds 1460214325.21 process_cpu_seconds_total 0.169999999998 process_open_fds 8.0
process_max_fds 1024.0
process_virtual_memory_bytes 156393472.0 process_resident_memory_bytes 20480000.0 process_start_time_seconds 1460214325.21 process_cpu_seconds_total 0.169999999998 process_open_fds 8.0
process_max_fds 1024.0
None
from prometheus_client import \ Histogram, Gauge REQUEST_TIME = Histogram( "cat_or_not_request_seconds",
"Time spent in HTTP requests.")
from prometheus_client import \ Histogram, Gauge REQUEST_TIME = Histogram( "cat_or_not_request_seconds",
"Time spent in HTTP requests.") ANALYZE_TIME = Histogram( "cat_or_not_analyze_seconds", "Time spent analyzing pictures.")
from prometheus_client import \ Histogram, Gauge REQUEST_TIME = Histogram( "cat_or_not_request_seconds",
"Time spent in HTTP requests.") ANALYZE_TIME = Histogram( "cat_or_not_analyze_seconds", "Time spent analyzing pictures.") IN_PROGRESS = Gauge( "cat_or_not_in_progress_requests", "Number of requests in progress.")
@IN_PROGRESS.track_inprogress() @REQUEST_TIME.time() @app.route("/analyze", methods=["POST"]) def analyze(): g.auth.check(request) with ANALYZE_TIME.time(): result
= is_cat( request.files["pic"].stream) return "meow!" if result else "nope!"
@IN_PROGRESS.track_inprogress() @REQUEST_TIME.time() @app.route("/analyze", methods=["POST"]) def analyze(): g.auth.check(request) with ANALYZE_TIME.time(): result
= is_cat( request.files["pic"].stream) return "meow!" if result else "nope!"
AUTH_TIME = Histogram("auth_seconds", "Time spent authenticating.") AUTH_ERRS = Counter("auth_errors_total", "Errors
while authing.") AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total", "Wrong credentials.") class Auth: # ... @AUTH_TIME.time() def auth(self, request): while True: try: return self._auth(request) except WrongCredsError: AUTH_WRONG_CREDS.inc() raise except Exception: AUTH_ERRS.inc()
AUTH_TIME = Histogram("auth_seconds", "Time spent authenticating.") AUTH_ERRS = Counter("auth_errors_total", "Errors
while authing.") AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total", "Wrong credentials.") class Auth: # ... @AUTH_TIME.time() def auth(self, request): while True: try: return self._auth(request) except WrongCredsError: AUTH_WRONG_CREDS.inc() raise except Exception: AUTH_ERRS.inc()
AUTH_TIME = Histogram("auth_seconds", "Time spent authenticating.") AUTH_ERRS = Counter("auth_errors_total", "Errors
while authing.") AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total", "Wrong credentials.") class Auth: # ... @AUTH_TIME.time() def auth(self, request): while True: try: return self._auth(request) except WrongCredsError: AUTH_WRONG_CREDS.inc() raise except Exception: AUTH_ERRS.inc()
AUTH_TIME = Histogram("auth_seconds", "Time spent authenticating.") AUTH_ERRS = Counter("auth_errors_total", "Errors
while authing.") AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total", "Wrong credentials.") class Auth: # ... @AUTH_TIME.time() def auth(self, request): while True: try: return self._auth(request) except WrongCredsError: AUTH_WRONG_CREDS.inc() raise except Exception: AUTH_ERRS.inc()
@app.route("/analyze", methods=["POST"]) def analyze(): g.auth.check(request) with ANALYZE_TIME.time(): result = is_cat(
request.files["pic"].stream) return "meow!" if result else "nope!"
pip install prometheus_async
Wrapper from prometheus_async.aio import time @time(REQUEST_TIME) async def view(request): #
...
Goodies
Goodies ❖ aiohttp-based metrics export
Goodies ❖ aiohttp-based metrics export ❖ also in thread!
Goodies ❖ aiohttp-based metrics export ❖ also in thread! ❖
Consul Agent integration
Wrap Up
Wrap Up
Wrap Up ✓
Wrap Up ✓ ✓
Wrap Up ✓ ✓ ✓
ox.cx/p @hynek vrmd.de