Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Prometheus - A Whirlwind Tour
Search
Cindy Sridharan
May 10, 2017
Technology
11
3.5k
Prometheus - A Whirlwind Tour
A presentation on Prometheus at OSCON 2017.
Cindy Sridharan
May 10, 2017
Tweet
Share
More Decks by Cindy Sridharan
See All by Cindy Sridharan
Unmasking netpoll.go
copyconstructor
4
2.4k
Monitoring in the time of Cloud Native
copyconstructor
4
380
The Python Deployment Albatross - PyTennessee 2017
copyconstructor
1
480
Prometheus at Google NYC Tech Talks Nov 2016
copyconstructor
10
2.3k
Other Decks in Technology
See All in Technology
Platform EngineeringがあればSREはいらない!? 新時代のSREに求められる役割とは
mshibuya
2
4.1k
あなたの興味は信頼性?それとも生産性? SREとしてのキャリアに悩むみなさまに伝えたい選択肢
jacopen
6
3.5k
【Λ(らむだ)】アップデート機能振り返りΛ編 / PADjp20250127
lambda
0
120
Grid表示のレイアウトで Flow layoutsを使う
cffyoha
1
150
アクセシブルなマークアップの上に成り立つユーザーファーストなドロップダウンメニューの実装 / 20250127_cloudsign_User1st_FE
bengo4com
2
1.2k
FastConnect の冗長性
ocise
1
9.3k
もし今からGraphQLを採用するなら
kazukihayase
9
4.3k
エンジニアとしてプロダクトマネジメントに向き合った1年半
sansantech
PRO
0
110
顧客の声を集めて活かすリクルートPdMのVoC活用事例を徹底解剖!〜プロデザ!〜
recruitengineers
PRO
0
210
Amazon Aurora バージョンアップについて、改めて理解する ~バージョンアップ手法と文字コードへの影響~
smt7174
1
260
[JAWS-UG栃木]地方だからできたクラウドネイティブ事例大公開! / jawsug_tochigi_tachibana
biatunky
0
150
生成AIを活用した機能を、顧客に提供するまでに乗り越えた『4つの壁』
toshiblues
1
210
Featured
See All Featured
How STYLIGHT went responsive
nonsquared
96
5.3k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
20
2.4k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
33
2k
Building Your Own Lightsaber
phodgson
104
6.2k
Typedesign – Prime Four
hannesfritz
40
2.5k
The MySQL Ecosystem @ GitHub 2015
samlambert
250
12k
Designing Experiences People Love
moore
139
23k
The Cost Of JavaScript in 2023
addyosmani
47
7.3k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
6
520
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
175
51k
A Tale of Four Properties
chriscoyier
157
23k
Imperfection Machines: The Place of Print at Facebook
scottboms
267
13k
Transcript
Prometheus A Whirlwind Tour Cindy Sridharan Oscon 2017 Austin, Texas
@copyconstruct @copyconstruct @copyconstruct
The Future?
None
None
None
OBSERVABILITY > TESTING
Things testing cannot detect
elasticity of the production environment
unpredictability of inputs
the vagaries of upstream and downstream dependencies
Cloud native architectures need best in class observability
None
We cannot understand software unless we observe it
Debugging must be viewed as the process by which systems
are understood and improved, not merely as the process by which bugs are made to go away! - Bryan Cantrill
OBSERVABILITY must also be viewed as the process by which
systems are understood and improved, not merely as the process by which bugs are made to go away!
OBSERVABILITY cannot be an afterthought
Instrumentation should be a requirement for a PR to be
merged
OBSERVABILITY needs to be a part of system design and
development
But … what even is “observability” ?
There are three pillars that make up a modern Observability
stack
Logging Tracing Metrics
All three are examples of whitebox “monitoring”
WHITEBOX Observability data gathered from the internals of the target
system Is capable of providing warning about a problem before it occurs BLACKBOX Observes external functionality as observed by an end user of the system Helps detect when a problem is ongoing and contributing to external symptoms
None
Blackbox methods test your Service Level Objectives
None
Whitebox methods monitor your Service Level Agreements
None
Different systems have different blackbox monitoring and whitebox instrumentation requirements
given their agreed upon SLO and SLA
Where does Prometheus fit in here?
None
None
Prometheus
Whitebox monitoring toolkit and a TSDB for metrics
Monitoring Toolkit
Client Instrumentation Metrics Ingestion Metrics Processing and Storage Querying and
Visualization Analysis Alerting
Client instrumentation
What even is a “metric”?
A set of numbers that give information about a particular
process or activity
Metrics are usually measured over intervals of time — in other words,
a time series
None
What metrics to collect?
The Four Golden Signals Proposed by the SRE book
Latency Traffic Errors Saturation Proposed by the SRE book
USE method by Brendan Gregg
Utilization average time the resource is busy servicing work Saturation
degree to which resource has extra work which it can't service, often queued Errors count of error events B R E N D A N G R E G G
RED method by Tom Wilkie
How busy is my service? R equest rate Are there
any errors in my service E rror rate What is the latency in my service D uration of requests T O M W I L K I E
None
Prometheus has stateful client libraries in all major languages
Server is agnostic to the type of metric
The Prometheus client libraries support four types of metrics
Counters Gauges Histogram Summary
“Target” discovery happens via service discovery
None
Metrics ingestion
None
Pull over HTTP
Does Pull scale?
Prometheus isn’t an event based system or Nagios that spawns
a subprocess while “pulling”
Pull lowers risk of DDoSing your monitoring system
Pull based systems monitor if a service is down (if
a scrape fails) as a part of gathering metrics
None
None
With statsd type of systems, the application sends a UDP
message for every event it observes
Monitoring traffic increases proportionally to user traffic or whatever traffic
is generating monitoring data
Prometheus clients aggregate metrics in memory which is scraped by
the Prometheus server upon regular intervals
None
If you want to push, there’s a PUSHGATEWAY for short
lived jobs
EXPORTERS
Exporters help in exporting existing metrics from third-party systems as
Prometheus metrics.
JMX SNMP HAProxy MySQL Blackbox cAdvisor (Node) system metrics
S T O R A G E
Single node, no clustering
For HA, run 2 identical Prometheus servers
None
In Prometheus, a time series has an ID and a
sample
None
An ID is a combination of both the metric name
and the labels associated
A sample is a combination of a millisecond precision timestamp
and a float64 value
Requirements of *any* TSDB? Effective queries Effective writes
Write optimized Requires parallel queries and aggregation for diverse query
patterns during read time
None
None
None
None
Write pattern is horizontal A TSDB ingests potentially several time
series from every target at specific intervals of time
None
None
None
None
Reads are random We read not entire rows or columns
but sparse matrices
Read optimized Write data in such a way that it
is closely aligned for reads
None
None
The time series are stored in a one file per
time series format on disk
None
Incoming time series are stored in chunks in memory Chunks
are flushed to disk when they are full
None
Incomplete chunks are checkpointed to disk so as to be
able to recover after a crash
None
All data required to evaluate a PromQL expression needs to
be in memory This data is also cached aggressively for future queries.
None
None
None
None
Prometheus supports two types of rules which may be configured
and then evaluated at regular intervals - Recording rules and Alerting rules.
Same chunk eviction policy applies while evaluating for Alerting and
Recording Rules
RECORDING RULES Recording rules allow you to precompute frequently needed
or computationally expensive expressions and save their result as a new set of time series
RECORDING RULES Querying the precomputed result will then often be
much faster than executing the original expression every time it is needed
RECORDING RULES Come in handy while creating dashboards where the
same expression is evaluated every time a dashboard is refreshed
ALERTING RULES Allow defining alert conditions based on PromQL expressions
and to send notifications about firing alerts to an external service.
Drawbacks of V2 storage
Single file per time series
High resource utilization because of time-series churn
Checkpointing to disk can be longer than acceptable
Deletion of stale time-series is prohibitively expensive
SQOF a ka Single Query of Failure
None
None
None
None
None
None
None
FEDERATION
Federation allows a Prometheus server to scrape selected time series
from another Prometheus server
None
CROSS-SERVICE FEDERATION
A Prometheus server of one service is configured to scrape
selected data from another service's Prometheus server to enable alerting and queries against both datasets within a single server
None
HIERARCHICAL FEDERATION
The federation topology resembles a tree, with higher level Prometheus
servers collecting aggregated time series data from a larger number of subordinated servers
None
REMOTE STORAGE
None
None
None
Weave Cortex (DynamoDB + S3) Chronix (Solr) Vulcan (Kafka +
Cassandra)
VISUALIZATION
None
ANALYSIS
PromQL one of the defining features of Prometheus
Labels > Hierarchy
stats . timers . accounts . ios . http .
post . authenticate . response_time . upper_95
{ resource=accounts, method=post, protocol=http, user_agent=ios, endpoint=/authenticate, name=response_time, }
Better exploration because of dimensional queries
PromQL rate(api_http_requests_total [5m] ) SQL SELECT job, instance, method, status,
path, rate(value, 5m) FROM api_http_requests_total
ALERTING
No automatic anomaly detection
ALERT <alert name> IF <expression> [ FOR <duration> ] [
LABELS <label set> ] [ ANNOTATIONS <label set> ]
None
ALERT ConsulRaftPeersLow IF consul_raft_peers < 5 FOR 1m LABELS {severity="page”,
team=“infra”} ANNOTATIONS {description="consul raft peer count low: {{$value}}", summary="consul raft peer count low: {{$value}}"}
ALERT QueueCritical IF sum (broker_q{svc_pref="prod"}) > 5000 FOR 10m LABELS
{severity="page", team=”product"} ANNOTATIONS {description="service: {{$labels.service}} instance: {{$labels.instance}} queue length: {{$value}} for too long", summary="service: {{$labels.service}} instance: {{$labels.instance}} queue length: {{$value}} for too long"}
ALERTMANAGER
Deduplication Grouping Routing Suppression of Alerts
None
CASE STUDY
None
None
None
24 employees 8 engineers
Requirements for a monitoring system?
Ease of Use
Ease of Operation
Cost Effective!
None
None
Cost Effective “at scale”
Scale?
imgix
imgix
imgix Our last outage when we were both shedding load
and serving up errors
None
CONCLUSION
None
None
Our stack is C, Lua, Go, Python
Fantastic official Go and Python clients
Custom LuaJIT client for counters, gauges and histograms
None
None
Single statically linked Go binary
No clustering No dependency on Zookeeper et al.
~2 years of Prometheus use in production
None
Only “cost” has been SSD upgrades on boxes
None
Let’s not answer that last question!
Thank You! @copyconstruct