Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
MOM! My algorithms SUCK
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Abe Stanway
September 19, 2013
Programming
2.8k
15
Share
MOM! My algorithms SUCK
Given at Monitorama.eu 2013 in Berlin.
http://vimeo.com/75183236
Abe Stanway
September 19, 2013
More Decks by Abe Stanway
See All by Abe Stanway
Building Data Driven Organizations
astanway
1
250
A Deep Dive into Monitoring with Skyline
astanway
6
1.9k
Bring the Noise: Continuously Deploying Under a Hailstorm of Metrics
astanway
34
8.1k
Data Visualization in the Trenches
astanway
5
740
Gifs as Language
astanway
2
920
Your API is a Product
astanway
3
1k
Zen and the Art of Writing Commit Logs
astanway
3
860
Other Decks in Programming
See All in Programming
ローカルで稼働するAI エージェントを超えて / beyond-local-ai-agents
gawa
1
210
メッセージングを利用して時間的結合を分離しよう #phperkaigi
kajitack
3
530
Codexに役割を持たせる 他のAIエージェントと組み合わせる実務Tips
o8n
4
1.5k
How to stabilize UI tests using XCTest
akkeylab
0
150
ネイティブアプリとWebフロントエンドのAPI通信ラッパーにおける共通化の勘所
suguruooki
0
230
20260320登壇資料
pharct
0
140
Understanding Apache Lucene - More than just full-text search
spinscale
0
150
年間50登壇、単著出版、雑誌寄稿、Podcast出演、YouTube、CM、カンファレンス主催……全部やってみたので面白さ等を比較してみよう / I’ve tried them all, so let’s compare how interesting they are.
nrslib
4
590
Codex CLI でつくる、Issue から merge までの開発フロー
amata1219
0
260
AI-DLC 入門 〜AIコーディングの本質は「コード」ではなく「構造」〜 / Introduction to AI-DLC: The Essence of AI Coding Is Not “Code” but “Structure”
seike460
PRO
0
140
Migration to Signals, Signal Forms, Resource API, and NgRx Signal Store @Angular Days 03/2026 Munich
manfredsteyer
PRO
0
200
Kubernetesでセルフホストが簡単なNewSQLを求めて / Seeking a NewSQL Database That's Simple to Self-Host on Kubernetes
nnaka2992
0
190
Featured
See All Featured
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
2
190
Future Trends and Review - Lecture 12 - Web Technologies (1019888BNR)
signer
PRO
0
3.3k
Winning Ecommerce Organic Search in an AI Era - #searchnstuff2025
aleyda
1
1.9k
Building AI with AI
inesmontani
PRO
1
840
The Hidden Cost of Media on the Web [PixelPalooza 2025]
tammyeverts
2
260
Paper Plane (Part 1)
katiecoart
PRO
0
6.2k
The SEO Collaboration Effect
kristinabergwall1
0
410
Claude Code どこまでも/ Claude Code Everywhere
nwiizo
64
54k
Speed Design
sergeychernyshev
33
1.6k
B2B Lead Gen: Tactics, Traps & Triumph
marketingsoph
0
94
How to build a perfect <img>
jonoalderson
1
5.3k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.4k
Transcript
@abestanway MOM! my algorithms SUCK
i know how to fix monitoring once and for all.
a real human physically staring at a single metric 24/7
that human will then alert a sleeping engineer when her
metric does something weird
Boom. Perfect Monitoring™.
this works because humans are excellent visual pattern matchers* *there
are, of course, many advanced statistical applications where signal cannot be determined from noise just by looking at the data.
can we teach software to be as good at simple
anomaly detection as humans are?
let’s explore.
anomalies = not “normal”
humans can tell what “normal” is by just looking at
a timeseries.
“if a datapoint is not within reasonable bounds, more or
less, of what usually happens, it’s an anomaly” the human definition:
there are real statistics that describe what we mentally approximate
None
“what usually happens” the mean
“more or less” the standard deviation
“reasonable bounds” 3σ
so, in math speak, a metric is anomalous if the
absolute value of latest datapoint is over three standard deviations above the mean
we have essentially derived statistical process control.
pioneered in the 1920s. heavily used in industrial engineering for
quality control on assembly lines.
traditional control charts specification limits
grounded in exchangeability past = future
needs to be stationary
produced by independent random variables, with well- defined expected values
this allows for statistical inference
in other words, you need good lookin’ timeseries for this
to work.
normal distribution: a more concise definition of good lookin’ μ
34.1% 13.6% 2.1% 34.1% 13.6% μ - σ 2.1%
if you’ve got a normal distribution, chances are you’ve got
an exchangeable, stationary series produced by independent random variables
99.7% fall under 3σ
μ 34.1% 13.6% 2.1% 34.1% 13.6% 2.1% μ - σ
if your datapoint is in here, it’s an anomaly.
when only .3% lie above 3σ...
...you get a high signal to noise ratio...
...where “signal” indicates a fundmental state change, as opposed to
a random, improbable variation.
a fundamental state change in the process means a different
probability distribution function that describes the process
determining when probability distribution function shifts have occurred, as early
as possible. anomaly detection:
μ 1
μ 1 a new PDF that describes a new process
drilling holes sawing boards forging steel
snapped drill bit teeth missing on table saw steel, like,
melted
processes with well planned expected values that only suffer small,
random deviances when working properly...
...and massive “deviances”, aka, probability function shifts, when working improperly.
the bad news:
server infrastructures aren’t like assembly lines
systems are active participants in their own design
processes don’t have well defined expected values
they aren’t produced by genuinely independent random variables.
large variance does not necessarily indicate poor quality
they have seasonality
skewed distributions! less than 99.73% of all values lie within
3σ, so breaching 3σ is not necessarily bad 3σ possibly normal range
the dirty secret: using SPC-based algorithms results in lots and
lots of false positives, and probably lots of false negatives as well
no way to retroactively find the false negatives short of
combing with human eyes!
how do we combat this?* *warning! ideas!
we could always use custom fit models...
...after all, as long as the *errors* from the model
are normally distributed, we can use 3σ
Parameters are cool! a pretty decent forecast based on an
artisanal handcrafted model
but fitting models is hard, even by hand.
possible to implement a class of ML algorithms that determine
models based on distribution of errors, using Q-Q plots
Q-Q plots can also be used to determine if the
PDF has changed, although hard to do with limited sample size
consenus: throw lots of different models at a series, hope
it all shakes out.
[yes] [yes] [no] [no] [yes] [yes] = anomaly!
of course, if your models are all SPC-based, this doesn’t
really get you anywhere
use exponentially weighted moving averages to adapt faster
fourier transforms to detect seasonality
second order anomalies: is the series “anomalously anomalous”?
...this is all very hard.
so, we can either change what we expect of monitoring...
...and treat it as a way of building noisy situational
awareness, not absolute directives (alerts)...
...or we can change what we expect out of engineering...
...and construct strict specifications and expected values of all metrics.
neither are going to happen.
so we have to crack this algorithm nut.
...ugh. @abestanway