Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Berlin 2013 - Kale Workshop - Abe Stanway
Search
Monitorama
September 20, 2013
0
440
Berlin 2013 - Kale Workshop - Abe Stanway
Monitorama
September 20, 2013
Tweet
Share
More Decks by Monitorama
See All by Monitorama
Monitorama PDX 2017 - Ian Bennett
monitorama
1
580
PDX 2017 - Pedro Andrade
monitorama
0
730
PDX 2017 - Roy Rapoport
monitorama
4
940
PDX 2017 - Julia Evans
monitorama
0
460
Berlin 2013 - Session - Brad Lhotsky
monitorama
5
700
Berlin 2013 - Session - Alex Petrov
monitorama
6
680
Berlin 2013 - Session - Jeff Weinstein
monitorama
2
620
Berlin 2013 - Session - Oliver Hankeln
monitorama
1
530
Berlin 2013 - Session - David Goodlad
monitorama
0
450
Featured
See All Featured
BBQ
matthewcrist
89
9.7k
Code Reviewing Like a Champion
maltzj
524
40k
The Power of CSS Pseudo Elements
geoffreycrofte
77
5.8k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
123
52k
Scaling GitHub
holman
459
140k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
34
3k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
26k
Building Applications with DynamoDB
mza
95
6.4k
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
181
53k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.4k
Designing Experiences People Love
moore
142
24k
Transcript
WELCOME TO BROOKLYN: A WORKSHOP ON KALE Abe Stanway @abestanway
Disclaimer: still in beta
Kale is composed of two sister services: Skyline and Oculus
SKYLINE
SKYLINE
Q). How do you analyze a timeseries for anomalies in
real time?
A). Lots of HTTP requests to Graphite’s API!
Q). How do you analyze a quarter million timeseries for
anomalies in real time?
Skyline!
Real time?
Kinda.
StatsD Ten second resolution
Ganglia One minute resolution
~ 10s ( ~ 1min Best case:
( Takes about 70 seconds with our throughput.
( Still faster than you would have discovered it otherwise.
Memory > Disk
None
Q). How do you get a quarter million timeseries into
Redis on time?
STREAM THAT SHIT!
Graphite’s relay agent original graphite backup graphite
Graphite’s relay agent original graphite backup graphite [statsd.numStats, [1365603422, 82345]]
pickles [statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]]
Graphite’s relay agent original graphite skyline [statsd.numStats, [1365603422, 82345]] pickles
[statsd.numStats, [1365603432, 80611]] [statsd.numStats, [1365603412, 73421]]
We import from Ganglia too.
Storing timeseries
Minimize I/O Minimize memory
redis.append() - Strings - Constant time - One operation per
update
JSON?
“[1358711400, 51],” => get statsD.numStats ----------------------------
“[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23],”
“[1358711400, 51], => get statsD.numStats ---------------------------- [1358711410, 23], [1358711420, 45],”
OVER HALF CPU time spent decoding JSON
[1,2]
[ 1 , 2 ] Stuff we care about Extra
bullshit
MESSAGEPACK
MESSAGEPACK A binary-based serialization protocol
\x93\x01\x02 Array size (16 or 32 bit big endian integer)
Things we care about
\x93\x01\x02 Array size (16 or 32 bit big endian integer)
Things we care about \x93\x02\x03
CUT IN HALF Run Time + Memory Used
ROOMBA.PY CLEANS THE DATA
“Wait...you wrote this in Python?”
Great statistics libraries Not fun for parallelism
Simple map/reduce design The Analyzer
Assign Redis keys to each process Process decodes and analyzes
The Analyzer
Anomalous metrics written as JSON setInterval() retrieves from front end
The Analyzer
None
What does it mean to be anomalous?
Consensus model
[yes] [yes] [no] [no] [yes] [yes] = anomaly!
Helps correct model mismatches
Implement everything you can get your hands on
Basic algorithm: “A metric is anomalous if its latest datapoint
is over three standard deviations above its moving average.”
Histogram binning
Take some data
Find most recent datapoint value is 40
Make a histogram
Check which bin contains most recent data
Check which bin contains most recent data latest value is
40, tiny bin size, so...anomaly!
Ordinary least squares
Take some data
Fit a regression line
Find residuals
Three sigma winner!
Median absolute deviation
Median absolute deviation (calculate residuals with respect to median instead
of regression line)
Exponentially weighted moving average
Instead of:
Add a decay factor!
These algorithms aren’t good enough.
A robust set of algorithms is the current focus of
this project.
Q). How do you analyze a quarter million timeseries for
correlations?
OCULUS
Image comparison is expensive and slow
“[[975, 1365528530], [643, 1365528540], [750, 1365528550], [992, 1365528560], [580, 1365528570],
[586, 1365528580], [649, 1365528590], [548, 1365528600], [901, 1365528610], [633, 1365528620]]” Use raw timeseries instead of raw graphs
Euclidian Distance
Dynamic Time Warping (helps with phase shifts)
We’ve solved it!
O(N2)
O(N2) x 250k
Too slow!
No need to run DTW on all 250k.
Discard obviously dissimilar metrics.
“975 643 643 750 992 992 992 580” “sharpdecrement flat
increment sharpincrement flat flat shapdecrement” Shape Description Alphabet
“975 643 643 750 992 992 992 580” “sharpdecrement flat
increment sharpincrement flat flat shapdecrement” Shape Description Alphabet “24 4 4 11 25 25 25 0 1” (normalization step)
None
Search for shape description fingerprint in Elasticsearch
Run DTW on results as final polish
O(N2) on ~10k metrics
Still too slow.
Fast DTW - O(N) similar strategy - coarse, then refine
Elasticsearch Details Phrase search for first pass scores across shape
description fingerprints
Elasticsearch Details Phrase search for first pass scores across shape
description fingerprints Custom FastDTW and euclidian distance plugins to score across the remaining filtered timeseries
Elasticsearch Structure { :id => “statsd.numStats”, :fingerprint => “sdec inc
sinc sdec”, :values => "10 1 2 15 4" }
First pass query :match => { :fingerprint => { :query
=> “sdec inc sinc sdec inc”, :type => "phrase", :slop => 20 } } shape description fingerprint
Refinement query {:custom_score => { :query => <first_pass_query>, :script =>
"oculus_dtw", :params => { :query_value => “10 20 20 10 30”, :query_field => "values.untouched", }, } raw timeseries
Skyline Elasticsearch Resque Sinatra Ganglia Graphite StatsD KALE Flask
Populating Elasticsearch
ES Index resque workers
Too slow to update and search
New Index Last Index Webapp
Sinatra frontend Queries ES Renders results
Happy monitoring. @abestanway github.com/etsy/skyline github.com/etsy/oculus