Monitoring on a budget

monitoring on a budget

a few animated gifs with the Twelfth Doctor (0 cats)

C J Silverio vp of engineering, @ceejbot

let's talk npm by the numbers

205 million packages Tuesday 10K requests/sec

npm is 25 people 4 of us run the registry

when the company was formed 5 people total

you outsource many services when you're tiny

you pull them back in-house when you succeed

success is sometimes a catastrophe

npm's scale: runaway success npm's staff: wouldn't this be neat

mission: know this on a budget

2 questions: is the registry up? how well is it
performing?

is the registry up? monitoring

how well is it performing? metrics

monitoring

monitoring == pull ask questions that you know the right
answers for

Is this host up? Is this cert about to expire?
Is the DB replication keeping up?

if you get the wrong answer somebody gets paged

nagios state of the art in free

It's okay. We never look at it. It just triggers
Pager Duty.

nagios’s virtues: reliability & custom checks

goal: never page anybody

self-healing checks automate the ﬁx if you can!

monitoring == unit tests a ratchet for continuous improvement

external monitoring ping services

you must monitor but that's just the start

monitoring tells you what it doesn't tell you why

metrics

Q: What's a metric? A: A name + a value
+ a time.

counter: it happened N times gauge: it's Y-sized right now
rate: it's happening N times/second timing: it took X milliseconds

metrics == push the app gives you numbers

emit from a service store in timeseries db query &
graph

the usual stack statsd ➜ graphite ➜ grafana

statsd uses UDP

Q: Why not send metrics over UDP? A: You care
about receiving them.

just try to install graphite

for-pay/SAAS services exist but I can't afford them

monitoring 400 processes right now 12+ GB of log data
a day

interlude: when should you pay?

convert the £$€ cost into engineer hours/month

pay when it's cheaper than investing an engineer (be honest
about the cost)

numbat was born “How hard can it be?” I said.

https://github.com/ numbat-metrics numbat - powered metrics

npm’s stack numbat ➜ inﬂuxdb ➜ grafana

so easy to emit a metric that we just do
it any time something interesting happens

4000 metrics/sec from the registry

metrics ➜ alerts

Server handling expected trafﬁc? Latency higher than normal? Error rate
higher than usual?

metrics comprise a data stream send the stream to more
than one place!

anomaly detection

recap time!

your web apps are backed by something

what's it up to? how do you know?

get data on what your services are up to

what: monitoring yes/no questions

why: metrics data changing over time

next: anomaly detection predictions & trends

automate don't require humans

npm install -g npm@latest @ceejbot on all the things npm
loves you

Monitoring on a budget

Monitoring on a budget

More Decks by C J Silverio

Other Decks in Programming

Featured

Transcript