Metrics and Monitoring

Metrics & Monitoring Sleman, January 9th 2016

“You cannot manage what you do not measure” - Edward
Deming

Metrics indicators to find what is on track and what
needs to be changed before it is too late

Metrics Categories • Business performance • Usage • Health

What product guys wants to see

What backend guys want to see

What mobile guys want to see

What we DON’T WANT to see

Monitoring Core Principles 1. Identify important topics as many as
possible 2. Identify all topics as early as possible 3. Generate alarms as few as possible 4. Do it with as little work as possible

Tools • Alert Alarms that wakes us up if something
happened • Graph Summarize all of trends. We are visual beings. • Logs Base source of truth and contains all of details

DEVELOPMENT ENV

Development Practices 1. Naive Implementation 2. Measure Implementation 3. Optimize
(if needed) • Measure Everything • Logging is Cheap

Continuous Integration Pipeline

Insights • Potential thresholds • Consequences for failures • Filtered
important resources to be monitored • Where to do improvement or optimization

PRODUCTION ENV

Where to set our eyes? Layer 0: Application Layer 1:
Process • Active connections • Slow processing • Throughput • Warning, Error, Fatal logs, etc • Changes in process status e.g. terminated, stopped, restarted • Uptime • Consumed resources, etc

Where to set our eyes? Layer 2: Server Layer 3:
Hosting • Number of running processes • System resource ( CPU, network, IO, memory, etc) • Hardware health, etc • Latency • Availability • Maintenance schedules, etc

Where to set our eyes? Layer 4: External Dependencies Layer
5: USERS!! • API Changes • SSL-APNS certificates renewals • Policy changes, etc • Behaviours • Crashes • Device types • Social Media oauth logins • Successful responses * • Sessions, etc

Four Monitoring Steps • Monitor potential bad things • Monitor
actual bad things • Monitor good things • Tune and Improve

Monitor potential bad things • Identify resource • Understand the
threshold value and consequences • Set alert before the threshold reached • Daily active users reached 70% of PubNub threshold • Increased social login failure in 30 minutes • Increased timeouts in 30 minutes • Increased >= 400 HTTP Codes

Monitor actual bad things • INEVITABLE • Identify resource •
Understand the failure effects • Ensure alert triggered • Ensure all source of truths exists • Application server restarted • Fatal error or exceptions happened in apps * • Mobile apps crashed • Chats aren’t delivered • Twilio failed to send SMS

Monitor Good Things ( before turns into disaster ) •
Identify resource • Set alert when changes happened • It’s BETTER to compare to sudden drops/spike rather than gradual changes / threshold reached * • Stores created every hour • Transactions created every hour • Successful payments every hour • Chat delivered every hour, etc

Tune and Improve • Add metrics as part of our
retrospectives • Asks our teams if any metrics need to be added / changed • Add / remove logs if necessary • Remove noise alerts • Pay close attention to our tools *

A metric will tell you that something is happening, while
an analysis will tell you why something is happening. - Vince Law

maturnuwun :)

Source: • Scalyr.com • Fabric.io • Newrelic.com

Metrics and Monitoring

Metrics and Monitoring

Kuncara Adi Nugraha

More Decks by Kuncara Adi Nugraha

Other Decks in Programming

Featured

Transcript

Metrics & Monitoring Sleman, January 9th 2016

“You cannot manage what you do not measure” - Edward

Metrics indicators to find what is on track and what

Metrics Categories • Business performance • Usage • Health

What product guys wants to see

What backend guys want to see

What mobile guys want to see

What we DON’T WANT to see

Monitoring Core Principles 1. Identify important topics as many as

Tools • Alert Alarms that wakes us up if something

DEVELOPMENT ENV

Development Practices 1. Naive Implementation 2. Measure Implementation 3. Optimize

Continuous Integration Pipeline

Insights • Potential thresholds • Consequences for failures • Filtered

PRODUCTION ENV

Where to set our eyes? Layer 0: Application Layer 1:

Where to set our eyes? Layer 2: Server Layer 3:

Where to set our eyes? Layer 4: External Dependencies Layer

Four Monitoring Steps • Monitor potential bad things • Monitor

Monitor potential bad things • Identify resource • Understand the

Monitor actual bad things • INEVITABLE • Identify resource •

Monitor Good Things ( before turns into disaster ) •

Tune and Improve • Add metrics as part of our

A metric will tell you that something is happening, while

maturnuwun :)

Source: • Scalyr.com • Fabric.io • Newrelic.com