Process • Active connections • Slow processing • Throughput • Warning, Error, Fatal logs, etc • Changes in process status e.g. terminated, stopped, restarted • Uptime • Consumed resources, etc
threshold value and consequences • Set alert before the threshold reached • Daily active users reached 70% of PubNub threshold • Increased social login failure in 30 minutes • Increased timeouts in 30 minutes • Increased >= 400 HTTP Codes
Identify resource • Set alert when changes happened • It’s BETTER to compare to sudden drops/spike rather than gradual changes / threshold reached * • Stores created every hour • Transactions created every hour • Successful payments every hour • Chat delivered every hour, etc
retrospectives • Asks our teams if any metrics need to be added / changed • Add / remove logs if necessary • Remove noise alerts • Pay close attention to our tools *