Agent • Time series data (metrics and events) • Processing nearly a trillion data points per day • Intelligent Alerting • We’re hiring! (www.datadoghq.com/careers/) Datadog Overview
OFTEN DON'T HAVE OBVIOUS, CLEAN- CUT SOLUTIONS, SO IT'S USEFUL TO CULTIVATE YOUR TROUBLESHOOTING SKILLS, NO MATTER WHAT ROLE YOU WORK IN.” Internal Datadog Developer Guide
(and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously … Joyent Postmortem http://bit.ly/joyent-post JOYENT US-EAST-1 POST-MORTEM 2014
happened here at a high-level -- think of it as an abstract in a scientific paper. ▸ What was the impact on customers? ▸ What was the severity of the outage? ▸ What components were affected? ▸ What ultimately resolved the outage?
We want to make sure we detected the issue early and would catch the same issue if it were to repeat. ▸ Did we have a metric that showed the outage? ▸ Was there a monitor on that metric? ▸ How long did it take for us to declare an outage?