From Velocity 2015, I bring you... Crafting performance alerting tools!
The performance team at Etsy has recently developed a number of new alerting tools to help us discover and dig into performance regressions across the site. We built these new tools on top of existing technology (Nagios, Nagios Herald, and Graphite) to bring added context to site slowdowns and help us fix regressions more quickly.
These new alerts change the conversation with our coworkers. Before, we’d ask people about regressions long after they occurred. Now, we discover regressions almost immediately, and are able to share context such as graphs and recent site changes as we work with other teams to track down and fix regressions.
In this talk, Allison McKnight, performance engineer at Etsy, will cover:
- How we created alerts for backend performance slowdowns
- How we iterated on adding context to those alerts, including: Experiment ramp-ups, the state of our most popular and slowest - pages week-over-week, and better graphs
- How we built a dashboard for these alerts that highlights what’s currently an issue, allowing users to play with settings to dig into what is affected by the regression and to compare related pages
- How we built an IRC command to help us do this work alongside our daily chatter, which helps us ask our teammates in real-time for more context
- How good tools end up being contagious around a company
The future of our alerts: alerting on performance wins and native app metrics; automatically including other teams in alerts about their pages.
Open-source tools mentioned in this talk:
github.com/etsy/logster
github.com/etsy/nagios_tools
github.com/etsy/nagios-herald