13:50 - Planned A co-host of Chaos Experiment announces a routine of chaos in the #chaos-eng Slack channel. During the Chaos Friday 14:00 - Planned One of the co-hosts runs the prepared commands to incite the failure in PROD environment. Note taker record the time. During the Chaos Friday 14:15 - Planned All seems to be well, the team collects the evidence of the failure, confirms the hypothesis and prepares the recovery. Real Chaos Friday 14:30 - NO Planned PROD environment should return to a steady state but something is wrong.. After the Chaos Friday 16:05 - Planned An automated recovery procedure put online the region with chaos. After 90 min Team finds a document with commands to restart the new service. They cross their fingers and fortunately it works! Real Chaos Friday 14:50 - NO Planned In spite of having experience, the failure is on a new component and there is not documentation available! www.yurynino.dev
fictional. Things that went well: team work, use of communication channels and chaos engineering isolation. Things that did not go well: lack of knowledge about the automation recovery script, dev team unavailable and lack of documentation! Action item priority Meeting between Dev and SRE team Documentate the new service PostMortem Time www.yurynino.dev
can not depend on the knowledge of individuals passed verbally to new members. • Documentation helps developers communicate with each other. • Documents help future developers understand and maintain the code. • Good documentation help you learn from your mistakes. Documentation is important because … If the concepts are not documented, they will need to be relearned painfully through trial and error like the previous story!
for running services and product documentation. • Writers should provide consulting to assess, assist, and address documentation and information management needs. • Writers should evaluate and improve documentation tools to provide the best solutions for Chaos Engineering. Technical Writers
documentation that lays the foundation for such teams to scale up and take a principled approach for managing new and unfamiliar services. www.yurynino.dev
the team. Team Charter . Team Charter . How team operates Vision statement Short description of top services Key principles and values Links to the team site and docs www.yurynino.dev
is ready for production, www.yurynino.dev Production Readiness Review. Architecture and Dependencies Capacity Planning Failure Modes Processes and Automation External Dependencies Production Readiness Review
What to improve for next time Lessons learned Action items Action item priority www.yurynino.dev A postmortem is an analysis conducted after a system failure. Postmortems
solution will function. www.yurynino.dev Reliability Reports . Reliability Reports . Indicator name Collection method Assessment/formula/scale criteria Targets and performance thresholds Source of data Data frequency Data entry Expiry/revision date