delegate to other member - Clarify the sharing channel - LINE has internal slack channel for outage sharing - Information expands to stakeholders - The Sharing template helps clear communication [Outage notice] • Outage level: • Outage product: • Detection time: • Issues: • Cause: • Services affected: • Status:
(Within 1 working day) - Outage report is created from the template and standardized - All corrective and preventive measures are registered as issue tickets. - The report is distributed through email to related product members [Summary] Product / Region Level / Seriousness Coverage(%) / Reliability (%) Occurred time/ Detected time/ Resolved time Brief description [Detail] Services affected by the outage Cause / Timeline and resolution [Corrective and preventive measures] Preventing future outages Improving outage detection Improving outage handling
5 working days - Invite All Platform Server Members - Mandatory - Members of teams and services who were directly affected by the outage - Non Mandatory - All Platform Server Member - Explain outage reports and get feedback from various views. - Update and finalize report after the retrospective
developer writes tech spec documents. (describe intention and design) The document is reviewed by a peer developer. Code Review All code change requires peer review. Code cannot be released without approval from the peer developers. Testing It's hard to be merged without unit tests recently. Many teams are operating end-to-end tests. (It’s the outcome from preventive measures)
servers have a lot of monitoring conditions. (outcome of outage corrective measures) Some developers have to respond to the alarm. - On-Call duty is a rotating responsibility among developers. - We prevent complex outages by early detection and reaction.
: Medium Region : All Coverage : 0.01% Reliability : 99% Occurred At : 4th Aug 17:50:35 Detected At : 4th Aug 17:63:00 Resolved At : 4th Aug 18:32:09 Brief Description Due to Proxy configuration Error, the request control did not work properly during the Channel Gateway restart. As a result, user requests were delivered normally before the server started, and GC frequently occurred during server start-up because the slow start was not working…
check from proxy server to backends Improving outage detection • Adding Health metrics Improving outage handling • Improve rollback & service out steps on release guide Rule • Every action item should be executable • Every Action Item should be registered as a ticket on issue tracking system