failure Network upgrades Rack failures Core Switch failures Connectivity issues Flaky DNS Misconfigured machines Bugs Corrupt or unavailable backups Cultural Issues Lack of knowledge sharing Lack of knowledge handover Lack of on-call training Lack of chaos engineering Lack of an incident management program Lack of documentation and playbooks Lack of alerts and pages Lack of effective alerting thresholds Lack of backup strategy