Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Drift Happens!
3 Kubernetes Drift Scenarios & H...

Drift Happens!
3 Kubernetes Drift Scenarios & How to Overcome Them

Avatar for Komodor

Komodor

April 24, 2025
Tweet

More Decks by Komodor

Other Decks in Technology

Transcript

  1. Housekeeping & Introductions Why Drift Happens The Impact of Drift

    on Actual Environments Best Practices and Strategies Q and A Agenda
  2. Housekeeping • Yes this webinar is recorded • Use the

    Q+A section to ask questions • ~45 minutes
  3. A Quick Poll! • How does your team primarily detect

    potential configuration drift in Kubernetes today?
  4. A Quick Poll! A) Manual Checks – comparing manifests, kubectl

    diff, regular reviews. B) Reactively – usually only discovered when investigating an incident or failure. C) Using built-in features of GitOps tools (like Argo CD, Flux). D) We don't have a specific or consistent process for detecting drift.
  5. “We can’t track who changed what across our clusters” “Configuration

    drift between clusters is a constant problem” “Our GitOps workflow breaks down when changes that meant for DEV, ended up in PROD” Common Drift Concerns
  6. K8s Estate Increases • More clusters, more services - more

    issues and headaches Manual Changes & Control • Break glass mechanisms are important but can be debilitating Deployment Issues • Large scale and complex Kubernetes environments can suffer from inconsistent deployments “drifting” from baseline configurations Why Does Drift Happen???
  7. 01 Configuration Drift Across Environments Inconsistent Behavior in a Service

    A service deployed across two regions: Prod EU and Prod US, runs smoothly in EU. The Culprit - Inconsistent Memory Limits Due to a misconfiguration during deployment The Cost - 1 Hour of Troubleshooting Took the team an hour to identify the issue at hand.
  8. 02 Managing a Large K8s Fleet Degraded Cluster Performance Managing

    hundreds of services across multiple clusters. The Culprit - Outdated Container Image An incomplete deployment process left the cluster with an outdated image. The Cost - 4 Hours of Analysis Multiple team members spent hours trying to detect the root cause of performance issues.
  9. 03 GitOps Workflow Service Reliability Issues Pod Crashes for a

    Critical Service Started with a new feature rollout The Culprit - Liveness Probes Incorrectly Configured The Cost - 1 Full Day to Recover A container image with non-prod configurations was deployed due to GitOps workflows Took the developer and escalated SRE engineer to identify and remediate
  10. Understanding the Full Impact of Drift Performance and Stability Issues

    • Degraded service performance • Increased failure rates and downtime • Longer troubleshooting time due to hard-to-detect configuration discrepancies
 Security Issues • Vulnerabilities from outdated or misconfigured services
 Cost and Inefficiency Issues • Services running misaligned configurations can impact cloud costs
  11. Recommendations and Techniques Use policies and automation to limit risky

    manual changes and enforce best practices. Set Guardrails where Possible Use Git as the single source of truth for configurations. GitOps ensures visibility, consistency, and accountability across environments.
 Move towards GitOps Proactively catch misconfigurations with automated alerts and self-healing mechanisms to reduce MTTR.
 Automate Everything Drift happens — your ability to detect and react defines your resilience. Here are key strategies to proactively manage and reduce the risk of drift: Treat drift checks as a default part of incident response — it can dramatically speed up root cause identification. Integrate Drift into Troubleshooting
  12. Immediately identify root cause, and quickly resolve it. Intuitive and

    user friendly view with detailed insights. Compare versions and resources on your Helm charts. Winning the Battle Against K8s Drift! Easy to Use Visual Experience Easily edit the desired state, and enforce best practices with all resources types. Diff only mode for changes in multiple services. Accelerate Troubleshooting & Recovery Detect Discrepancies Keep service configurations uniform across complex K8s environments. Flag deviations as reliability risks and standardize configs across the fleet. Automate Drift Detection Automatically detect and remediate. Connect to GitOps tooling to maintain a consistent source of truth.