Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes Health Management with Komodor

Komodor
March 06, 2025
6

Kubernetes Health Management with Komodor

Komodor

March 06, 2025
Tweet

Transcript

  1. Danielle Inbar 👋 • PM Director @ Komodor • Prev.

    Snyk, Spot.io, Verint, Motorola, Dell EMC • Mother of Arbel 👦, Yuval 👧 and Charlie 🐶
  2. The more complicated Kubernetes gets, the more complicated it gets

    to do anything at all, until eventually we can’t move, we can’t add new features or solve problems. Tim Hockin Kubernetes Co-Founder, Google KubeCon NA 2023 Keynote Speech
  3. Network Policies IAC Node Scaling Kubernetes Creates Complexity Applications Infrastructure

    Workloads Kubernetes Cluster Configuration Addons Infra Pods Deployments Jobs DaemonSets StatefulSets Networking Storage Access Control Configs & Secrets Nodes Distributions & Versions Cloud Providers Core Components Storage Packages Cert & Secrets AI/MLOps & Batch Processing Streaming VCS & CI/CD Network Policies & IaC Autoscalers Service Mesh API Gateway DNS Pod Scaling
  4. 11 Kubernetes is a Complete Ecosystem. While pods and deployments

    are the tip of the iceberg, there are dozens of components that are required to run Kubernetes effectively at scale.
  5. The Challenges of Kubernetes Management Operational Toil: Managing a production-grade

    Kubernetes stack at scale imposes significant toil, including handling numerous resources Complex Correlation: Teams must manage tasks like autoscaling, network policies, and service meshes (e.g., Istio), consistently applying best practices and optimizing configurations to ensure seamless service availability across the cluster. Risk of Downtime: Misconfigurations or delays with critical components like storage solutions, networking plugins and other core services can lead to cluster-wide downtime or degraded performance. Lack of Visibility and Proactiveness: Not all K8s elements have standards for visibility error tracking and alerting. This results in many issues going undetected and reactive firefighting. Impact on Productivity: The burden of managing K8s clusters is time consuming and diverts focus from strategic initiatives to tactic ‘keeping the head above the water’.
  6. Alert: The ‘Checkout’ web service is down Inspect K8s deployment

    - all pods are failing Inspecting the logs Failed to connect to the DB Inspecting the DB pods Everything is up and running Inspecting the the DB logs - no issues Inspecting the DB connections - dropped to 0 Escalating to the ops team 1 hr of downtime Example: Typical Cert-Manager issue investigation flow Developer
  7. Inspecting everything again Inspecting network policies Result: no issues Inspecting

    Certificate Result: Certificate Expired Inspecting Cert manager Failed to renew Result: Issuer Problem Inspecting Cert manager Issuer Result: DNS Problem Fixing the DNS problem Alert: Checkout service is down Inspect K8s deployment Result: all pods are failing Inspecting the logs Result: Failed to connect to the DB Inspecting the DB pods Result: Everything is up and running Inspecting the DB logs Result: no issues Inspecting the DB connections Result: dropped to 0 Escalating to the ops team 2 hrs of downtime Developer DevOps Example: Typical Cert-Manager issue investigation flow
  8. Inspecting everything again Inspecting network policies Result: no issues Inspecting

    Certificate Result: Certificate Expired Inspecting Cert manager Failed to renew Result: Issuer Problem Inspecting Cert manager Issuer Result: DNS Problem Fixing the DNS problem Alert: Checkout service is down Inspect K8s deployment Result: all pods are failing Inspecting the logs Result: Failed to connect to the DB Inspecting the the DB pods Result: Everything is up and running Inspecting the DB logs Result: no issues Inspecting the the DB connections Result: dropped to 0 Escalating to the ops team 2 hrs of downtime Developer DevOps Example: Typical Cert-Manager issue investigation flow A seemingly simple issue, yet it requires 2 teams to be involved for ~2 hours & over 10 investigation steps to get to the root cause
  9. Kubernetes Health Management is Holistic Applications CRDs Workloads K8s Resources

    Infrastructure Configurations General K8s Knowledge System History Live K8s Data Potential Risks Real-time Issues Violations
  10. Continuous Kubernetes Health Management DETECT Realtime issues and ongoing reliability

    risks PRIORITIZE Understand where to focus your attention based on impact INVESTIGATE AI-driven root cause analysis and guided investigation REMEDIATE Hundreds of auto- mated playbooks for Kubernetes issues OPTIMIZE Proactively prevent future issues and reduce costs
  11. Komodor Simplifies & Automates Kubernetes Management at Scale Enterprise Kubernetes

    Management Platform Operations & User Management Cost Optimization Health & Reliability Management Multi-cluster/cloud/hybrid Native support for CRDs, workloads, infra, & addons Direct actions & Audit trail Custom workspaces K8s access, RBAC, SSO JiT kubectl access Right-sizing suggestions Cost allocation visibility Autopilot Smart cost/performance balancing AI-powered RCA Realtime issues Reliability risks 100s of automatic remediation playbooks Drift detection Standards violations DevOps Engineers Management Data Engineers SREs Developers Platform Engineers
  12. How Does it Work? Technical Deep-Dive Ecosystem Integrations Komodor Agents

    Enrichment Engine Operations & User Management Health & Reliability Management Cost Optimization Data Engineers DevOps Engineers Executives Developers Platform Team Workspaces Kubernetes Brain Klaudia AI
  13. Why Komodor? We Are Your Kubernetes Partner! From migration to

    Day 2 Operations, Komodor ensures that every cloud-native initiative is successful & ROI-positive 67% Reduction in MTTR 72% Faster dev onboarding to K8s 41% Increase In Productivity 57% Reduction In k8s Costs 48% Faster Migration To K8s 64% Fewer Tickets to DevOps & SREs 36% Improved Velocity
  14. Komodor Enables K8s For the Entire Enterprise Platform Engineers 360°

    visibility into multi-cloud/hybrid Improved reliability & resiliency Centralized access control & governance Fewer escalations & no TicketOps Developers Data Engineers Reduced cognitive load Increased developer velocity Less time spent on troubleshooting Increased ownership & accountability Support KubeFlow, Airflow, Argo Workflows Increased productivity & efficiency No longer a bottleneck for MLOps Focusing on data science & research The users were happy immediately. And every time we show a screenshot, people always say, “Wow. What tool is that and how do I get into it?” It has been nothing but a delight for all of our engineers from day one. Don’t overthink it! It was so easy to onboard, and definitely worth it in the long run. Michael Keith Senior SRE
  15. How Does it Work? Technical Deep-Dive Ecosystem Integrations Komodor Agents

    Enrichment Engine Operations & User Management Health & Reliability Management Cost Optimization Data Engineers DevOps Engineers Executives Developers Platform Team Workspaces Kubernetes Brain Klaudia AI
  16. How Does it Work? Technical Deep-Dive Komodor Agents Enrichment Engine

    CRDs, AddOns, Operators Ecosystem Integrations
  17. How Does it Work? Technical Deep-Dive Enrichment Engine Operations &

    User Management Health & Reliability Management Cost Optimization Workspa
  18. How Does it Work? Technical Deep-Dive Operations & User Management

    Health & Reliability Management Cost Optimization Workspace Builder Platform Team Komodor Alerts
  19. How Does it Work? Technical Deep-Dive Ecosystem Integrations Komodor Agents

    Enrichment Engine Operations & User Management Health & Reliability Management Cost Optimization Data Engineers DevOps Engineers Executives Developers Platform Team Workspaces Kubernetes Brain Klaudia AI