Kubernetes Health Management with Komodor

Kubernetes Health Management Kubernetes For Humans Danielle Inbar, Director of
Product Management March 2025

Introductions Why Traditional o11y Isn’t Enough? What is K8s Health?
The Komodor Approach Demo Time Agenda

Danielle Inbar 👋 • PM Director @ Komodor • Prev.
Snyk, Spot.io, Verint, Motorola, Dell EMC • Mother of Arbel 👦, Yuval 👧 and Charlie 🐶

Is Kubernetes Observability Broken?

The more complicated Kubernetes gets, the more complicated it gets
to do anything at all, until eventually we can’t move, we can’t add new features or solve problems. Tim Hockin Kubernetes Co-Founder, Google KubeCon NA 2023 Keynote Speech

Once Upon a Time Applications Infra Monitoring APM Infrastructure

Network Policies IAC Node Scaling Kubernetes Creates Complexity Applications Infrastructure
Workloads Kubernetes Cluster Configuration Addons Infra Pods Deployments Jobs DaemonSets StatefulSets Networking Storage Access Control Configs & Secrets Nodes Distributions & Versions Cloud Providers Core Components Storage Packages Cert & Secrets AI/MLOps & Batch Processing Streaming VCS & CI/CD Network Policies & IaC Autoscalers Service Mesh API Gateway DNS Pod Scaling

Constantly Expanding Scale Hybrid, multicluster, multicloud, edge…

Overwhelming Amounts of Data

10 What is Kubernetes Health Management?

11 Kubernetes is a Complete Ecosystem. While pods and deployments
are the tip of the iceberg, there are dozens of components that are required to run Kubernetes effectively at scale.

The Challenges of Kubernetes Management Operational Toil: Managing a production-grade
Kubernetes stack at scale imposes significant toil, including handling numerous resources Complex Correlation: Teams must manage tasks like autoscaling, network policies, and service meshes (e.g., Istio), consistently applying best practices and optimizing configurations to ensure seamless service availability across the cluster. Risk of Downtime: Misconfigurations or delays with critical components like storage solutions, networking plugins and other core services can lead to cluster-wide downtime or degraded performance. Lack of Visibility and Proactiveness: Not all K8s elements have standards for visibility error tracking and alerting. This results in many issues going undetected and reactive firefighting. Impact on Productivity: The burden of managing K8s clusters is time consuming and diverts focus from strategic initiatives to tactic ‘keeping the head above the water’.

Alert: The ‘Checkout’ web service is down Inspect K8s deployment
- all pods are failing Inspecting the logs Failed to connect to the DB Inspecting the DB pods Everything is up and running Inspecting the the DB logs - no issues Inspecting the DB connections - dropped to 0 Escalating to the ops team 1 hr of downtime Example: Typical Cert-Manager issue investigation flow Developer

Inspecting everything again Inspecting network policies Result: no issues Inspecting
Certificate Result: Certificate Expired Inspecting Cert manager Failed to renew Result: Issuer Problem Inspecting Cert manager Issuer Result: DNS Problem Fixing the DNS problem Alert: Checkout service is down Inspect K8s deployment Result: all pods are failing Inspecting the logs Result: Failed to connect to the DB Inspecting the DB pods Result: Everything is up and running Inspecting the DB logs Result: no issues Inspecting the DB connections Result: dropped to 0 Escalating to the ops team 2 hrs of downtime Developer DevOps Example: Typical Cert-Manager issue investigation flow

Inspecting everything again Inspecting network policies Result: no issues Inspecting
Certificate Result: Certificate Expired Inspecting Cert manager Failed to renew Result: Issuer Problem Inspecting Cert manager Issuer Result: DNS Problem Fixing the DNS problem Alert: Checkout service is down Inspect K8s deployment Result: all pods are failing Inspecting the logs Result: Failed to connect to the DB Inspecting the the DB pods Result: Everything is up and running Inspecting the DB logs Result: no issues Inspecting the the DB connections Result: dropped to 0 Escalating to the ops team 2 hrs of downtime Developer DevOps Example: Typical Cert-Manager issue investigation flow A seemingly simple issue, yet it requires 2 teams to be involved for ~2 hours & over 10 investigation steps to get to the root cause

Kubernetes Health Management is Holistic Applications CRDs Workloads K8s Resources
Infrastructure Configurations General K8s Knowledge System History Live K8s Data Potential Risks Real-time Issues Violations

17 The Komodor Approach

Continuous Kubernetes Health Management DETECT Realtime issues and ongoing reliability
risks PRIORITIZE Understand where to focus your attention based on impact INVESTIGATE AI-driven root cause analysis and guided investigation REMEDIATE Hundreds of auto- mated playbooks for Kubernetes issues OPTIMIZE Proactively prevent future issues and reduce costs

Komodor Simplifies & Automates Kubernetes Management at Scale Enterprise Kubernetes
Management Platform Operations & User Management Cost Optimization Health & Reliability Management Multi-cluster/cloud/hybrid Native support for CRDs, workloads, infra, & addons Direct actions & Audit trail Custom workspaces K8s access, RBAC, SSO JiT kubectl access Right-sizing suggestions Cost allocation visibility Autopilot Smart cost/performance balancing AI-powered RCA Realtime issues Reliability risks 100s of automatic remediation playbooks Drift detection Standards violations DevOps Engineers Management Data Engineers SREs Developers Platform Engineers

How Does it Work? Technical Deep-Dive Ecosystem Integrations Komodor Agents
Enrichment Engine Operations & User Management Health & Reliability Management Cost Optimization Data Engineers DevOps Engineers Executives Developers Platform Team Workspaces Kubernetes Brain Klaudia AI

Health Score

22 Demo Time

Thank You

24 Appendix (if needed)

Why Komodor? We Are Your Kubernetes Partner! From migration to
Day 2 Operations, Komodor ensures that every cloud-native initiative is successful & ROI-positive 67% Reduction in MTTR 72% Faster dev onboarding to K8s 41% Increase In Productivity 57% Reduction In k8s Costs 48% Faster Migration To K8s 64% Fewer Tickets to DevOps & SREs 36% Improved Velocity

Komodor Enables K8s For the Entire Enterprise Platform Engineers 360°
visibility into multi-cloud/hybrid Improved reliability & resiliency Centralized access control & governance Fewer escalations & no TicketOps Developers Data Engineers Reduced cognitive load Increased developer velocity Less time spent on troubleshooting Increased ownership & accountability Support KubeFlow, Airflow, Argo Workflows Increased productivity & efficiency No longer a bottleneck for MLOps Focusing on data science & research The users were happy immediately. And every time we show a screenshot, people always say, “Wow. What tool is that and how do I get into it?” It has been nothing but a delight for all of our engineers from day one. Don’t overthink it! It was so easy to onboard, and definitely worth it in the long run. Michael Keith Senior SRE

How Does it Work? Technical Deep-Dive Komodor Agents Enrichment Engine
CRDs, AddOns, Operators Ecosystem Integrations

How Does it Work? Technical Deep-Dive Enrichment Engine Operations &
User Management Health & Reliability Management Cost Optimization Workspa

How Does it Work? Technical Deep-Dive Operations & User Management
Health & Reliability Management Cost Optimization Workspace Builder Platform Team Komodor Alerts

Thank You

Kubernetes Health Management with Komodor

Kubernetes Health Management with Komodor

Komodor

More Decks by Komodor

Featured

Transcript

Kubernetes Health Management Kubernetes For Humans Danielle Inbar, Director of

Introductions Why Traditional o11y Isn’t Enough? What is K8s Health?

Danielle Inbar 👋 • PM Director @ Komodor • Prev.

Is Kubernetes Observability Broken?

The more complicated Kubernetes gets, the more complicated it gets

Once Upon a Time Applications Infra Monitoring APM Infrastructure

Network Policies IAC Node Scaling Kubernetes Creates Complexity Applications Infrastructure

Constantly Expanding Scale Hybrid, multicluster, multicloud, edge…

Overwhelming Amounts of Data

10 What is Kubernetes Health Management?

11 Kubernetes is a Complete Ecosystem. While pods and deployments

The Challenges of Kubernetes Management Operational Toil: Managing a production-grade

Alert: The ‘Checkout’ web service is down Inspect K8s deployment

Inspecting everything again Inspecting network policies Result: no issues Inspecting

Inspecting everything again Inspecting network policies Result: no issues Inspecting

Kubernetes Health Management is Holistic Applications CRDs Workloads K8s Resources

17 The Komodor Approach

Continuous Kubernetes Health Management DETECT Realtime issues and ongoing reliability

Komodor Simplifies & Automates Kubernetes Management at Scale Enterprise Kubernetes

How Does it Work? Technical Deep-Dive Ecosystem Integrations Komodor Agents

Health Score

22 Demo Time

Thank You

24 Appendix (if needed)

Why Komodor? We Are Your Kubernetes Partner! From migration to

Komodor Enables K8s For the Entire Enterprise Platform Engineers 360°

How Does it Work? Technical Deep-Dive Ecosystem Integrations Komodor Agents

How Does it Work? Technical Deep-Dive Komodor Agents Enrichment Engine

How Does it Work? Technical Deep-Dive Enrichment Engine Operations &

How Does it Work? Technical Deep-Dive Operations & User Management

How Does it Work? Technical Deep-Dive Ecosystem Integrations Komodor Agents

Thank You