Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reliability-Driven Kubernetes Fleet Management

Reliability-Driven Kubernetes Fleet Management

Komodor

June 27, 2024
Tweet

More Decks by Komodor

Other Decks in Technology

Transcript

  1. Agenda Intro Understanding the Complexity of K8s Fleets Cluster Groups

    Best Practices Fleet Management with Komodor Demo Time!
  2. Hi, My Name Is Itiel Shwartz 👋 • The CTO

    and Co-Founder of Komodor • A big believer in dev empowerment and moving fast! • Backend Developer turned DevOps • Worked at eBay, Forter, Rookout (first developer) • K8S fanboy 😃
  3. What is a Kubernetes Fleet? • The number of K8s

    clusters per organization is growing every year • While in 2019 only 10% of organizations had 50+ clusters, in 2024 having 100s of K8s clusters is no longer considered an outlier • The inherent complexities of managing a K8s cluster (lifecycle, monitoring, maintenance, troubleshooting, etc.) are multiplied • The sheer scale of these multi-cluster deployments poses a new and unique set of challenges, heightened by the popularity of K8s on Edge • To address those the concept of Fleet Management was invented.
  4. What is Kubernetes Fleet Management? “A fleet provides a way

    to logically group and normalize Kubernetes clusters, helping you uplevel management from individual clusters to entire groups of clusters.” - GKE Enterprise Efficiency Security Standardization
  5. Example: Fleet Management as a Platform PLATFORM TEAM BU 1

    BU 2 BU 3 A B C A B C A B C AWS west-1 Clusters AWS east-2 Clusters On-Prem Clusters BU 1 PLATFORM TEAM
  6. Cluster Lifecycle • Upgrades & maintenance • Infrastructure resiliency Cost

    & Resource Utilization • Keeping the cost low across multi-cluster/cloud/on-prem & hybrid • Efficient resource utilization on Edge nodes Reliability & Resiliency • Resolving issues across different envs & AZs • RCA is endless • Knowledge gaps between Dev & Ops create bottlenecks Access Management • RBAC for cluster access • JiT access • Edge locations Governance & Standardization • Enforcing standards across the fleet • Policy enforcement • Security compliance Cross-Cluster Visibility • Hard to correlate between issues • Deviations in service performance What’s So Hard About Fleet Management?
  7. The Human Aspect of Fleet Management • Every persona in

    the organization has a different mindset and approach • Different requirements and KPIs for different teams • Different permissions and access required per persona or per use-case (JiT) • Knowledge and skills gaps (K8s has a steep learning curve)
  8. How to Start Thinking in Cluster Attributes? Region 1 Region

    2 Region N Production Staging Development AWS Azure Google Cloud NS: frontend NS: backend NS: auth By Region 👉 By Environment 👉 By Cloud Provider 👉 By Namespace 👉
  9. Fleet Management For K8s on Edge HQ Europe North America

    West-US East-US Germany France Los Angeles San Francisco Berlin Munich Paris Location 1 Location 42 Location 1 Location 92 Location 1 Location 1
  10. Fleet Management Best Practices 1. Leverage IaC 2. Implement GitOps

    3. Use best-of-breed monitoring 4. Consolidate clusters in a unified single-pane-of-glass 5. Build or buy a dedicated Fleet Management solution
  11. Golden Tip: Shift-Left Ops • Abstract complexity - expose functionality

    and bubble up relevant data in the right context (i.e simplify K8s and reduce cognitive load on non-experts) • Automate away toil in a manner that can circumvent human errors • Template services, deployments, etc. (i.e enforce governance and standardization) • Empower developers and other stakeholders to own K8s (i.e manage their workloads on K8s without having to learn K8s)