Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Metal To Apps: LinkedIn’s Kubernetes-based...

From Metal To Apps: LinkedIn’s Kubernetes-based Compute Platform

(Presented at KubeCon + CloudNativeCon Europe 2025 https://kccnceu2025.sched.com/event/1txGQ)

What does it take to design a Kubernetes-based fleet management stack that bridges the gap between bare-metal servers in data centers and a platform capable of hosting thousands of microservices, large-scale stateful applications, and a GPU fleet running AI workloads?

At LinkedIn, we use Kubernetes as a foundational primitive in our compute platform. We run thousands of microservices, manage large stateful applications with our custom scheduler, manage a large fleet of GPUs –all while performing regular maintenance on the bare metal hosts with no downtime or manual intervention.

In this talk, we’ll talk about how we architected and built an API-driven, Kubernetes-based compute stack with a large-scale microservices platform, a workload-agnostic stateful scheduler, and a multi-tenant ML/batch jobs platform. We’ll share insights on scaling Kubernetes for diverse workloads while maintaining tenant isolation, resilience, flexibility, and ease of use for developers.

Speakers:
Ahmet Alp Balkan, Sr. Staff Software Engineer, LinkedIn
Ronak Nathani, Sr. Staff Software Engineer, LinkedIn

Ahmet Alp Balkan

April 03, 2025
Tweet

More Decks by Ahmet Alp Balkan

Other Decks in Technology

Transcript

  1. 1

  2. From metal to apps: LinkedIn’s Kubernetes-based compute platform Ahmet Alp

    Balkan (@ahmetb) Ronak Nathani (@ronaknathani) 2
  3. Ahmet Seattle 󰑔 Ronak Toronto 󰎟 First KubeCon 2016, Seattle

    2022, Detroit # of KubeCons 7 3 Hobbies Gardening, kubectl plugins Racket sports, podcasting About us 3
  4. What is LinkedIn’s scale? 500,000+ servers 3,000+ services 1.5M+ containers

    50,000+ deploys/day Everything on bare metal Multiple datacenters 4 1B+ members
  5. Kubernetes Cluster Management Infrastructure as a Service (IaaS) Workload Platform

    Layer Compute Broker (machine allocator) Host Health Monitoring & Remediation Datacenter Inventory Manager Kubernetes Clusters Kubernetes Clusters Kubernetes Clusters pool-A pool-B pool-C Multi-cluster workload routing Stateless Workloads Platform Stateful Workloads Platform Jobs Platform Kubernetes- as-a-service Next-gen compute platform Maintenance Orchestrator 5 Cluster/Pool/Node Lifecycle Controllers
  6. Kubernetes Cluster Management Infrastructure as a Service (IaaS) Workload Platform

    Layer Compute Broker (machine allocator) Host Health Monitoring & Remediation Datacenter Inventory Manager Cluster/Pool/Node Lifecycle Controllers Kubernetes Clusters Kubernetes Clusters Kubernetes Clusters pool-A pool-B pool-C Multi-cluster workload routing Stateless Workloads Platform Stateful Workloads Platform Jobs Platform Kubernetes- as-a-service Next-gen compute platform Maintenance Orchestrator 6
  7. Datacenter/machine layer • Inventory Manager manages our datacenter inventory and

    machine properties. • Compute Broker (machine allocation API) ◦ A declarative gRPC API to manage machine pools, add/remove capacity ▪ Pools have heterogeneous (but interchangeable) hardware ▪ Each pool specifies a “node profile” (minimum machine type + configuration) ◦ Source of truth for machine maintenance operations. • Host health monitoring & remediation ◦ No humans in the loop to detect unhealthy hosts and remediate-or-replace them. • Maintenance orchestrator to ramp node upgrades gradually across the fleet. 7
  8. Node Maintenance Zones • Datacenters are striped to 20 maintenance

    zones (MZs) to perform software rolling update on the fleet (OS, kernel settings, kubelet…) ◦ it’s not a physical fault domain like AZs • Compute pools span multiple MZs (has nodes from every MZ, balanced) • Kubernetes clusters are still a fault domain due to cluster-wide configs/policies ◦ CustomResourceDefinition, MutatingWebhookConfiguration, ClusterRole, … 8 Upgrade MZ1 Upgrade MZ2 MZ20 …
  9. Kubernetes Coordinated node maintenance Disruptions coordinate transferring control of the

    machine from K8s to maintenance actor (and back) 1. Planned: kubelet/OS upgrades, switch upgrade, hardware decomm. 2. Unplanned: host health remediation Machine Disruptor Cluster Manager Node create disruption watch cordon+drain approve disruption perform maintenance remove disruption uncordon watch poll 9 Compute Broker
  10. • No Kubernetes distro ◦ OSS Kubernetes configured with an

    in-house setup (no kubeadm, or Cluster API) ◦ Works better with our machine provisioning. We also customize apiserver/etcd setup. • Large clusters with ~5k nodes (planning to push further) ◦ Helps reduce hardware fragmentation across clusters, allows in-place growth ◦ Clusters are multi-tenant with mixed workload types (stateless+stateful+batch+...) • Kubelet upgrades happen as part of OS maintenance. • Centralized “hub” clusters to manage workload routing and the clusters ◦ Each app gets a separate Namespace, routed to a specific cluster. Cluster organization and scale 10
  11. Compute Broker gRPC API KRM-style APIs for Pool Management •

    Custom resources (CRDs) and controllers to manage pools and clusters, or coordinate node maintenance activities. • Pools/Clusters are declared on the Hub cluster (managed via GitOps), reconciled asynchronously by in-house controllers. ◦ Adjusting capacity in a pool is as simple as a field update: KubernetesPool CR spec: poolTemplate: capacity: 1200 nodeProfile: gpu nodeConfig: kubelet-1.25-gpu nodeLabels: {...} requiredDaemonSets: [{...}] … status: … ComputeBrokerPool CR spec: capacity: 1200 nodeProfile: gpu status: … 11
  12. How we scale Kubernetes API Server is a shared resource

    • Restrict access via RBAC • Use API Fairness and Priority (APF) Etcd is a shared resource • First bottleneck to hit scaling beyond 5,000 nodes • Increased the storage limit from 8G → 16G (planning for 32G on SSDs) • In-house etcd backup/restore system as DR strategy Controller scalability • Many controllers watching/caching Pods (memory-bound) • Controller sharding isn’t a solved problem yet 12
  13. Kubernetes Cluster Management Infrastructure as a Service (IaaS) Workload Platform

    Layer Compute Broker (machine allocator) Host Health Monitoring & Remediation Datacenter Inventory Manager Kubernetes Clusters Kubernetes Clusters Kubernetes Clusters pool-A pool-B pool-C Multi-cluster workload routing Stateless Workloads Platform Stateful Workloads Platform Jobs Platform Kubernetes- as-a-service Next-gen compute platform Maintenance Orchestrator 13 Cluster/Pool/Node Lifecycle Controllers
  14. Stateful Stateless How we use Kubernetes Jobs LinkedIn Kubernetes Service

    LiDeployment LiStatefulSet Volcano Scheduler Spark ML Infra Data Scientist ML Engineer App developer Regional Job Quotas / Queues Deployment Orchestration Service Flyte/Airflow Pipeline Orchestration 14
  15. Stateful Stateless How we use Kubernetes Jobs LinkedIn Kubernetes Service

    LiDeployment LiStatefulSet Spark ML Infra Deployment Orchestration Service Flyte/Airflow Pipeline Orchestration 15 Volcano Scheduler Regional Job Quotas / Queues Data Scientist ML Engineer App developer
  16. We are migrating all services to Kubernetes. Principles • No

    downtime to the live site • Centrally driven, fully automated with no app owner involvement (for stateless) • Challenge legacy requirements while reducing tech debt Progress • More than halfway through in our stateless migration • Some stateful apps running in production on Kubernetes Migration principles and progress 16
  17. Internal Service Infrastructure Cloud Kubernetes LinkedIn Kubernetes PKI cert-manager Current:

    internal PKI Future: cert-manager (w/ custom CA/approver) Service Discovery Kubernetes Services + coredns (cluster-scoped) In-house (regional) based on xDS Monitoring Prometheus In-house (Moving to OTel) CNI Many options (cluster-scoped) Current: Host network Future: IPvLAN (global) Network Policy CNI-provided (cluster-scoped) In-house (global) Config & Secrets ConfigMap/Secret (cluster-scoped) In-house (regional) 17
  18. Internal Service Infrastructure Cloud Kubernetes LinkedIn Kubernetes PKI cert-manager Current:

    internal PKI Future: cert-manager (w/ custom CA/approver) Service Discovery Kubernetes Services + coredns (cluster-scoped) In-house (regional) based on xDS Monitoring Prometheus In-house CNI Many options (cluster-scoped) Current: Host network Future: IPvLAN (global) Network Policy CNI-provided (cluster-scoped) In-house (global) Config & Secrets ConfigMap/Secret (cluster-scoped) In-house (global) Kubernetes primarily orchestrates pods for our stateless and stateful workloads. We don’t use several Kubernetes features that only work within the cluster boundary and heavily leverage the flexibility it offers to extend it for our needs. 18
  19. Stateful on Kubernetes LinkedIn has many in-house data systems (Kafka,

    Pinot, Samza, Ambry…) • Data stored on local SSDs (not network-attached/block storage) • Evicting a pod is not straightforward and requires coordination. ◦ PDBs/StatefulSets don’t work here, Pods run different shards. One generic stateful workload operator to manage stateful pod lifecycle: • Operator coordinates with “shard managers” of each stateful system • A custom protocol between the operator ⇔ shard manager Watch our KubeCon NA 2024 talk + read LinkedIn Engineering blog to learn more: 19 LiStatefulSet CR spec: application: kafka acmEndpoint: <endpoint> …
  20. CloneSet spec: podTemplateSpec: … volumeClaimTemplates: … Stateless on Kubernetes LiDeployment

    CR spec: application: feed version: app: 3.1.126 config: 1.1.10 replicas: 1000 resources: cpu: 24 memory: 48G canary : configuration: … status: conditions: - type: Ready … stable: ready: 990 … Pod spec: # infra initContainers: … containers: … <500+ lines> ~10 lines 20
  21. Manifest authoring 21 App developer LiDeployment CR spec: application: feed

    resources: cpu: 24 memory: 48G … Helm repo Helm chart published as part of CI
  22. Hub (ns controller) Namespace Onboarding 23 Workload Cluster 1 Workload

    Cluster 2 Workload Cluster N Deployment Orchestration Service Regional Authorization Service App developer app: <app> tenant: stateless nodeProfile: {SKU, config}
  23. Hub (ns controller) Namespace Onboarding 24 Workload Cluster 1 Workload

    Cluster 2 Workload Cluster N Deployment Orchestration Service Regional Authorization Service App developer AuthZ rules translated to RBAC app: <app> tenant: stateless nodeProfile: {SKU, config}
  24. Namespaces are routed based on capacity and pool availability matching

    the nodeProfile Hub (ns controller) Namespace Onboarding 25 Workload Cluster 1 Workload Cluster 2 Workload Cluster N Deployment Orchestration Service Regional Authorization Service App developer AuthZ rules translated to RBAC app: <app> tenant: stateless nodeProfile: {SKU, config} kind: RoleBinding metadata: labels: tenant: stateless app: <app> kind: Namespace metadata: labels: tenant: stateless app: <app>
  25. App owner workflow Deployment orchestration service Apiserver LinkedIn Controllers kubelet

    kubelet kubelet Kafka Logs & events Azure Data Explorer Helm repo PKI Service Discovery Configs & Secrets AuthZ O11y 26 app: <app> version: 1.1.2 App developer
  26. End users don’t see ArgoCD but we use it heavily.

    We configure ArgoCD to manage only 1 cluster. • Served us well when the scale was smaller. ◦ Replacing it with our own gitops engine for app deployments. ◦ Will continue using it to deploy Kubernetes addons, policy objects etc. • Deployments got slower with growing cluster size and replica counts ◦ As number of objects in the cluster grew, application syncs slowed down. ◦ As size of replicas in Applications grew, health status syncs slowed down. A note on ArgoCD 27
  27. Failures and categorization 28 Need to distinguish between infra vs

    app failures to reduce support load. • ProgressDeadlineSeconds to identify rollout failures. • status.conditions reflect source of failures and category.
  28. UX Internal kubectl plugin that only exposes internal custom resources

    and pods. • Automatically figures out which cluster/namespace for an app. • Custom troubleshooting subcommands. Internal UI for browsing/troubleshooting workloads • Watching/aggregating data from all clusters into a centralized storage to power this UI with near-real time information. 29
  29. Delete protections: If “kubectl delete” something can cause an outage,

    prevent it. • All user-facing custom resources • Namespaces that have resources in them • CRDs that have CRs Other accident preventions • scaling down >X% in one shot is forbidden • upper bound for allowed maxSurge or canary percentages API Guardrails 30
  30. Workload federation • Aligning clusters with maintenance zones, tolerate single

    cluster failure. • Customers growing in-place hit safe scaling limits for a cluster. • Helps with machine types fragmented across different clusters. Better resource isolation • CPU pinning to address noisy neighbor problem. IPv6 Pod IPs with a flat network spanning multiple regions • Using ipvlan CNI Kubeception • Run Kubernetes control plane itself as Pods in another cluster. • Makes cluster creation and management easier at scale. • Stacking components of different clusters on the same node. What’s next 31
  31. Migration Lessons • Start early and make incremental progress. ◦

    …there will be a long tail. • Figure out which tech debt to solve now vs later. • Be intentional about what features to use in Kubernetes. • Don’t give raw Kubernetes to your customers ◦ Invest in building abstractions • Invest in guard-rails to prevent user errors • Develop good user guides for self-serve troubleshooting 32
  32. Migration - Challenges • Generating container images ◦ App owners

    don’t write Dockerfiles, it’s all auto-generated for them. • Thousands of microservices to migrate ◦ …without involving application owners. • Deployment failure categorization ◦ surfacing Kubernetes specific failure points to non-Kubernetes-savvy app owners • Different Debugging UX for customers had to change 34
  33. Failures and categorization Shift left for validating user inputs/manifests. Need

    to distinguish between infra vs app failures for app deployments. • ProgressDeadlineSeconds to identify rollout failures. • status.conditions reflect source of failures and category. 35