Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Kubernetes controllers (Seattle Kuberne...

Scaling Kubernetes controllers (Seattle Kubernetes Meetup)

Presented Tim Ebert's thesis on controller scalability with some of my intro and thoughts at Seattle Kubernetes Meetup on Jan 24, 2023.

Ahmet Alp Balkan

January 24, 2023
Tweet

More Decks by Ahmet Alp Balkan

Other Decks in Technology

Transcript

  1. Intro to Controllers in Kubernetes Examples: 1. kube-controller-manager 2. kube-scheduler

    ? 3. kubelet ?? 4. kube-apiserver ??? … controller Kubernetes API 2. do something 1.watch external world
  2. Informer machinery 1. LIST+WATCH to be aware of tracked Kubernetes

    objects. 2. Cache the encountered objects in-memory (reduce live API requests) 3. Notify the controller of new/updated objects (+existing objects periodically) controller process handler informer (client-go pkgs) local cache periodic resync (list) watch notify k8s API get
  3. Fault tolerance • Run N replicas. • Elect a leader.

    • Only the leader does work. • If leader fails, take over. (active/active) e.g. kube-controller-manager, kube-scheduler, cert-manager, … Pod replica1 Pod replica2 Pod replica3 controller controller controller Lease leader: replica2
  4. Lease API in Kubernetes Typically used to elect leaders in

    Kubernetes. apiVersion: coordination.k8s.io/v1 kind: Lease spec: holder: my-replica-1 leaseDurationSeconds: 10 renewTime: "2022-11-30T18:04:27.912073Z"
  5. Scalability Skeptical? • Supported limit 5K nodes but providers supporting

    25K+ nodes • Kubernetes Job controller in 1.26 now supports 100K pods. • kcp project aims to push limits of Kubernetes API server beyond current limits (more storage, more watches etc) • CRD sprawl, multitenancy, …
  6. Scalability: Throughput If it takes t time to reconcile N

    objects, how long does it take to reconcile 1000*N objects? What if only the leader is allowed to do work? Where is it throttled? (CPU, etcd, network…) Pod replica1 Pod replica2 Pod replica3 controller controller controller workqueue
  7. Scalability: Memory • Active replicas all maintain LIST+WATCH on a

    local cache. • How much memory do you think it takes to store 100,000 pods? • What if during a periodic resync (full LIST)? • How much memory are you willing to throw at your controller? Pod replica1 controller local cache Pod replica2 controller local cache Pod replica3 controller local cache kube-apiserver L I S T + W A T C H
  8. What do we need for horizontal controller scalability? • Use

    existing controller development libraries (e.g. client-go, controller-runtime) • Membership and failure detection for controller replicas • Preventing concurrent handling of an object
  9. High-level Architecture What if we exploited the fact that you

    can create watches with label selectors? kube-apiserver sharder list+watch ALL Pods, label them controller replicaA controller replicaB controller replicaC list+watch Pods label=A list+watch Pods label=B list+watch Pods label=C
  10. High-level Architecture What if we exploited the fact that you

    can create watches with label selectors? kube-apiserver sharder list+watch ALL Pods, label them controller replicaA controller replicaB controller replicaC list+watch Pods label=A list+watch Pods label=B list+watch Pods label=C how to discover members? single point of failure, still a bottleneck how to reassign work of dead replicas?
  11. Object Partitioning Consistent hash ring with virtual nodes representing controller

    replicas. A B’ A’’ A’ B B’’ C’’ C’ C • hash(apiGroup_ns_name) • find the spot on the ring • assign object to the controller replica by labeling the object on Kubernetes API: metadata: labels: shard: controller-a73e7b
  12. Scaling the sharder Active/passive: Active sharder stores metadata-only portion of

    objects. kube-apiserver list+watch ALL Pods + label them shard=replicaA sharder replica#3 (standby) Lease (sharder-leader) leader: sharder2 sharder replica#1 (active) sharder replica#2 (standby) local cache only object `metadata`
  13. Membership discovery How do we learn about which controller replicas

    are up or down? sharder controller replicaA controller replicaB controller replicaC Lease replicaA holder: replicaA Lease replicaB holder: replicaB Lease replicaC holder: replicaC watch renew renew renew unhealthy if not renewed in the past 2 x leaseDurationSeconds
  14. Membership discovery How do we learn about which controller replicas

    are up or down? sharder controller replicaA controller replicaB controller replicaC Lease replicaA holder: replicaA Lease replicaB holder: replicaB Lease replicaC holder: replicaC watch renew renew renew unhealthy if not renewed in the past 2 x leaseDurationSeconds
  15. Reassignment/rebalancing Sharder keeps the hash ring up to date (replicas

    die, new ones added). Objects must be reassigned to their destination. Need to ensure the old replica stops reconciling the object. • Step 1: sharder adds label `drain: true` on the object • Step 2: controller sees the `drain` label, removes `shard` label • Step 3: sharder sees the object now has no `shard` label • Step 4: sharder calculates the replica and sets the `shard` label. A B’ A’’ A’ B B’’ C’ ’ C’ C D
  16. Results? With N=3 replicas memory usage is only 11% less

    (on the active sharder) My theory: controller-runtime shared informer cache is still carrying the entire object (not “just metadata”). Needs more debugging.
  17. More ideas? What if we weren’t limited to the existing

    controller/informer machinery? We could use various pub/sub models that assigns renciliations to controllers on the fly. k8s API dispatcher1 watch dispatcher2 dispatcherN controller controller controller controller controller LB watch-only (not cached), handles 1/N objects, dispatch updates to connected clients (consistent hashing) establish long polling to watch object changes, do not cache locally