objects. 2. Cache the encountered objects in-memory (reduce live API requests) 3. Notify the controller of new/updated objects (+existing objects periodically) controller process handler informer (client-go pkgs) local cache periodic resync (list) watch notify k8s API get
• Only the leader does work. • If leader fails, take over. (active/active) e.g. kube-controller-manager, kube-scheduler, cert-manager, … Pod replica1 Pod replica2 Pod replica3 controller controller controller Lease leader: replica2
25K+ nodes • Kubernetes Job controller in 1.26 now supports 100K pods. • kcp project aims to push limits of Kubernetes API server beyond current limits (more storage, more watches etc) • CRD sprawl, multitenancy, …
objects, how long does it take to reconcile 1000*N objects? What if only the leader is allowed to do work? Where is it throttled? (CPU, etcd, network…) Pod replica1 Pod replica2 Pod replica3 controller controller controller workqueue
local cache. • How much memory do you think it takes to store 100,000 pods? • What if during a periodic resync (full LIST)? • How much memory are you willing to throw at your controller? Pod replica1 controller local cache Pod replica2 controller local cache Pod replica3 controller local cache kube-apiserver L I S T + W A T C H
existing controller development libraries (e.g. client-go, controller-runtime) • Membership and failure detection for controller replicas • Preventing concurrent handling of an object
can create watches with label selectors? kube-apiserver sharder list+watch ALL Pods, label them controller replicaA controller replicaB controller replicaC list+watch Pods label=A list+watch Pods label=B list+watch Pods label=C how to discover members? single point of failure, still a bottleneck how to reassign work of dead replicas?
replicas. A B’ A’’ A’ B B’’ C’’ C’ C • hash(apiGroup_ns_name) • find the spot on the ring • assign object to the controller replica by labeling the object on Kubernetes API: metadata: labels: shard: controller-a73e7b
are up or down? sharder controller replicaA controller replicaB controller replicaC Lease replicaA holder: replicaA Lease replicaB holder: replicaB Lease replicaC holder: replicaC watch renew renew renew unhealthy if not renewed in the past 2 x leaseDurationSeconds
are up or down? sharder controller replicaA controller replicaB controller replicaC Lease replicaA holder: replicaA Lease replicaB holder: replicaB Lease replicaC holder: replicaC watch renew renew renew unhealthy if not renewed in the past 2 x leaseDurationSeconds
die, new ones added). Objects must be reassigned to their destination. Need to ensure the old replica stops reconciling the object. • Step 1: sharder adds label `drain: true` on the object • Step 2: controller sees the `drain` label, removes `shard` label • Step 3: sharder sees the object now has no `shard` label • Step 4: sharder calculates the replica and sets the `shard` label. A B’ A’’ A’ B B’’ C’ ’ C’ C D
(on the active sharder) My theory: controller-runtime shared informer cache is still carrying the entire object (not “just metadata”). Needs more debugging.
controller/informer machinery? We could use various pub/sub models that assigns renciliations to controllers on the fly. k8s API dispatcher1 watch dispatcher2 dispatcherN controller controller controller controller controller LB watch-only (not cached), handles 1/N objects, dispatch updates to connected clients (consistent hashing) establish long polling to watch object changes, do not cache locally