to cloud in 2019 • Standing on the shoulder of open source: • Kubernetes • Operator Framework • CNI, CSI, CRI, DevicePlugin … • Prometheus • Containerd (Alibaba Pouch distribution) • runC + KataContainers • DevOps framework from ACK • ACK = Alibaba Container Service for Kubernetes • and much more …
1 process is Systemd • ALL in ONE container (“Rich Container”), independent upgrade • app, sshd, log, monitoring, cache, VIP, DNS, proxy, agent, start/stop scripts … • Traditional operating workflow • Start container -> SSH into container -> Start the app • Log files & user data are distributed everywhere in the container • In-house orchestration & scheduling system “Rich Container” Monolithic Java System
live upgrading by: 1. Copy WAR/data into shared volume 2. Trigger app container re-load WAR/data • Rely on “In-Place-Upgrade” to rollout sidecar containers • Not out-of-box in Kubernetes, will explain later
• A SidecarSet CRD: • Describe all sidecars need to be operated • A SidecarOperator: 1. Inject sidecar containers to selected Pods 2. Upgrade sidecar containers following rollout policy when SidecarSet is updated 3. Delete sidecar containers when SidecarSet is deleted Controller Inject Upgrade Delete Admission Hook Pod Pod Pod
+ controllers that operate applications at web scale. • Pluggable, repeatable and Kubernetes native (Declarative API + Controller Pattern) • 100 % Open Source (very soon!) kube-apiserver Deployment StatefulSet InplaceSet BroadcastJob Pod Pod Pod Pod Pod Pod Pod Pod
web-scale cluster • We prefer In-Place-Upgrade, because with thousands of pods reshuffled across cluster: • Topology changes, image re-warm, unexpected overhead, resource allocation churn … • Generally, we ❤ StatefulSet, but: • SS will still tear down pods during rolling upgrade • Less rollout strategy than Deployment Deployment StatefulSet InplaceSet Ordering No Yes Yes Naming Random Ordered Ordered PVC reserve No Yes Yes Retry on other nodes No No Yes Rollout policy Rolling, Recreate Rolling, On- delete Rolling, On-delete, In-place Pause/Resume Yes No Yes Partition No Yes Yes Max unavailable Not yet Yes Yes Pre/Post update hook No No Yes InplaceSet = A in-place “StatefulSet” with more rollout strategies
and Job • Run pods on all machines exactly once • Use case: software upgrade, node validator, node labeler etc • and tons of other use cases in this issue #36601 https://github.com/kubernetes/kubernetes/issues/36601
More than 10k nodes • More than 300k pods • Non-goal: • Total containers & pods per node • Scalability boundary of upstream K8s (v1.14) • No more than 5k nodes • No more than 150k total pods • No more than 300k total containers • No more than 100 pods per node • Question: • How to discover scalability issue in 10k nodes cluster?
Pods • cmd/kubemark/hollow-node.go • Taint and drain nodes for perf test, and run it • Typical test cases in 10k nodes cluster: • Start up time during scaling pods • Time of creating and deleting pods • Pod listing RT • Failure counts Perf master kubelet master curl -X POST -H "Content-Type: application/json" \ "https://k8s-performance-toolkit.alibaba-inc.com/api/kubemark/test" \ -d '{"test_focus":"\\[Feature:Performance\\]","test_skip":"handle","node_count":10000,"pods_per_node":30}' How to run?
not block concurrent read transactions: etcd-io/etcd#9296 • Fully allow concurrent large read: etcd-io/etcd#9384 • Improve index compaction blocking by using a copy on write clone to avoid holding the lock for the traversal of the entire index: etcd-io/etcd#9511 • Improve lease expire/revoke operation performance: etcd-io/etcd#9418 • Use segregated hash map to boost the freelist allocate and release performance: etcd-io/bbolt#141 • Add backend batch limit/interval fields: etcd-io/etcd#10283 • Benchmark: • 100 clients, 1 million random key value pairs, 5000 QPS • Completion time: ~200s • Latency: 99.9% in 97.6ms
2 releases lag with upstream • No API change • annotations, aggregator, CRD etc • Respect K8s philosophy • Declarative API & Controller Pattern • Leverage K8s interfaces & extensibility • CNI, CSI, admission hook, initializer, extender etc • Honor kubelet & CRI • “fork” • Lock down on specific K8s release, never upgrade • In-house/modified K8s API, hide/wrap K8s API • Bypass K8s core workflow • Bypass K8s interface (CSI, CNI, CRI) • Replace kubelet with some other agent • … One more thing: set up a small upstream team across your org, it’s fun, and rewarding.