Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes Walk Through from Technical View

Kubernetes Walk Through from Technical View

Kubernetes architecture and core concepts walk through. This is the the presentation I spoke in k8s workshop at Alibaba main campus

Avatar for Lei (Harry) Zhang

Lei (Harry) Zhang

June 28, 2017
Tweet

More Decks by Lei (Harry) Zhang

Other Decks in Technology

Transcript

  1. Kubernetes • Created by Google Borg/Omega team • Founded and

    operated by CNCF (Linux Foundation) • Container orchestration, scheduling and management • One of the most popular open source project in the world
  2. Architecture kubelet SyncLoop controller-manager ControlLoop kubelet SyncLoop proxy proxy network

    pod replica namespace service job deployment volume petset … scheduler Node Node Desired World Real World etcd api-server
  3. Example kubelet SyncLoop kubelet SyncLoop proxy proxy 3.1 New container

    detected 3.2 Bind container to a node etcd scheduler api-server
  4. Example kubelet SyncLoop kubelet SyncLoop proxy proxy 4.1 Detected bind

    operation 4.2 Start container on this machine etcd scheduler api-server
  5. Take Aways • Independent control loops • loosely coupled •

    high performance • easy to customize and extend • “Watch” object change • Decide next step based on state change • not edge driven (event), level driven (state)
  6. Co-scheduling • Tow containers: • App: generate log files •

    LogCollector: read and redirect logs to storage • Request MEM: • App: 1G • LogCollector: 0.5G • Available MEM: • Node_A: 1.25G • Node_B: 2G • What happens if App is scheduled to Node_A first?
  7. Pod • Deeply coupled containers • Atomic scheduling/placement unit •

    Shared namespace • network, IPC etc • Shared volume • Process group in container cloud
  8. Why co-scheduling? • It’s about using container in right way:

    • Lesson learnt from Borg: “workloads tend to have tight relationship”
  9. Copy Files from One to Another? • Wrong! Master Pod

    kube-apiserver kube-scheduler controller-manager /etc/kubernetes/ssl
  10. Connect to Peer Container thru IP? • Wrong! Master Pod

    kube-apiserver kube-scheduler controller-manager network namespace
  11. So this is Pod • Design pattern in container world

    • decoupling • reuse & refactoring • Describe more real-world workloads by container • e.g. ML • Parameter server and trainer in same Pod
  12. Resource Model • Compressible resources • Hold no state •

    Can be taken away very quickly • “Merely” cause slowness when revoked • e.g. CPU • Non-compressible resources • Hold state • Are slower to be taken away • Can fail to be revoked • e.g. Memory, disk space Kubernetes (and Docker) can only handle CPU & Memory Don’t handle things like memory bandwidth, disk time, cache, network bandwidth, ... (yet)
  13. Resource Model • Request: amount of a resource allowed to

    be used, with a strong guarantee of availability • CPU (seconds/second), RAM (bytes) • Scheduler will not over-commit requests • Limit: max amount of a resource that can be used, regardless of guarantees • scheduler ignores limits • Mapping to Docker • —cpu-shares=requests.cpu • —cpu-quota=limits.cpu • —cpu-period=100ms • —memory=limits.memory
  14. QoS Tiers and Eviction • Guaranteed • limits is set

    for all resources, all containers • limits == requests (if set) • Be killed until they exceed their limits • or if the system is under memory pressure and there are no lower priority containers that can be killed. • Burstable • requests is set for one or more resources, one or more containers • limits (if set) != requests • killed once they exceed their requests and no Best-Effort pods exist when system under memory pressure • Best-Effort • requests and limits are not set for all of the resources, all containers • First to get killed if the system runs out of memory
  15. Scheduler • Predicates • NoDiskConflict • NoVolumeZoneConflict • PodFitsResources •

    PodFitsHostPorts • MatchNodeSelector • MaxEBSVolumeCount • MaxGCEPDVolumeCount • CheckNodeMemoryPressure • eviction, QoS tiers • CheckNodeDiskPressure • Priorities • LeastRequestedPriority • BalancedResourceAllocation • SelectorSpreadPriority • CalculateAntiAffinityPriority • ImageLocalityPriority • NodeAffinityPriority • Design tips: • watch and sync podQueue • schedule based on cached info • optimistically bind • predicates is paralleled between nodes • priorities are paralleled between functions in Map-Reduce way
  16. Deployment • Replicas with control • Bring up a Replica

    Set and Pods. • Check the status of a Deployment. • Update that Deployment (e.g. new image, labels). • Rollback to an earlier Deployment revision. • Pause and resume a Deployment.
  17. Create • ReplicaSet • Next generation of ReplicaController • —record:

    record command in the annotation of ‘nginx-deployment’
  18. Check • DESIRED: .spec.replicas • CURRENT: .status.replicas • UP-TO-DATE: contains

    the latest pod template • AVAILABLE: pod status is ready (running)
  19. Update • kubectl set image • will change container image

    • kubectl edit • open an editor and modify your deployment yaml • RollingUpdateStrategy • 1 max unavailable • 1 max surge • can also be percentage • Does not kill old Pods until a sufficient number of new Pods have come up • Does not create new Pods until a sufficient number of old Pods have been killed. trigger
  20. Update Process • The update process is coordinated by Deployment

    Controller • Create: Replica Set (nginx-deployment-2035384211) and scaled it up to 3 replicas directly. • Update: • created a new Replica Set (nginx-deployment-1564180365) and scaled it up to 1 • scaled down the old Replica Set to 2 • continued scaling up and down the new and the old Replica Set, with the same rolling update strategy. • Finally, 3 available replicas in the new Replica Set, and the old Replica Set is scaled down to 0.
  21. Pausing & Resuming (Canary) • Tips • blue-green deployment: duplicated

    infrastructure • canary release: share same infrastructure • rollback resumed deployment is WIP • old way: kubectl rolling-update rc-1 rc-2
  22. DaemonSet • Spread daemon pod to every node • DaemonSet

    Controller • bypass default scheduler • even on unschedulable nodes • e.g. bootstrap
  23. Horizontal Pod Autoscaling • Tips • Scale out/in • TriggeredScaleUp

    (GCE, AWS, will add more) • Support for custom metrics
  24. Custom Metrics • Endpoint (Location to collect metrics from) •

    Name of metric • Type (Counter, Gauge, ...) • Data Type (int, float) • Units (kbps, seconds, count) • Polling Frequency • Regexps (Regular expressions to specify which metrics to collect and how to parse them) • The metric will be added to pod as ConfigMap volume Prometheus Nginx
  25. ConfigMap • Decouple configuration from image • configuration is a

    runtime attribute • Can be consumed by pods thru: • env • volumes
  26. Secret • Tip: credentials for accessing the k8s API is

    automatically added to your pods as secret
  27. Downward Api • Get these inside your pod as ENV

    or volume • The pod’s name • The pod’s namespace • The pod’s IP • A container’s cpu limit • A container’s cpu request • A container’s memory limit • A container’s memory request
  28. Service • The unified portal of replica Pods • Portal

    IP:Port • External load balancer • GCE • AWS • HAproxy • Nginx • OpenStack LB
  29. Service Implementation Tip: ipvs solution works in nat mode which

    is the same with this iptables way $ iptables-save | grep my-service -A KUBE-SERVICES -d 10.0.0.116/32 -p tcp -m comment --comment "default/my-service: cluster IP" -m tcp --dport 8001 -j KUBE-SVC-KEAUNL7HVWWSEZA6 -A KUBE-SVC-KEAUNL7HVWWSEZA6 -m comment --comment "default/my-service:" --mode random -j KUBE-SEP-6XXFWO3KTRMPKCHZ -A KUBE-SVC-KEAUNL7HVWWSEZA6 -m comment --comment "default/my-service:" --mode random -j KUBE-SEP-57KPRZ3JQVENLNBRZ -A KUBE-SEP-6XXFWO3KTRMPKCHZ -p tcp -m comment --comment "default/my-service:" -m tcp -j DNAT --to-destination 172.17.0.2:80 -A KUBE-SEP-57KPRZ3JQVENLNBRZ -p tcp -m comment --comment "default/my-service:" -m tcp -j DNAT --to-destination 172.17.0.3:80
  30. Publishing Services • Use Service.Type=NodePort • <node_ip>:<node_port> • External IP

    • IPs route to one or more cluster nodes (e.g. floating IP) • Use external LoadBalancer • Require support from IaaS (GCE, AWS, OpenStack) • Deploy a service-loadbalancer (e.g. HAproxy) • Official guide: https://github.com/kubernetes/contrib/tree/master/service-loadbalancer
  31. Ingress • The next generation external Service load balancer •

    Deployed as a Pod on dedicated Node (with external network) • Implementation • Nginx, HAproxy, GCE L7 • External access for service • SSL support for service • … s1 http://foo.bar.com <IP_of_Ingress_node> http://foo.bar.com/foo
  32. StatefulSet: “clustered applications” • Ordinal index • startup/teardown ordering •

    Stable hostname • Stable storage • linked to the ordinal & hostname • Databases like MySQL or PostgreSQL • single instance attached to a persistent volume at any time • Clustered software like Zookeeper, Etcd, or Elasticsearch, Cassandra • stable membership. Update StatefulSet: Scale: create/delete one by one Scale in: will not delete old persistent volume
  33. StatefulSet StatefulSet Example cassandra-0 cassandra-1 volume 0 volume 1 cassandra-0.cassandra.default.svc.cluster.local

    cassandra-1.cassandra.default.svc.cluster.local $ kubectl patch petset cassandra -p '{"spec":{"replicas":10}}'
  34. One Pod One IP • Network sharing is important for

    affiliate containers • Not all containers need independent network • Network implementation for pod is totally the same as for single container Pod Infra container Container A Container B --net=container:pause /proc/{pid}/ns/net -> net:[4026532483]
  35. Kubernetes uses CNI • CNI plugin • e.g. Calico, Flannel

    etc • The kubelet cni flags: • --network-plugin=cni • --network-plugin-dir=/etc/cni/net.d • CNI is very simple 1.Kubelet creates a network namespace for Pod 2.Kubelet invokes CNI plugin to configure the NS (interface name, IP, MAC, gateway, bridge name …) 3.Infra container in Pod join this network namespace
  36. Tips • host < calico(bgp) < calico(ipip) = flannel(vxlan) =

    docker(vxlan) < flannel(udp) < weave(udp) • Test graph comes from: http://cmgs.me/life/docker-network-cloud Calico Flannel Weave Docker Overlay Network Network Model Pure Layer-3 Solution VxLAN or UDP Channel VxLAN or UDP Channel VxLAN
  37. Persistent Volumes • -v host_path:container_path 1.Attach networked storage to host

    path 1. mounted to host_path 2.Mount host path as container volume 1. bind mount container_path with host_path 3. Independent volume control loop
  38. Officially Supported PVs • GCEPersistentDisk • AWSElasticBlockStore • AzureFile •

    FC (Fibre Channel) • NFS • iSCSI • RBD (Ceph Block Device) • CephFS • Cinder (OpenStack block storage) • Glusterfs • VsphereVolume • HostPath (single node testing only) • more than 20+ • Write your own volume plugin: FlexVolume 1. Implement 10 methods 2. Put binary/shell in plugin directory • example: LVM as k8s volume
  39. Production ENV Volume Model Persistent Volumes PersistentVolumeClaims Pod Host path

    networked storage Pod Pod mountPath mountPath Key point: 职责分离
  40. PV & PVC • System Admin: • $ kubectl create

    -f nfs-pv.yaml • create a volume with access mode, capacity, recycling mode • Dev: • $ kubectl create -f pv-claim.yaml • request a volume with access mode, resource, selector • $ kubectl create -f pod.yaml
  41. More … • GC • Health check • Container lifecycle

    hook • Jobs (batch) • Pod affinity and binding • Dynamic provisioning • Rescheduling • CronJob • Logging and monitoring • Network policy • Federation • Container capabilities • Resource quotas • Security context • Security polices • GPU scheduling
  42. Summary • Q: Where are all these control panel ideas

    come from? • A: Kubernetes = “Borg” + “Container” • Kubernetes is a set of methodology for using containers based on past 10+ yr’s exp in Google Inc. • “不不要摸着⽯石头过河” • Kubernetes is a container centric DevOps/Workload orchestration system • Not a “CI/CD”, “Micro-service” focused container cloud
  43. Growing Adopters • Public Cloud • AWS • Microsoft Azure

    (acquired Deis) • Google Cloud • 腾讯云 • 百度AI • 阿⾥里里云 Enterprise Users Data source: Kubernetes Leadership Summit (with CN adopters)