The Ins and Outs of Networking in Google Container Engine

The ins and outs of networking in Google Container Engine
Michael Rubin Tim Hockin

Kubernetes is about clusters Because of that, networking is pretty
important Most of Kubernetes centers on network concepts Our job is to make sure your applications can communicate: • With each other • With the world outside your cluster • Only where you want

It’s easy to get overwhelmed Many people are comfortable with
TCP/IP, but containers bring new concepts: • Namespaces • Virtual interfaces • IP forwarding • Underlays • Overlays • iptables • NAT It’s enough to make your head spin

Kubernetes is a very API-centric system - everything communicates through
the API • No private APIs • No “system only” calls REST: Defined in terms of “resources” (nouns, aka “objects”) and methods (verbs) Background: API server

A small group of tightly-coupled containers & volumes, composed together
The atom of Kubernetes Shared lifecycle and fate Shared networking - a shared “real” IP, containers see each other as localhost Background: Pods

A piece of code that watches the Kubernetes API and
reacts The defining pattern of Kubernetes, used everywhere Self-healing, aka rectification Examples: ReplicaSet, Services, DNS, Kubelet Background: Controllers

Background: Labels Metadata (key-value) which can be attached to any
API resource Labels: identification • Allow users to define how to group resources • Examples: app name, tier (frontend/backend), stage (dev/test/prod) Annotations: data that “rides along” with objects • Third-party or internal state that isn’t part of an object’s schema role: fe stage: prod app: store

Background: Selectors Expresses which objects to act upon • Think
“select ... where” Provides very loose coupling Users can manage groups however they need Examples: services, deployments

Background: Selectors role: fe stage: test role: be stage: test
role: fe stage: prod role: be stage: prod app: store app: store app: store app: store

role: fe stage: prod role: be stage: prod app: store app: store app: store app: store app=store, role=fe

role: fe stage: prod role: be stage: prod app: store app: store app: store app: store app=store, stage=test

role: fe stage: prod role: be stage: prod app: store app: store app: store app: store app=store

The IP-per-pod model

Every pod has a real IP address This is different
from the out-of-the-box model Docker offers • No machine-private IPs • No port-mapping Pod IPs are accessible from other pods, regardless of which VM they are on Linux “network namespaces” (aka “netns”) and virtual interfaces

VM Network namespaces eth0

VM Network namespaces root netns eth0

VM Network namespaces root netns eth0 pod1 netns

VM Network namespaces root netns eth0 pod1 netns vethxy vethxx

VM Network namespaces root netns eth0 pod1 netns eth0 vethxx

VM Network namespaces root netns eth0 pod1 netns pod2 netns
eth0 eth0 vethxx vethyy

VM Network namespaces root netns eth0 pod1 netns pod2 netns
eth0 eth0 vethxx vethyy cbr0

VM Life of a packet: pod-to-pod, same node root netns
eth0 vethxx vethyy cbr0 pod1 netns pod2 netns eth0 eth0

pod1 netns pod2 netns eth0 eth0 vethxx vethyy cbr0 src: pod1 dst: pod2 eth0

eth0 ctr1 netns ctr2 netns eth0 eth0 vethxx vethyy cbr0 src: pod1 dst: pod2 pod1 netns pod2 netns eth0 eth0

Flat network space Pods must be reachable across VMs, too
Kubernetes doesn’t care HOW, but this is a requirement • L2, L3, or overlay Assign a CIDR (IP block) to each VM GCP: Teach the network how to route packets

VM1 Life of a packet: pod-to-pod, across nodes root eth0
vethxx vethyy cbr0 VM2 root eth0 vethxx vethyy cbr0 pod1 eth0 pod2 eth0 pod3 pod4 eth0 eth0

VM1 Life of a packet: pod-to-pod, across nodes ctr1 eth0
ctr2 eth0 root eth0 vethxx vethyy cbr0 VM2 root eth0 ctr3 ctr4 eth0 eth0 vethxx vethyy cbr0 pod1 eth0 pod2 eth0 pod3 pod4 eth0 eth0 src: pod1 dst: pod4

ctr2 eth0 root eth0 vethxx vethyy cbr0 VM2 root eth0 ctr3 ctr4 eth0 eth0 vethxx vethyy cbr0 src: pod1 dst: pod4 pod1 eth0 pod2 eth0 pod3 pod4 eth0 eth0

ctr2 eth0 root eth0 vethxx vethyy cbr0 VM2 root eth0 ctr3 ctr4 eth0 eth0 vethxx vethyy cbr0 Anti-spoofing: only allow known source IPs (i.e. VMs) pod1 eth0 pod2 eth0 pod3 pod4 eth0 eth0

Programming GCP’s network GKE automatically sets up routing for you
using every trick it needs All VMs are created as “routers” • --can-ip-forward • Disable anti-spoof protection for this VM Add one GCP static route for each VM • gcloud compute routes create vm2 --destination-range=x.y.z.0/24 --next-hop-instance=vm2 The GCP network does the rest

ctr2 eth0 root eth0 vethxx vethyy cbr0 VM2 root eth0 ctr3 ctr4 eth0 eth0 vethxx vethyy cbr0 src: pod1 dst: pod4 pod1 eth0 pod2 eth0 pod3 pod4 eth0 eth0

ctr2 eth0 root eth0 vethxx vethyy cbr0 VM2 root eth0 ctr3 ctr4 eth0 eth0 vethxx vethyy cbr0 pod1 eth0 pod2 eth0 pod3 pod4 eth0 eth0 src: pod1 dst: pod4

Dealing with change You need something more durable than a
pod IP A real cluster changes over time: • Scale-up and scale-down events • Rolling updates • Pods crash or hang • VMs reboot The pod addresses you want to talk to can change without warning

Services

The service abstraction A service is a group of endpoints
(usually pods) Services provide a stable VIP VIP automatically routes to backend pods • Implementations can vary • We will examine the default implementation The set of pods “behind” a service can change Clients only need the VIP, which doesn’t change

Service What you submit is simple • Other fields will
be defaulted or assigned kind: Service apiVersion: v1 metadata: name: store-be spec: selector: app: store role: be ports: - name: http port: 80

Service What you submit is simple • Other fields will
be defaulted or assigned The ‘selector’ field chooses which pods to balance across kind: Service apiVersion: v1 metadata: name: store-be spec: selector: app: store role: be ports: - name: http port: 80

Service What you get back has more information Automatically creates
a distributed load balancer kind: Service apiVersion: v1 metadata: name: store-be namespace: default creationTimestamp: 2016-05-06T19:16:56Z resourceVersion: "7" selfLink: /api/v1/namespaces/default/services/store-be uid: 196d5751-13bf-11e6-9353-42010a800fe3 Spec: type: ClusterIP selector: app: store role: be clusterIP: 10.9.3.76 ports: - name: http protocol: TCP port: 80 targetPort: 80 sessionAffinity: None

Service What you get back has more information Automatically creates
a distributed load balancer The default is to allocate an in-cluster IP kind: Service apiVersion: v1 metadata: name: store-be namespace: default creationTimestamp: 2016-05-06T19:16:56Z resourceVersion: "7" selfLink: /api/v1/namespaces/default/services/store-be uid: 196d5751-13bf-11e6-9353-42010a800fe3 Spec: type: ClusterIP selector: app: store role: be clusterIP: 10.9.3.76 ports: - name: http protocol: TCP port: 80 targetPort: 80 sessionAffinity: None

Endpoints selector: app: store role: be app: store role: be
10.11.8.67 app: store role: be 10.11.5.3 app: store role: be 10.11.0.9 app: db role: be 10.7.1.18 app: store role: fe 10.11.8.67 app: db role: be 10.4.1.11

Endpoints When you create a service, a controller wakes up
kind: Endpoints apiVersion: v1 metadata: name: store-be namespace: default subsets: - addresses: - ip: 10.11.8.67 - ip: 10.11.5.3 - ip: 10.11.0.9 ports: - name: http port: 80 protocol: TCP

Endpoints When you create a service, a controller wakes up
Holds the IPs of the pod backends kind: Endpoints apiVersion: v1 metadata: name: store-be namespace: default subsets: - addresses: - ip: 10.11.8.67 - ip: 10.11.5.3 - ip: 10.11.0.9 ports: - name: http port: 80 protocol: TCP

Life of a packet: pod-to-service root netns eth0 pod1 netns
eth0 vethxx cbr0

Life of a packet: pod-to-service root netns eth0 ctr1 netns
eth0 vethxx cbr0 src: pod1 dst: svc1 pod1 netns eth0

eth0 vethxx cbr0 src: pod1 dst: svc1 iptables pod1 netns eth0

eth0 vethxx cbr0 iptables src: pod1 dst: svc1 dst: pod99 DNAT, conntrack pod1 netns eth0

Conntrack Linux kernel connection-tracking Remembers address translations • Based on
the 5-tuple Does a lot more, but not very relevant here Reversed on the return path { protocol = TCP src_ip = pod1 src_port = 1234 dst_ip = svc1 dst_port = 80 } => { protocol = TCP src_ip = pod1 src_port = 1234 dst_ip = pod99 dst_port = 80 }

eth0 vethxx cbr0 iptables src: pod1 dst: pod99 pod1 netns eth0

eth0 vethxx cbr0 iptables src: pod99 dst: pod1 pod1 netns eth0

eth0 vethxx cbr0 iptables src: pod99 src: svc1 dst: pod1 un-DNAT pod1 netns eth0

eth0 vethxx cbr0 iptables src: svc1 dst: pod1 pod1 netns eth0

The iptables rules look scary, but are actually simple: Configured
by ‘kube-proxy’ - a pod running on each VM • Not actually a proxy • Not in the data path Kube-proxy is a controller - it watches the API for services if dest.ip == svc1.ip && dest.port == svc1.port { pick one of the backends at random rewrite destination IP } A bit more on iptables

DNS Even easier: services are added to an in-cluster DNS
server You would never hardcode an IP, but you might hardcode a hostname and port Serves “A” and “SRV” records DNS itself runs as pods and a service

DNS Service Requests a particular cluster IP Pods are auto-scaled
with the cluster size Service VIP is stable kind: Service apiVersion: v1 metadata: name: kube-dns namespace: kube-system spec: clusterIP: 10.0.0.10 selector: k8s-app: kube-dns ports: - name: dns port: 53 protocol: UDP - name: dns-tcp port: 53 protocol: TCP

Simple and powerful Can use any port you want, no
conflicts Can request a particular ‘clusterIP’ Can remap ports

That’s all there is to it Services are an abstraction
- the API is a VIP No running process or intercepting the data-path All a client needs to do is hit the service IP:port

Sending external traffic Services are within a cluster What happens
if you want your pod to reach google.com?

Egress

Leaving the GCP project VMs get private IPs (in 10.0.0.0/8)
VMs can have public IPs, too GCP: Public IPs are provided by 1-to-1 NAT

GCP Project VM Life of a packet: VM-to-internet root netns
eth0 1:1 NAT

eth0 1:1 NAT src: VM-internal dst: 8.8.8.8

eth0 1:1 NAT src: VM-internal src: VM-external dst: 8.8.8.8

eth0 1:1 NAT src: VM-external dst: 8.8.8.8

eth0 1:1 NAT src: 8.8.8.8 dst: VM-external

eth0 1:1 NAT src: 8.8.8.8 dst: VM-external dst: VM-internal

eth0 1:1 NAT src: 8.8.8.8 dst: VM-internal

GCP Project VM Life of a packet: pod-to-internet pod1 netns
eth0 root netns eth0 vethxx cbr0 1:1 NAT

eth0 root netns eth0 vethxx cbr0 1:1 NAT src: pod1 dst: 8.8.8.8

eth0 root netns eth0 vethxx cbr0 1:1 NAT dropped!

What went wrong? The 1:1 NAT only understands VM IPs
• Anything else gets dropped Pod IPs != VM IPs When in doubt, add some more iptables • MASQUERADE, aka SNAT Applies to any packet with a destination *outside* of 10.0.0.0/8

eth0 root netns eth0 vethxx cbr0 iptables 1:1 NAT src: pod1 src: VM-internal dst: 8.8.8.8 MASQUERADE

eth0 root netns eth0 vethxx cbr0 iptables 1:1 NAT src: VM-internal dst: 8.8.8.8

eth0 root netns eth0 vethxx cbr0 iptables 1:1 NAT src: VM-internal src: VM-external dst: 8.8.8.8

eth0 root netns eth0 vethxx cbr0 iptables 1:1 NAT src: VM-external dst: 8.8.8.8

eth0 root netns eth0 vethxx cbr0 iptables 1:1 NAT src: 8.8.8.8 dst: VM-external

eth0 root netns eth0 vethxx cbr0 iptables 1:1 NAT src: 8.8.8.8 dst: VM-external dst: VM-internal

eth0 root netns eth0 vethxx cbr0 iptables 1:1 NAT src: 8.8.8.8 dst: VM-internal

eth0 root netns eth0 vethxx cbr0 iptables 1:1 NAT src: 8.8.8.8 dst: VM-internal dst: pod1

eth0 root netns eth0 vethxx cbr0 iptables 1:1 NAT src: 8.8.8.8 dst: pod1

Receiving external traffic GCP offers multiple products here Kubernetes builds
on two: • Network Load Balancer (L4) • HTTP/S Load balancer (L7) These map to Kubernetes APIs: • Service type=LoadBalancer • Ingress

L4: Service + LoadBalancer

Service Change the type of your service Implemented by the
cloud provider controller kind: Service apiVersion: v1 metadata: name: store-be spec: type: LoadBalancer selector: app: store role: be ports: - name: https port: 443

Service The LB info is populated when ready kind: Service
apiVersion: v1 metadata: name: store-be # ... spec: type: LoadBalancer selector: app: store role: be clusterIP: 10.9.3.76 ports: # ... sessionAffinity: None status: loadBalancer: ingress: - ip: 86.75.30.9

GCP Project VM1 Life of a packet: external-to-service VM2 VM3
pod1 pod2 pod3

GCP Project VM1 Life of a packet: external-to-service Net LB
VM2 VM3 pod1 pod2 pod3

VM2 VM3 src: client dst: LB pod1 pod2 pod3

GCP Project VM1 VM1 VM1 VM1 Life of a packet:
external-to-service Net LB VM2 VM3 src: client dst: LB Choose a VM pod1 pod2 pod3

GCP Project VM1 VM1 Life of a packet: external-to-service Net
LB VM2 VM3 src: client dst: LB Choose a VM pod1 pod2 pod3

LB VM2 VM3 Rejected by firewall GKE runs: gcloud firewalls create ... pod1 pod2 pod3

LB VM2 VM3 src: client dst: LB pod1 pod2 pod3

Balancing to VMs The LB only knows about VMS VMs
do not map 1:1 with pods VM1 VM2 VM3

GCP Project VM1 The imbalance problem Net LB VM2 VM3
pod1 pod2 pod3 Assume the LB only hits VMs with pods The LB only knows about VMS

GCP Project VM1 The imbalance problem Net LB VM2 VM3
pod1 pod2 pod3 50% 50%

GCP Project VM1 50% The imbalance problem Net LB VM2
VM3 50% 50% 25% 25% pod1 pod2 pod3

Balancing to VMs The LB only knows about VMS VMs
do not map 1:1 with pods How do we avoid imbalance? iptables, of course VM1 VM2 VM3

VM2 VM3 iptables src: client dst: LB Choose a pod pod1 pod2 pod3

VM2 VM3 iptables src: client dst: LB dst: pod2 NAT pod1 pod2 pod3

VM2 VM3 iptables src: client dst: pod2 pod1 pod2 pod3

VM2 VM3 iptables src: pod2 dst: client pod1 pod2 pod3

VM2 VM3 iptables src: pod2 dst: client INVALID pod1 pod2 pod3

VM2 VM3 iptables src: client src: VM1 dst: LB dst: pod2 NAT pod1 pod2 pod3

VM2 VM3 iptables src: VM1 dst: pod2 pod1 pod2 pod3

VM2 VM3 iptables src: pod2 dst: VM1 pod1 pod2 pod3

VM2 VM3 iptables src: pod2 src: LB dst: VM1 dst: client pod1 pod2 pod3

VM2 VM3 src: LB dst: client pod1 pod2 pod3

Explain the complexity To avoid imbalance, we re-balance inside Kubernetes
A backend is chosen randomly from all pods Good: • Well balanced, in practice Bad: • Can cause an extra network hop • Hides the client IP from the user’s backend Users wanted to make the trade-off themselves

OnlyLocal Specify an external-traffic policy iptables will always choose a
pod on the same node Preserves client IP Risks imbalance kind: Service apiVersion: v1 metadata: name: store-be annotations: service.beta.kubernetes.io/external-traffic: OnlyLocal spec: type: LoadBalancer selector: app: store role: be ports: - name: https port: 443

GCP Project VM1 50% Opt-in to the imbalance problem Net
LB VM2 VM3 25% 25% iptables iptables In practice Kubernetes spreads pods across nodes If pods >> nodes: OK If nodes >> pods: OK If pods ~= nodes: risk pod1 pod2 pod3 50% 50%

VM2 VM3 Not considered Health-check fails if no backends pod1 pod2 pod3

VM2 VM3 src: client dst: LB pod1 pod2 pod3

GCP Project VM1 VM1 VM1 Life of a packet: external-to-service
Net LB VM2 VM3 src: client dst: LB Choose a VM pod1 pod2 pod3

VM2 VM1 VM3 src: client dst: LB pod1 pod2 pod3

LB VM2 VM3 src: client dst: LB pod1 pod2 pod3

VM2 VM3 iptables src: client dst: LB Choose a pod pod1 pod2 pod3

VM2 VM3 iptables src: client dst: LB dst: pod2 DNAT pod1 pod2 pod3

VM2 VM3 iptables src: client dst: pod2 pod1 pod2 pod3

VM2 VM3 iptables src: pod2 src: LB dst: client pod1 pod2 pod3

VM2 VM3 iptables src: LB dst: client pod1 pod2 pod3

VM2 VM3 src: LB dst: client pod1 pod2 pod3

L7: Ingress

Service Change the type of your service Allocates and forwards
a port on every VM to the service port Exactly the same data path as the LB case kind: Service apiVersion: v1 metadata: name: store-be spec: type: NodePort selector: app: store role: be ports: - name: https port: 443

Ingress A different API resource Maps HTTP to services Implemented
by the cloud provider controller kind: Ingress apiVersion: extensions/v1beta1 metadata: name: store-ing spec: rules: - http: paths: - path: /customers backend: serviceName: customers-be - path: /products backend: serviceName: products-be

Ingress The LB info is populated when ready kind: Ingress
apiVersion: extensions/v1beta1 metadata: name: store-ing spec: rules: - http: paths: - path: /customers backend: serviceName: customers-be - path: /products backend: serviceName: products-be status: loadBalancer: ingress: - ip: 86.73.50.9

GCP Project VM1 Life of a packet: external-to-ingress VM2 VM3
pod1 pod4 pod5 pod2 pod3

GCP Project VM1 Life of a packet: external-to-ingress GCLB VM2
VM3 pod1 pod4 pod5 pod2 pod3

VM3 src: client dst: LB path: /products pod1 pod4 pod5 pod2 pod3

GCP Project VM1 VM1 VM1 VM1 Life of a packet:
external-to-ingress GCLB VM2 VM3 src: client dst: LB path: /products Choose a VM pod1 pod4 pod5 pod2 pod3

GCP Project VM1 VM1 Life of a packet: external-to-ingress GCLB
VM2 VM3 src: client dst: LB path: /products Choose a VM pod1 pod4 pod5 pod2 pod3

VM2 VM3 src: GCLB dst: VM3 pod1 pod4 pod5 pod2 pod3

VM3 iptables src: GCLB dst: VM3 Choose a pod pod1 pod4 pod5 pod2 pod3

VM3 iptables src: GCLB src: VM3 dst: VM3 dst: pod3 pod1 pod4 pod5 pod2 pod3

VM3 iptables src: VM3 dst: pod3 pod1 pod4 pod5 pod2 pod3

VM3 iptables src: pod3 dst: VM3 pod1 pod4 pod5 pod2 pod3

VM3 iptables src: pod3 src: VM3 dst: VM3 dst: GCLB pod1 pod4 pod5 pod2 pod3

VM3 iptables src: VM3 dst: GCLB pod1 pod4 pod5 pod2 pod3

VM2 VM3 src: VM3 dst: GCLB pod1 pod4 pod5 pod2 pod3

VM3 src: VM3 dst: client pod1 pod4 pod5 pod2 pod3

VM3 src: LB dst: client pod1 pod4 pod5 pod2 pod3

OnlyLocal The same annotation as before Configured per-service iptables will
always choose a pod on the same node Risks imbalance Removes 2nd hop kind: Service apiVersion: v1 metadata: name: store-be annotations: service.beta.kubernetes.io/external-traffic: OnlyLocal spec: type: NodePort selector: app: store role: be ports: - name: https port: 443

But wait, there’s more! Things we didn’t really cover: •
Pod liveness probes • Graceful termination • Cloud health checks • Firewalls • Headless services • IPAM • SSL • ...

Google Container Engine is a moving target The efforts of
Open Source developers and Google Engineers continue to improve and simplify the system Google NEXT ‘18 will have more “ins” and “outs” for network traffic Watch this space

https://kubernetes.io Code: github.com/kubernetes/kubernetes Chat: slack.k8s.io Twitter: @kubernetesio

Thank you

The Ins and Outs of Networking in Google Contai...

The Ins and Outs of Networking in Google Container Engine

More Decks by Tim Hockin

Other Decks in Technology

Featured

Transcript