important Most of Kubernetes centers on network concepts Our job is to make sure your applications can communicate: • With each other • With the world outside your cluster • Only where you want
TCP/IP, but containers bring new concepts: • Namespaces • Virtual interfaces • IP forwarding • Underlays • Overlays • iptables • NAT It’s enough to make your head spin
the API • No private APIs • No “system only” calls REST: Defined in terms of “resources” (nouns, aka “objects”) and methods (verbs) Background: API server
API resource Labels: identification • Allow users to define how to group resources • Examples: app name, tier (frontend/backend), stage (dev/test/prod) Annotations: data that “rides along” with objects • Third-party or internal state that isn’t part of an object’s schema role: fe stage: prod app: store
from the out-of-the-box model Docker offers • No machine-private IPs • No port-mapping Pod IPs are accessible from other pods, regardless of which VM they are on Linux “network namespaces” (aka “netns”) and virtual interfaces
Kubernetes doesn’t care HOW, but this is a requirement • L2, L3, or overlay Assign a CIDR (IP block) to each VM GCP: Teach the network how to route packets
using every trick it needs All VMs are created as “routers” • --can-ip-forward • Disable anti-spoof protection for this VM Add one GCP static route for each VM • gcloud compute routes create vm2 --destination-range=x.y.z.0/24 --next-hop-instance=vm2 The GCP network does the rest
pod IP A real cluster changes over time: • Scale-up and scale-down events • Rolling updates • Pods crash or hang • VMs reboot The pod addresses you want to talk to can change without warning
(usually pods) Services provide a stable VIP VIP automatically routes to backend pods • Implementations can vary • We will examine the default implementation The set of pods “behind” a service can change Clients only need the VIP, which doesn’t change
be defaulted or assigned The ‘selector’ field chooses which pods to balance across kind: Service apiVersion: v1 metadata: name: store-be spec: selector: app: store role: be ports: - name: http port: 80
10.11.8.67 app: store role: be 10.11.5.3 app: store role: be 10.11.0.9 app: db role: be 10.7.1.18 app: store role: fe 10.11.8.67 app: db role: be 10.4.1.11
10.11.8.67 app: store role: be 10.11.5.3 app: store role: be 10.11.0.9 app: db role: be 10.7.1.18 app: store role: fe 10.11.8.67 app: db role: be 10.4.1.11
by ‘kube-proxy’ - a pod running on each VM • Not actually a proxy • Not in the data path Kube-proxy is a controller - it watches the API for services if dest.ip == svc1.ip && dest.port == svc1.port { pick one of the backends at random rewrite destination IP } A bit more on iptables
• Anything else gets dropped Pod IPs != VM IPs When in doubt, add some more iptables • MASQUERADE, aka SNAT Applies to any packet with a destination *outside* of 10.0.0.0/8
A backend is chosen randomly from all pods Good: • Well balanced, in practice Bad: • Can cause an extra network hop • Hides the client IP from the user’s backend Users wanted to make the trade-off themselves
pod on the same node Preserves client IP Risks imbalance kind: Service apiVersion: v1 metadata: name: store-be annotations: service.beta.kubernetes.io/external-traffic: OnlyLocal spec: type: LoadBalancer selector: app: store role: be ports: - name: https port: 443
LB VM2 VM3 25% 25% iptables iptables In practice Kubernetes spreads pods across nodes If pods >> nodes: OK If nodes >> pods: OK If pods ~= nodes: risk pod1 pod2 pod3 50% 50%
a port on every VM to the service port Exactly the same data path as the LB case kind: Service apiVersion: v1 metadata: name: store-be spec: type: NodePort selector: app: store role: be ports: - name: https port: 443
always choose a pod on the same node Risks imbalance Removes 2nd hop kind: Service apiVersion: v1 metadata: name: store-be annotations: service.beta.kubernetes.io/external-traffic: OnlyLocal spec: type: NodePort selector: app: store role: be ports: - name: https port: 443
Open Source developers and Google Engineers continue to improve and simplify the system Google NEXT ‘18 will have more “ins” and “outs” for network traffic Watch this space