Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Wait! Can Your Pod Survive a Restart?

Wait! Can Your Pod Survive a Restart?

Avatar for Aya (Igarashi) Ozawa

Aya (Igarashi) Ozawa

April 04, 2025
Tweet

More Decks by Aya (Igarashi) Ozawa

Other Decks in Technology

Transcript

  1. Who am I? github.com/llmariner/llmariner Aya Ozawa GitHub: @Ladicle Member of

    Technical Staff at CloudNatix Extensible generative AI platform on Kubernetes with OpenAI-compatible APIs.
  2. Why Restartability Matters MAXIMIZING THE BENEFITS OF KUBERNETES AUTOMATION Pet

    Cattle I utilize self-healing, rolling updates, and autoscaling!
  3. Pod Pod Pod Pod V1 V1 V2 V2 Deployment Workload

    Restart Scenarios DIFFERENT LEVELS, DIFFERENT BEHAVIOR Container Self-Healing ReplicaSet Self-Healing Deployment Rolling-Upgrade Kubelet Rollout Controller Manager Controller Manager Startup/Liveness Probe Failure Exit C C C C Pod Pod Pod Pod Delete/Evict Preemption Soft/Hard Eviction API clients Scheduler Kubelet Pod Pod Restart Restart Create a New Pod Delete the old Pod
  4. TerminationGracePeriodSeconds (Default: 30s) Simple Workload Restart Flow DEPLOYMENT POD DELETE→RECREATE

    UNDER DEFAULT SETTING Delete Pod E.g. Kubectl Pod Default: SIGTERM STOPSIGNAL SIGKILL Pod Created Container is Created & Mark as Ready ʙ ʙ Pod apiVersion: apps/v1 kind: Deployment metadata: name: signal-demo spec: terminationGracePeriodSeconds: 30 selector: matchLabels: app: signal-demo template: metadata: labels: app: signal-demo spec: containers: - name: app image: ladicle/pod-restart-demo args: - "signal" - "--graceful-shutdown=false" No Signal Handling & the App Runs on PID1
  5. Long Graceful Period Minimizing Shutdown Impact HANDLE `SIGTERM` FOR A

    FAST, GRACEFUL EXIT No Signal Handling It can waste resources while the container remains. 🙅 Immediate termination may cause problems if your app needs shutdown tasks. 🙅 TerminationGracePeriodSeconds (Default: 30s) Delete Pod E.g. Kubectl Pod STOPSIGNAL Graceful Shutdown SIGKILL ʙ ʙ package main // ... func main() { // ... sigCh := make(chan os.Signal, 1) signal.Notify(sigCh, syscall.SIGTERM) <-sigCh // Attempt graceful shutdown tasks
  6. TerminationGracePeriodSeconds SIGQUIT Graceful Shutdown ʙ ʙ Minimizing Shutdown Impact ALTERNATIVE

    SIGNAL HANDLING WAYS FOR THIRD-PARTY APPS (1) Set the custom signal for the container to terminate. (Default: SIGTERM) Container Lifecycle Hook: PreStop PreStop hook is called immediately before a container is terminated. FROM gcr.io/distroless/static-debian12:nonroot STOPSIGNAL SIGQUIT COPY --from=builder /out/demo /bin/ ENTRYPOINT ["/bin/demo"] 👉 Dockerfile: STOPSIGNAL
  7. GracefulShutdown Delete Pod STOPSIGNAL PreStop Trigger the apps’ graceful shutdown

    from the hook. Dockerfile: STOPSIGNAL Minimizing Shutdown Impact ALTERNATIVE SIGNAL HANDLING WAYS FOR THIRD-PARTY APPS (1) Set the custom signal for the container to terminate. (Default: SIGTERM) Container Lifecycle Hook: PreStop PreStop hook is called immediately before a container is terminated. ʙ ʙ apiVersion: apps/v1 kind: Deployment metadata: name: prestop-demo spec: # ... template: spec: containers: - # ... lifecycle: preStop: exec: command: ["/bin/bash", "-c", "kill -SIGUSR1 1"] lifecycle: preStop: httpGet: path: /quit port: 8080 scheme: HTTP lifecycle: preStop: stopSignal: SIGUSR1 Planned Feature 👉
  8. What’s Sidecars? • Enabled by default in k8s v1.29+ •

    Sidecars are secondary containers that extend the primary apps Graceful Shutdown with Sidecars CONTROLLING CONTAINER SHUTDOWN ORDERS IN KUBERNETES TerminationGracePeriodSeconds Delete Pod STOPSIGNAL SIGKILL Shutdown Shutdown Shutdown PreStop Shutdown Pod Primary 1 Primary 2 Sidecar 1 Sidecar 2 With PreStop Hook Primary 3 Sidecar 3 Shutdown Shutdown Reverse Order ⚠ # ... initContainers: - name: sidecar-1 restartPolicy: Always image: pod-restart-demo args: - "signal" - "sidecar-1"
  9. Receiving Traffic HTTP Server Restart Flow RESTARTING THE DEPLOYMENT POD

    WITH SERVICE Receiving Traffic Delete Pod Pod STOPSIGNAL Graceful Shutdown ʙ ʙ Pod Pod Created Container is Created & Mark as Ready Traffic Routed to New Pod Pod Removed from Routing Table apiVersion: apps/v1 kind: Deployment metadata: name: server-demo spec: # ... template: spec: containers: - name: app-0 image: ladicle/pod-restart-demo args: - "server" - "--graceful-shutdown" ports: - name: http containerPort: 8080 protocol: TCP apiVersion: v1 kind: Service metadata: name: server-demo spec: selector: app: server-demo ports: - protocol: TCP port: 8080 targetPort: http
  10. apiVersion: apps/v1 kind: Deployment metadata: name: server-demo spec: # ...

    template: spec: containers: - # ... lifecycle: preStop: sleep: seconds: 3 Minimizing Traffic Downtime DELAY SHUTDOWN UNTIL TRAFFIC STOPS Receiving Traffic Delete Pod Pod Graceful Shutdown ʙ ʙ Pod Removed from Routing Table Risk of Dropping New Requests package main // ... func main() { // ... <-ctx.Done() log.Info("Shutting down server...") // Stop listening for new requests and waits // in-flight requests to finish. if err := srv.Shutdown(ctx); err != nil { return err } log.Info("Shutdown has finished") } Receiving Traffic Pod Graceful Shutdown ʙ ʙ Pod Removed from Routing Table Sleep Delay Shutdown Using PreStop.sleep or Application Logic 🙅 STOPSIGNAL
  11. Minimizing Traffic Downtime AVOID SERVING TRAFFIC BEFORE YOUR APP IS

    READY Receiving Traffic Pod Pod Created Container is Created & Mark as Ready Traffic Routed to New Pod Risk of Dropping Requests Receiving Traffic Pod Pod Created Container is Created Readiness Probe returns True & Mark as Ready Traffic Routed to New Pod 🙅 apiVersion: apps/v1 kind: Deployment metadata: name: server-demo spec: # ... template: spec: containers: - # ... startupProbe: httpGet: path: /startz port: http periodSeconds: 1 failureThreshold: 5 readinessProbe: httpGet: path: /readyz port: http ʙ ʙ ʙ ʙ
  12. Minimizing Traffic Downtime HOW STARTUP, READINESS, AND LIVENESS PROBES DIFFER

    IN BEHAVIOR AND IMPACT ɹɹɹɹ Receiving Traffic Container is Created Readiness Probe returns True & Mark as Ready C Startup Probe Readiness Probe Liveness Probe Restarted on failure Removed from Service endpoints on failure Restarted on failure
  13. No PreStop Sleep & No Probes Corrected Deployment Manifest Demo:

    HTTP Server Restart SCENARIOS WITH AND WITHOUT DOWNTIME apiVersion: apps/v1 kind: Deployment metadata: name: server-demo spec: # ... template: # ... spec: initContainers: - name: init image: ghcr.io/ladicle/pod-restart-demo command: ["echo"] args: ["initializing..."] containers: - name: app image: ghcr.io/ladicle/pod-restart-demo args: - server - --graceful-shutdown - --startup-delay=2s ports: - name: http containerPort: 8080 protocol: TCP Server takes 2 seconds to start apiVersion: apps/v1 kind: Deployment metadata: name: server-demo spec: # ... containers: - name: app # ... lifecycle: preStop: sleep: seconds: 4 startupProbe: httpGet: path: /startz port: http periodSeconds: 1 failureThreshold: 5 readinessProbe: httpGet: path: /readyz port: http Adding preStop sleep & Probes prevents downtime 🙅 👍
  14. Case 1: Remove a Terminating Endpoint Immediately if other Endpoints

    are Ready Pod Delete→Create (Ready > 1) Understanding Routing Updates FALLBACK TO TERMINATING ENDPOINTS (IF NO READY EXIST) Pod Create→Delete (Rollout, maxSurge >0) ʙ ʙ STOPSIGNAL New Pod Created Container is Created Container Mark as Ready New Pod Added to Routing Shutdown Receiving Traffic ʙ ʙ Pod Removed from Routing ʙ ʙ Traffic is Served by Other Replicas New Pod Created Container is Created Container Mark as Ready Receiving Traffic ʙ ʙ Rollout ʙ ʙ Routing Switched to New Pod Shutdown Receiving Traffic STOPSIGNAL
  15. Having multiple Replicas Increases Availability. Case 2: Terminating Endpoints Still

    Serve During Shutdown Understanding Routing Updates FALLBACK TO TERMINATING ENDPOINTS (IF NO READY EXIST) Pod Delete→Create (Ready = 0) ʙ ʙ STOPSIGNAL Recreate Container Pod Added to Routing Again Shutdown ʙ ʙ Pod Removed from Routing Liveness Probe Failure Delete Container Container is Created Container Mark as Ready Receiving Traffic Temporary Traffic Loss During Restart C Liveness Probe Failure → Restart (Ready = 0) ʙ ʙ STOPSIGNAL New Pod Created Container is Created Container Mark as Ready Routing Switched to New Pod Shutdown Still Receiving Traffic Receiving Traffic ʙ ʙ Pod
  16. Controller Restart Flow WITH LEADER ELECTION Holding Leadership Delete Pod

    STOPSIGNAL ʙ ʙ Pod Created Container is Created & Mark as Ready Acquire Leadership Role Shutdown Still Holding Leadership ʙ ʙ Stop Renewing Leadership 1. Stop Non-Leader Election Runnables 2. Stop Leader Election Runnables 3. Stop Cache 4. Stop Webhooks 5. Stop HTTP Servers 6. Shut Down Manager // import ctrl "sigs.k8s.io/controller-runtime" func Start(ctx context.Context, config *rest.Config) error { mgr, err := ctrl.NewManager(config, ctrl.Options{ LeaderElection: true, }) // ... ctx, cancel := signal.NotifyContext(ctx, syscall.SIGTERM) defer cancel() log.Info("Starting manager...") // Start graceful shutdown when the context is cancelled. return mgr.Start(ctx) }
  17. Holding Leadership Max Delay: LeaseDuration (15s) + Retry Period (2s)

    = 17s ʙ ʙ Stop Renewing Leadership Minimizing Leader Election Disruptions RELEASE LEADERSHIP ON CANCELLATION • LeaseDuration: Maximum time the leader can hold onto leadership without renewal. If this time expires, the leadership is lost, and a new election begins. (Default: 15s) • RetryPeriod: The interval between each candidate’s attempt to acquire or renew leadership. (Default: 2s) apiVersion: coordination.k8s.io/v1 kind: Lease metadata: name: controller-demo namespace: default # ... spec: holderIdentity: controller-demo-e7c7d31534-b698j_42 leaseDurationSeconds: 15 leaseTransitions: 1 acquireTime: "2025-04-04T14:30:17.572844Z" renewTime: "2025-04-04T15:00:29.117662Z" Lease 1. Success to update Lock Shared Resource Leader 2. Failed to Update Pod A Pod B Pod A
  18. Holding Leadership Max Delay: LeaseDuration(1s) + Retry Period (2s) =

    3s ʙ ʙ Update LeaseDuration To 1s at the end Holding Leadership Max Delay: LeaseDuration (15s) + Retry Period (2s) = 17s ʙ ʙ Stop Renewing Leadership Minimizing Leader Election Disruptions RELEASE LEADERSHIP ON CANCELLATION • LeaseDuration: Maximum time the leader can hold onto leadership without renewal. If this time expires, the leadership is lost, and a new election begins. (Default: 15s) • RetryPeriod: The interval between each candidate’s attempt to acquire or renew leadership. (Default: 2s) mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), LeaderElection: true, LeaderElectionReleaseOnCancel: true, // ... })
  19. apiVersion: apps/v1 kind: Deployment metadata: name: pdb-demo spec: replicas: 2

    selector: matchLabels: app: pdb-demo template: # ... Pod Node 1 Pod Node 2 Node 3 Pod Minimizing Pod Eviction Impact USING WORKLOAD WITH PDB TO MITIGATE THE IMPACT OF POD EVICTION apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: pdb-demo spec: maxUnavailable: 1 selector: matchLabels: app: pdb-demo Pod Node 1 Pod Node 2 Node 3 Pod Eviction Request Sent Pod Evicted Successfully Not Running Eviction Request Sent Eviction Request Rejected Eviction Not Allowed: Not Running Pod Count (1) + Pod to be Evicted (1) = Total (2), exceeding maxUnavailable (1) Pod Disruption Budget (PDB) A PDB is a resource that restricts how many Pods can be unavailable during voluntary disruptions.
  20. Limitations of Pod Disruption Budget (PDB) PDB DOES NOT APPLY

    TO EVERY POD REMOVAL SCENARIO Eviction Before Recreation PDB May Always Block Eviction PDB Applies Only to Eviction API & Preemption (Best Effort) Pod Node 1 Node 2 Pod Eviction Request Sent 1. Deleted an Evict Pod 2. Created a New Pod PDB limits disruptions but does not create a new Pod before evicting the existing one, unlike maxSurge in rolling updates. If PDB permanently enforces DisruptionAllowed=False, Pod cannot be evicted, which may lead to issues such as stalled node termination. Pod Node 1 Node 2 Pod Eviction Request Sent Eviction is Always Rejected Permanently Blocked Case Examples: Workload Replicas=1, maxUnavailable=0%, minAvailable=100% Only the Eviction API and Preemption (best effort) are respected by the PDB. Other restarts, such as Delete, are not taken into account. Scheduler Pod Node Node Pod Pod Preempted Priority Low Eviction Request Sent API Client
  21. Limitations of Pod Disruption Budget (PDB) PDB DOES NOT APPLY

    TO EVERY POD REMOVAL SCENARIO Eviction Before Recreation PDB May Always Block Eviction PDB Applies Only to Eviction API & Preemption (Best Effort) Pod Node 1 Node 2 Pod Eviction Request Sent 1. Deleted an Evict Pod 2. Created a New Pod PDB limits disruptions but does not create a new Pod before evicting the existing one, unlike maxSurge in rolling updates. If PDB permanently enforces DisruptionAllowed=False, Pod cannot be evicted, which may lead to issues such as stalled node termination. Pod Node 1 Node 2 Pod Eviction Request Sent Eviction is Always Rejected Permanently Blocked Case Examples: Workload Replicas=1, maxUnavailable=0%, minAvailable=100% Only the Eviction API and Preemption (best effort) are respected by the PDB. Other restarts, such as Delete, are not taken into account. Scheduler Pod Node Node Pod Pod Preempted Priority Low Eviction Request Sent API Client apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: pdb-demo spec: maxUnavailable: 1 unhealthyPodEvictionPolicy: AlwaysAllow selector: matchLabels: app: pdb-demo This field ensures evictions are not permanently blocked when an app has a bug and is never in a Running state.
  22. Cases Where the Pod Termination Grace Period Is Ignored OVERRIDING

    OR BYPASSING TERMINATIONGRACEPERIODSECONDS Override in Deletion & Eviction APIs Node Pressure: Soft and Hard Evictions The deletion and eviction APIs can optionally override the Grace Period, allowing a Pod to be deleted immediately. Soft eviction (disabled by default) overrides terminationGracePeriodSeconds during node pressure, while hard eviction immediately deletes the Pod. Kubelet Node Pod Pod Evicted Immediately Hard Eviction Threshold Exceeded Pod Node Pod Evicted Using eviction-max-pod- grace-period kubelet Soft Eviction Threshold Exceeded e.g., memory.available <100Mi
  23. Key Takeaways HOW TO MAKE PODS RESTART-FRIENDLY Handle Signals for

    Graceful Shutdown Ensure your application handles SIGTERM (or a custom stop signal) properly. Dockerfile STOPSIGNAL and PreStop exec/httpGet can be used for a custom signal. Sidecars are terminated sequentially after primary containers exit. Remember that terminationGracePeriodSeconds is not always guaranteed. Minimize Traffic Downtime Use PreStop Sleep to give enough time for traffic stops before shutting down. Set Readiness/Startup Probes so the server receives traffic when actually ready. Plan for Leader Election Restarts Controllers using Leader Election can speed up leadership handover by enabling LeaderElectionReleaseOnCancel, but be careful split-brain. Mitigate Node Maintenance Disruptions Use Pod Disruption Budgets (PDBs) to limit disruption by evictions Restartability brings you leverage k8s’ strengths (self-healing, rolling updates, autoscaling).
  24. Appendix: Recap Restart Scenarios Level Scenario Restarted-by Pod Termination Grace

    Period Note Container Container Exit, Startup/Liveness Probe Failure kubelet Respected ReplicaSet Direct Deletion API Call (API clients) Respected/Can be overridden ReplicaSet Eviction API Call (API clients) Respected/Can be overridden PDB Respected ReplicaSet Priority Preemption scheduler Respected PDB Best-effort ReplicaSet NoExcute Taint controller-manager Respected ReplicaSet Soft Eviction by Node Pressure kubelet Overridden by kubelet Config Default Disabled ReplicaSet Hard Eviction by Node Pressure kubelet Ignored Workload RollingUpgrade (maxSurge > 0) controller-manager Respected Create First Before Deleting