Upgrade to Pro — share decks privately, control downloads, hide ads and more …

KubeCologne keynote—Troubleshooting Kubernetes ...

KubeCologne keynote—Troubleshooting Kubernetes apps

Michael Hausenblas

February 08, 2019
Tweet

More Decks by Michael Hausenblas

Other Decks in Technology

Transcript

  1. Hit me up on Twitter: @mhausenblas 2 Scope • Focusing

    on prototyping, developing, and testing applications
 with Kubernetes from an appops perspective (tools & techniques) • But not really (much) about … • troubleshooting installation or upgrading issues • performance testing or optimising containerized microservices • SRE-style troubleshooting (check out what Googlers say on this topic)
  2. Hit me up on Twitter: @mhausenblas 3 Monoliths vs. microservices

    monolith v1 monolith v2 time µS1
 v1 µS2
 v1 µS3
 v1 µS2
 v2 µS3
 v2 µS1
 v2 µS2
 v3 µS3
 v3 µS1
 v3 µS3
 v4 µS2
 v4 µS3
 v5 µS1
 v4 µS2
 v5 µS3
 v6
  3. Hit me up on Twitter: @mhausenblas 8 The TOP 10

    list 1. invalid YAML specification 2. wrong or missing permissions 3. wrong container image 4. no access to container registry 5. supposedly long-running application exits
  4. Hit me up on Twitter: @mhausenblas 9 The TOP 10

    list 6. missing/bad config or secret 7. lifecycle issues (probes fail) 8. can’t reach service 9. looking at the wrong place—where is localhost? 10. failed mounts
  5. Hit me up on Twitter: @mhausenblas Observe What’s in the

    logs? Establish baseline. Orient Formulate hypotheses. Don’t jump to conclusions. Decide Sort hypotheses by likelihood.
 Pick one of the hypotheses. Act Test the hypothesis you picked. If confirmed: fix it, else: continue. OODA loop
  6. Hit me up on Twitter: @mhausenblas • Deployment seems OK

    • Pod seems OK (image found, scheduled, launched) • I see log output, so container is running • Keeps crashing after launch
  7. Hit me up on Twitter: @mhausenblas • Could be a

    resource issues (OOM, etc.) • Could be config/data missing • Could be an application logic/runtime error
  8. Hit me up on Twitter: @mhausenblas 1.Could be an application

    logic/runtime error 2.Could be a resource issues (OOM, etc.) 3.Could be config/data missing
  9. Hit me up on Twitter: @mhausenblas command: - sh -

    '-c' - echo "I will just print something here and then exit” && sleep 1000
  10. Hit me up on Twitter: @mhausenblas 18 The How •

    Using kubectl get events • Using kubectl describe • Using kubectl exec • Using kubectl logs (or kubetail, stern) • Full-blown observability approaches
  11. Hit me up on Twitter: @mhausenblas 20 Metrics node container

    runtime app alerts dashboards storage event router
  12. Hit me up on Twitter: @mhausenblas 21 Metrics • Out-of-the-box

    low-level metrics (CPU, memory) • Application-specific metrics (full-blown instrumentation vs service mesh- based approaches) • Options • Roll your own, use the industry standards Prometheus + Grafana • Cloud provider native
  13. Hit me up on Twitter: @mhausenblas 26 Aggregated logs •

    In app, log to stdout or if you can’t use an adapter • Options • Roll your own, use the industry standards: ELK/EFK stack • Cloud provider native such as CloudWatch or StackDriver
  14. Hit me up on Twitter: @mhausenblas 28 Distributed tracing and

    debugging • Roots: need to overcome limitations of “time-synced logs” • Specifications: OpenCensus and OpenTracing • Tooling: Zipkin, Jaeger, Stackdriver • A must-have in a microservices setup • Debugging: use KubeSquash
  15. Hit me up on Twitter: @mhausenblas control plane worker node

    kubectl apply kubelet asks container runtime via CRI to launch container(s) etcd happy? API Server stores desired state Scheduler sees new pod, selects node Scheduler assigns pod to a fitting node container runtime pulls image container runtime runs images kubelet takes over pod lifecycle (probes) pod runs until deleted or evicted garbage collection ask cluster admin NO YES does the pod get scheduled? fork out more $$$ container runtime happy? ask cluster admin can access container registry? fix access to registry is container starting up? (init containers) debug app probes fine? no leaking resources? soak testing, monitoring YES NO YES YES YES YES NO NO NO NO kubelet watches API server and notices new pod 1 2 3 4 5 6 7 8 9 container crashing after startup? NO YES debug app
  16. Hit me up on Twitter: @mhausenblas 36 • Node (kubelet)

    • ABAC (outdated) • RBAC • Webhook (external) Authentication & authorization • static password/token file • X509 client certs • proxy+header • OpenID Connect • custom via Webhook
  17. Hit me up on Twitter: @mhausenblas 37 Access control (RBAC)

    and policies • Use kubectl auth can-i to check RBAC permissions • Make yourself familiar with: • Pod Security Policies, might constrain your app too much • Network Policies, might be too strict for your app’s communication needs • See kubernetes-security.info
  18. Hit me up on Twitter: @mhausenblas control plane worker node

    kubectl apply kubelet asks container runtime via CRI to launch container(s) etcd happy? API Server stores desired state Scheduler sees new pod, selects node Scheduler assigns pod to a fitting node container runtime pulls image container runtime runs images kubelet takes over pod lifecycle (probes) pod runs until deleted or evicted garbage collection ask cluster admin NO YES does the pod get scheduled? fork out more $$$ container runtime happy? ask cluster admin can access container registry? fix access to registry is container starting up? (init containers) debug app probes fine? no leaking resources? soak testing, monitoring YES NO YES YES YES YES NO NO NO NO kubelet watches API server and notices new pod 1 2 3 4 5 6 7 8 9 container crashing after startup? NO YES debug app
  19. Hit me up on Twitter: @mhausenblas 42 I think I’m

    having image issues … kubectl get events to the rescue?
  20. Hit me up on Twitter: @mhausenblas 45 Dunno, just keeps

    crashing … kubectl describe and exec
  21. Hit me up on Twitter: @mhausenblas 46 Oh my Lanta,

    something’s wrong with the app … kubectl logs
  22. Hit me up on Twitter: @mhausenblas 50 What and how

    • Container networking in Kubernetes (CNI) • App-level or infra (CNI, DNS, etc.)? • See mhausenblas.info/cn-ref
  23. Hit me up on Twitter: @mhausenblas 54 What and how

    • Storage in Kubernetes (CSI) • Understand storage offerings (vendor docs!) • Failure modes • See stateful.kubernetes.sh
  24. Hit me up on Twitter: @mhausenblas 56 Quizzie time! You

    wrote an application server. For load-balancing purposes, where would you put a reverse proxy such as NGINX? A. Into the container (same Dockerfile) B. Into a side car container (same pod) C. Into a separate pod
  25. Hit me up on Twitter: @mhausenblas 57 Proactive measures Architect

    your apps the cloud native way by … • knowing and using the Kubernetes primitives (services, deployments) • implementing retries & timeouts (in-tree or via service mesh) • avoiding hardcoded (start-up) dependencies • listening on 0.0.0.0 (not 127.0.0.1)
  26. Hit me up on Twitter: @mhausenblas 58 Proactive measures •

    Apply chaos engineering as long as all
 is well and learn from it where and how
 your system fails • Provide debug tools in image, but also: footprint, security! • Automate all the things: Autoscaler, Brigade, Draft, Forge, Helm, knative, ksync, odo, Operators, Skaffold, watchpod, etc.
  27. Hit me up on Twitter: @mhausenblas 60 Liz Rice &

    Michael Hausenblas Operating Kubernetes Clusters and Applications Safely Kubernetes Security
  28. Hit me up on Twitter: @mhausenblas 61 • Kubernetes Troubleshooting

    site • Debugging microservices - Squash vs. Telepresence • Debugging and Troubleshooting Microservices in Kubernetes with Ray Tsang (Google) • Troubleshooting Kubernetes Using Logs • Debug a Go Application in Kubernetes from IDE • Troubleshooting Kubernetes Networking Issues • Video: CrashLoopBackoff, Pending, FailedMount and Friends: Debugging Common Kubernetes Cluster • Video: Troubleshooting & Debugging Microservices in Kubernetes • Slide deck: Evolution of Monitoring and Prometheus Articles, slide decks, videos
  29. Hit me up on Twitter: @mhausenblas 62 • 10 Most

    Common Reasons Kubernetes Deployments Fail: Part 1 and Part 2 • Kubernetes Application Operator Basics • Kubernetes: five steps to well-behaved apps • Kubernetes Best Practices • Developing on Kubernetes • Debugging Microservices: How Google SREs Resolve Outages • Debugging Microservices: Lessons from Google, Facebook, Lyft • Troubleshooting Java applications on OpenShift • Debugging Kubernetes PVCs Articles, slide decks, videos
  30. Hit me up on Twitter: @mhausenblas 63 • kubernetes.io/docs/tasks/debug-application-cluster/debug-application/ •

    kubernetes.io/docs/tasks/debug-application-cluster/ • kubernetes.io/docs/tasks/debug-application-cluster/debug-init-containers/ • kubernetes.io/docs/tasks/debug-application-cluster/debug-pod-replication-controller/ • kubernetes.io/docs/tasks/debug-application-cluster/debug-service/ • kubernetes.io/docs/tasks/debug-application-cluster/debug-stateful-set/ • kubernetes.io/docs/tasks/debug-application-cluster/local-debugging/ Official Kubernetes docs