Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Practical Monitoring for Knative Serving / Kube...

Practical Monitoring for Knative Serving / KubeCon + CloudNativeCon Japan 2025

Knative is a widely adopted CNCF-hosted software for running serverless applications using Kubernetes. Knative Serving consists of many system components, such as Activator, Autoscaler, Controller, Webhook, and Istio or Kourier as an ingress gateway. Therefore, end users need to implement monitors for common error patterns and best metrics from many metrics. However, there is relatively little knowledge and resources for Knative end users.

This talk will present a production case study of monitoring for Knative Service. Specifically, it will explain how we can monitor Knative control plane efficiency, reconciliation operations, pod scaling health, concurrency observation, HTTP request success rate, and more.
That includes how Knative components implement Prometheus metrics, metrics pipelines (on Google Kubernetes Engine), dashboards and alerts.

This case study will benefit existing Knative users and potential users considering employing Knative in their Kubernetes clusters.

Avatar for Kazuki Higashiguchi

Kazuki Higashiguchi

June 11, 2025
Tweet

More Decks by Kazuki Higashiguchi

Other Decks in Technology

Transcript

  1. Who am I? Kazuki Higashiguchi Senior Site Reliability Engineer @

    Autify AI Platform for Software Quality Assurance End-user of Knative Serving for our ML workloads in Production /in/hgsgtk
  2. Monitors from the user’s perspective, correlates with scaling metrics, and

    alerts early. But, there’s little community guidance on this. So we’re sharing our lessons learned! Why Monitoring Knative Matters ✅ Knative’s magic: • Serverless autoscaling, out of the box Gateway errors hurt user experience 🚨 But in production… • Is autoscaling fast enough? • How many warm pods? • Enough nodes for scale?
  3. Knative Serving Architecture Request driven pod scaling - spins up

    pods on demand in response to incoming requests Key Components: • Activator • Autoscaler • Controller • Webhook • Ingress gateway - Istio, Kourier Knative supports Prometheus and OpenTelemetry Collector for collecting metrics.
  4. Monitor from the user’s perspective User experience: API-level monitors Key

    component - Activator: • queues incoming requests and forwards them • triggers the autoscaler to bring scaled-to-zero services back online Key Metrics: • request_count - The number of requests • request_latencies - The response time in milliseconds for successfully routed requests
  5. Alerting on Bad Gateway Errors in Knative 502 errors often

    happen when scaling can’t keep up with traffic. Activator metrics detect issues early. ⚠ App-level error reporting won’t detect these! 💡 Optionally monitor all 5xx errors. Alert threshold activator_request_count {response_code=502}
  6. API-level Dashboard • Request volume (request_count) ◦ Total ◦ By

    Service ◦ By Response Code • Success Rate (request_count) ◦ Non-5xx / total • Response Time (request_latencies) ◦ By Service ◦ By Response Code
  7. Monitoring Autoscaler Efficiency Key Component - Autoscaler: • scales Knative

    services based on configuration, metrics, and incoming requests Key Metrics: • pending_pods – Pods currently pending • requested_pods, actual_pods, not_ready_pods, terminating_pods – Pods in various lifecycle states • excess_burst_capacity – Overserved burst capacity (buffer for scale)
  8. Alerting on Potential Scaling Issues ⚠ High pending ratio may

    indicate cluster capacity issues. Alert threshold autoscaler_pending_pods / “total_pods” e.g., insufficient allocatable nodes
  9. Scaler-level Dashboard • Pod Counts (*_pods) ◦ requested|actual| not_ready|pendin g|terminating

    • Concurrency ◦ Requested (activator.request _concurrency) ◦ Observed (excess_burst_cap acity)
  10. References • Architecture ◦ Knative Serving Architecture - https://knative.dev/docs/serving/architecture/ ◦

    Knative Serving Autoscaling System - https://github.com/knative/serving/blob/main/docs/scaling/SYSTEM.md • Observability ◦ Collecting metrics in Knative - https://knative.dev/docs/serving/observability/metrics/collecting-metrics/#import-grafa na-dashboards ◦ Knative Serving metrics - https://knative.dev/docs/serving/observability/metrics/serving-metrics/ ◦ Grafana Dashboards - https://github.com/knative-extensions/monitoring/tree/main/grafana