Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Grafana Alloy Best Practice

Grafana Alloy Best Practice

Speaker: Eric Huang
Event: COSCUP 2024

LINE Developers Taiwan

August 04, 2024
Tweet

More Decks by LINE Developers Taiwan

Other Decks in Technology

Transcript

  1. Eric Huang LINE Taiwan / SRE 2021 : E-SUN bank

    2022 : LINE Taiwan Kubernetes, Rust, eBPF 2 titaneric chen-yi-huang
  2. • Collect metrics from client side & browser • Monitor

    web application performance • Discover error • Track user behavior (session) Real User Monitoring (RUM) Source: Web Vitals, User-centric performance metrics, Grafana Faro OSS 4
  3. (Distributed) Tracing Source: Observability primer | OpenTelemetry 5 • Represent

    the full journey of request though distributed environment • Improve the visibility of the app • Diagnose the source of error
  4. “Alloy is a flexible, high performance, vendor- neutral distribution of

    the OpenTelemetry Collector” Key features: • Custom components • Chained components • Debugging utilities Adopt faro.receiver component with Faro SDK Grafana Alloy Source: Grafana Alloy | Grafana Alloy documentation 7
  5. 8

  6. “Grafana Faro includes a highly configurable web SDK for real

    user monitoring that instruments browser frontend applications to capture observability signals.” Key features: • Monitoring applications performance • Captures errors, logs, user activity • Instrument performance and observe full stack Grafana Faro Web SDK Source: Grafana Faro OSS | Web SDK for real user monitoring (RUM) 9
  7. 10

  8. End-to-End Tracing Spans include: • frontend app (nextjs) • ingress

    controller (traefik) • web framework (flask) • http client library (requests) 12
  9. Requirements Must have: • Adopt present observability platform • Easy

    to deploy alloy service automatically • Control traffic load sent from real user Nice to have: • Easy application for new tenant • Slack workflow • Sample code for SSR and CSR app • Nextjs based sample app 22
  10. Why? • Adopt gateway instead of individual ingress for each

    cluster? • Unified traffic control by SRE • Decouple the business logic and telemetry traffic • Easy deployment for alloy • Cut down the Security Review procedure 25
  11. Why? • Choose Contour instead of Traefik, or other ingress

    controller? • Contour is more performant and less memory consumption • Envoy Gateway is considered, but k8s version is not compatible 26
  12. • Handle incoming large amount of traffic • Load test

    and tuning for Contour and Alloy • 3 levels of protections 1. Client side sampling 2. Contour rate limit 3. Grafana Alloy rate limit • Increasing load from Loki and Tempo • Continuously tuning for Loki and Tempo • Individual rate limit for each tenant Challenges Load Test Report: Alloy: 1500 RPS (1 core, 1Gi) Envoy: 10000 connection (3 core, 1Gi) 27
  13. • Web vitals is stored in Loki instead of Prometheus

    • Adopt Loki Rulers to ingest Loki query result into Prometheus • Faster loading for real user monitoring dashboard • Constrained trace propagation in present architecture • Upgrade or update the trace propagation in the intermediate • Block trace propagation header from API gateway • Add allowed list for trace context header (e.g., TraceParent, Uber-Trace-Id) Challenges 28
  14. • Upgrade Traefik to v3.0 to adopt OpenTelemetry • Resolve

    the issue of unbalanced requests to OTEL collector • Zero-code instrumentation by eBPF (e.g., Grafana Beyla) • Continuously tuning for Tempo, Loki, and Alloy 31