Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Breaking down the Pillars of Observability: F...

Avatar for Prathamesh Sonpatki Prathamesh Sonpatki
October 26, 2023
92

Breaking down the Pillars of Observability: From data toΒ outcomes

Avatar for Prathamesh Sonpatki

Prathamesh Sonpatki

October 26, 2023
Tweet

Transcript

  1. 2

  2. 5

  3. Why this matters today? β€’ Workloads have changed β€’ Infra

    is cattle - ephemeral β€’ Services are dynamic β€’ Cloud Native Environments A 3 node cluster running 10 namespaces with 5 deployments with a replica set of ~3-5 with 10 config maps emits whooping 16566 time series per minute using the popular kube-state-metrics library 6
  4. Why this matters today? β€’ Workloads have changed β€’ Infra

    is cattle - ephemeral β€’ Services are dynamic β€’ Cloud Native Environments β€’ Pod Metrics β€’ Deployment Metrics β€’ ReplicaSet Metrics β€’ StatefulSet Metrics β€’ DaemonSet Metrics β€’ Job Metrics β€’ Service Metrics β€’ Namespace Metrics β€’ Node Metrics 7
  5. Why this matters today? β€’ Volume β€’ Velocity β€’ Variety

    β€’ Complexity β€’C.O.S.T. - Cardinality - Operations - Scale - Toil 9
  6. 10

  7. Outcomes we want β€’ To not have downtimes β€’ To

    mitigate problems quickly β€’ To debug a failure β€’ To know how the system is behaving in real time β€’ To co-relate an outage to a hardware failure β€’ To fi nd anomalies and patterns β€’ To trace a payment failure β€’ To fi nd out unknown failures before they happen β€’ To prevent hampering customer experience and business impact 11
  8. Questions we ask β€’ What is wrong? β€’ Did we

    change anything? β€’ What do we do so this doesn’t repeat? 12
  9. Answers we want β€’ System Health β€’ Quick Decisions β€’

    Time β€’ Root Cause β€’ Testing β€’ Correctness 14
  10. β€’ Getting Started βœ… β€’ Adoption βœ… β€’ Debugging βœ…

    β€’ Relationships πŸ₯² Logs β€’ Volume πŸ₯² β€’ Standardisation πŸ₯² β€’ Health πŸ₯² β€’ System insights πŸ₯² 18
  11. β€’ Getting Started 😐 β€’ Adoption βœ… β€’ Debugging πŸ₯²

    β€’ Relationships πŸ₯² Metrics β€’ Volume 😐 β€’ Standardisation βœ… β€’ Health βœ… β€’ System insights βœ… 21
  12. β€’ Getting Started 😐 β€’ Adoption πŸ₯² β€’ Debugging βœ…

    β€’ Relationships βœ… Traces β€’ Volume 😐 β€’ Standardisation βœ… β€’ Health πŸ₯² β€’ System insights πŸ₯² 24
  13. Events β€’ Structured logs? β€’ Schema based? β€’ Domain Events

    β€’ Easier to adopt? β€’ Can unlock co-relation β€’ Dimensionality 26
  14. β€’ Getting Started 😐 β€’ Adoption πŸ₯² β€’ Debugging βœ…

    β€’ Relationships πŸ₯² Events β€’ Volume βœ… β€’ Standardisation πŸ₯² β€’ Health βœ… β€’ System insights βœ… 27
  15. Answers we want β€’ Know β€’ Communicate β€’ Recover β€’

    Analyse β€’ Debug β€’ Root cause Real Time Post Factor 28
  16. Answers we want β€’ Know β€’ Communicate β€’ Recover β€’

    Analyse β€’ Debug β€’ Root cause SRE/DevOps Programmer/Developers 29
  17. 80% of Telemetry data is unused β€’ Yet, we store

    it and pay for the data that is unused! β€’ Slow dashboards, concurrent access woes β€’ No real time alerting β€’ Cost vs. Performance vs. Retention tradeo ff s 31
  18. Control Levers β€’ Treat workloads di ff erently β€’ Fast

    vs. Slow Data Tiers β€’ Policies β€’ Declarative Observability 33