Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The hidden cost of instrumentation at Conf42 Devops 2023

Prathamesh
January 27, 2023

The hidden cost of instrumentation at Conf42 Devops 2023

Prathamesh

January 27, 2023
Tweet

More Decks by Prathamesh

Other Decks in Technology

Transcript

  1. The hidden cost of the
    instrumentation
    1
    Prathamesh Sonpatki
    Last9.io
    Conf42 Devops 2023

    View Slide

  2. 2
    Instrumentation? 🤔 🤨

    View Slide

  3. 3
    Instrumentation? 🤔 🤨
    - How do you know your application is running as expected?

    View Slide

  4. 4
    Instrumentation? 🤔 🤨
    - How do you know your application is running as expected?
    - Service Level Agreements(SLA)

    View Slide

  5. 5
    Instrumentation? 🤔 🤨
    - How do you know your application is running as expected?
    - Service Level Agreements(SLA)
    - Good night’s sleep 😴 💤

    View Slide

  6. 💡“Hope is not a strategy!”
    6
    https://sre.google/sre-book/introduction/

    View Slide

  7. 💡The Reliability mandate starts with
    Instrumentation
    You can only improve what you measure.
    7

    View Slide

  8. 🌈 Landscape of the Instrumentation
    8

    View Slide

  9. 🌈 Landscape of the Instrumentation
    9
    - Your application is not standalone

    View Slide

  10. 🌈 Landscape of the Instrumentation
    10
    - Your application is not standalone
    - It’s actually a 🍔

    View Slide

  11. 🌈 Landscape of the Instrumentation
    11
    - Your application is not standalone
    - It’s actually a 🍔
    - The Bun(Cloud/VM)
    - Patty(application)
    - Along with Mayo sauce(RDS/DB)
    - And Ketchup(Third party services)

    View Slide

  12. 🌈 Landscape of the Instrumentation
    12
    - Your application is not standalone
    - It’s actually a 🍔
    - The Bun(Cloud/VM)
    - Patty(application)
    - Along with Mayo sauce(RDS/DB)
    - And Ketchup(Third party services)
    “Full stack observability” FTW!

    View Slide

  13. 💡Modern applications are like living
    organisms that grow and shrink in all
    possible directions.
    And also communicate with their friends!
    13

    View Slide

  14. Bow in the Temple of Observability 󰚍
    14

    View Slide

  15. Bow in the Temple of Observability 󰚍
    15
    - Logs
    - Metrics
    - Traces

    View Slide

  16. Bow in the Temple of Observability 󰚍
    16
    - Logs
    - Metrics
    - Traces
    - Profiling
    - Events (External)
    - Exceptions
    https://medium.com/@YuriShkuro/temple-six-pillars-of-observability-4ac3e
    3deb402

    View Slide

  17. Bow in the Temple of Observability 󰚍
    17
    - Logs
    - Metrics
    - Traces
    - Profiling
    - Events (External)
    - Exceptions
    How many people use more than 3 from these at the same time??

    View Slide

  18. There ain’t no such thing as free lunch 💰
    18

    View Slide

  19. Cardinality/Churn
    19
    - Capturing monitoring data is easier than ever today.
    - A 3-node Kubernetes cluster with Prometheus will ship around 40k
    active series by default!

    View Slide

  20. Operations
    - Run, manage and operate the instrumentation of the entire stack.
    - One more thing to operate besides the app.
    20

    View Slide

  21. Scale
    - Make sure not just your app scales but also your instrumentation.
    21

    View Slide

  22. Tuning/Toil
    - Constant tuning of monitoring data
    - Resulting into Engineering Toil
    22

    View Slide

  23. C.O.S.T. 💸
    Cardinality/Churn, Operations, Scale, Tuning/Toil
    23

    View Slide

  24. But what is the hidden
    cost? 🤔
    24

    View Slide

  25. Distraction!
    25

    View Slide

  26. Distraction!
    26
    - Reduce the Datadog monitoring cost, it is going out of hand.

    View Slide

  27. Distraction!
    27
    - Reduce the Datadog monitoring cost, it is going out of hand.
    - Our logs are piling up from last 2 days, can you please look at it as P0
    and contain them? Otherwise vendor will charge us double.

    View Slide

  28. Distraction!
    28
    - Reduce the Datadog monitoring cost, it is going out of hand.
    - Our logs are piling up from last 2 days, can you please look at it as P0
    and contain them? Otherwise vendor will charge us double.
    - Today is new year’s day and our prometheus is not getting required
    metrics. Ignore the product release, just fix this for now, we are blind
    otherwise.

    View Slide

  29. 💡A modern systems engineer has to not
    just maintain their software but also
    Instrumentation of that software.
    29

    View Slide

  30. Fatigue!
    30

    View Slide

  31. Fatigue!
    31
    - Too much information de-sensitises us.
    - Duplicate alarms.
    - Focus on getting more and more data rather than why even we are
    getting it.
    - Debugging becomes difficult because there is just too much of data, we
    don’t know from where to start.

    View Slide

  32. What’s the way out? 🏆
    32

    View Slide

  33. What’s the way out? 🏆
    33
    - Focus on data that gives early warnings with least amount of data

    View Slide

  34. What’s the way out? 🏆
    34
    - Focus on data that gives early warnings with least amount of data
    - Think about Apple watch ⌚ - only vitals such as heart rate or sleep
    metric.

    View Slide

  35. What’s the way out? 🏆
    35
    - Focus on data that gives early warnings with least amount of data
    - Think about Apple watch ⌚ - only vitals such as heart rate or sleep
    metric.
    - Detailed X-Ray scans and ECG reports 📰 once the vitals are off the
    track.

    View Slide

  36. 💡A threat of breaking is better.
    36

    View Slide

  37. Plan of action
    37

    View Slide

  38. Plan of action
    38
    - Plan what to measure why not how
    - Emit (only what you need)
    - Observe and Track (usage)
    - Prune (unused) aggressively
    - Store less for less amount of time.
    - Focus on what can give best value for the money

    View Slide

  39. A Better Plan of action
    39
    - Access Policies
    - Data storage policies
    - Standards

    View Slide

  40. 💡Less is better.
    Because Instrumentation is liability.
    40

    View Slide

  41. Thanks
    41
    Prathamesh Sonpatki
    Last9.io
    Blog - https://prathamesh.tech
    Twitter -
    https://twitter.com/_cha1tanya
    Matsodon -
    https://hachyderm.io/@Prathamesh
    “Last9 of Reliability” Discord

    View Slide