Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring Microservices with Minimal Effort (G...

Monitoring Microservices with Minimal Effort (GDG DevFest Ankara 2018)

I gave this talk at GDG DevFest Ankara 2018. This talk was about how we use Istio and OpenCensus to automatically extract request metrics and traces from applications, and further extend a simple microservices applications to export profiling data and metrics to Google Stackdriver APM.

Find more at http://cloud.google.com/stackdriver and https://opencensus.io/.

Ahmet Alp Balkan

November 17, 2018

More Decks by Ahmet Alp Balkan

Other Decks in Technology


  1. • Software Engineer at Google Cloud • Worked at Microsoft

    Azure (2012-2016) on porting Docker to Windows & Linux stuff. • Kubernetes/GKE, Knative developer experience • Twitter/GitHub: @ahmetb About me
  2. • SLO (service-level objective): agreed way to measure performance of

    a service, between two parties (often internal) • SLA (service-level agreement): SLO, but has a legal contract. • Error budget: how much can you afford to violate SLA/SLO? Service availability contracts Team A Team B's service Ucuzabilet Turkish Airlines You Google Cloud Storage API
  3. Error Budgets • Team X owns a critical service at

    Google, used by other teams. • ServiceA has error budget of ~60 minutes of SLO violation per year in a region. (~5 minutes/month)
  4. Error Budgets • Team X owns a critical service at

    Google, used by other teams. • ServiceA has error budget of ~60 minutes of SLO violation per year in a region. (~5 minutes/month) • ServiceA went down for 30 minutes today.
  5. Error Budgets • Team X owns a critical service at

    Google, used by other teams. • ServiceA has error budget of ~60 minutes of SLO violation per year in a region. (~5 minutes/month) • ServiceA went down for 30 minutes today. • TeamX can't ship new features to ServiceA for 6 months, until they have more error budget (or they will risk violating their SLO/SLA).
  6. Asking the right questions Is my home page responding successfully

    to the 99.5% of the requests within 100 milliseconds?
  7. Asking the right questions Is my home page responding successfully

    to the 99.5% of the requests within 100 milliseconds from the servers in us-east1?
  8. At Google, everything is a service • Google Cloud Storage

    frontend → service • Google Fonts API → service • Google search index backend → internal service • Human resources database → internal service • Cafeteria menus → internal service ~O(1010) requests per second in Google’s private network. Mostly gRPC/Protobuf-style networking (not HTTP REST/JSON APIs)
  9. microservices • Develop independently YES NO • Scale independently YES

    NO • Fail independently YES NO • Number of things to monitor MANY ONE monoliths
  10. At Google, we don't write the BEST CODE, but we

    have world systems OBSERVABILITY.
  11. Time-series Metrics Measurement of a value over time: • Gauges:

    current value of an indicator (example: current memory usage MB) • Counters: only-increasing values (example: request count) Examples: • 99th percentile latency of POST requests to /login over the past 5 minutes • success rate of GET requests over past day • average memory usage in the past 30 minutes • "number of orders completed" in the past hour
  12. Time-series Metrics "orders_completed" counter example: server 1 server 2 GET

    /_metrics: orders_completed[server=1] 3 GET /_metrics: orders_completed[server=2] 12 metrics collector
  13. Anatomy of a metrics page • name • labels •

    value orders_created[server=A, region=us-central1, version=14] 563 http_requests[status=200, method=GET, path='/', server=A, region=us-central1, version=14] 156183 http_requests[status=200, method=GET, path='/login', server=A, region=us-central1, version=14] 560 http_requests[status=500, method=GET, path='/login', server=A, region=us-central1, version=14] 2
  14. Benefits of metrics Measure if you're meeting your SLOs Create

    alerts Answer difficult questions (even in a Google datacenter): • what's the average uptime of machines that's in the top 10% • how many packet drops happened in the past minutes, from which machines
  15. • Prometheus → Grafana (UI) → prometheus alerting → PagerDuty,

    … • Google Stackdriver → Stackdriver alerting → Stackdriver console (UI) Metrics collection
  16. Tracing Which services does a request travel through, for how

    long. Tracing allows you to understand call patterns between your services, and find bottlenecks. • Exercise: how do you optimize Facebook home page load time
  17. How Tracing works You need to update your code: •

    (→)incoming requests: get trace ID from request • outgoing(→) requests: pass trace ID to the request Service A Service B GET http://A/foo Trace-Header: 123 GET http://B/bar Trace-Header: 123
  18. Example request trace frontend.GET./home 120ms ServiceA.Calculate 71ms ServiceF.ComputeX 28ms ServiceD.GetX

    32ms ServiceF.ComputeX 28ms ServiceF.ComputeY 38ms ServiceH.GetZ 24ms ServiceF.ComputeY 20 ms
  19. Example request trace frontend.GET./home 120ms ServiceA.Calculate 71ms ServiceF.ComputeX 28ms ServiceD.GetX

    32ms ServiceF.ComputeX 28ms ServiceF.ComputeY 38ms ServiceH.GetZ 24ms ServiceF.ComputeY 20 ms
  20. Profiling Which functions/methods is time spent in my"process"? • Helps

    you identify "slow paths" in your "fast paths" • You need to enable profiling in your application Want easy process profiling? • Try Stackdriver Profiling.
  21. Which services calls which other services? • Identify dependencies between

    your services. • Answer hard questions about services communication easily. Service Topology Graph
  22. Service Topology Graph Which services calls which other services? •

    Identify dependencies between your services. • Answer hard questions about services communication easily. Examples: • Who is making requests to ServiceA? • How many requests-per-second (RPS) for ServiceA ⇒ ServiceB ? • How is the latency of A⇒B compare to C ⇒ B? • What % of A⇒B requests go to B in us-west, what % to us-east?
  23. Want easy Service Topology? Istio gives you a service topology

    graph without changing any application code.
  24. • Read the SRE Book (free) by Google to learn

    about SLOs/SLAs/error budgets • If you're using Kubernetes, use Istio (...to get metrics without changing code) • Types of monitoring • Metrics ◦ Good for monitoring SLOs/SLAs (+alerting), or app health ◦ Try Prometheus or Stackdriver Metrics • Tracing ◦ ...which service in the call graph takes how much time ◦ Try OpenCensus + Stackdriver Trace • Profiling ◦ Function-level performance diagnostics in a process Summary
  25. • Play with github.com/GoogleCloudPlatform/microservices-demo • Say hello on twitter: @ahmetb

    • Google Cloud Startup Program ($3000 credits) ◦ Special offer for DevFest: http://goo.gl/XyeCQ ◦ g.co/cloudstartups Thanks