Cloud Native Observability

Cloud Native Observability @hashfyre

The Need: Complex Systems!

System Logic Platform Exacts Performance Overhead Correctness Available Fault Tolerant
Elastic

The Correctness Pyramid Unit Tests Integration Tests Smoke Tests Dark
Prod Canary Prod Chaos Fuzz Unpredictable Input Undesired State

The Availability Equation Ao = Tm / Tm + Td
Ap = Tm / Tm + Tp Tp = ----------------------- MTTR + MLDT + MAMDT MTBF Ao = Operational Availability Ap = Predictive Availability Tm = Task Duration Td = Downtime Tp = Predictive Downtime MTTR = Mean Time to Recover MLDT = Mean Logistics Delay Time MAMDT = Mean Active Maintenance Downtime MTBF = Mean Time Between Failure

Probable Downtime = --------------------------------------------------------------------------------------- Time to Detect Failure Time Taken
to Respond Time to Fix How long since everything was on What it Really Means

Look Cap’n! It’s an Iceberg. Symptoms Root Cause Request Rate
Error Duration Utlization Saturation Errors Metrics Logs Traces HealthCheck Uptime Latency Profiler Debugger Dependency Analyzer

Observability Effort to increase intrinsic predictability of a system, for
infinitely arbitrary input states and across infinite mutations of the underlying platform... ...by recording granular system states continuously, so as to feed it back to the system in the form of architectural changes and instrumentation that further increases visibility. System Visibility Understanding Architecture Instrumentation

The Three Pillars, a Taxonomy Logs Metrics Traces Plantext Structured
Binary RED USE SLI SLO Violation Alerting Playbooks Recovery Tracing Exception Handling Debugging Profiling RCA RCA Audit Anomaly Capacity

VPC0 Subnet IGW Cloud Native Systems Are Complex! VPC1 RTab
Subnet ELB ASG NAT VPN VPC2 EC2 KVM OS Runtime Container Cache Queue Datastores DR Peering Distributed Computing

Lifecycle of A Request in K8S ELB WAF API Gateway
Ingress Service Deployment Pod LOGS, REQ-COUNT ATTACK PREVENTION RATE LIMIT, SERVICE DISCOVERY HA, LOAD-BALANCE SOA ABSTRACTION ORCHESTRATION, HEALTH CODE EXECUTION

Cloud Native + Observability • HealthChecks • Load Balancing •
Failed Service Rotation • Service Discovery • Reduction in catastrophic failures Out-of-the-box To-Do • Small but numerous failures from a myriad of moving parts • Centralized logging across IaaS, PaaS, OS, Application • Metric isolation and aggregation across multiple abstractions and virtualization layers • Capacity Planning across 3 to 4 levels of virtualization • Distributed Tracing across 10 to 100s of microservices

Observation Quality Observation Quality = f (System Grain, System Context)
• Node-level metrics for Cluster AutoScaling ◦ Node Exporter • Pod-level Metrics for Horizontal Pod AutoScaling ◦ CAdvisor ◦ Metrics Server • Kubernetes Platform Metrics to determine health of Orchestration Layer ◦ Kube-state-metrics

Logging in K8S with EFK APP NAMESPACE LOG NAMESPACE COREOS
VM Pod DaemonSet /var/log/<app>/<pod> /var/log fluentbit /var/lib/docker/containers Text logs Stdout logs Input Parsers ES HOST Output ES HOST Indexes KIBANA Search Dashboard

MONITORING NAMESPACE Monitoring in K8S with Prometheus APP NAMESPACE COREOS
VM Pod DaemonSet /proc Node- exporter /sys PROMETHEUS Scrappers Grafana Dashboard Ingress KUBE-SYSTEM NAMESPACE KSM CAdvisor

Prometheus Scrape jobs - job_name: 'traefik' metrics_path: "/metrics" ec2_sd_configs: -
region: ap-south-1 port: 8080 relabel_configs: - source_labels: [__meta_ec2_tag_edge] regex: true action: keep - source_labels: [__meta_ec2_tag_environment] regex: stage action: keep

K8S: Resource Quotas resources: limits: cpu: 500m memory: 2500Mi requests:
cpu: 100m memory: 100Mi Hard upper bound Soft upper bound

Tracing: Logging + Context • Assign UUID to Each Request
• Context = UUID + metadata • Next Request = Payload + Context • Baggage = Set(K1:V1, K2:V2, ...) • Async capture: ◦ Timing ◦ Events ◦ Tags • Re-create call tree from store A B C D E service = A service = A, service = B service = A, service = C service = A, service = C, Service = D service = A, service = C, Service = E

Trade-offs Logs Metrics Traces Ease of Generation Very Easy, RELP
Easy but Difficult Processing Overhead Default High, Sampling V3 Invariant, Time variant Benchmarks Vary Ease of Query Moderate Very Easy Easy Information Quality Rich, System Scope Very Rich, System Scope Request Scope, Rich Cost Effectiveness V3 Variant, Low High, Alerts! Higher than Logging

Cloud Native Decisions • Chose a Logging provider that cater
to your Volume, Variety and Velocity ◦ Logging is a OLAP problem ◦ Ensure provider abides by RELP (Reliable Event Logging Protocol) ◦ ES has indexing overheads that delay log delivery • Prometheus is much better than other TSDBs like Graphite ◦ Taggable metrics ◦ However, not for long term storage ◦ Export to LTS solutions from Prometheus ◦ For short lived / scheduled jobs, use the PushGateway • Don’t use Distributed Tracing unless you have 15-20+ Microservices ◦ You Don’t really need a Service Mesh / ESB when you have a set of 5 services ◦ Use Exception Trackers like Sentry before you think ‘Tracing’ ◦ Go with OpenTracing / Zipkin, before paying for a SaaS solutions

Cloud Native Learnings • Complete automation of Observability isn’t the
panacea ◦ You need humans to debug and architect ◦ But: ▪ n(team) != n(services) ▪ Complex systems evolve with high velocity ▪ Knowledge of Complex systems evolve amongst practitioners ▪ More humans -> more ambiguity -> more error introduction • Automate Playbook and Dashboard generation ◦ Link playbooks to Dashboard to reduce MLDT • Create an uniform Logging / Monitoring framework across your services ◦ Make each app log the same way, track the same metrics ◦ Triaging becomes uniform across the Org

Thank You. @hashfyre

Cloud Native Observability

Cloud Native Observability

Joy Bhattacherjee

More Decks by Joy Bhattacherjee

Other Decks in Technology

Featured

Transcript