Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unlocking the Power of OpenTelemetry: Enhancing...

id
October 28, 2024

Unlocking the Power of OpenTelemetry: Enhancing Design, Development, and Testing

Open Source Summit Japan 2024

Developers often face the complex challenges of designing, debugging, and testing distributed systems like microservices. Understanding failures, identifying performance issues, and ensuring system reliability from the early stages of design and development can be daunting. Observability technologies provide valuable insights not just in production but also during design and development. In this session, we will explore OpenTelemetry, a cutting-edge observability framework, and its practical applications in the design, debugging, and testing of distributed systems. Key topics include:

- Assessing the impact of incorporating a cache server on system behavior during the design phase.
- Evaluating how database failures affect both backend and frontend applications during fault testing.
- Detecting performance bottlenecks for specific requests during load testing.

Participants will gain a clear understanding of how OpenTelemetry can revolutionize their debugging and testing processes, leading to more effective experiments and increased reliability in their distributed systems.

id

October 28, 2024
Tweet

More Decks by id

Other Decks in Technology

Transcript

  1. © Hitachi, Ltd. 2024. All rights reserved. Unlocking the Power

    of OpenTelemetry: Enhancing Design, Development, and Testing Oct. 28, 2024 Takaya Ide, Yasuo Nakashima Services Computing Research Dept. Research & Development Group, Hitachi, Ltd.
  2. © Hitachi, Ltd. 2024. All rights reserved. Today’s complex distributed

    systems involve multiple interacting services. This makes design, debugging, and testing increasingly challenging. Increasing Complexity of Development 1 1. Top Software Development Trends for 2024, SIMFROM 2. Business Wire. 76% of CIOs Say It’s Impossible to Manage Digital Performance Complexity, 2018 3. DORA. (2022). State of DevOps 2022. DevOps Research and Assessment. 74% of orgs use microservice architecture 1 76% of CIOs Say It Could Become Impossible to Manage Digital Performance, as IT Complexity Soars 2 42% of orgs use hybrid hloud 3 82% of software makers reporting defects associated with undiagnosed test failures causing production problems” 5 “developers say they tend to spend 25–50% of their time per year on debugging” 6 “Among projects over 500 person- months, 51.7% missed deadlines, and 40.4% exceeded budgets” 4 4. JUAS, “企業IT動向調査報告書2024 (Corporate IT Trends Survey Report 2024)”, 2024 5. Undo.io, Optimizing the software supplier and customer relationship, 2020 6. Undo, “Time spent debugging software”
  3. © Hitachi, Ltd. 2024. All rights reserved. How should we

    address such issues? Challenges Faced by Developers 2 What value will be implementing a cache server bring? Or will it just add burdens in terms of operations and costs? When a DB failure occurs, how will the effects spread? What is the risk of cascade failures? What is the impact on latency? What processing is the bottleneck in OIDC authentication? It’s tough because parameter adjustments shift the bottleneck.
  4. © Hitachi, Ltd. 2024. All rights reserved. • OpenTelemetry is

    an open-source observability framework and spec. • Measuring, Collecting, Processing, Exporting telemetry signals OpenTelemetry (OTel) 4 App1 OTel API/SDK App2 OTel Auto-Inst. OTel Collector ... ... Other Monitoring Tools / Services Measuring Collecting Processing Exporting Storing Analyzing Visualizing Signals Application Monitoring Input Output Signals Signals
  5. © Hitachi, Ltd. 2024. All rights reserved. OTel should be

    also a powerful tool for design and development Enable Data-driven decision making Correlating signals across multiple applications May be considered excessive for development use Favorable Trends • Auto-instrumentation allows fast attach-detach, minimizing setup cost • Growing support from open-source and cloud services for analyzing OTel signals Unlocking OTel Beyond Operations 5 → Can we leverage OTel in design and dev. by attaching only when needed?
  6. © Hitachi, Ltd. 2024. All rights reserved. Key Features 6

    Auto- Instrumentation Signals (Semantic Convention)
  7. © Hitachi, Ltd. 2024. All rights reserved. Attach instrumentation without

    modifying program code • Achieved via monkey patching, which dynamically modifying target code • Supported: Java, JavaScript, Python, PHP, .NET, Ruby, (Go) Auto-Instrumentation 7 app.jar opentelemnetry- agent.jar • Analyze intermediate code • Detect libs (e.g., spring) • Modify code Monkey Patching
  8. © Hitachi, Ltd. 2024. All rights reserved. Languages that compile

    to binaries, like Golang or C, can’t use this, can they? Question 8
  9. © Hitachi, Ltd. 2024. All rights reserved. OpenTelemetry Go Instrumentation

    Auto-instrumentation is inherently hard to apply to binaries. Efforts are underway to solve this issue (Ref.) Auto-Instrumentation for binary 9 [WIP]Auto-Instrumentation based on Traffic Pattern (Hitachi 2021) [WIP]opentelemetry-go-instrumentation (OTel community) https://github.com/open-telemetry/opentelemetry-go-instrumentation Golang Process eBPF Analyzer Inst. Manager Set probe. Load eBPF program This agent analyzes a target Go process and find instrumentable functions. Then it attaches eBPF program to hooks in the functions. Traces Detect process. Find funcs. Analyze stack & CPU register
  10. © Hitachi, Ltd. 2024. All rights reserved. OpenTelemetry Operator enables

    auto-instrumentation of containers and deploys the OTel Collector in a Kubernetes-native way OpenTelemetry Operator for Kubernetes 10 OTel Collector https://github.com/open-telemetry/opentelemetry-operator Instrumentation custom resource OpenTelemetryCollector custom resource Add OTel agent as init-continer Deploy Signals Con- troller OpenTelemetry Operator App OTel agent Auto Inst. Kubernetes
  11. © Hitachi, Ltd. 2024. All rights reserved. Key Features 11

    Auto- Instrumentation Signals (Semantic Convention)
  12. © Hitachi, Ltd. 2024. All rights reserved. OTel Signals 12

    As of 2024, Auto-Instrumentation measures metrics, logs, and traces Enrich log output Metrics Logs Traces Time-series data like latency, with various values defined by Semantic Conventions. Converts standard logs into structured logs with contextual information. Visualizes process call relationships; generates large data volumes.
  13. © Hitachi, Ltd. 2024. All rights reserved. Semantic Convention 13

    https://opentelemetry.io/docs/specs/semconv/ Semantic Conventions define common attributes that give meaning to signals. E.g., Metric: http.server.request.duration Attribute Type Description Examples Requirement Level Stability http.request.method string HTTP request method. [1] GET; POST; HEAD Required url.scheme string The URI scheme component identifying the used protocol. http; https Required error.type string Describes a class of error the operation ended with. [3] timeout; java.net.Unkno wnHostException; server _certificate_invalid; 500 Conditionally Required If request has ended with an err. http.response. status_code int HTTP response status code. 200 Conditionally Required If and only if one was received/sent. ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ stable stable stable stable → Java Auto-Instrumentation supports not only HTTP, but also JVM, Process
  14. © Hitachi, Ltd. 2024. All rights reserved. Examples of developer

    support using OTel 15 Overall performance and downtime can be monitored using conventional external monitoring, Req. Resp. Request generator Frontend App. Frontend Backend App. Backend http request Long Response time Internal Server Error 500 … ? ? ? However, It is difficult to evaluate each component. Load Balancer DB Cache Server Load Balancer
  15. © Hitachi, Ltd. 2024. All rights reserved. Examples of developer

    support using OTel 16 Req. Resp. Request generator App. App. http request Long Response time Internal Server Error 500 … Dashboard / Analysis by script (ex. anomaly detection, downtime calculation) OpenTelemetry Measurement Agent OpenTelemetry Measurement Agent Traces, Metrics, Logs Collecting telemetry for each component using OTel Understanding the performance and behavior of each component DB Cache Server Telemetry management platform (e.g., Amazon CloudWatch, Grafana) Frontend Backend Load Balancer Load Balancer
  16. © Hitachi, Ltd. 2024. All rights reserved. Information that can

    be obtained ✓ Start and end timings of events (failures or load spikes) ✓ Error rate, latency, resource consumption before, during, and after the event ✓ Changes in Java connection count, Java thread count, and other metrics Use case examples 1. Failure testing: Investigate how the duration of frontend errors changes during DB failover when RDS Proxy is introduced. 2. Performance testing: Test if resource consumption under load remains below the specified limits and investigate the components causing bottlenecks for potential improvements. 3. Proof of Concept: Investigate how the response time for database access changes when a cache server is introduced. Concept of analysis with OTel 17 [{ "name": "my_aurora_db", "start_time": "2024-06-19T12:00:00Z", "end_time": "2024-06-19T12:01:00Z", "abnormal_time_seconds": 60, "metrics": { "before_abnormal": { "average_latency_ms": 50, "total_errors": 0, "error_percentage": 0.0, "requests_per_second": 200, "cpu_usage_percentage": 30.0, "memory_usage_mb": 2048, "read_iops": 600, "write_iops": 400, "retry_attempts": 0, "cache_hit_ratio": 95.0, "connection_errors": 0, "transaction_rollbacks": 0 }, "under_abnormal": { "average_latency_ms": 250, "total_errors": 150, "error_percentage": 5.0, "requests_per_second": 80, "cpu_usage_percentage": 75.0, "memory_usage_mb": 4096, "read_iops": 1200, "write_iops": 800, "retry_attempts": 20, "cache_hit_ratio": 85.0, "connection_errors": 30, "transaction_rollbacks": 10 }, "after_abnormal": { ... } } }, { "name": "backend_app", ... }] Example analysis output behavior before a failure behavior during the failure anomaly time Target component behavior after the failure
  17. © Hitachi, Ltd. 2024. All rights reserved. Issue: Evaluating how

    database failures affect applications during fault testing Case 1: Analysis of database failures during fault testing 18 Request generator (k6) Amazon ECS Service(Tasks) App. w/ OTel agent container Collector container AWS FIS Template Fault injection (reboot DB) Aurora PostgreSQL Analysis Script (Python+boto3) Req./Res Get traces/metrics Transmit traces/metrics Amazon CloudWatch, AWS X-Ray Dashboard Detail Summary Start test Analysis tools Testing tools Target system ALB RDS Proxy ALB: Auto Load Balance RDS: Relational Database Service FIS: Fault Injection Service
  18. © Hitachi, Ltd. 2024. All rights reserved. Issue: Evaluating how

    database failures affect applications during fault testing Case 1: Analysis of database failures during fault testing 19 Request generator (k6) AWS FIS Template Fault injection (reboot DB) Analysis Script (Python+boto3) Get traces/metrics Amazon CloudWatch, AWS X-Ray Dashboard Detail Summary Start test Analysis tools Testing tools Amazon ECS Service(Tasks) App. w/ OTel agent container Collector container Aurora PostgreSQL Transmit traces/metrics Target system ALB RDS Proxy 1. Select metrics about fault with target attributes (ex. MetricName="FaultRate", "ServiceName="test-app", ServiceType=AWS::ECS::Fargate) 2. Detect anomaly time for each component using the metrics 3. Analyze related metrics and traces around the anomaly time Req./Res
  19. © Hitachi, Ltd. 2024. All rights reserved. Information in metrics

    20 Metric Attribute (situation where the metric was recorded) Metric name (what the metric means) Metric datapoints ・・・statistics of metric including Sample count, Average, Max, Min, p99, …) *From Amazon Cloudwatch
  20. © Hitachi, Ltd. 2024. All rights reserved. ◆ about HTTP

    • http.server.duration, http.client.duration ◆ about Runtime environment • jvm.threads.count, jvm.memory.usage, jvm.cpu.utilization, … ◆ about Database • db.client.connection.count, db.client.connection.create_time, db.client.connection.pending_requests, … Examples of OTel metrics 21
  21. © Hitachi, Ltd. 2024. All rights reserved. Case 1: How

    database failures affect applications 22 Request generator (k6) AWS FIS Template Fault injection (reboot DB) Analysis Script (Python+boto3) Get traces/metrics Detail Summary Start test Analysis tools Testing tools Amazon ECS Service(Tasks) App. w/ OTel agent container Collector container Aurora PostgreSQL Transmit traces/metrics Target system ALB RDS Proxy Req./Res Dashboard Amazon CloudWatch, AWS X-Ray
  22. © Hitachi, Ltd. 2024. All rights reserved. Metrics provide insight

    into statistical behavior over relatively long periods of time (1min~). Case 1: Information obtained from metrics 23 Reboot DB by FIS Metrics obtained from the API Dashboard(Amazon CloudWatch) Anomaly time detection using threshold Example: ✓ Average Error rate (5xx) > 0 ✓ p99 of Response time > 1 seconds ✓ p95 of Response time > p99 of Response time under normal conditions *From Amazon Cloudwatch
  23. © Hitachi, Ltd. 2024. All rights reserved. Case 1: Analysis

    using trace 24 Trace: Information of each request ◆Start time ◆Http Status Anomaly time detection in seconds Start time(sec) Response time(sec) Start time Http status=500 ◆ DB connection ✓ db.client.connection.count ✓ db.client.connection.wait_time ◆ Resource usage by retry ✓ jvm.memory.usage ✓ jvm. system.cpu.utilization ✓ jvm.gc.duration Analysis around the anomaly time Related metrics to check The response stopped when the database failed because no timeout set on the application ◆ Response time of each request in trace
  24. © Hitachi, Ltd. 2024. All rights reserved. Issue: Detecting performance

    bottlenecks for specific requests during load testing. Case 2: Bottleneck detection during load testing 25 Request generator (k6) Amazon ECS Service(Tasks) App. w/ OTel agent container Collector container Aurora PostgreSQL Analysis Script (Python+boto3) Req./Res Get traces/metrics Transmit traces/metrics Amazon CloudWatch, AWS X-Ray Dashboard Detail Summary Start test Analysis tools Testing tools Target system ALB RDS Proxy Increase # of requests ALB: Auto Load Balance RDS: Relational Database Service
  25. © Hitachi, Ltd. 2024. All rights reserved. Case2: Analysis using

    traces 26 *From Grafana Trace includes information about the duration in each segment →useful for bottleneck detection
  26. © Hitachi, Ltd. 2024. All rights reserved. Issue: Assessing the

    impact of incorporating a cache server on system behavior during the design phase. Case 3: Evaluation of performance change by a cache server 27 Evaluate with / without a cache server Request generator (k6) Amazon ECS Service(Tasks) App. w/ OTel agent container Collector container Aurora PostgreSQL Analysis Script (Python+boto3) Req./Res Get traces/metrics Transmit traces/metrics Amazon CloudWatch, AWS X-Ray Dashboard Detail Summary Start test Analysis tools Testing tools Target system ALB RDS Proxy Amazon ElastiCache ALB: Auto Load Balance RDS: Relational Database Service
  27. © Hitachi, Ltd. 2024. All rights reserved. Case 3: Evaluation

    of performance change by a cache server 28 Statistics value of Latency (Average, Min, Max, p99, p50 …) w/o a cache server w/ a cache server w/ a cache server w/o a cache server or App. w/ OTel agent container Collector container Aurora PostgreSQL RDS Proxy Amazon ElastiCache ✓ The cache reduces latency but increases operational costs. ✓ I want to evaluate the benefits quantitatively, rather than relying on intuition or experience. Will a cache server provide benefits that justify the cost? Quantitative evaluation using metrics by OTel Latency Worth the cost Not worth the cost ✓ Statistical evaluation can be done easily. →Implementation decisions can be made based on solid evidence.
  28. © Hitachi, Ltd. 2024. All rights reserved. Useful telemetry for

    testing 29 1. v1.20 or later : http.server.request.duration 2. Python auto instrumentation (v1.27.0) is not support process. Metrics Traces Throughput http.server.duration1 rpc.server.duration Latency http.server.duration rpc.server.duration (end_time) – (start_time) Error Rate status attribute of http.server.duration Failover time http.server.duration distribution of traces Resource Utilization jvm.threads.count, jvm.memory.usage (process.memory.usage2), jvm.cpu.utilization (process.cpu.time), … Connection to DB db.client.connection.count, db.client.connection.create_time, db.client.connection.pending_requests
  29. © Hitachi, Ltd. 2024. All rights reserved. OTel metrics(HTTP,DB) 30

    Category Information Metrics Unit HTTP Duration of HTTP server requests http.server.request.duration second DB The number of connections that are currently in state described by the state attribute db.client.connection.count (db.client.connections.usage) # of connection The maximum/minumum number of idle open connections allowed db.client.connection.max db.client.connection.idle.min # of connection The number of current pending requests for an open connection db.client.connection.pending_requests # of connection The time it took to create a new connection db.client.connection.create_time second The time it took to obtain an open connection from the pool db.client.connection.wait_time second The time between borrowing a connection and returning it to the pool db.client.connection.use_time second *https://opentelemetry.io/docs/specs/semconv/
  30. © Hitachi, Ltd. 2024. All rights reserved. OTel metrics(JVM) 31

    Category Information Metrics Unit jvm Thread count process.runtime.jvm.threads.count # of threads Recent system-wide CPU usage process.runtime.jvm.system.cpu.utilization CPU usage Average system-wide CPU load over the past minute process.runtime.jvm.system.cpu.load_1m # of CPU cores Memory usage in use process.runtime.jvm.memory.usage Byte Garbage collection duration process.runtime.jvm.gc.duration second Process CPU usage process.runtime.jvm.cpu.utilization CPU usage Number of classes unloaded since JVM startup process.runtime.jvm.classes.unloaded # of classes Number of classes loaded since JVM startup process.runtime.jvm.classes.loaded # of classes Number of classes currently loaded process.runtime.jvm.classes.current_loaded # of classes Memory used by buffers process.runtime.jvm.buffer.usage Byte Maximum memory used by buffers process.runtime.jvm.buffer.limit Byte Number of buffers in the pool process.runtime.jvm.buffer.count # of buffers * https://opentelemetry.io/docs/specs/semconv/
  31. © Hitachi, Ltd. 2024. All rights reserved. Our practice, Tips

    32 *From AWS Cost Explorer collect traces throughout the entire day collect traces during tests  Even if the telemetry transmitter is OSS, costs are incurred when using a managed service on the receiver side. ◆ We use OTel as a transmitter and AWS as a receiver ✓ If distributed tracing is collected without sampling, • it could cost $20/day for 50 requests per second on a 2-layer system • $0.2/day for 1 request per second per component. → Sending 1,000 req./sec to 10 components could cost $2,000/day=$60,000/month just for tracing. ✓ It's important to be cautious about long-running tests and forgetting to stop them, and to configure sampling rules to avoid such issues.
  32. © Hitachi, Ltd. 2024. All rights reserved.  Some metrics

    are optional ◆ The metrics you want to collect may not be implemented, depending on the collector or programming language. ◆ In HTTP metrics,  In some versions of OTel, the names of metrics may have changed ◆ e.g., db.client.connections.usage(v1.24.0)→db.client.connection.count(v1.26.0) ◆ For instance, in distributions like ADOT, metrics might still be collected under older names, and there can be a lag before updates from the OSS are reflected ◆ When checking or configuring metrics, it's important to be aware of these name changes based on the version in use. Our practice, Tips 33 Required http.server.request.duration, http.client.request.duration Optional http.server.active_requests, http.server.request.body.size, http.server.response.body.size, http.client.request.body.size, http.client.response.body.size, http.client.open_connections, http.client.connection.duration, http.client.active_requests
  33. © Hitachi, Ltd. 2024. All rights reserved.  Comment from

    a development team ◆ It is challenging to build application logs to the extent that OTel collects. ◆ While commercial software offers high functionality, the implementation burden is significant, making it desirable if the same can be achieved with OSS. ◆ Since support is limited to OSS-level, careful consideration is needed when integrating it into products. Our practice, Tips 34
  34. © Hitachi, Ltd. 2024. All rights reserved. • Increasing complexity

    of development • OpenTelemetry (Otel) can help you to design, develop application • Otel is relatively easy to implement, allowing for the analysis of user experience, resource changes. • Be mindful of the receiver of telemetry and the cost when collecting telemetry Key Takeaways 35 Let’s use OpenTelemetry to enhance our development experience!
  35. © Hitachi, Ltd. 2024. All rights reserved. • Amazon Fault

    Injection Service, Application Load Balancer, Amazon CloudWatch, AWS X-Ray, Amazon Aurora, Amazon RDS Proxy, Amazon Elastic Container Service, AWS Cost Explorer, and Amazon ElastiCache , boto3 are trademarks of Amazon Web Services, Inc. in the United States and/or other countries. • OpenTelemetry (OTel), Kubernetes is a registered trademark of Linux Foundation in the United States and/or other countries. • Grafana, K6 is a registered trademark of Grafana Labs in the United States and/or other countries. Trademarks 36