Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Kubecon Europe 2023] Ingesting 6.5 Tb of Telem...

[Kubecon Europe 2023] Ingesting 6.5 Tb of Telemetry Data Daily Through Open Telemetry Protocol and Collectors

Presentation video: https://www.youtube.com/watch?v=aDysORX1zIs

This presentation aims to share how VTEX observability team moved from a single vendor to a full Open Telemetry protocol solution that handles 6.5 terabytes of telemetry data per day (logs, system metrics, business metrics, traces and audit logs). Thinking on the CNCF community, this talk will show the entire architecture, the tradeoffs, how to instrument every application inside the company, how to manage OTEL Collectors at scale, how to centralize visualization, how to extend collectors code and how to guarantee resiliency. Open Telemetry allowed VTEX to completely modernize its Observability stack looking to a horizon of at least 5 years ahead without requiring any sort of migrations on the VTEX's applications. With the architecture this talk presents, VTEX can switch backend vendors without impacting instrumented code. Thus, allow engineering organization to move faster. Last but not least, this solution made VTEX reduce 40% of its Observability costs while enabling a modern, safer and efficient way to engineers to observe their applications at scale. If these topics are interesting to you, please come to this presentation. The idea is to give back to CNCF community what they gave to us: knowledge and cutting edge solutions.

kubecon europe 2023
Tuesday April 18, 2023 11:55 - 12:20 CEST
Hall 7, Room E | Ground Floor | Europe Complex
Observability Day, Project-specific: Observability + Prometheus + OpenMetrics + OpenTelemetry +Fluentbit

Gustavo Pantuza

April 18, 2023
Tweet

More Decks by Gustavo Pantuza

Other Decks in Programming

Transcript

  1. Ingesting 6.5 Tb of Telemetry Data Daily Through Open Telemetry

    Protocol and Collectors Gustavo Pantuza
  2. Build, manage, and deliver profitable ecommerce businesses with more agility

    and less risk. The Enterprise Digital Commerce Platform VTEX context
  3. VTEX at a glance 3.200+ Active online stores 38 Countries

    with active customers Public-listed company
  4. • LONDON • BARCELONA • BUCUREȘTI WHERE WE ARE Locations

    across the globe 18 Global Platform • BUENOS AIRES SANTIAGO • BOGOTÁ • • CIUDAD DE MÉXICO • NEW YORK • JOÃO PESSOA • MEDELLÍN MILAN • • RIO DE JANEIRO SÃO PAULO • LIMA • LISBON • • SINGAPORE Employees 1.3 k • RECIFE As of Q2/21 ended on June 30th, 2021 PARIS •
  5. Inefficient ingestion control HTTP 1.1 with no encryption Unstructured telemetry

    data No common fields by design Problem Too many library implementations Telemetry data governance Inefficient way to store KPIs Single vendor for different telemetry signals
  6. Problem How to evolve to a long-term o11y stack without

    vendor lock-in while improving o11y efficiency?
  7. Solution + Outcomes Open Telemetry Protocol on every possible layer

    Common libraries with the same interfaces Open Telemetry Collectors as Telemetry Ingestor Different data sinks to different telemetry signals Sharded architecture by Telemetry signal
  8. 41% reduction on Observability investments Long Term Architecture that does

    not require developers to migrate their applications in case we change o11y vendors 6.5 Tb of telemetry data getting ingested per day and with control knobs such as dynamic sampling Solution + Outcomes
  9. Architecture gRPC communication Encryption Common methods interface Shared library (Diagnostics)

    /** * VTEX's Telemetry methods */ service Telemetry { /* Logs related methods */ rpc Info(LogRequest) returns (google.protobuf.Empty); rpc Warn(LogRequest) returns (google.protobuf.Empty); rpc Error(LogRequest) returns (google.protobuf.Empty); rpc Debug(LogRequest) returns (google.protobuf.Empty); /* Metrics related methods */ rpc SystemMetric(Metric) returns (google.protobuf.Empty); rpc BusinessMetric(Metric) returns (google.protobuf.Empty); /* Traces related methods */ rpc Trace(Trace) returns (google.protobuf.Empty); }
  10. Architecture Structured logs by design Common fields Shared library (Diagnostics)

    /** * Common Fields on telemetry data */ message Common { ... /* Name of the service that is sending telemetry data */ string service_name = 1; /* Instance hostname */ string instance_id = 2; /* Instance Availability zone */ string az = 3; /* Instance region */ string region = 4; /* Optional hash table allowing users to send extra fields */ map<string, string> extra_fields = 5; ... } Instrumented with Open Telemetry official libraries
  11. Architecture Extended with our modules Built internally Different configurations per

    telemetry signal Custom Collectors Open Telemetry Collectors Builder (ocb)
  12. dist: name: otelcol description: OpenTelemetry Collector version: 0.xx.y otelcol_version: 0.xx.y

    receivers: - ... exporters: - ... extensions: - ... processors: - ... Architecture Extended with our modules Built internally Different configurations per telemetry signal Custom Collectors builder.yaml
  13. 4 Terabytes of logs per day 150 millions Active time

    series 2.15 Billions of events ingested per day Architecture
  14. Horizontal Pod auto-scaling Burst requests on a single deployment can

    increase significantly collectors load autoscaling: enabled: true minReplicas: 10 maxReplicas: 20 targetMemoryUtilizationPercentage: 90 targetCPUUtilizationPercentage: 60 One Example of single Deployment Resilience
  15. Sharding strategy Sharded environments by business criteria such as core

    systems vs internal services Shard 0 Shard 1 Shard 2 Shard 3 Logs Metrics Traces Resilience
  16. settings: default_sampling_percentage: 25 skip_sampling_field: debug services_config: sampling: - name: service_0

    index: service_0 percentage: 75 - name: service_1 index: service_1 percentage: 0 Default sampling Sampling % per service Skip sampling Real time Sampling (tail based sampling) Resilience Custom Open Telemetry Collector Processor
  17. wal: s3_region: some-aws-region s3_bucket: "wal-bucket-name" flush: bytes: 10000000 interval: 120

    cluster: shard0 Data Backfill one week data expiration AWS S3 + Lambda functions Write Ahead Log (WAL) Resilience Custom Open Search exporter
  18. We monitor all collectors from all pipelines and shards. Based

    on this monitoring we trigger alerts that pages the team OnCall engineer. Resilience Alerting
  19. Migration tips RFC like process to engage engineering teams C

    levels buy-in on the project Understand your client (speak to your teams) Engage your vendors on the resiliency plan Find early adopters. One step at a time
  20. VTEX Context We saw the multi-tenant architecture Problem We understood

    the problem of using a single vendor for different telemetry data Solution + Outcomes Before jumping on details we saw the overall approach and direct outcomes Architecture Then we jumped into details of the long term architecture and system design Recap Resilience Finally we discussed strategies to avoid global outages and how to monitor the entire architecture