Upgrade to Pro — share decks privately, control downloads, hide ads and more …

All I wish I knew before running Istio in produ...

drequena
September 27, 2024

All I wish I knew before running Istio in production

Running a production ready service mesh is VERY hard. From fully understanding the mesh architecture and installing it,to how to scale, upgrade, monitor, secure, and so on, it can take many months to build confidence to release it into production.

For more than 2 years, our Traffic team has been running an Istio based service mesh. We faced tons of problems during the many phases of a project this big. From picking initial features to choose the Mesh architecture, to thinking on how to hide Mesh complexity details from users to when released it to production.

In this talk we want to share the biggest stumbles in a variety of subjects related to installation, maintenance, monitoring, upgrading and operation of Istio. We believe that by sharing some, hard to find, tips and tricks, people and organizations can save a lot of time adopting Istio Service Mesh.

Currently The Mesh is our main traffic solution, we chose to run a Single Mesh that spreads to all our Business Units Kubernetes clusters, creating a virtual, network flat like environment. Istio is responsible to do mutual TLS, traffic routing, canary releases, retry policy, outlier detection, circuit breaking, authentication and authorization between micro-services and users, and it is a core piece of software to our currently multi-region initiative. We also extended it to other custom features.

Join us in this minefield Service Mesh adventure and learn how to avoid almost all of them.

drequena

September 27, 2024
Tweet

More Decks by drequena

Other Decks in Technology

Transcript

  1. All I wish I knew before running Istio in Production

    KCD Porto - Portugal Sep/24 by: Daniel Requena
  2. Agenda ➔ Whoami ➔ Our environment ➔ What I wish

    I knew ➔ Questions? ➔ References
  3. Dad, Husband, Nerd Bachelor in Computer Science Master Computer Engineering

    +20 years of XP in Sysadmin/DevOps/SRE…etc. Staff Engineer at iFood @traffic team Daniel Requena $Whoami
  4. Special Thanks Jhonn Frazão Eduardo Baitello Débora Berte Fagner Luz

    Jhonatan Morais Edson Almeida Fernando Junior Kelvin Lopes
  5. Our environment iFood? Big numbers • Brazilian Food Delivery Company

    • +100 millions orders per month • ~5500 employees / ~2k engineers • ~250K/600k RPS • +8k Deploy per month • ~3k microservices • +54 Kubernetes Cluster
  6. Our environment Mesh • Istio based ◦ sidecar model •

    Kubernetes only (no VMs) • Running since Q1-2022 • Current workload adoption: +70% • Current traffic flow: +75%
  7. • Features ◦ mTLS ◦ Authn/Authz ◦ Traffic management ▪

    Canary ▪ Retry policy ▪ Circuit Break ▪ Rate Limit ◦ Telemetry ◦ Traces ◦ Service Map (?) ◦ + some custom extensions • Important role in our multi-region strategy Our environment Mesh
  8. What I wish I knew Let's divide in topics •

    Concepts and mental model • Setup/Upgrades • Scalability • Monitoring • Sidecar/Proxy stuff • Cost • Misc
  9. What I wish I knew Concepts and mental model •

    Istio is an "Envoy configurator", at least in sidecar-mode (please, don't be mad)
  10. What I wish I knew Concepts and mental model Api

    Server Istio CRDs services endpoints … xDS Protocol Istiod Remote Api Server services endpoints
  11. What I wish I knew Concepts and mental model •

    What else does it do it? ◦ Adds its own rules and validations ◦ It can choose different Envoy features ◦ Has a mechanism for precedence and merge of objects ▪ local ns ▪ external ns ▪ root ns (istio-system) ▪ This rule can be affected by "ExportTo" configurations ▪ some CRDs have different merge rules
  12. What I wish I knew Concepts and mental model •

    What else does it do it? ◦ Adds its own rules and validations ◦ It can choose different Envoy features ◦ Has a mechanism for precedence and merge of objects ▪ local ns ▪ external ns ▪ root ns ▪ This rule can be affected by "ExportTo" configurations ▪ some CRDs have different merge rules
  13. What I wish I knew Concepts and mental model •

    Most of the features are enforced in Client Side (sidecar mode) ◦ Load Balancing ◦ Retry ◦ Locality ◦ Timeout ◦ etc… service-b.namespace.svc.cluster.local service-b 100.127.2.1 100.127.2.1
  14. What I wish I knew Concepts and mental model service-b.namespace.svc.cluster.local

    service-b 100.127.2.1 100.127.2.2 100.127.2.3 100.127.2.4 100.127.2.1 100.127.2.2 100.127.2.3 100.127.2.4
  15. What I wish I knew Concepts and mental model service-b.namespace.svc.cluster.local

    service-b 100.127.2.1 100.127.2.2 100.127.2.3 100.127.2.4 100.127.2.1 100.127.2.2 100.127.2.3 100.127.2.4 100.127.3.1 100.127.3.2 100.127.3.3 100.127.3.4 100.127.5.1 100.127.5.2 100.127.5.3 100.127.5.4 service-d 100.127.3.1 100.127.3.2 100.127.3.3 100.127.3.4 service-e 100.127.5.1 100.127.5.2 100.127.5.3 100.127.5.4
  16. What I wish I knew Concepts and mental model •

    Envoy request workflow and "structures" Endpoint list: - 100.67.1.2 - 100.67.2.1 - 100.67.10.5 - …
  17. What I wish I knew Concepts and mental model •

    Envoy request workflow and "structures" ◦ istioctl proxy-config [structure] args… ◦ istioctl proxy-config logs ◦ istioctl proxy-status
  18. What I wish I knew Concepts and mental model •

    Envoy request workflow and "structures" ◦ istioctl proxy-config [structure] args… ◦ istioctl proxy-status
  19. What I wish I knew Setup • Choose WISELY ◦

    Mesh type ▪ Single Mesh ▪ Isolated Meshes ◦ Network Model ▪ Single ▪ Multi ◦ Control plane setup ▪ Centralized ▪ Decentralized Our Setup • Single Mesh ◦ Per environment • Multi-Cluster ◦ Business units • Multi-primary ◦ Each cluster has its Istiod • Multi-Network ◦ Aws setup ◦ k8s network setup
  20. What I wish I knew • Downsides ◦ N:N K8S

    Istio ratio (scalability) ◦ Multiple upgrades processes ◦ Namespace + Service "uniqueness" ◦ Istio Service Discovery scope ◦ East-West L4 is "problematic" • Setup/maintenance processes ◦ istioctl + istiooperator.yaml file (GitOps) Setup
  21. What I wish I knew Upgrades The mesh is a

    platform on its on… • CRDs • APIs • Internal structures • Proxy behaviour
  22. What I wish I knew Upgrades The Istio upgrade monster

    👻 • Benchmarks scared us ◦ Difficult ◦ Error prune ◦ "We are far behind from supported version" • Sandbox for the win! • began 1.12 • today 1.22 Revision based FROM DAY 1
  23. What I wish I knew Upgrades The Istio upgrade monster

    👻 • Benchmarks scared us ◦ Difficult ◦ Error prune ◦ "We are far behind from supported version" • Sandbox for the win! • began 1.12 • today 1.22 Revision based FROM DAY 1
  24. What I wish I knew Scalability • Istio, by default,

    is greedy ◦ All namespaces and services are "consumed" ◦ Proxy configs are one of the biggest reasons for ▪ adding latency ▪ resources consumptions
  25. What I wish I knew Scalability • Let's "fix" that.

    ◦ discoverySelectors meshConfig: discoverySelectors: - matchExpressions: - key: istio-discovery operator: NotIn values: - disabled ◦ All kubernetes and "machinery" namespaces
  26. What I wish I knew Scalability • Let's "fix" that.

    ◦ Default service "ExportTo" meshConfig: defaultServiceExportTo: - "~" services: labels: networking.istio.io/exportTo: '*' ◦ Only Mesh services will be recognized
  27. What I wish I knew Scalability • Sidecar Object ◦

    Limits the "knowledge" of a sidecar about mesh ▪ reduces configs/cost ◦ How we solved this ▪ Pipeline code scan (meh) ▪ Consul Service Discovery 👍 ◦ Sidecar Objects DON'T WORK for Gateways ▪ see costs slides spec: egress: - hosts: - ./* - istio-system/* - '*/consumed.workload.svc.cluster.local' workloadSelector : labels: app.kubernetes.io/name : my-app
  28. What I wish I knew Scalability • Ingress Gateways ◦

    cpu, memory, connections, requests • Some components just can't scale by itself (istiod) ◦ 30 min connection ◦ Flip-flop (unless big spike) ◦ Just create a warm up routine
  29. What I wish I knew Monitoring • 3 components ◦

    Istiod ◦ Gateways (N/S - E/W) ◦ Sidecars • But, there are A LOT of Metrics
  30. What I wish I knew Monitoring • Istiod ◦ Convergence

    Time ◦ Config erros (stall) ◦ Certificate validation and emission • Gateways (N/S - E/W) ◦ Basic Resources ◦ Envoy Open connections • Sidecars (basic) ◦ Resources (avoid overload or restarts)
  31. Configuration convergence time: • pilot_proxy_convergence_time • pilot_proxy_queue_time • pilot_xds_push_time API

    XDS and sidecar injection errors: • pilot_total_xds_internal_errors • pilot_total_xds_rejects • envoy.cluster_manager.cds.update_failure.count • sidecar_injection_failure_total Configuration consistency: • controller_sync_errors_total • pilot_duplicate_envoy_clusters • pilot_conflict_inbound_listener • pilot_no_ip • pilot_endpoint_not_ready Citadel certificate expire, emission and authentication errors: • citadel_server_root_cert_expiry_timestamp • citadel_server_cert_chain_expiry_timestamp • citadel_server_authentication_failure_count • citadel_server_csr_parsing_err_count What I wish I knew Galley configs validations: • galley_validation_config_update_error • galley_validation_config_load_error • galley_validation_http_error • galley_validation_failed Extra troubleshootings metrics (dashboards and stuff): • pilot_inbound_updates • pilot_push_triggers • pilot_xds_pushes • pilot_k8s_cfg_events • pilot_xds • pilot_virt_services • pilot_services • envoy_cluster_upstream_cx_active{cluster_name="xds-gr pc"} • envoy_cluster_upstream_cx_rx_bytes{cluster_name="xds- grpc"} • envoy_cluster_upstream_cx_tx_bytes{cluster_name="xds- grpc"}
  32. What I wish I knew Sidecar/Proxy stuff • Start/Stop •

    HPA • Flags ◦ UH/UF/UO/NR • Connection "imbalance"
  33. What I wish I knew • Port/protocol/Network exclusions ◦ traffic.sidecar.istio.io/excludeOutboundPorts

    ◦ traffic.sidecar.istio.io/excludeOutboundIPRanges • Connections drains meshConfig: defaultConfig: proxyMetadata: MINIMUM_DRAIN_DURATION: "5s" EXIT_ON_ZERO_ACTIVE_CONNECTIONS: "true" Sidecar
  34. What I wish I knew Sidecar/Proxy stuff • HPA Main

    APP Sidecar HPA CPU: 80% Mem: 70% resources: request: memory: 512MB cpu: 500 resources: request: memory: 200MB cpu: 200
  35. What I wish I knew • Two major cost factors

    ◦ Sidecar resources (already "solved") ▪ CPU ▪ Memory ▪ Ambient Mesh? ◦ Data Transfer ▪ Gateways receives ALL configs ▪ Huge Mesh ▪ Lots of workloads ▪ Lots of Gateways Replicas Cost
  36. What I wish I knew Cost • Data Transfer ◦

    Not a solved problem in production yet ▪ Total TX: ~160G per day ▪ Az isolated ASGs (gateways + istiod?) ▪ Maybe Topology Aware Routing feature? ◦ In sandbox - Pilot env: - name: PILOT_FILTER_GATEWAY_CLUSTER_CONFIG value: "true"
  37. What I wish I knew Misc • Envoy filters ◦

    No compatibility guarantee during upgrades ◦ We had problems ▪ Internal structure problem (lua code) ▪ Rate limit • Default Retry 2! ◦ Highly elastic apps (↑↓) ◦ Endpoints update process can fail (503 increase) Warning: EnvoyFilter exposes internal implementation details that may change at any time. Prefer other APIs if possible, and exercise extreme caution, especially around upgrades.
  38. What I wish I knew Misc • Guardrails ◦ Block

    direct access and validate: ▪ Gateway ▪ Peerauthentication ▪ VirtualServices ▪ DestinationRules
  39. References • [Envoy request path] https://www.envoyproxy.io/docs/envoy/latest/intro/life_of_a_request • [Deployment Models] https://istio.io/latest/docs/ops/deployment/deployment-models/

    • [CRDs Merge Policy] https://istio.io/latest/docs/ops/best-practices/traffic-management/#split-virtual-services • [Discovery Selectors] https://istio.io/v1.14/blog/2021/discovery-selectors/#discovery-selectors-vs-sidecar-resource • [Istio Ratelimit in iFood] https://www.youtube.com/watch?v=GGnCq3B2J8A • [Istio in Action - Book] https://www.manning.com/books/istio-in-action • [Envoy Lv 2] https://www.envoyproxy.io/docs/envoy/latest/configuration/best_practices/level_two • [Envoy Edge] https://www.envoyproxy.io/docs/envoy/latest/configuration/best_practices/edge • [LB gRPC with Service Mesh] https://www.useanvil.com/blog/engineering/load-balancing-grpc-in-kubernetes-with-istio/ • [Kubernetes native sidecar] https://kubernetes.io/blog/2023/08/25/native-sidecar-containers/ • [Istio sidecar k8s support] https://istio.io/latest/blog/2023/native-sidecars/ • [PILOT_FILTER_GATEWAY_CLUSTER_CONFIG discussion] https://github.com/istio/istio/issues/29131