Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Upgrading Log-Analytics Clusters to OpenSearch ...

Upgrading Log-Analytics Clusters to OpenSearch (Amitai Stern, LogzIO) | RTA Summit 2023

Here at Logz.io, the open-source observability and security company, we run ElasticSearch for over 1300 companies in highly scalable multi-cloud deployments. With the license changes of 2021 we needed to migrate to an open-source platform, and OpenSearch was where we were going to contribute and what we wanted to run in production.

Many equate upgrading to OpenSearch from Elasticsearch in production as changing the tires on a moving bus. Upgrading has many risks, and if the cluster is in continuous production use, ingesting terabytes of data daily, the risks can seem overbearing.

In this talk, we will cover multiple upgrade strategies, including version requirements, and their pros and cons. Additionally, we will cover a different option, which is the way we, at Logz.io, upgraded all our clusters to OpenSearch without significant extra costs while minimizing risk. Not only did we upgrade to OpenSearch, but we also migrated our AWS workloads to Graviton2 instances.

StarTree

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. The Drain Method Data nodes PUT _cluster/settings { "persistent": {

    "cluster.routing.allocation.exclude._ip": "172.22.4.9" } } 172.22.4.9
  2. The Drain Method Data nodes PUT _cluster/settings { "persistent": {

    "cluster.routing.allocation.exclude._ip": "172.22.4.9", "indices.recovery.max_bytes_per_sec": "150mb" } } 172.22.4.9
  3. The Drain Method Data nodes PUT _cluster/settings { "persistent": {

    "cluster.routing.allocation.include._ip": "172.33.14.1,172.22.4.9" } } 172.22.4.9 172.33.14.1
  4. The Drain Method Data nodes PUT _cluster/settings { "persistent": {

    "cluster.routing.allocation.include._ip": "172.33.14.1,172.22.4.9" } }
  5. The Drain Method: Upgrade Process Overview PUT _cluster/settings { "persistent":

    { "cluster.routing.allocation.include._ip": "<Elasticsearch IPs>" } } Data nodes Coordinator nodes Cluster manager nodes
  6. The Drain Method: Upgrade Process Overview PUT _cluster/settings { "persistent":

    { "cluster.routing.allocation.include._ip": "<Elasticsearch IPs>", "cluster.routing.allocation.exclude._ip": "<OpenSearch IPs>" } } Data nodes Coordinator nodes Cluster manager nodes
  7. The Drain Method: Upgrade Process Overview PUT _cluster/settings { "persistent":

    { "cluster.routing.allocation.include._ip": "<OpenSearch IPs>", "cluster.routing.allocation.exclude._ip": "<Elasticsearch IPs>" } } Data nodes Coordinator nodes Cluster manager nodes
  8. The Drain Method: Upgrade Process Overview PUT _cluster/settings { "persistent":

    { "cluster.routing.allocation.include._ip": "<OpenSearch IPs>", "cluster.routing.allocation.exclude._ip": "<Elasticsearch IPs>" } } Data nodes Coordinator nodes Cluster manager nodes "indices.recovery.max_bytes_per_sec": "300mb"
  9. The Drain Method: Upgrade Process Overview PUT _cluster/settings { "persistent":

    { "cluster.routing.allocation.include._ip": "<OpenSearch IPs>", "cluster.routing.allocation.exclude._ip": "<Elasticsearch IPs>" } } Data nodes Coordinator nodes Cluster manager nodes "indices.recovery.max_bytes_per_sec": "0mb"
  10. The Drain Method: Upgrade Process Overview PUT _cluster/settings { "persistent":

    { "cluster.routing.allocation.include._ip": null, "cluster.routing.allocation.exclude._ip": null } } Data nodes Coordinator nodes Cluster manager nodes
  11. The Drain Method: Upgrade Process Overview | Data nodes Coordinator

    nodes Cluster manager nodes load balancer DNS record
  12. The Drain Method: Upgrade Process Overview - New LB -

    OpenSearch Coordinator Nodes Data nodes load balancer Coordinator nodes Cluster manager nodes load balancer DNS record
  13. - New LB - OpenSearch Coordinator Nodes - Override DNS

    record DNS record The Drain Method: Upgrade Process Overview Data nodes load balancer Coordinator nodes Cluster manager nodes load balancer
  14. The Drain Method: Upgrade Process Overview - New LB -

    OpenSearch Coordinator Nodes - Override DNS record - Remove old Coordinating Nodes Data nodes Coordinator nodes Cluster manager nodes load balancer
  15. The Drain Method: Upgrade Process Overview - Add 3 more

    Cluster manager Nodes Data nodes Coordinator nodes Cluster manager nodes
  16. The Drain Method: Upgrade Process Overview - Add 3 more

    Cluster manager Nodes - Remove the old ones one at a time (elected one last) Data nodes Coordinator nodes Cluster manager nodes
  17. The Drain Method: Upgrade Process Overview - Add 3 more

    Cluster manager Nodes - Remove the old ones one at a time (elected one last) Data nodes Coordinator nodes Cluster manager nodes
  18. The Drain Method: Upgrade Process Overview - Add 3 more

    Cluster manager Nodes - Remove the old ones one at a time (elected one last) Data nodes Coordinator nodes Cluster manager nodes
  19. The Drain Method: Upgrade Process Overview - Add 3 more

    Cluster manager Nodes - Remove the old ones one at a time (elected one last) Data nodes Coordinator nodes Cluster manager nodes
  20. The Drain Method: Upgrade Process Overview - Add 3 more

    Cluster manager Nodes - Remove the old ones one at a time (elected one last) - Await Cluster Manager Node reelection Data nodes Coordinator nodes ??? Cluster manager nodes
  21. DONE :) The Drain Method: Upgrade Process Overview Data nodes

    Coordinator nodes Cluster manager nodes
  22. Backup cluster Backup Log-Engine Application Kibana Query-Service search The Drain

    Method: Managing risk Log-Engine ingest Amazon S3 (cluster snapshots)
  23. Backup cluster Application Kibana Query-Service search The Drain Method: Managing

    risk Log-Engine ingest Amazon S3 (cluster snapshots) Backup Log-Engine
  24. Backup cluster The Drain Method: Managing risk Log-Engine ingest Amazon

    S3 (cluster snapshots) Backup Log-Engine Application Kibana Query-Service search
  25. Backup cluster Application Kibana Query-Service search The Drain Method: Managing

    risk Log-Engine ingest Amazon S3 (cluster snapshots) Backup Log-Engine
  26. Backup cluster Backup Log-Engine Application Kibana Query-Service search The Drain

    Method: Managing risk ingest Amazon S3 (cluster snapshots)
  27. Backup Log-Engine Application Kibana Query-Service search The Drain Method: Managing

    risk ingest Amazon S3 (cluster snapshots) Restore from Snapshot Backup cluster
  28. Backup cluster Backup Log-Engine Application Kibana Query-Service search The Drain

    Method: Managing risk ingest Amazon S3 (cluster snapshots)
  29. Blue/Green In Place Drain Pros Fully revertable (instantly) Can replace

    hardware as well Fast (within a few hours) Cheap (0 extra nodes) Fully revertable (within hours) Rather fast (many hours) Can replace hardware as well Cheaper than Blue/Green Cons Slow upgrade (days/weeks) Complexity grows over time Double the cluster cost for the duration No rolling back No hardware change Costs more than In Place Complex upgrade process Complex rollback Summary Drain