Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Managing scalable database clusters with the Ti...

Managing scalable database clusters with the TiDB Operator

Presented during HTAP Summit 2023 in San Francisco.

Website: https://www.pingcap.com/htap-summit
Abstract page: https://events.bizzabo.com/474592/agenda/speakers/3096751
Recording TBA

Location: Computer History Museum, 1401 N Shoreline Blvd, Mountain View, CA 94043, USA

Abstract:
Why is Kubernetes and other popular cloud native projects so differently designed compared to previous-generation “VM-era” systems? How has the second law of thermodynamics and control theory shaped cloud native designs? How the shift from traditionally managing servers to using Kubernetes operators (such as TiDB Operator) similar to the Industrial Revolution?

This talk offers the audience a unique perspective into some common cloud native patterns. Kubernetes and Google Spanner, for example, are often described as designed from “decades of experience”, but it is not as often mentioned what that means in practice. Quite conversely, many newcomers to find Kubernetes and similar technologies “too complex”. Why is it, or why is that the impression?

After this talk, the audience has an improved vocabulary of cloud native philosophy terms. This by learning the fundamental design philosophies of Kubernetes and cloud native through well-known phenomena and real-world analogies.

This talk can also relate the concepts presented to features in TiKV and TiDB, such as consistency control and self-healing features. After the concepts are introduced the TiDB Operator is presented as a case-study of the theory.

Lucas Käldström

September 21, 2023
Tweet

More Decks by Lucas Käldström

Other Decks in Technology

Transcript

  1. © 2023 Lucas Käldström 1 Managing scalable database clusters with

    the TiDB Operator Lucas Käldström – CNCF Ambassador Mountain View – September 21, 2023
  2. © 2023 Lucas Käldström 2 Cloud Native Philosophy: Why Do

    We Now Design Software the Way We Do? Lucas Käldström – CNCF Ambassador Mountain View – September 21, 2023 or similarly,
  3. © 2023 Lucas Käldström 3 $ whoami Lucas Käldström, 1st-year

    MSc student at Aalto University, Finland CNCF Ambassador, Certified Kubernetes Administrator and Emeritus Kubernetes WG/SIG Lead KubeCon Speaker in Berlin, Austin, Copenhagen, Shanghai, Seattle, San Diego & Valencia KubeCon Keynote Speaker in Barcelona Former Kubernetes approver and subproject owner, active in the OSS community for 7+ years. Worked on e.g. SIG Cluster Lifecycle => kubeadm to GA. Former Weaveworks contractor, Weave Ignite & libgitops author Cloud Native Nordics co-founder & meetup organizer Guild of Automation and Systems Technology corporate relations & CFO
  4. © 2023 Lucas Käldström 5 Agenda - Database Sysadmin Complexities

    - Kubernetes Design Architecture - A Sysadmin’s Best Friend: The Operator - The TiDB Operator - Demo Screenshots (not enough time for live demo)
  5. © 2023 Lucas Käldström 8 Why are we here? Want

    a database for both transactions processing and analytical processing
  6. © 2023 Lucas Käldström 10 What does this require? -

    Failure Tolerance and Capacity Demand => Multiple Replicas - Multiple Replicas => Consistency Control (Paxos / Raft) - Capacity Demands => Sharding - and much more!
  7. © 2023 Lucas Käldström 11 - Multiple Nodes => Need

    scheduling logic - Consensus Algorithms => We need to take care when: - Scaling: Need some kind of “learner mode” - Upgrading: Avoid killing the consensus leader; give a proper handoff first - Sharding => Nodes have varying set of data, one node doesn’t necessarily all the data - Quickly-changing business requirements => Lots of sysadmin work What does this mean?
  8. © 2023 Lucas Käldström 14 Required sysadmin work grows faster

    than scale Business scaling requirement Sysadmin work
  9. © 2023 Lucas Käldström 17 Kubernetes Primer - Kubernetes is

    an open source container orchestration system. - Project to solve sysadmin operational challenges of app orchestration - Already decade old (!), the founding project of CNCF, 80000+ contributors - Runs in all environments from own DC to cloud (even on Raspberry Pis!) - Super extensible system, you can configure literally everything
  10. © 2023 Lucas Käldström 19 Node Kubernetes Architecture Single source

    of truth Raft key-value store Stateful Stateless, declarative and extensible REST API stateless controllers Node Node … these controllers “make stuff happen” <- reconcile ->
  11. © 2023 Lucas Käldström 21 Kubernetes: A Control Plane for

    (any) infrastructure = A set of automated controllers with operational knowledge of how to control a target system
  12. © 2023 Lucas Käldström 22 Kubernetes: A Control Plane for

    (any) infrastructure = A set of automated controllers with operational knowledge of how to control a target system Around 45 (!) of them in Kubernetes v1.28
  13. © 2023 Lucas Käldström 23 Kubernetes: A Control Plane for

    (any) infrastructure = A set of automated controllers with operational knowledge of how to control a target system “I know how to efficiently schedule workloads to nodes” “I know how to heal applications that were on failed nodes” “I know how to configure dynamic service discovery”
  14. © 2023 Lucas Käldström 24 Kubernetes: A Control Plane for

    (any) infrastructure = A set of automated controllers with operational knowledge of how to control a target system
  15. “Control Through Choreography” All user intent is stored in the

    API server. Business logic split into controllers making user intent a reality
  16. “deliberately leave significant headroom for workload growth, occasional ‘black swan’

    events, load spikes, machine failures, hardware upgrades, and large-scale partial failures (e.g., a power supply bus duct)” Source: (Verma et. al., 2015) Google Finding: “Failure is the Norm”
  17. © 2023 Lucas Käldström 31 Entropy: Putting order to chaos

    Time Entropy Order Start Stop Chaos Reversing, ordering process
  18. © 2023 Lucas Käldström 32 Kubernetes: The dishwasher of servers

    Time Entropy Order Start Stop Chaos Reversing, ordering process
  19. © 2023 Lucas Käldström 37 Key Takeaways a) Systems are

    inevitably becoming less ordered, thus b) need some periodic corrective action to steer the course towards c) some declared desired state of the system.
  20. = Automated reconcile loops with “human-like” operational knowledge Coined in

    2016 by Brandon Phillips, back then at CoreOS Operators: Encode human-like knowledge
  21. = Automated reconcile loops with “human-like” operational knowledge Coined in

    2016 by Brandon Phillips, back then at CoreOS Operators: Encode human-like knowledge Delegate “repetitive human activities that are devoid of lasting value”
  22. © 2023 Lucas Käldström 41 What should an operator do?

    - Keep infrastructure in control: continuously minimizing drift between the desired and actual state, - Resource scalability: codify and automate “repetitive human activities that are devoid of lasting value”, by encoding domain-specific knowledge, - Monitoring scalability: observe application health, metrics and logs, such that configuration can be adaptively tuned and alerts of any abnormal behavior can be sent seldom but with high importance, and - Knowledge scalability: provide a high-level abstraction interface such that the application can be operated by engineers without the domain-specific knowledge otherwise required
  23. © 2023 Lucas Käldström 46 TiDB Operator Capabilities The tidb-operator

    provides you with TiDB as a Service in your own cluster It features features such as: - Multi-Cluster Creation - Online up- and downgrades - Online up- and downscaling of replicas, even automatically - Automatic failover/self-healing - Dynamic monitoring - Re-configuration of the database - Backup and Restore
  24. © 2023 Lucas Käldström 47 Operator fulfils the user’s desires

    Observe and diff Desired State Source Target System 2 1 2, 6: Actual State 1: Desired State
  25. © 2023 Lucas Käldström 48 Operator fulfils the user’s desires

    Observe and diff Act Desired State Source 3 Target System 2 1 2, 6: Actual State 1: Desired State 4: Action 3: Action Plan 4
  26. © 2023 Lucas Käldström 49 Operator fulfils the user’s desires

    Observe and diff Act Desired State Source 3 Report (Actual State Sink) Target System 2 1 7: Requeue 2, 6: Actual State 1: Desired State 4: Action 3: Action Plan 5: Result 4 5 (6)
  27. © 2023 Lucas Käldström 50 Operator fulfils the user’s desires

    Observe and diff Act Desired State Source 3 Report (Actual State Sink) Target System 2 1 7: Requeue 2, 6: Actual State 1: Desired State 4: Action 3: Action Plan 5: Result 4 5 (6) 7
  28. © 2023 Lucas Käldström 52 Hardened tidb-operator setup In this

    demo, we will initially configure 3 cloud VMs for TiDB, 3 cloud VMs for PD, and 3 cloud VMs for TiKV. Further, we will 1) install the tidb-operator through the CNCF GitOps engine, Flux 2) set up the monitoring stack (Prometheus, Grafana) to watch performance 3) create one TiDBCluster with the operator 4) apply advanced configuration such as topology and upgrade tuning This demo running on UpCloud, thanks for donating cloud credits for this cause!
  29. © 2023 Lucas Käldström 53 Upgrading a cluster with a

    60k QPS load In this demo, we will: 1) bump the version number from v7.1.0 to v7.1.1 using a GitHub Pull Request, 2) ⇒ operator upgrades the 3*3-TiDB cluster gracefully, 3) while serving 60k requests per second (without any reconnects!), 4) while monitoring TiDB performance This demo running on UpCloud, thanks for donating cloud credits for this cause!
  30. © 2023 Lucas Käldström 56 Step 2: Relax and watch

    the upgrade let the upgrade do the work!
  31. © 2023 Lucas Käldström 60 - Manual service discovery (for

    peers, backup and monitoring) - Manual TLS setup - Manual scaling - Manual version upgrades - Manual re-configuration - Manual disaster recovery What do we **not** have to do? real-life footage of sysadmin not having to run 1002 commands to upgrade the database:
  32. © 2023 Lucas Käldström 64 Check out my thesis for

    more details! Available openly on Github: https://github.com/luxas/research CC-BY-SA 4.0 licensed Encoding human-like operational knowledge using declarative Kubernetes operator patterns
  33. © 2023 Lucas Käldström 65 Control Theory (Vallery Lancery, QCon,

    2018) I have another talk on control theory + declarative APIs = Kubernetes Also check out Vallery Lancery’s great talk on the subject.