Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Best Practices of Production-Grade Rook/Ceph Cluster

Best Practices of Production-Grade Rook/Ceph Cluster

My presentation slide at Cephalocon 2023
https://sched.co/1JKZf

Satoru Takeuchi

April 17, 2023
Tweet

More Decks by Satoru Takeuchi

Other Decks in Technology

Transcript

  1. Agenda 2 ▌Cybozu and our storage system ▌Quick introduction to

    K8s and Rook ▌Advanced Rook/Ceph cluster ▌Efforts and challenges ▌Conclusion
  2. Agenda 3 ▌Cybozu and our storage system ▌Quick introduction to

    K8s and Rook ▌Advanced Rook/Ceph cluster ▌Efforts and challenges ▌Conclusion
  3. Our infrastructure ▌We have an on-premise infrastructure ▌Current system has

    many problems ⚫Not scalable ⚫A lot of day 2 work ▌Developing a new infrastructure ⚫Kubernetes(K8s) ⚫Ceph and Rook(Ceph orchestration in K8s) 5
  4. Why Ceph? ▌Fulfill our requirements ⚫block & object storage ⚫Rack

    failure tolerance ⚫Bit-rot tolerance ▌Open source ⚫Detailed evaluation investigation of problems ⚫Use our custom container in case of emergency (see later) 6
  5. Why Rook? ▌Manage storage system in k8s way as other

    system components ▌Offload a lot of work to K8s ⚫Lifecycle management of hardware ⚫Mon failover ⚫Restart problematic Ceph daemons 7
  6. OSD(data) OSD(data) OSD(data) OSD(data) Our storage system 8 Ceph cluster

    for RGW Ceph cluster for RBD HDD SSD HDD HDD SSD NVMe SSD LogicalVolume LVM VolumeGroup OSD(data) LogicalVolume OSD(index) OSD Applications bucket RBD Applications Applications LogicalVolume node node • 3 replicas • Rack failure tolerance Rook manage
  7. Agenda 9 ▌Cybozu and our storage system ▌Quick introduction to

    K8s and Rook ▌Advanced Rook/Ceph cluster ▌Efforts and challenges ▌Conclusion
  8. What is K8s ▌A container orchestration ▌All services work as

    “pod” ⚫Pod is a set of containers 10 Kubernetes node Pod Container Container Pod Pod node node … Pod Pod Pod Pod Pod Pod
  9. The concept of K8s ▌All configurations are described as resources

    ▌K8s keeps the desired states 11 Kubernetes Pod node Pod resource PersistenVolumeClaim (PVC) resource PersistenVolume (PV) resource volume 10 GiB volume please! There is A 10 GiB volume! Provisioned by storage Provisioner (driver)
  10. ▌All Ceph components are described as resources ▌Rook keeps the

    desired state of Ceph clusters The concept of Rook 12 CephCluster Resource CephBlockPool resource Pod resource(MON) Pod resource(MGR) Pod resource(OSD) PV resource (for OSD) PVC resource (for OSD) MON pod MGR pod OSD pod disk node node node admin node Rook pod create watch create K8 cluster
  11. Agenda 13 ▌Cybozu and our storage system ▌Quick introduction to

    K8s and Rook ▌Advanced Rook/Ceph cluster ▌Efforts and challenges ▌Conclusion
  12. Our Rook/Ceph clusters ▌Requirements ⚫3 replicas ⚫Rack failure tolerance ⚫All

    OSDs should be spread evenly over all racks/nodes ▌Typical operations ⚫Create and upgrade clusters ⚫Manage OSDs (add a new one, replace a damaged one) ⚫etc. 14
  13. ▌Just create the following resource 15 Create a Ceph cluster

    kind: CephCluster Metadata: name: ceph-ssd … mgr: count: 2 … … mon: count: 3 … storage: … count: 3 • All OSDs are evenly spread over all nodes and racks • Special configurations are necessary (see later) rack rack rack MON pod MGR pod OSD pod node node node node node node OSD pod OSD pod MON pod MON pod MGR pod
  14. Create an RBD pool ▌Just create the following resources 16

    kind: CephBlockPool metadata: name: block-pool spec: replicated: size: 3 failureDomain: zone kind: StorageClass metadata: name: ceph-block parameters: clusterID: ceph-ssd pool: block-pool csi.storage.k8s.io/fstype: ext4 Ceph cluster A pool for rbd Name:block-pool Replicated: 3 Failure domain: rack “zone” means “rack” in our clusters ceph-block provisioner
  15. Consume an RBD image ▌Just create the following resources 17

    kind: PersistentVolumeClaim metadata: name: myclaim spec: … storageClassName: ceph-block kind: Pod metadata: name: mypod volumes: - name: myvolume persistentVolumeClaim: claimName: myclaim Ceph cluster A pool for RBD Name: block-pool Replicas: 3 Failure domain: rack RBD volume mypod create consume ceph-block provisioner
  16. Configurations ▌ Edit the “rook-config-override” ConfigMap resource corresponding to “/etc/ceph.conf”

    ⚫ Restarting Ceph pods is necessary after that ▌Some configurations (e.g. “ceph set”) can’t use this way ⚫ Run Ceph commands in the “toolbox” pod 18 kind: ConfigMap metadata: name: rook-config-override data: config: | debug rgw = 5/5
  17. ▌Just edit the CephCluster resource 19 Expand a cluster kind:

    CephCluster … storage: … count: 6 Change from 3 rack rack rack OSD pod node node node node node node OSD pod OSD pod OSD pod OSD pod OSD pod MON pod MGR pod MON pod MON pod MGR pod
  18. ▌Just run our home-made script ⚫There isn’t such the official

    job/script Replace a damaged OSD 20 rack rack rack OSD pod node node node node node node OSD pod OSD pod OSD pod OSD pod OSD pod MON pod MGR pod MON pod MON pod MGR pod OSD pod remove create A home-maid script
  19. Upgrade Rook and Ceph ▌Edit the CephCluster resource ▌All Ceph

    container images will be upgraded after that ▌Other work might be needed ⚫See files under “Documentation/Upgrade” of Rook repository 21 kind: CephCluster … image: ceph/ceph:v17.2.6 Change from “ceph/ceph:v17.2.5”
  20. Troubleshooting ▌Same as other Ceph orchestrations ⚫Running Ceph commands ⚫Referring

    logs and metrics ⚫Report bugs to upstream ▌Rook is convenient but is not a silver bullet ⚫Rook can’t debug and fix bugs 22
  21. A subject about even OSD deployment ▌K8s deploy Pods to

    arbitrary nodes by default ⚫ OSDs might be unevenly spread 24 node node kind: CephCluster … storage: … count: 3 One node lost results in data loss node OSD OSD OSD
  22. Solution ▌Use the “TopologySpreadConstraints” feature of K8s ⚫ Spread specific

    pods evenly over all nodes (or racks, and so on) 25 node kind: CephCluster … storage: … count: 3 … topologySpreadConstraints: - labelSelector: matchExpressions: - key: app operator: In values: - rook-ceph-osd - rook-ceph-osd-prepare topologyKey: topology.kubernetes.io/hostname OSD node OSD node OSD See also: https://blog.kintone.io/entry/2020/09/18/175030
  23. rack In our clusters ▌Two constraints for both nodes and

    racks 26 kind: CephCluster … storage: … count: 12 … topologySpreadConstraints: - labelSelector: … topologyKey: topology.kubernetes.io/zone - labelSelector … topologyKey: topology.kubernetes.io/hostname rack rack node OSD OSD node OSD node OSD node OSD node OSD node OSD OSD OSD OSD OSD OSD
  24. A subject about automatic OSD deployment ▌OSD creation flow 1.

    Rook create a PVC resources for an OSD 2. K8s bind a PV resource to this PVC resource 3. Rook creates an OSD on top of the block device corresponding to the PV ▌It’s suspended if the PV resource isn’t available 27 OSD OSD Ceph cluster OSD A Storage provisioner Not provisioned yet kind: CephCluster … storage: … count: 3
  25. Solution ▌Use storage providers supporting dynamic provisioning ⚫e.g. many cloud

    storages, TopoLVM (for local volumes) ▌Provision PV resources beforehand ⚫e.g. local-static-provisioner 28 storage: … count: 3 volumeClaimTemplates: - spec: storageClassName: nice-provisioner OSD OSD Ceph cluster OSD Nice provisioner Provision on demand or pre-provisioned
  26. In our cluster ▌NVMe SSD ⚫Use TopoLVM ⚫PV resources and

    the corresponding LVM logical volumes are created on OSD creation ▌HDD ⚫Use Local persistent volume ⚫Create PVs for all HDDs on deploying nodes ⚫PV resources are already available on OSD creation 29
  27. Agenda 30 ▌Cybozu and our storage system ▌Quick introduction to

    K8s and Rook ▌Advanced Rook/Ceph cluster ▌Efforts and challenges ▌Conclusion
  28. Daily check of upstream Rook/Ceph ▌Check every update of both

    Rook and Ceph projects every day ⚫Watch the important bugfix PRs and backport them if necessary ▌e.g. a data lost bug in RGW (PR#49795) ⚫Some object might be gone on bucket index resharding ⚫Pending resharding operation until the upgrade to >= v17.2.6 31
  29. Upstream-first development ▌We’ve shared everything with Rook/Ceph communities ⚫ Reduce

    the long-term maintenance cost ⚫ Make both communities better ▌Major contributions ⚫ Implemented the Rook’s advanced configurations ⚫ I’ve been working as a Rook maintainer ⚫ Resolved some problems in containerized Ceph clusters ⚫ https://speakerdeck.com/sat/revealing-bluestore-corruption-bugs-in-containerized-ceph-clusters 32
  30. Running the custom containers ▌If there is a critical bug

    and we can’t wait the next release, we want to use our custom containers ⚫ Official release + critical patches ▌Have being tried to run Teuthology in our test environment to verify custom containers ⚫ Succeeded in running all tests, but most tests still fail ⚫ We’ll continue to fix this problem 33
  31. Agenda 35 ▌Cybozu and our storage system ▌Quick introduction to

    K8s and Rook ▌Advanced Rook/Ceph cluster ▌Efforts and challenges ▌Conclusion
  32. Conclusion ▌Rook is an attractive option of Ceph orchestration ⚫Especially

    if you are familiar with K8s ▌There are some advanced configurations ▌We’ll continue to provide feedbacks to Rook/Ceph communities 36