Best Practices of Production-Grade Rook/Ceph Cluster

Best Practices of Production-Grade Rook/Ceph Cluster Apr. 17th, 2023 Cybozu,
Inc. Satoru Takeuchi 1

Agenda 2 ▌Cybozu and our storage system ▌Quick introduction to
K8s and Rook ▌Advanced Rook/Ceph cluster ▌Efforts and challenges ▌Conclusion

About Cybozu ▌A leading cloud service provider in Japan ▌Providing
Web services that supports teamwork 4

Our infrastructure ▌We have an on-premise infrastructure ▌Current system has
many problems ⚫Not scalable ⚫A lot of day 2 work ▌Developing a new infrastructure ⚫Kubernetes(K8s) ⚫Ceph and Rook(Ceph orchestration in K8s) 5

Why Ceph? ▌Fulfill our requirements ⚫block & object storage ⚫Rack
failure tolerance ⚫Bit-rot tolerance ▌Open source ⚫Detailed evaluation investigation of problems ⚫Use our custom container in case of emergency (see later) 6

Why Rook? ▌Manage storage system in k8s way as other
system components ▌Offload a lot of work to K8s ⚫Lifecycle management of hardware ⚫Mon failover ⚫Restart problematic Ceph daemons 7

OSD(data) OSD(data) OSD(data) OSD(data) Our storage system 8 Ceph cluster
for RGW Ceph cluster for RBD HDD SSD HDD HDD SSD NVMe SSD LogicalVolume LVM VolumeGroup OSD(data) LogicalVolume OSD(index) OSD Applications bucket RBD Applications Applications LogicalVolume node node • 3 replicas • Rack failure tolerance Rook manage

What is K8s ▌A container orchestration ▌All services work as
“pod” ⚫Pod is a set of containers 10 Kubernetes node Pod Container Container Pod Pod node node … Pod Pod Pod Pod Pod Pod

The concept of K8s ▌All configurations are described as resources
▌K8s keeps the desired states 11 Kubernetes Pod node Pod resource PersistenVolumeClaim (PVC) resource PersistenVolume (PV) resource volume 10 GiB volume please! There is A 10 GiB volume! Provisioned by storage Provisioner (driver)

▌All Ceph components are described as resources ▌Rook keeps the
desired state of Ceph clusters The concept of Rook 12 CephCluster Resource CephBlockPool resource Pod resource(MON) Pod resource(MGR) Pod resource(OSD) PV resource (for OSD) PVC resource (for OSD) MON pod MGR pod OSD pod disk node node node admin node Rook pod create watch create K8 cluster

Our Rook/Ceph clusters ▌Requirements ⚫3 replicas ⚫Rack failure tolerance ⚫All
OSDs should be spread evenly over all racks/nodes ▌Typical operations ⚫Create and upgrade clusters ⚫Manage OSDs (add a new one, replace a damaged one) ⚫etc. 14

▌Just create the following resource 15 Create a Ceph cluster
kind: CephCluster Metadata: name: ceph-ssd … mgr: count: 2 … … mon: count: 3 … storage: … count: 3 • All OSDs are evenly spread over all nodes and racks • Special configurations are necessary (see later) rack rack rack MON pod MGR pod OSD pod node node node node node node OSD pod OSD pod MON pod MON pod MGR pod

Create an RBD pool ▌Just create the following resources 16
kind: CephBlockPool metadata: name: block-pool spec: replicated: size: 3 failureDomain: zone kind: StorageClass metadata: name: ceph-block parameters: clusterID: ceph-ssd pool: block-pool csi.storage.k8s.io/fstype: ext4 Ceph cluster A pool for rbd Name:block-pool Replicated: 3 Failure domain: rack “zone” means “rack” in our clusters ceph-block provisioner

Consume an RBD image ▌Just create the following resources 17
kind: PersistentVolumeClaim metadata: name: myclaim spec: … storageClassName: ceph-block kind: Pod metadata: name: mypod volumes: - name: myvolume persistentVolumeClaim: claimName: myclaim Ceph cluster A pool for RBD Name: block-pool Replicas: 3 Failure domain: rack RBD volume mypod create consume ceph-block provisioner

Configurations ▌ Edit the “rook-config-override” ConfigMap resource corresponding to “/etc/ceph.conf”
⚫ Restarting Ceph pods is necessary after that ▌Some configurations (e.g. “ceph set”) can’t use this way ⚫ Run Ceph commands in the “toolbox” pod 18 kind: ConfigMap metadata: name: rook-config-override data: config: | debug rgw = 5/5

▌Just edit the CephCluster resource 19 Expand a cluster kind:
CephCluster … storage: … count: 6 Change from 3 rack rack rack OSD pod node node node node node node OSD pod OSD pod OSD pod OSD pod OSD pod MON pod MGR pod MON pod MON pod MGR pod

▌Just run our home-made script ⚫There isn’t such the official
job/script Replace a damaged OSD 20 rack rack rack OSD pod node node node node node node OSD pod OSD pod OSD pod OSD pod OSD pod MON pod MGR pod MON pod MON pod MGR pod OSD pod remove create A home-maid script

Upgrade Rook and Ceph ▌Edit the CephCluster resource ▌All Ceph
container images will be upgraded after that ▌Other work might be needed ⚫See files under “Documentation/Upgrade” of Rook repository 21 kind: CephCluster … image: ceph/ceph:v17.2.6 Change from “ceph/ceph:v17.2.5”

Troubleshooting ▌Same as other Ceph orchestrations ⚫Running Ceph commands ⚫Referring
logs and metrics ⚫Report bugs to upstream ▌Rook is convenient but is not a silver bullet ⚫Rook can’t debug and fix bugs 22

Advanced configurations ▌Even OSD deployment over all nodes/racks ▌Automatic OSD
deployment 23

A subject about even OSD deployment ▌K8s deploy Pods to
arbitrary nodes by default ⚫ OSDs might be unevenly spread 24 node node kind: CephCluster … storage: … count: 3 One node lost results in data loss node OSD OSD OSD

Solution ▌Use the “TopologySpreadConstraints” feature of K8s ⚫ Spread specific
pods evenly over all nodes (or racks, and so on) 25 node kind: CephCluster … storage: … count: 3 … topologySpreadConstraints: - labelSelector: matchExpressions: - key: app operator: In values: - rook-ceph-osd - rook-ceph-osd-prepare topologyKey: topology.kubernetes.io/hostname OSD node OSD node OSD See also: https://blog.kintone.io/entry/2020/09/18/175030

rack In our clusters ▌Two constraints for both nodes and
racks 26 kind: CephCluster … storage: … count: 12 … topologySpreadConstraints: - labelSelector: … topologyKey: topology.kubernetes.io/zone - labelSelector … topologyKey: topology.kubernetes.io/hostname rack rack node OSD OSD node OSD node OSD node OSD node OSD node OSD OSD OSD OSD OSD OSD

A subject about automatic OSD deployment ▌OSD creation flow 1.
Rook create a PVC resources for an OSD 2. K8s bind a PV resource to this PVC resource 3. Rook creates an OSD on top of the block device corresponding to the PV ▌It’s suspended if the PV resource isn’t available 27 OSD OSD Ceph cluster OSD A Storage provisioner Not provisioned yet kind: CephCluster … storage: … count: 3

Solution ▌Use storage providers supporting dynamic provisioning ⚫e.g. many cloud
storages, TopoLVM (for local volumes) ▌Provision PV resources beforehand ⚫e.g. local-static-provisioner 28 storage: … count: 3 volumeClaimTemplates: - spec: storageClassName: nice-provisioner OSD OSD Ceph cluster OSD Nice provisioner Provision on demand or pre-provisioned

In our cluster ▌NVMe SSD ⚫Use TopoLVM ⚫PV resources and
the corresponding LVM logical volumes are created on OSD creation ▌HDD ⚫Use Local persistent volume ⚫Create PVs for all HDDs on deploying nodes ⚫PV resources are already available on OSD creation 29

Daily check of upstream Rook/Ceph ▌Check every update of both
Rook and Ceph projects every day ⚫Watch the important bugfix PRs and backport them if necessary ▌e.g. a data lost bug in RGW (PR#49795) ⚫Some object might be gone on bucket index resharding ⚫Pending resharding operation until the upgrade to >= v17.2.6 31

Upstream-first development ▌We’ve shared everything with Rook/Ceph communities ⚫ Reduce
the long-term maintenance cost ⚫ Make both communities better ▌Major contributions ⚫ Implemented the Rook’s advanced configurations ⚫ I’ve been working as a Rook maintainer ⚫ Resolved some problems in containerized Ceph clusters ⚫ https://speakerdeck.com/sat/revealing-bluestore-corruption-bugs-in-containerized-ceph-clusters 32

Running the custom containers ▌If there is a critical bug
and we can’t wait the next release, we want to use our custom containers ⚫ Official release + critical patches ▌Have being tried to run Teuthology in our test environment to verify custom containers ⚫ Succeeded in running all tests, but most tests still fail ⚫ We’ll continue to fix this problem 33

Other remaining work ▌Backup/restore ▌Remote Replication ▌More automation 34

Conclusion ▌Rook is an attractive option of Ceph orchestration ⚫Especially
if you are familiar with K8s ▌There are some advanced configurations ▌We’ll continue to provide feedbacks to Rook/Ceph communities 36

Thank you! 37

Best Practices of Production-Grade Rook/Ceph Cl...

Best Practices of Production-Grade Rook/Ceph Cluster

More Decks by Satoru Takeuchi

Other Decks in Technology

Featured

Transcript