Upgrade to Pro — share decks privately, control downloads, hide ads and more …

cgroup v2 internals

Avatar for Kenta Tada Kenta Tada
December 05, 2020

cgroup v2 internals

Avatar for Kenta Tada

Kenta Tada

December 05, 2020
Tweet

More Decks by Kenta Tada

Other Decks in Programming

Transcript

  1. About me ⚫Kenta Tada ⚫Software Engineer, Sony ⚫CloudNative Days Tokyo

    2020 • https://speakerdeck.com/kentatada/embedded-container- runtime-using-linux-capabilities-seccomp-cgroups 2
  2. Unified hierarchy ⚫All the controllers are under the same hierarchy

    since cgroup v2. ⚫If the parent cgroup disables some controllers, those cannot be enabled in the descendant cgroups • You can control what controllers are enabled in the descendant cgroups from the file of cgroup.subtree_control 5 /sys/fs/cgroup /cgtest1 /cpu.* /memory.* /cgtest2 /cpu.* /memory.*
  3. New features ⚫nsdelegate and cgroup namespace for unprivileged container •

    But some controllers need a privilege because of using eBPF. ⚫cgroup aware OOM killer ⚫PSI per cgroup ⚫(NEW) utilization clamping support • Assign the actual the computational power assigned to task groups considering the actual frequency which is depending on the operation of schedutil and asymmetric capacity systems like Arm's big.LITTLE. 6 https://github.com/torvalds/linux/commit/2480c093130f64ac3a410504fa8b3db1fc4b87ce
  4. v1 subsystem Kernel object name source v2 subsystem blkio io_cgrp_subsys

    block/blk-cgroup.c io cpuacct cpuacct_cgrp_subsys kernel/sched/cpuacct.c cpu cpu cpu_cgrp_subsys kernel/sched/core.c cpu cpuset cpuset_cgrp_subsys kernel/cgroup/cpuset.c cpuset devices devices_cgrp_subsys security/device_cgroup.c Using eBPF freezer freezer_cgrp_subsys (v1)kernel/cgroup/legacy_freezer.c (v2)kernel/cgroup/freezer.c freezer hugetlb hugetlb_cgrp_subsys mm/hugetlb_cgroup.c hugetlb memory memory_cgrp_subsys mm/memcontrol.c memory net_cls net_cls_cgrp_subsys net/core/netclassid_cgroup.c Using eBPF net_prio net_prio_cgrp_subsys net/core/netprio_cgroup.c Using eBPF perf_event perf_event_cgrp_subsys kernel/events/core.c perf_event pids pids_cgrp_subsys kernel/cgroup/pids.c pids rdma rdma_cgrp_subsys kernel/cgroup/rdma.c rdma The difference between v1 and v2 at kernel 5.9 7 https://events.static.linuxfound.org/sites/events/files/slides/cgroup_and_namespaces.pdf
  5. How to confirm the available controllers ⚫cgroup.controllers • Ex. /sys/fs/cgroup/cgroup.controllers

    • Each cgroup has a “cgroup.controllers” file which lists all controllers available for the cgroup to enable ⚫But some controllers are not listed in the above file. ⚫Why?? 9 https://www.kernel.org/doc/html/v5.9/admin-guide/cgroup-v2.html
  6. Implicit or inhibit controllers ⚫Some controllers are not supported in

    the default hierarchy. • cgrp_dfl_inhibit_ss_mask ⚫Some controllers are implicitly enabled on the default hierarchy. • cgrp_dfl_implicit_ss_mask ⚫When system boots, kernel sets up the above masks. 10 if (ss->implicit_on_dfl) cgrp_dfl_implicit_ss_mask |= 1 << ss->id; else if (!ss->dfl_cftypes) cgrp_dfl_inhibit_ss_mask |= 1 << ss->id; https://github.com/torvalds/linux/blob/v5.9/kernel/cgroup/cgroup.c#L5740-L5743
  7. cgroup.controllers in kernel ⚫ cgroup core interface files are defined

    as below: ⚫cgroup_controllers_show() calls cgroup_control(). ⚫cgroup_control() returns the visible controllers using the masks. 11 static u16 cgroup_control(struct cgroup *cgrp) … if (cgroup_on_dfl(cgrp)) root_ss_mask &= ~(cgrp_dfl_inhibit_ss_mask | cgrp_dfl_implicit_ss_mask); return root_ss_mask; } https://github.com/torvalds/linux/blob/v5.9/kernel/cgroup/cgroup.c .name = "cgroup.controllers", .seq_show = cgroup_controllers_show,
  8. Device controller in cgroup v1 ⚫Device controller allows or denies

    access to devices with each cgroup. ⚫There are three files to control behavior. • devices.allow is the allowlist of devices. • devices.deny is the denylist of devices. • devices.list shows available devices. ⚫Interface • Ex. Allow cgroup 1 to read and mknod /dev/null as below: 13 https://www.kernel.org/doc/html/v5.9/admin-guide/cgroup-v1/devices.html # echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow
  9. Device controller in cgroup v2 ⚫cgroup v2 uses eBPF for

    some reasons. • Ex. To control network access ⚫The eBPF program is attached to a specific cgroup. ⚫BPF_PROG_TYPE_CGROUP_DEVICE was introduced since kernel version 4.15. 14
  10. What is eBPF ⚫eBPF is a revolutionary technology that can

    run sandboxed programs in the Linux kernel without changing kernel source code. 15 https://ebpf.io/ http://www.brendangregg.com/ebpf.html
  11. Attach device control settings inside kernel space ⚫Attach the eBPF

    program to a cgroup • https://github.com/torvalds/linux/blob/v5.9/kernel/bpf/syscall.c#L4187 → bpf_prog_attach at kernel/bpf/syscall.c#L2839 → cgroup_bpf_prog_attach at kernel/bpf/cgroup.c#L762 → cgroup_bpf_attach at kernel/cgroup/cgroup.c#L6496 → __cgroup_bpf_attachat kernel/bpf/cgroup.c#L433 16
  12. __cgroup_bpf_attach 17 /** * __cgroup_bpf_attach() - Attach the program or

    the link to a cgroup, and * propagate the change to descendants * @cgrp: The cgroup which descendants to traverse * @prog: A program to attach * @link: A link to attach * @replace_prog: Previously attached program to replace if BPF_F_REPLACE is set * @type: Type of attach operation * @flags: Option flags * * Exactly one of @prog or @link can be non-null. * Must be called with cgroup_mutex held. */ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, struct bpf_prog *replace_prog, struct bpf_cgroup_link *link, enum bpf_attach_type type, u32 flags)
  13. Check device permissions inside kernel space ⚫For example, mknod(2) is

    executed. • https://github.com/torvalds/linux/blob/v5.9/fs/namei.c#L3528 → devcgroup_inode_mknod at include/linux/device_cgroup.h#L40 → devcgroup_check_permission at security/device_cgroup.c#L835 → BPF_CGROUP_RUN_PROG_DEVICE_CGROUP at include/linux/bpf-cgroup.h#L295 → __cgroup_bpf_check_dev_permissionat kernel/bpf/cgroup.c#L1125 18
  14. __cgroup_bpf_check_dev_permission 19 int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor, short

    access, enum bpf_attach_type type) { struct cgroup *cgrp; struct bpf_cgroup_dev_ctx ctx = { .access_type = (access << 16) | dev_type, .major = major, .minor = minor, }; int allow = 1; rcu_read_lock(); cgrp = task_dfl_cgroup(current); allow = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], &ctx, BPF_PROG_RUN); rcu_read_unlock(); return !allow; }
  15. Demo : Load device control settings from user space 20

    ⚫Observe kernel functions when the kernel test code is executed. ⚫The test code(test_dev_cgroup.c) only permits below: • major: 1, minor: 5( /dev/zero) • major: 1, minor: 9(/dev/urandom) ⚫bpftrace shows values which are actually checked in kernel. assert(system("mknod /tmp/test_dev_cgroup_null c 1 3")); assert(system("mknod /tmp/test_dev_cgroup_zero c 1 5") == 0); assert(system("dd if=/dev/urandom of=/dev/zero count=64") == 0); assert(system("dd if=/dev/urandom of=/dev/full count=64")); assert(system("dd if=/dev/random of=/dev/zero count=64")); The test code from https://github.com/torvalds/linux/blob/v5.10-rc3/tools/testing/selftests/bpf/test_dev_cgroup.c [Expected outputs] 1. major: 1, minor: 3 2. major: 1, minor: 5 3. major: 1, minor: 9 4. major: 1, minor: 5 5. major: 1, minor: 9(if is allowed) 6. major: 1, minor: 7(of is forbidden) 7. major: 1, minor: 8(if is forbidden)
  16. How to use eBPF-based device controller ⚫Many tools provide the

    abstraction layer. ⚫OCI runtime spec • This specification is originally designed for cgroup v1. • But some container runtimes can handle the configuration of cgroup v1 for v2. ⚫libcgroup is currently developing the facility of cgroup v2 interfaces. •https://github.com/libcgroup/libcgroup/issues/12 22
  17. Key takeaways ⚫There are a lot of features in cgroup

    v2. • Like uclamp, it is for not only cloud systems but also embedded systems ⚫cgroup v2 changed interfaces and the way of resource control. • Some cgroup v2 controllers are not supported in the default hierarchy. ⚫eBPF is important for cgroup v2. 23