Nested Docker inside LXC 3.0 on Debian 10 (Buster) with CGroups V2 (Unified Hierarchy)

I have been using systemd-nspawn on my Debian servers as a simple isolation tool, but there is just one small problem: it does not play nicely with nested Docker inside. I had to use a dirty hack to make it work, but the hack basically makes the namespace isolation pointless. This seems to be a limitation of cgroups v1, i.e. the “legacy hierarchy”, but for some reason enabling cgroups v2 “unified hierarchy” makes it impossible to use Docker inside altogether, even with the hack. With Debian's migration to croups v2 on the horizon, I figured it might be a good idea to start preparing for the transition right now instead of later, and since systemd-nspawn did not seem to work as of my previous attempt, I decided to try LXC, as I have heard that it supported nesting Docker inside even on cgroups v1.

Enabling CGroups V2 in Systemd

To use pure cgroups v2, we have to make sure systemd uses only the unified hierarchy, otherwise it defaults to “hybrid hierarchy” as of Debian 10. This is a problem because, obviously, we want to prepare for the transition to pure cgroups v2.

It is fairly straightforward to enable cgroups v2 support in systemd: just add systemd.unified_cgroup_hierarchy=1 to kernel cmdline (for GRUB, /etc/default/grub), and then rebuild your bootloader config (for GRUB, update-grub).

After rebooting, run mount | grep cgroup and check if the output looks like

cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

Notice the cgroup2 at the start and after type.

LXC Configuration for CGroups V2

I won't go into the details about how to set up an LXC container — you can do it the “template” way, or you can generate the rootfs yourself using something like debootstrap and write the configuration like what's described in this article. Either way, you'll have to apply the following additional configuration yourself.

Debian has an excellent wiki page about LXC cgroups v2 compatibility, but it is for LXC 4.0, which will probably be included in Debian 11 “Bullseye”. As of now, Debian 10 “Buster” comes with LXC 3.0, which contains some bugs in terms of cgroups v2 compatibility. Specifically,

lxc.mount.auto = cgroup:rw:force

is broken on LXC 3.0: it mounts cgroup2 at /sys/fs/cgroup/cgroup instead of the correct path /sys/fs/cgroup (reported here and fixed in 4.0). The workaround is to simply mount cgroup2 to the correct path manually

lxc.mount.entry = cgroup2 sys/fs/cgroup cgroup2 create=dir,rw 0 0

(Note: theoretically, you can simply use the alternative solution provided in the Debian wiki page, lxc.init.cmd = /sbin/init systemd.unified_cgroup_hierarchy=1, but this did not work for me either for a Debian 10 rootfs inside the container. I am not sure why, but force-mounting the cgroup2 filesystem seem to work just fine).

In addition, the automatically generated AppArmor profile is broken for cgroups v2, resulting in some systemd services inside the containers crashing, such as systemd-networkd, due to not being able to set up service-specific namespaces. For now, the dirty fix is to simply put containers in the unconfined profile:

lxc.apparmor.profile = unconfined

This is, obviously, not exactly great for security, but rest assured you shouldn't need this once Debian 11 is released and you can upgrade to LXC 4.0.

I have put together a configuration file that you can simply include into your own configuration via the lxc.include directive:

# Mount cgroup2 at ${CONTAINER_ROOT}/sys/fs/cgroup
# to force systemd to use cgroups v2
# TODO: After LXC 4.0, we only need
#   lxc.mount.auto = cgroup:rw:force
# for this. The line above does not work in LXC 3.0
lxc.mount.entry = cgroup2 sys/fs/cgroup cgroup2 create=dir,rw 0 0

# Clear cgroups v1 rules
lxc.cgroup.devices.allow =
lxc.cgroup.devices.deny =

# LXC 3.0 has apparmor bugs for cgroup v2
# TODO: Remove this after LXC 4.0
lxc.apparmor.profile = unconfined

Docker nested in LXC

Actually, with the configuration above, you should already be able to use Docker inside the container. The problem is that Docker only supports CGroups v2 since 20.10, but Debian 10 comes with a pretty old version of it. To use Docker 20.10 on Debian 10, you'll need to add

deb [arch=amd64] https://download.docker.com/linux/debian buster stable

to your APT sources list, and then install docker-ce from it.

Remember that for Docker to behave correctly inside a container, you must load the overlay kernel module on the host, otherwise it will fall back to an very inefficient driver vfs.

To verify your Docker installation is working, run docker info and make sure it shows

 Cgroup Driver: systemd
 Cgroup Version: 2