If you want to make containers under Linux, plenty of high-level options exist. [Lucavallin] wanted to learn more about how containers really work, so he decided to tackle the problem using the low-level kernel functions, and he shared the code with us on GitHub.
Containers are more isolated than processes but not quite full virtual machines. While a virtual machine creates a fake computer, a container is more like a fake operating system. Applications can run with their own idea of libraries, devices, and other resources, but it doesn’t try to abstract the underlying hardware.
[Lucavallin] tells us that the key features include
namespaces which allow different kernel resources to be grouped into related sets and control access to the different features. The
seccomp facility controls what system calls a process may make while the
capabilities system controls what root can do in the container. Finally, the
cgroups system allows you to limit resources so one container gets a fair share of things like CPU time or disk I/O.
These capabilities are available in the kernel started with version 6.0.x, so you’ll need that. In addition, namespaces and cgroupsv2 have to be on. If you aren’t sure, skim your
/boot/config-* file (use the one that matches what
uname -a tells you). For the user namespace, for example, you should find
CONFIG_USER_NS set to y. You can also look at
/proc/self/ns and see if it has namespace object you are looking for. If you want to be sure cgroupv2 is enabled, try “
grep cgroup /proc/filesystems” and you should see a “cgroup2” entry.