I can’t belive its been more than a year since my last composefs blog. So much has changed, yet the goal is the same. However, finally things are starting to settling down, so I think it is time to do an update on the current state.
First some background, like what even is Composefs?
The initial version of Composefs was an actual linux kernel filesystem module. But during development and upstream discussions it has become clear that a better approach is to use existing kernel features, with some minor feature additions rather than a completely new filesystem. However, from a user perspective, it still looks like a filesystem.
Composefs is not a regular filesystem like ext4 though. It is not meant to be stored on a disk partition, but instead as a file that gets downloaded and mounted. The typical usecase is that you have some directory structure, call it an “image”, and you want to package up this image so it can easily be used elsewhere without needing to unpack it. For example, this could be a container image, or a rootfs image.
The closest thing that we have today is a loopback mount of a read-only filesystem (such as squashfs or erofs). However, composefs has two big advantages over this: file sharing and integrity validation.
A composefs image is made up of two things, the filesystem image and a directory of backing files. For example, suppose you have this directory:
$ tree example-dir/ example-dir/ ├── data.txt └── subdir └── other.txt
With the mkcomposefs tool you can create an image from this:
$ mkcomposefs --digest-store=objects /the/example-dir example.cfs $ tree ├── example.cfs └── objects ├── 9e │ └── 3ba51c3a07352a7c32a27a17604d8118724a16... └── a2 └── 447bfab34972328b1f67e3d06e82c990c64f12...
The example.cfs image has all the information about the files and directories, like filenames, permissions, mtimes, xattrs, etc. However, the actual file data is stored in the individual backing files in the objects subdirectory, and are only accessed when needed (i.e. when respective file is opened).
We can mount it like this (using the mount.composefs helper):
$ sudo mount -t composefs -o basedir=objects example.cfs /mnt $ ls -l /mnt/ -rw-r--r--. 1 alex alex 18 11 jul 14.25 data.txt drwxr-xr-x. 2 alex alex 48 11 jul 14.26 subdir
Note that the backing files are named by the checksum of their content. This means that if you create multiple images with a shared objects directory, then the same backing file will be used for any file that is shared between images.
Not only does this mean that images that share files can be stored more efficiently on disk, it also means that any such shared files will be stored only once in page-cache (i.e. ram). A container system using this would allow more containers to run on the same hardware, because libraries shared between unrelated images can be shared.
Additionally, composefs supports using fs-verity for both the image file, and all the backing files. This means that if you specify the expected digest when you mount the composefs image, it will be validated before use. Then the image itself contains the expected fs-verity digests of the backing files, and these will be also be verified at use time. Using this we get the integrity benefit of something like dm-verity, while still allowing fine-grained disk and memory sharing.
Composefs itself consists of a few pieces:
- Userspace support, including mkcomposefs and mount.composefs
- The image format, which is based on an erofs image with overlayfs xattrs
- New overlayfs kernel feature for supporting the composefs usecase
The userspace is available at https://github.com/containers/composefs and is now in a pretty good state. The current releases are marked pre-release, because we don’t fully want to mark it stable until all the overlayfs changes are in a kernel release so we can fully rely on the format being long-term stable.
On the erofs side, all the features we need are in kernel 5.15 or later.
For overlayfs there are two features the we rely on, the new “data-only” lower directory support, which has landed in 6.5-rc1, and the support for storing fs-verity digests in the overlay.metacopy xattr, which is queued for 6.6 in the overlayfs-next branch. However, these kernel changes are only needed for the integrity checking, so if you don’t need those, then current kernels work.
One of the main usecases for composefs is OSTree. Initial work for composefs integration landed in OSTree 2023.4, and further work is ongoing to support validation of composefs images using ed25519 signatures. This will allow secure boot to extend the chain of trust from the kernel/initrd to the whole of userspace.
This integration with OSTree is intentionally done with minimal changes, because the goal is to allow current users of OSTree to easily benefit from the advantages of composefs. However, in parallel there are long term plans to radically redo and simplify some parts of OSTree on top of composefs. See this upstream issue for more discussions about this.
Another important usecase for composefs has always been OCI containers. Recently the initial work on this landed in containers/storage, which is the first step in making podman use composefs for images, thus allowing higher density and integrity for container images.
These are only the initial steps for composefs. I think the future holds even more interesting ideas. For example, given that both ostree and podman are converging on the same storage model, there is hope that we will be able to share backing files between ostree and podman, such that files identical in the host OS and a container image are stored only once.
We also hope that as composefs starts being deployed more widely people will come up with new exciting use cases.