Composefs state of the union

I can’t belive its been more than a year since my last composefs blog. So much has changed, yet the goal is the same. However, finally things are starting to settling down, so I think it is time to do an update on the current state.

Background

First some background, like what even is Composefs?

The initial version of Composefs was an actual linux kernel filesystem module. But during development and upstream discussions it has become clear that a better approach is to use existing kernel features, with some minor feature additions rather than a completely new filesystem. However, from a user perspective, it still looks like a filesystem.

Composefs is not a regular filesystem like ext4 though. It is not meant to be stored on a disk partition, but instead as a file that gets downloaded and mounted. The typical usecase is that you have some directory structure, call it an “image”, and you want to package up this image so it can easily be used elsewhere without needing to unpack it. For example, this could be a container image, or a rootfs image.

The closest thing that we have today is a loopback mount of a read-only filesystem (such as squashfs or erofs). However, composefs has two big advantages over this: file sharing and integrity validation.

A composefs image is made up of two things, the filesystem image and a directory of backing files. For example, suppose you have this directory:

$ tree example-dir/
 example-dir/
  ├── data.txt
  └── subdir
      └── other.txt

With the mkcomposefs tool you can create an image from this:

$ mkcomposefs --digest-store=objects /the/example-dir example.cfs
$ tree
  ├── example.cfs
  └── objects
       ├── 9e
       │   └── 3ba51c3a07352a7c32a27a17604d8118724a16...
       └── a2
           └── 447bfab34972328b1f67e3d06e82c990c64f12...

The example.cfs image has all the information about the files and directories, like filenames, permissions, mtimes, xattrs, etc. However, the actual file data is stored in the individual backing files in the objects subdirectory, and are only accessed when needed (i.e. when respective file is opened).

We can mount it like this (using the mount.composefs helper):

$ sudo mount -t composefs -o basedir=objects example.cfs /mnt
$ ls -l /mnt/
-rw-r--r--. 1 alex alex 18 11 jul 14.25 data.txt
drwxr-xr-x. 2 alex alex 48 11 jul 14.26 subdir

Note that the backing files are named by the checksum of their content. This means that if you create multiple images with a shared objects directory, then the same backing file will be used for any file that is shared between images.

Not only does this mean that images that share files can be stored more efficiently on disk, it also means that any such shared files will be stored only once in page-cache (i.e. ram). A container system using this would allow more containers to run on the same hardware, because libraries shared between unrelated images can be shared.

Additionally, composefs supports using fs-verity for both the image file, and all the backing files. This means that if you specify the expected digest when you mount the composefs image, it will be validated before use. Then the image itself contains the expected fs-verity digests of the backing files, and these will be also be verified at use time. Using this we get the integrity benefit of something like dm-verity, while still allowing fine-grained disk and memory sharing.

Composefs status

Composefs itself consists of a few pieces:

Userspace support, including mkcomposefs and mount.composefs
The image format, which is based on an erofs image with overlayfs xattrs
New overlayfs kernel feature for supporting the composefs usecase

The userspace is available at https://github.com/containers/composefs and is now in a pretty good state. The current releases are marked pre-release, because we don’t fully want to mark it stable until all the overlayfs changes are in a kernel release so we can fully rely on the format being long-term stable.

On the erofs side, all the features we need are in kernel 5.15 or later.

For overlayfs there are two features the we rely on, the new “data-only” lower directory support, which has landed in 6.5-rc1, and the support for storing fs-verity digests in the overlay.metacopy xattr, which is queued for 6.6 in the overlayfs-next branch. However, these kernel changes are only needed for the integrity checking, so if you don’t need those, then current kernels work.

OSTree status

One of the main usecases for composefs is OSTree. Initial work for composefs integration landed in OSTree 2023.4, and further work is ongoing to support validation of composefs images using ed25519 signatures. This will allow secure boot to extend the chain of trust from the kernel/initrd to the whole of userspace.

This integration with OSTree is intentionally done with minimal changes, because the goal is to allow current users of OSTree to easily benefit from the advantages of composefs. However, in parallel there are long term plans to radically redo and simplify some parts of OSTree on top of composefs. See this upstream issue for more discussions about this.

Container backend

Another important usecase for composefs has always been OCI containers. Recently the initial work on this landed in containers/storage, which is the first step in making podman use composefs for images, thus allowing higher density and integrity for container images.

Going forward

These are only the initial steps for composefs. I think the future holds even more interesting ideas. For example, given that both ostree and podman are converging on the same storage model, there is hope that we will be able to share backing files between ostree and podman, such that files identical in the host OS and a container image are stored only once.

We also hope that as composefs starts being deployed more widely people will come up with new exciting use cases.

4 thoughts on “Composefs state of the union”

MayeulC says:

July 11, 2023 at 5:23 pm

Hi, this is the first time I’m hearing about this.

It sounds quite promising, but I’m puzzled by the apparent lack of deduplication/chunking (leveraging a rolling hash) in the presented file structure.

As far as I know, Borg, Git, OSTree and others leverage this, so moving the OSTree back end to this would make it a regression?

I was also anticipating some comparison with tar, and this sentence left me expecting the final structure to be a flat file:

> It is not meant to be stored on a disk partition, but instead as a file that gets downloaded and mounted

Other than that, the readup was intriguing and interesting, though I don’t have much interest in locked-down devices.

1. alexl says:
  
  July 12, 2023 at 7:47 am
  
  OSTree only does chunked deduplication in the static deltas, which is an over-the-air features that minimizes download times. On disk it is always a content-addressed store that does de-duplication on a per-file level. So, the composefs version of ostree will not regress. In fact, you can use an existing ostree repo on disk with the composefs images, as they are identical.
  
  If you want a flat file I recommend using a dm-verity validated loopback-mounted erofs image. That is what e.g. systemd extensions use. Tar is a horrible format and should not be used for anything. However, the entire point of composefs is lost if you go to a flat format, so I don’t understand why you expected that?
  
Alex says:

July 11, 2023 at 9:57 pm

Terrific work, kudos!

> such that files identical in the host OS and a container image are stored only once

This makes me wonder, is there any chance we can have the same files shared between host and Flatpak runtimes/apps? The fact that I need so many things, like Qt, duplicated between the host and a Flatpak runtime always gave me a sense of imperfection.

1. alexl says:
  
  July 12, 2023 at 7:50 am
  
  Endless os initially had the system ostree repo merged with flatpak in order to do this. However, the system ostree repo is different (bare vs bare-user format), so this didn’t quite work out. These days they use a regular separated repo.
  
  However, with composefs, it should be possible to use a bare-user repo for the system repos, and then this may work again.