composefs – Alexander Larsson

Testing composefs in Silverblue

As of the version 39 of Fedora Silverblue all the basic code is merged to support a composefs-based root filesystem.

To try it, do:

- - Update to the latest version (I tested 39.20240115.0)
  - Configure ostree to create and use composefs images:
    $ sudo ostree config set ex-integrity.composefs yes
  - Trigger a manual (re)deploy of the current version:
    $ sudo ostree admin deploy fedora/39/x86_64/silverblue
  - Reboot into the new deploy
  - If using ext4 filesystem for rootfs (not needed for btrfs), enable “verity” feature on it:
    $ sudo tune2fs -O verity /dev/vda3 # Change to right root disk
  - Enable fs-verity on all pre-existing ostree repo files:
    $ sudo ostree admin post-copy

At this point, the rootfs should be be a composefs mount. You can verify it by looking at the mount, which should look like this:

$ findmnt /
TARGET SOURCE  FSTYPE  OPTIONS
/ overlay overlay ro,relatime,seclabel,lowerdir=/run/ostree/.private/cfsroot-lower::/sysroot/ostree/repo/objects,redirect_dir=on,metacopy=on

So, what does this mean?

First of all, it means the rootfs is truly read-only:

# touch /usr/new_file
touch: cannot touch '/usr/new_file': Read-only file system

The above error message happens also with regular ostree, but in that case it is only a read-only mount flag, and a root user can re-mount it read-write to modify it (or modify the backing directories in /ostree). However, when using composefs, the root filesystem is a combination of a erofs mount (from /ostree/deploy/fedora/deploy/*/.ostree.cfs) and an overlayfs with no writable directories, and neither of these have any ability to write to disk.

In addition, the system is set up to validate all file accesses, as the composefs image has recorded the expected fs-verity checksums for all files and overlayfs can validate them on use.

To fully complete the validation, Silverblue will just need a few additions (which I hope will be done soon):

Each build should generate a one-use signature keypair
The ostree commit should be signed with the private key
Add public key as /etc/ostree/initramfs-root-binding.key
Add /usr/lib/ostree/prepare-root.conf with this content:
```
[composefs]
enabled=yes
signed=yes
```
These files will be copied into the initrd, and during boot the public key will be used to validate the composefs image, which in turn guarantee that all file accesses give the correct, unchanged data.

To further improve security, the initramfs and the kernel can be combined into a Unified Kernel Image and signed. Then SecureBoot can guarantee that your system will not boot any other initramfs, and thus no other userspace.

Announcing composefs 1.0

As of Linux 6.6-rc1, which contains the overlayfs fs-verity support, all the kernel changes that was required for composefs are upstream. This allows us to finalize the composefs image format and give guarantees of its future stability.

This means that we are happy to welcome Composefs 1.0 to the world!

The main feature of 1.0 is the stability of the file format and the library API, however, there are a few new major features in 1.0:

Various tweaks make the image format more efficient.
The library and the tools now has the ability to inspect composefs image files. This includes listing what basedir object files they refer to which makes it easy to figure out what objects are missing (and has to be downloaded).
The use of the built-in kernel fs-verity signature verification has been dropped on recommendation from the fs-verity maintainer. Instead we recommended to use userspace libraries to verify fs-verity digests.

For more details and download links, see the release notes. For a short introduction to composefs, see this earlier blog entry.

There is also ongoing work in the wider community to use composefs:

Ostree 2023.6 and rpm-ostree 2023.6 together allow for end-to-end signed and validated composefs ostree deployments. The code is still marked experimental and composefs needs to be enabled manually on the host, but the feature is compiled in and available by default.

containers/storage contains initial work on supporting composefs in the overlayfs backend. Once this is finalized and used in podman, it will be possible to use the cross-image de-duplication and tamper-proofing features of composefs for all podman containers. This will lead to improved container density and security.

Composefs state of the union

I can’t belive its been more than a year since my last composefs blog. So much has changed, yet the goal is the same. However, finally things are starting to settling down, so I think it is time to do an update on the current state.

Background

First some background, like what even is Composefs?

The initial version of Composefs was an actual linux kernel filesystem module. But during development and upstream discussions it has become clear that a better approach is to use existing kernel features, with some minor feature additions rather than a completely new filesystem. However, from a user perspective, it still looks like a filesystem.

Composefs is not a regular filesystem like ext4 though. It is not meant to be stored on a disk partition, but instead as a file that gets downloaded and mounted. The typical usecase is that you have some directory structure, call it an “image”, and you want to package up this image so it can easily be used elsewhere without needing to unpack it. For example, this could be a container image, or a rootfs image.

The closest thing that we have today is a loopback mount of a read-only filesystem (such as squashfs or erofs). However, composefs has two big advantages over this: file sharing and integrity validation.

A composefs image is made up of two things, the filesystem image and a directory of backing files. For example, suppose you have this directory:

$ tree example-dir/
 example-dir/
  ├── data.txt
  └── subdir
      └── other.txt

With the mkcomposefs tool you can create an image from this:

$ mkcomposefs --digest-store=objects /the/example-dir example.cfs
$ tree
  ├── example.cfs
  └── objects
       ├── 9e
       │   └── 3ba51c3a07352a7c32a27a17604d8118724a16...
       └── a2
           └── 447bfab34972328b1f67e3d06e82c990c64f12...

The example.cfs image has all the information about the files and directories, like filenames, permissions, mtimes, xattrs, etc. However, the actual file data is stored in the individual backing files in the objects subdirectory, and are only accessed when needed (i.e. when respective file is opened).

We can mount it like this (using the mount.composefs helper):

$ sudo mount -t composefs -o basedir=objects example.cfs /mnt
$ ls -l /mnt/
-rw-r--r--. 1 alex alex 18 11 jul 14.25 data.txt
drwxr-xr-x. 2 alex alex 48 11 jul 14.26 subdir

Note that the backing files are named by the checksum of their content. This means that if you create multiple images with a shared objects directory, then the same backing file will be used for any file that is shared between images.

Not only does this mean that images that share files can be stored more efficiently on disk, it also means that any such shared files will be stored only once in page-cache (i.e. ram). A container system using this would allow more containers to run on the same hardware, because libraries shared between unrelated images can be shared.

Additionally, composefs supports using fs-verity for both the image file, and all the backing files. This means that if you specify the expected digest when you mount the composefs image, it will be validated before use. Then the image itself contains the expected fs-verity digests of the backing files, and these will be also be verified at use time. Using this we get the integrity benefit of something like dm-verity, while still allowing fine-grained disk and memory sharing.

Composefs status

Composefs itself consists of a few pieces:

Userspace support, including mkcomposefs and mount.composefs
The image format, which is based on an erofs image with overlayfs xattrs
New overlayfs kernel feature for supporting the composefs usecase

The userspace is available at https://github.com/containers/composefs and is now in a pretty good state. The current releases are marked pre-release, because we don’t fully want to mark it stable until all the overlayfs changes are in a kernel release so we can fully rely on the format being long-term stable.

On the erofs side, all the features we need are in kernel 5.15 or later.

For overlayfs there are two features the we rely on, the new “data-only” lower directory support, which has landed in 6.5-rc1, and the support for storing fs-verity digests in the overlay.metacopy xattr, which is queued for 6.6 in the overlayfs-next branch. However, these kernel changes are only needed for the integrity checking, so if you don’t need those, then current kernels work.

OSTree status

One of the main usecases for composefs is OSTree. Initial work for composefs integration landed in OSTree 2023.4, and further work is ongoing to support validation of composefs images using ed25519 signatures. This will allow secure boot to extend the chain of trust from the kernel/initrd to the whole of userspace.

This integration with OSTree is intentionally done with minimal changes, because the goal is to allow current users of OSTree to easily benefit from the advantages of composefs. However, in parallel there are long term plans to radically redo and simplify some parts of OSTree on top of composefs. See this upstream issue for more discussions about this.

Container backend

Another important usecase for composefs has always been OCI containers. Recently the initial work on this landed in containers/storage, which is the first step in making podman use composefs for images, thus allowing higher density and integrity for container images.

Going forward

These are only the initial steps for composefs. I think the future holds even more interesting ideas. For example, given that both ostree and podman are converging on the same storage model, there is hope that we will be able to share backing files between ostree and podman, such that files identical in the host OS and a container image are stored only once.

We also hope that as composefs starts being deployed more widely people will come up with new exciting use cases.

Using Composefs in OSTree

Recently I’ve been looking at what options there are for OSTree based systems to be fully cryptographically sealed, similar to dm-verity. I really like the efficiency and flexibility of the ostree storage model, but it currently has this one weakness compared to image-based systems. See for example the FAQ in Lennarts recent blog about image-based OSes for a discussions of this.

This blog post is about fixing this weakness, but lets start by explaining the problem.

An OSTree boot works by encoding in the kernel command-line the rootfs to use, like this:

ostree=/ostree/boot.1/centos/993c66dedfed0682bc9471ade483e2f57cc143cba1b7db0f6606aef1a45df669/0

Early on in the boot some code runs that reads this and mount this directory (called the deployment) as the root filesystem. If you look at this you can see a long hex string. This is actually a sha256 digest from the signed ostree commit, which covers all the data in the directory. At any time you can use this to verify that the deployment is correct, and ostree does so when downloading and deploying. However, once the deployment has been written to disk, it is not verified again, as doing so is expensive.

In contrast, image-based systems using dm-verity compute the entire filesystem image on the server, checksum it with a hash-tree (that allows incremental verification) and sign the result. This allows the kernel to validate every single read operation and detect changes. However, we would like to use the filesystem to store our content, as it is more efficient and flexible.

Luckily, there is something called fs-verity that we can use. It is a checksum mechanism similar to dm-verity, but it works on file contents instead of partition content. Enabling fs-verity on a file makes it immutable and computes a hash-tree for it. From that point on any read from the file will return an error if a change was detected.

fs-verity is a good match for OSTree since all files in an the repo are immutable by design. Since some time ostree supportes fs-verity. When it is enabled the files in the repo get fs-verity enabled as they are added. This then propagates to the files in the deployment.

Isn’t this enough then? The files in the root fs are immutable and verified by the kernel.

Unfortunately no. fs-verity only verifies the file content, not the file or directory metadata. This means that a change there will not be detected. For example, its possible to change permissions on a file, add a file, remove a file or even replace a file in the deploy directories. Hardly immutable…

What we would like is to use fs-verity to also seal the filesystem metadata.

Enter composefs

Composefs is a Linux filesystem that Giuseppe Scrivano and I have been working on, initially with a goal of allowing deduplication for container image storage. But, with some of the recent changes it is also useful for the OSTree usecase.

The basic idea of composefs is that we have a set of content files and then we want to create directories with files based on it. The way ostree does this is to create an actual directory tree with hardlinks to the repo files. Unfortunately this has certain limitations. For example, the hardlinks share metadata like mtime and permission, and if these differ we can’t share the content file. It also suffer from not being an immutable representation.

So, instead of creating such a directory, we create a “composefs image”, which is a binary blob that contains all the metadata for the directory (names, structure, permissions, etc) as well as pathnames to the files that have the actual file contents. This can then be mounted wherever you want.

This is very simple to use:

# tree rootfs
rootfs
├── file-a
└── file-b
# cat rootfs/file-a
file-a
# mkcomposefs rootfs rootfs.img
# ls -l rootfs.img
-rw-r--r--. 1 root root 272 Jun 2 14:17 rootfs.img
# mount composefs -t composefs -o descriptor=rootfs.img,basedir=rootfs mnt

At this point the mnt directory is now a frozen version of the rootfs directory. It will not pick up changes to the original directory metadata:

# ls mnt/
file-a file-b
# rm mnt/file-a
rm: cannot remove 'mnt/file-a': Read-only file system
# echo changed > mnt/file-a
bash: mnt/file-a: Read-only file system#
# touch rootfs/new-file
# ls rootfs mnt/
mnt/:
file-a file-b

rootfs:
file-a file-b new-file

However, it is still using the original files for content (via the basedir= option), and these can be changed:

# cat mnt/file-a
file-a
# echo changed > rootfs/file-a
# cat mnt/file-a
changed

To fix this we enable the use of fs-verity, by passing the --compute-digest option to mkcomposefs:

# mkcomposefs rootfs --compute-digest rootfs.img
# mount composefs -t composefs -o descriptor=rootfs.img,basedir=rootfs mnt

Now the image will have the fs-verity digests recorded and the kernel will verify these:

# cat mnt/file-a
cat: mnt/file-a: Input/output error
WARNING: composefs backing file 'file-a' unexpectedly had no fs-verity digest

Oops, turns out we didn’t actually use fs-verity on that file. Lets remedy that:

# fsverity enable rootfs/file-a
# cat mnt/file-a
changed

We can now try to change the backing file (although fs-verity only lets us completely replace it). This will fail even if we enable fs verity on the new file:

# echo try-change > rootfs/file-a
bash: rootfs/file-a: Operation not permitted
# rm rootfs/file-a
# echo try-change > rootfs/file-a
# cat mnt/file-a
cat: mnt/file-a: Input/output error
WARNING: composefs backing file 'file-a' unexpectedly had no fs-verity digest
# fsverity enable rootfs/file-a
# cat mnt/file-a
cat: mnt/file-a: Input/output error
WARNING: composefs backing file 'file-a' has the wrong fs-verity digest

In practice, you’re likely to use composefs with a content-addressed store rather than the original directory hierarchy, and mkcomposefs has some support for this:

# mkcomposefs rootfs --digest-store=content rootfs.img
# tree content/
content/
├── 0f
│   └── e37b4a7a9e7aea14f0254f7bf4ba3c9570a739254c317eb260878d73cdcbbc
└── 76
└── 6fad6dd44cbb3201bd7ebf8f152fecbd5b0102f253d823e70c78e780e6185d
# mount composefs -t composefs -o descriptor=rootfs.img,basedir=content mnt
# cat mnt/file-b
file-b

As you can see it automatically copied the content files into the store named by the fs-verity digest and enabled fs-verity on all the content files.

Is this enough now? Unfortunately no. We can still modify the rootfs.img file, which will affect the metadata of the filesystem. But this is easy to solve by using fs-verity on the actual image file:

# fsverity enable rootfs.img
# fsverity measure rootfs.img
sha256:b92d94aa44d1e0a174a0c4492778b59171703903e493d1016d90a2b38edb1a21 rootfs.img
# mount composefs -t composefs -o descriptor=rootfs.img,basedir=content,digest=b92d94aa44d1e0a174a0c4492778b59171703903e493d1016d90a2b38edb1a21 mnt

Here we passed the digest of the rootfs.img file to the mount command, which makes composefs verify that the image matches what was expected.

Back to OSTree

That was a long detour into composefs. But how does OSTree use this?

The idea is that instead of checking out a hardlinked directory and passing that on the kernel commandline we build a composefs image, enable fs-verity on it and put its filename and digest on the kernel command line instead.

For additional trust, we also generate the composefs image on the server when building the ostree commit. Then we add the digest of that image to the commit metadata before signing it. Since building the composefs image is fully reproducible, we will get the exact same composefs image on the client and can validate it against the signed digest before using it.

This has been a long post, but now we are at the very end, and we have a system where every bit read from the “root filesystem” is continuously verified against a signed digest which is passed as a kernel command line. Much like dm-verity, but much more flexible.

The Containers usecase

As I mentioned before, composefs was originally made for a different usecase, namely container image storage. The goal there is that as we unpack container image layers we can drop the content files into a shared directory, and then generate composefs files for the image themselves.

This way identical files between any two installed images will be shared on the local machine. And the sharing would be both on disk and in memory (i.e. in the page cache), This will allow higher density on your cluster, and smaller memory requirements on your edge nodes.