Recently I’ve been looking at what options there are for OSTree based systems to be fully cryptographically sealed, similar to dm-verity. I really like the efficiency and flexibility of the ostree storage model, but it currently has this one weakness compared to image-based systems. See for example the FAQ in Lennarts recent blog about image-based OSes for a discussions of this.
This blog post is about fixing this weakness, but lets start by explaining the problem.
An OSTree boot works by encoding in the kernel command-line the rootfs to use, like this:
Early on in the boot some code runs that reads this and mount this directory (called the deployment) as the root filesystem. If you look at this you can see a long hex string. This is actually a sha256 digest from the signed ostree commit, which covers all the data in the directory. At any time you can use this to verify that the deployment is correct, and ostree does so when downloading and deploying. However, once the deployment has been written to disk, it is not verified again, as doing so is expensive.
In contrast, image-based systems using dm-verity compute the entire filesystem image on the server, checksum it with a hash-tree (that allows incremental verification) and sign the result. This allows the kernel to validate every single read operation and detect changes. However, we would like to use the filesystem to store our content, as it is more efficient and flexible.
Luckily, there is something called fs-verity that we can use. It is a checksum mechanism similar to dm-verity, but it works on file contents instead of partition content. Enabling fs-verity on a file makes it immutable and computes a hash-tree for it. From that point on any read from the file will return an error if a change was detected.
fs-verity is a good match for OSTree since all files in an the repo are immutable by design. Since some time ostree supportes fs-verity. When it is enabled the files in the repo get fs-verity enabled as they are added. This then propagates to the files in the deployment.
Isn’t this enough then? The files in the root fs are immutable and verified by the kernel.
Unfortunately no. fs-verity only verifies the file content, not the file or directory metadata. This means that a change there will not be detected. For example, its possible to change permissions on a file, add a file, remove a file or even replace a file in the deploy directories. Hardly immutable…
What we would like is to use fs-verity to also seal the filesystem metadata.
Composefs is a Linux filesystem that Giuseppe Scrivano and I have been working on, initially with a goal of allowing deduplication for container image storage. But, with some of the recent changes it is also useful for the OSTree usecase.
The basic idea of composefs is that we have a set of content files and then we want to create directories with files based on it. The way ostree does this is to create an actual directory tree with hardlinks to the repo files. Unfortunately this has certain limitations. For example, the hardlinks share metadata like mtime and permission, and if these differ we can’t share the content file. It also suffer from not being an immutable representation.
So, instead of creating such a directory, we create a “composefs image”, which is a binary blob that contains all the metadata for the directory (names, structure, permissions, etc) as well as pathnames to the files that have the actual file contents. This can then be mounted wherever you want.
This is very simple to use:
# tree rootfs rootfs ├── file-a └── file-b # cat rootfs/file-a file-a # mkcomposefs rootfs rootfs.img # ls -l rootfs.img -rw-r--r--. 1 root root 272 Jun 2 14:17 rootfs.img # mount composefs -t composefs -o descriptor=rootfs.img,basedir=rootfs mnt
At this point the
mnt directory is now a frozen version of the
rootfs directory. It will not pick up changes to the original directory metadata:
# ls mnt/ file-a file-b # rm mnt/file-a rm: cannot remove 'mnt/file-a': Read-only file system # echo changed > mnt/file-a bash: mnt/file-a: Read-only file system# # touch rootfs/new-file # ls rootfs mnt/ mnt/: file-a file-b rootfs: file-a file-b new-file
However, it is still using the original files for content (via the
basedir= option), and these can be changed:
# cat mnt/file-a file-a # echo changed > rootfs/file-a # cat mnt/file-a changed
To fix this we enable the use of fs-verity, by passing the
--compute-digest option to
# mkcomposefs rootfs --compute-digest rootfs.img # mount composefs -t composefs -o descriptor=rootfs.img,basedir=rootfs mnt
Now the image will have the fs-verity digests recorded and the kernel will verify these:
# cat mnt/file-a cat: mnt/file-a: Input/output error WARNING: composefs backing file 'file-a' unexpectedly had no fs-verity digest
Oops, turns out we didn’t actually use fs-verity on that file. Lets remedy that:
# fsverity enable rootfs/file-a # cat mnt/file-a changed
We can now try to change the backing file (although fs-verity only lets us completely replace it). This will fail even if we enable fs verity on the new file:
# echo try-change > rootfs/file-a bash: rootfs/file-a: Operation not permitted # rm rootfs/file-a # echo try-change > rootfs/file-a # cat mnt/file-a cat: mnt/file-a: Input/output error WARNING: composefs backing file 'file-a' unexpectedly had no fs-verity digest # fsverity enable rootfs/file-a # cat mnt/file-a cat: mnt/file-a: Input/output error WARNING: composefs backing file 'file-a' has the wrong fs-verity digest
In practice, you’re likely to use composefs with a content-addressed store rather than the original directory hierarchy, and mkcomposefs has some support for this:
# mkcomposefs rootfs --digest-store=content rootfs.img # tree content/ content/ ├── 0f │ └── e37b4a7a9e7aea14f0254f7bf4ba3c9570a739254c317eb260878d73cdcbbc └── 76 └── 6fad6dd44cbb3201bd7ebf8f152fecbd5b0102f253d823e70c78e780e6185d # mount composefs -t composefs -o descriptor=rootfs.img,basedir=content mnt # cat mnt/file-b file-b
As you can see it automatically copied the content files into the store named by the fs-verity digest and enabled fs-verity on all the content files.
Is this enough now? Unfortunately no. We can still modify the
rootfs.img file, which will affect the metadata of the filesystem. But this is easy to solve by using fs-verity on the actual image file:
# fsverity enable rootfs.img # fsverity measure rootfs.img sha256:b92d94aa44d1e0a174a0c4492778b59171703903e493d1016d90a2b38edb1a21 rootfs.img # mount composefs -t composefs -o descriptor=rootfs.img,basedir=content,digest=b92d94aa44d1e0a174a0c4492778b59171703903e493d1016d90a2b38edb1a21 mnt
Here we passed the digest of the
rootfs.img file to the mount command, which makes composefs verify that the image matches what was expected.
Back to OSTree
That was a long detour into composefs. But how does OSTree use this?
The idea is that instead of checking out a hardlinked directory and passing that on the kernel commandline we build a composefs image, enable fs-verity on it and put its filename and digest on the kernel command line instead.
For additional trust, we also generate the composefs image on the server when building the ostree commit. Then we add the digest of that image to the commit metadata before signing it. Since building the composefs image is fully reproducible, we will get the exact same composefs image on the client and can validate it against the signed digest before using it.
This has been a long post, but now we are at the very end, and we have a system where every bit read from the “root filesystem” is continuously verified against a signed digest which is passed as a kernel command line. Much like dm-verity, but much more flexible.
The Containers usecase
As I mentioned before, composefs was originally made for a different usecase, namely container image storage. The goal there is that as we unpack container image layers we can drop the content files into a shared directory, and then generate composefs files for the image themselves.
This way identical files between any two installed images will be shared on the local machine. And the sharing would be both on disk and in memory (i.e. in the page cache), This will allow higher density on your cluster, and smaller memory requirements on your edge nodes.