Alexander Larsson

Testing composefs in Silverblue

As of the version 39 of Fedora Silverblue all the basic code is merged to support a composefs-based root filesystem.

To try it, do:

- - Update to the latest version (I tested 39.20240115.0)
  - Configure ostree to create and use composefs images:
    $ sudo ostree config set ex-integrity.composefs yes
  - Trigger a manual (re)deploy of the current version:
    $ sudo ostree admin deploy fedora/39/x86_64/silverblue
  - Reboot into the new deploy
  - If using ext4 filesystem for rootfs (not needed for btrfs), enable “verity” feature on it:
    $ sudo tune2fs -O verity /dev/vda3 # Change to right root disk
  - Enable fs-verity on all pre-existing ostree repo files:
    $ sudo ostree admin post-copy

At this point, the rootfs should be be a composefs mount. You can verify it by looking at the mount, which should look like this:

$ findmnt /
TARGET SOURCE  FSTYPE  OPTIONS
/ overlay overlay ro,relatime,seclabel,lowerdir=/run/ostree/.private/cfsroot-lower::/sysroot/ostree/repo/objects,redirect_dir=on,metacopy=on

So, what does this mean?

First of all, it means the rootfs is truly read-only:

# touch /usr/new_file
touch: cannot touch '/usr/new_file': Read-only file system

The above error message happens also with regular ostree, but in that case it is only a read-only mount flag, and a root user can re-mount it read-write to modify it (or modify the backing directories in /ostree). However, when using composefs, the root filesystem is a combination of a erofs mount (from /ostree/deploy/fedora/deploy/*/.ostree.cfs) and an overlayfs with no writable directories, and neither of these have any ability to write to disk.

In addition, the system is set up to validate all file accesses, as the composefs image has recorded the expected fs-verity checksums for all files and overlayfs can validate them on use.

To fully complete the validation, Silverblue will just need a few additions (which I hope will be done soon):

Each build should generate a one-use signature keypair
The ostree commit should be signed with the private key
Add public key as /etc/ostree/initramfs-root-binding.key
Add /usr/lib/ostree/prepare-root.conf with this content:
```
[composefs]
enabled=yes
signed=yes
```
These files will be copied into the initrd, and during boot the public key will be used to validate the composefs image, which in turn guarantee that all file accesses give the correct, unchanged data.

To further improve security, the initramfs and the kernel can be combined into a Unified Kernel Image and signed. Then SecureBoot can guarantee that your system will not boot any other initramfs, and thus no other userspace.

Announcing composefs 1.0

As of Linux 6.6-rc1, which contains the overlayfs fs-verity support, all the kernel changes that was required for composefs are upstream. This allows us to finalize the composefs image format and give guarantees of its future stability.

This means that we are happy to welcome Composefs 1.0 to the world!

The main feature of 1.0 is the stability of the file format and the library API, however, there are a few new major features in 1.0:

Various tweaks make the image format more efficient.
The library and the tools now has the ability to inspect composefs image files. This includes listing what basedir object files they refer to which makes it easy to figure out what objects are missing (and has to be downloaded).
The use of the built-in kernel fs-verity signature verification has been dropped on recommendation from the fs-verity maintainer. Instead we recommended to use userspace libraries to verify fs-verity digests.

For more details and download links, see the release notes. For a short introduction to composefs, see this earlier blog entry.

There is also ongoing work in the wider community to use composefs:

Ostree 2023.6 and rpm-ostree 2023.6 together allow for end-to-end signed and validated composefs ostree deployments. The code is still marked experimental and composefs needs to be enabled manually on the host, but the feature is compiled in and available by default.

containers/storage contains initial work on supporting composefs in the overlayfs backend. Once this is finalized and used in podman, it will be possible to use the cross-image de-duplication and tamper-proofing features of composefs for all podman containers. This will lead to improved container density and security.

Composefs state of the union

I can’t belive its been more than a year since my last composefs blog. So much has changed, yet the goal is the same. However, finally things are starting to settling down, so I think it is time to do an update on the current state.

Background

First some background, like what even is Composefs?

The initial version of Composefs was an actual linux kernel filesystem module. But during development and upstream discussions it has become clear that a better approach is to use existing kernel features, with some minor feature additions rather than a completely new filesystem. However, from a user perspective, it still looks like a filesystem.

Composefs is not a regular filesystem like ext4 though. It is not meant to be stored on a disk partition, but instead as a file that gets downloaded and mounted. The typical usecase is that you have some directory structure, call it an “image”, and you want to package up this image so it can easily be used elsewhere without needing to unpack it. For example, this could be a container image, or a rootfs image.

The closest thing that we have today is a loopback mount of a read-only filesystem (such as squashfs or erofs). However, composefs has two big advantages over this: file sharing and integrity validation.

A composefs image is made up of two things, the filesystem image and a directory of backing files. For example, suppose you have this directory:

$ tree example-dir/
 example-dir/
  ├── data.txt
  └── subdir
      └── other.txt

With the mkcomposefs tool you can create an image from this:

$ mkcomposefs --digest-store=objects /the/example-dir example.cfs
$ tree
  ├── example.cfs
  └── objects
       ├── 9e
       │   └── 3ba51c3a07352a7c32a27a17604d8118724a16...
       └── a2
           └── 447bfab34972328b1f67e3d06e82c990c64f12...

The example.cfs image has all the information about the files and directories, like filenames, permissions, mtimes, xattrs, etc. However, the actual file data is stored in the individual backing files in the objects subdirectory, and are only accessed when needed (i.e. when respective file is opened).

We can mount it like this (using the mount.composefs helper):

$ sudo mount -t composefs -o basedir=objects example.cfs /mnt
$ ls -l /mnt/
-rw-r--r--. 1 alex alex 18 11 jul 14.25 data.txt
drwxr-xr-x. 2 alex alex 48 11 jul 14.26 subdir

Note that the backing files are named by the checksum of their content. This means that if you create multiple images with a shared objects directory, then the same backing file will be used for any file that is shared between images.

Not only does this mean that images that share files can be stored more efficiently on disk, it also means that any such shared files will be stored only once in page-cache (i.e. ram). A container system using this would allow more containers to run on the same hardware, because libraries shared between unrelated images can be shared.

Additionally, composefs supports using fs-verity for both the image file, and all the backing files. This means that if you specify the expected digest when you mount the composefs image, it will be validated before use. Then the image itself contains the expected fs-verity digests of the backing files, and these will be also be verified at use time. Using this we get the integrity benefit of something like dm-verity, while still allowing fine-grained disk and memory sharing.

Composefs status

Composefs itself consists of a few pieces:

Userspace support, including mkcomposefs and mount.composefs
The image format, which is based on an erofs image with overlayfs xattrs
New overlayfs kernel feature for supporting the composefs usecase

The userspace is available at https://github.com/containers/composefs and is now in a pretty good state. The current releases are marked pre-release, because we don’t fully want to mark it stable until all the overlayfs changes are in a kernel release so we can fully rely on the format being long-term stable.

On the erofs side, all the features we need are in kernel 5.15 or later.

For overlayfs there are two features the we rely on, the new “data-only” lower directory support, which has landed in 6.5-rc1, and the support for storing fs-verity digests in the overlay.metacopy xattr, which is queued for 6.6 in the overlayfs-next branch. However, these kernel changes are only needed for the integrity checking, so if you don’t need those, then current kernels work.

OSTree status

One of the main usecases for composefs is OSTree. Initial work for composefs integration landed in OSTree 2023.4, and further work is ongoing to support validation of composefs images using ed25519 signatures. This will allow secure boot to extend the chain of trust from the kernel/initrd to the whole of userspace.

This integration with OSTree is intentionally done with minimal changes, because the goal is to allow current users of OSTree to easily benefit from the advantages of composefs. However, in parallel there are long term plans to radically redo and simplify some parts of OSTree on top of composefs. See this upstream issue for more discussions about this.

Container backend

Another important usecase for composefs has always been OCI containers. Recently the initial work on this landed in containers/storage, which is the first step in making podman use composefs for images, thus allowing higher density and integrity for container images.

Going forward

These are only the initial steps for composefs. I think the future holds even more interesting ideas. For example, given that both ostree and podman are converging on the same storage model, there is hope that we will be able to share backing files between ostree and podman, such that files identical in the host OS and a container image are stored only once.

We also hope that as composefs starts being deployed more widely people will come up with new exciting use cases.

Using Composefs in OSTree

Recently I’ve been looking at what options there are for OSTree based systems to be fully cryptographically sealed, similar to dm-verity. I really like the efficiency and flexibility of the ostree storage model, but it currently has this one weakness compared to image-based systems. See for example the FAQ in Lennarts recent blog about image-based OSes for a discussions of this.

This blog post is about fixing this weakness, but lets start by explaining the problem.

An OSTree boot works by encoding in the kernel command-line the rootfs to use, like this:

ostree=/ostree/boot.1/centos/993c66dedfed0682bc9471ade483e2f57cc143cba1b7db0f6606aef1a45df669/0

Early on in the boot some code runs that reads this and mount this directory (called the deployment) as the root filesystem. If you look at this you can see a long hex string. This is actually a sha256 digest from the signed ostree commit, which covers all the data in the directory. At any time you can use this to verify that the deployment is correct, and ostree does so when downloading and deploying. However, once the deployment has been written to disk, it is not verified again, as doing so is expensive.

In contrast, image-based systems using dm-verity compute the entire filesystem image on the server, checksum it with a hash-tree (that allows incremental verification) and sign the result. This allows the kernel to validate every single read operation and detect changes. However, we would like to use the filesystem to store our content, as it is more efficient and flexible.

Luckily, there is something called fs-verity that we can use. It is a checksum mechanism similar to dm-verity, but it works on file contents instead of partition content. Enabling fs-verity on a file makes it immutable and computes a hash-tree for it. From that point on any read from the file will return an error if a change was detected.

fs-verity is a good match for OSTree since all files in an the repo are immutable by design. Since some time ostree supportes fs-verity. When it is enabled the files in the repo get fs-verity enabled as they are added. This then propagates to the files in the deployment.

Isn’t this enough then? The files in the root fs are immutable and verified by the kernel.

Unfortunately no. fs-verity only verifies the file content, not the file or directory metadata. This means that a change there will not be detected. For example, its possible to change permissions on a file, add a file, remove a file or even replace a file in the deploy directories. Hardly immutable…

What we would like is to use fs-verity to also seal the filesystem metadata.

Enter composefs

Composefs is a Linux filesystem that Giuseppe Scrivano and I have been working on, initially with a goal of allowing deduplication for container image storage. But, with some of the recent changes it is also useful for the OSTree usecase.

The basic idea of composefs is that we have a set of content files and then we want to create directories with files based on it. The way ostree does this is to create an actual directory tree with hardlinks to the repo files. Unfortunately this has certain limitations. For example, the hardlinks share metadata like mtime and permission, and if these differ we can’t share the content file. It also suffer from not being an immutable representation.

So, instead of creating such a directory, we create a “composefs image”, which is a binary blob that contains all the metadata for the directory (names, structure, permissions, etc) as well as pathnames to the files that have the actual file contents. This can then be mounted wherever you want.

This is very simple to use:

# tree rootfs
rootfs
├── file-a
└── file-b
# cat rootfs/file-a
file-a
# mkcomposefs rootfs rootfs.img
# ls -l rootfs.img
-rw-r--r--. 1 root root 272 Jun 2 14:17 rootfs.img
# mount composefs -t composefs -o descriptor=rootfs.img,basedir=rootfs mnt

At this point the mnt directory is now a frozen version of the rootfs directory. It will not pick up changes to the original directory metadata:

# ls mnt/
file-a file-b
# rm mnt/file-a
rm: cannot remove 'mnt/file-a': Read-only file system
# echo changed > mnt/file-a
bash: mnt/file-a: Read-only file system#
# touch rootfs/new-file
# ls rootfs mnt/
mnt/:
file-a file-b

rootfs:
file-a file-b new-file

However, it is still using the original files for content (via the basedir= option), and these can be changed:

# cat mnt/file-a
file-a
# echo changed > rootfs/file-a
# cat mnt/file-a
changed

To fix this we enable the use of fs-verity, by passing the --compute-digest option to mkcomposefs:

# mkcomposefs rootfs --compute-digest rootfs.img
# mount composefs -t composefs -o descriptor=rootfs.img,basedir=rootfs mnt

Now the image will have the fs-verity digests recorded and the kernel will verify these:

# cat mnt/file-a
cat: mnt/file-a: Input/output error
WARNING: composefs backing file 'file-a' unexpectedly had no fs-verity digest

Oops, turns out we didn’t actually use fs-verity on that file. Lets remedy that:

# fsverity enable rootfs/file-a
# cat mnt/file-a
changed

We can now try to change the backing file (although fs-verity only lets us completely replace it). This will fail even if we enable fs verity on the new file:

# echo try-change > rootfs/file-a
bash: rootfs/file-a: Operation not permitted
# rm rootfs/file-a
# echo try-change > rootfs/file-a
# cat mnt/file-a
cat: mnt/file-a: Input/output error
WARNING: composefs backing file 'file-a' unexpectedly had no fs-verity digest
# fsverity enable rootfs/file-a
# cat mnt/file-a
cat: mnt/file-a: Input/output error
WARNING: composefs backing file 'file-a' has the wrong fs-verity digest

In practice, you’re likely to use composefs with a content-addressed store rather than the original directory hierarchy, and mkcomposefs has some support for this:

# mkcomposefs rootfs --digest-store=content rootfs.img
# tree content/
content/
├── 0f
│   └── e37b4a7a9e7aea14f0254f7bf4ba3c9570a739254c317eb260878d73cdcbbc
└── 76
└── 6fad6dd44cbb3201bd7ebf8f152fecbd5b0102f253d823e70c78e780e6185d
# mount composefs -t composefs -o descriptor=rootfs.img,basedir=content mnt
# cat mnt/file-b
file-b

As you can see it automatically copied the content files into the store named by the fs-verity digest and enabled fs-verity on all the content files.

Is this enough now? Unfortunately no. We can still modify the rootfs.img file, which will affect the metadata of the filesystem. But this is easy to solve by using fs-verity on the actual image file:

# fsverity enable rootfs.img
# fsverity measure rootfs.img
sha256:b92d94aa44d1e0a174a0c4492778b59171703903e493d1016d90a2b38edb1a21 rootfs.img
# mount composefs -t composefs -o descriptor=rootfs.img,basedir=content,digest=b92d94aa44d1e0a174a0c4492778b59171703903e493d1016d90a2b38edb1a21 mnt

Here we passed the digest of the rootfs.img file to the mount command, which makes composefs verify that the image matches what was expected.

Back to OSTree

That was a long detour into composefs. But how does OSTree use this?

The idea is that instead of checking out a hardlinked directory and passing that on the kernel commandline we build a composefs image, enable fs-verity on it and put its filename and digest on the kernel command line instead.

For additional trust, we also generate the composefs image on the server when building the ostree commit. Then we add the digest of that image to the commit metadata before signing it. Since building the composefs image is fully reproducible, we will get the exact same composefs image on the client and can validate it against the signed digest before using it.

This has been a long post, but now we are at the very end, and we have a system where every bit read from the “root filesystem” is continuously verified against a signed digest which is passed as a kernel command line. Much like dm-verity, but much more flexible.

The Containers usecase

As I mentioned before, composefs was originally made for a different usecase, namely container image storage. The goal there is that as we unpack container image layers we can drop the content files into a shared directory, and then generate composefs files for the image themselves.

This way identical files between any two installed images will be shared on the local machine. And the sharing would be both on disk and in memory (i.e. in the page cache), This will allow higher density on your cluster, and smaller memory requirements on your edge nodes.

Quadlet, an easier way to run system containers

UPDATE: Note that this describes the initial separate release of quadlet. Since the Podman 4.4 release, quadlet is merged into podman and will be available automatically.

Kubernetes and its likes is an excellent way to run containers in the cloud. And for development and testing, manually running podman is very useful (although do check out toolbox). But sometimes you really want to run a system service using a container. This could be on your laptop, NUC, or maybe some kind of edge or embedded device. The container should automatically start at boot, restart on errors, etc.

The recommended way to do this is to run podman from a systemd service. A lot of work has gone into podman to make this work well (and it constantly improves), and there are lots of documentation around the internet on how to do this. Additionally podman itself has some tools to help starting out (see podman generate systemd). But, the end result of all of these is that you get a complex, hard to understand systemd unit file with a very long “podman run” command that you have to maintain.

There has to be a simpler way!
Enter quadlet.

Quadlet is a systemd generator that takes a container description and automatically generates a systemd service file from it. The container description is in the systemd unit file format and describes how you want to run the container (i.e. what image, which ports exposed, etc), as well as standard systemd options, like dependencies. However, it doesn’t need to bother with technical details about how a container gets created or how it integrates with systemd, which makes the file much easier to understand and maintain.

This is easiest demonstrated by an example:

[Unit]
Description=Redis container

[Container]
Image=docker.io/redis
PublishPort=6379:6379
User=999

[Service]
Restart=always

[Install]
WantedBy=local.target

If you install the above in a file called
/etc/containers/systemd/redis.container (or
/usr/share/containers/systemd/redis.container) then, during boot (and at systemctl daemon-reload time), this is used to generate the file /run/systemd/generator/redis.service, which is then made
available as a regular service.

To get a feeling for this, the above container file generates the following service file:

# Automatically generated by quadlet-generator
[Unit]
Description=Redis container
RequiresMountsFor=%t/containers
SourcePath=/etc/containers/systemd/redis.container

[X-Container]
Image=docker.io/redis
PublishPort=6379:6379
User=999

[Service]
Restart=always
Environment=PODMAN_SYSTEMD_UNIT=%n
KillMode=mixed
ExecStartPre=-rm -f %t/%N.cid
ExecStopPost=-/usr/bin/podman rm -f -i --cidfile=%t/%N.cid
ExecStopPost=-rm -f %t/%N.cid
Delegate=yes
Type=notify
NotifyAccess=all
SyslogIdentifier=%N
ExecStart=/usr/bin/podman run --name=systemd-%N --cidfile=%t/%N.cid --replace --rm -d --log-driver journald --pull=never --runtime /usr/bin/crun --cgroups=split --tz=local --init --sdnotify=conmon --security-opt=no-new-privileges --cap-drop=all --mount type=tmpfs,tmpfs-size=512M,destination=/tmp --user 999 --uidmap 999:999:1 --uidmap 0:0:1 --uidmap 1:362144:998 --uidmap 1000:363142:64538 --gidmap 0:0:1 --gidmap 1:362144:65536 -p=6379:6379 docker.io/redis

[Install]
WantedBy=local.target

Once started it looks like a regular service:

● redis.service - Redis container
Loaded: loaded (/etc/containers/systemd/redis.container; generated)
Active: active (running) since Tue 2021-10-12 12:34:14; 1s ago
Main PID: 1559371 (conmon)
Tasks: 8 (limit: 38373)
Memory: 32.0M
CPU: 387ms
CGroup: /system.slice/redis.service
├─container
│ ├─1559375 /dev/init -- docker-entrypoint.sh redis-server
│ └─1559489 "redis-server *:6379"
└─supervisor
  └─1559371 /usr/bin/conmon --api-version 1 -c 24184463a9>

In practice you don’t need to care about the generated file, all you need to maintain is the container file. In fact, over time as podman/systemd integration is improved it may generate slightly different files to take advantage of the new features.

In addition to being easier to understand, quadlet comes with a set of defaults for how the container is run that better fit the usecase of running system services. For example, it defaults to running without any capabilities, it has a basic init process in the container, it uses the journal log driver, and it sets up the cgroups in a mode that best matches what systemd needs.

Right now this is a separate project, but I’ve been in touch with the podman developers, and there is some discussions of how to make this feature part of podman instead. But, until then you can use it from github.com/containers/quadlet, and I have made a COPR build available for experimenting.

For more information see the docs linked from the README.

Scaling Flathub 100x

Flatpak relies on OSTree to distribute apps. This means that flatpak repositories, such as Flathub, are really just OSTree repositories. At the core of an OSTree repository is the summary file, which describes the content of the repository. This is similar to the metadata that “apt-get update” downloads.

Every time you do an flatpak install it needs the information in the summary file. The file is cached between operations, but any time the repository changes the local copy needs to be updated.

This can be pretty slow, with Flathub having around 1000 applications (times 4 architectures). In addition, the more applications there are, the more likely it is that one has been updated since the last time which means you need to update.

This isn’t yet a major problem for Flathub, but its just a matter of time before it is, as apps keep getting added:

This is particularly problematic if we want to add new architectures, as that multiplies the number of applications.

So, the last month I’ve been working in OSTree and Flatpak to solve this by changing the flatpak repository format. Today I released Flatpak 1.9.2 which is the first version to support the new format, and Flathub is already serving it (and the old format for older clients).

The new format is not only more efficient, it is also split by architecture meaning each user only downloads a subset of the total summary. Additionally there is a delta-based incremental update method for updating from a previous version.

Here are some data for the latest Flathub summary:

Original summary: 6,6M (1.8M compressed)
New (x86-64) summary: 2.7M (554k compressed)
Delta from previous summary: 20k

So, if you’re able to use the delta, then it needs 100 times less network bandwidth compared to the original (compressed) summary and will be much faster.

Also, this means we can finally start looking at supporting other architectures in Flathub, as doing so will not inconvenience users of the major architectures.

To the future and beyond!

Compatibility in a sandboxed world

Compatibility has always been a complex problems in the Linux world. With the advent of containers/sandboxing it has become even more complicated. Containers help solve compatibility problems, but there are still remaining issues. Especially on the Linux desktop where things are highly interconnected. In fact, containers even create some problems that we didn’t use to have.

Today I’ll take a look at the issues in more details and give some ideas on how to best think of compatibility in this post-container world, focusing on desktop use with technologies like flatpak and snap.

Forward and backwards compatibility

First, lets take a look at what we mean by compatibility. Generally we’re considering changing some part of the system that interacts with other parts (for example command-line arguments, a library, a service or even a file format). We say the change is compatible if the complete system still works as intended after the change.

Most of the time when compatibility is mentioned it refers to backwards compatibility. This means that a change is done such that all old code keeps working. However, after the update other things may start to rely on the new version and there is no guarantee that these new versions will work without the change. A simple example of a backwards compatible change is a file format change, where the new app can read the old files, but the old app may not necessarily read the new files.

However, there is also a another concept called forward compatibility. This is a property of the design rather than any particular change. If something is designed to be forward compatible that means any later change to it will not cause earlier versions to stop working. For example, a file format is designed to be forward compatible if it guarantees a file produced by a new app is still readable by an older app (possibly somewhat degraded due to the lack of new features).

The two concepts are complementary in the sense that if something is forward compatible you can “upgrade” to older versions and that change will be backwards compatible.

API compatibility

API stands for Application Programming Interface and it defines how a program interfaces with some other code. When we talk about API compatibility we mean at the programming level. In other words, a change is API compatible if the some other source code can be recompiled against your changed code and it still builds and works.

Since the source code is very abstract and flexible this means quite a lot of changes are API compatible. For example, the memory layout of a structure is not defined at the source code level, so that can change and still be API compatible.

API compatibility is mostly interesting to programmers, or people building programs from source. What affects regular users instead is ABI compatibility.

ABI compatibility

ABI means Application Binary Interface, and it describes how binaries compiled from source are compatible. Once the source code is compiled a lot more details are made explicit, such as the layout of memory. This means that a lot of changes that are API compatible are not ABI compatible.

For example, changing the layout of a (public) structure is API compatible, but not ABI compatible as the old layout is encoded in the compiled code. However, with sufficient care by the programmer there are still a lot of changes that can be made that are ABI backward compatible. The most common example is adding new functions.

Symbol versioning

One thing that often gets mentioned when talking about ABI compatibility is symbol versioning. This is a feature of the ELF executable format that allows the creation of multiple versions of a function in the binary with the same name. Code built against older versions of the library calls the old function, and code built against the new version will call the new function. This is a way to extend what is possible to change and still be backwards ABI compatible.

For example, using this it may be possible to change the layout of a structure, but keep a copy of the previous structure too. Then you have two functions that each work on its own particular layout, meaning the change is still ABI compatible.

Symbol versioning is powerful, but it is not a solution for all problems. It is mostly useful for small changes. For example, the above change is workable if only one function uses the structure. However, if the modified structure is passed to many functions then all those functions need to be duplicated, and that quickly becomes unmanageable.

Additionally, symbol versioning silently introduces problems with forward compatibility. Even if an application doesn’t rely on the feature that was introduced in the new version of the library a simple rebuild will pick up a dependency of the new version, making it unnecessarily incompatible with older versions. This is why Linux applications must be built against the oldest version of glibc they want to support running against rather than against the latest.

ABI domains

When discussing ABI compatibility there is normally an implicit context that is assumed to be fixed. For example, we naturally assume both builds are for the same CPU architecture.

For any particular system there are a lot of details that affect
this context, such as:

Supported CPU features
Kernel system call ABI
Function calling conventions
Compiler/Linker version
System compiler flags
ABI of all dependent modules (such as e.g. glibc or libjpeg versions)

I call any fixed combination of these an ABI domain.

Historically the ABI domain has not been very important in the context of a particular installation, as it is the responsibility of the distribution to ensure that all these properties stay compatible over time. However, the fact that ABI domains differ is one of the primary causes of incompatibility between different distributions (or between different versions of the same distribution).

Protocol compatibility

Another type of compatibility is that of communication protocols. Two programs that talk to each other using a networking API (which could be on two different machines, or locally on the same machine) need to use a protocol to understand each other. Changes to this protocol need to be carefully considered to ensure they are compatible.

In the remote case this is pretty obvious, as it is very hard to control what software two different machines use. However, even for local communication between processes care has to be taken. For example, a local service could be using a protocol that has several implementations and they all need to stay compatible.

Sometimes local services are split into a service and a library and the compatibility guarantees are defined by the library rather than the service. Then we can achieve some level of compatibility by ensuring the library and the service are updated in lock-step. For example a distribution could ship them in the same package.

Enter containers

Containers is (among other things) a way to allow having multiple ABI domains on the same machine. This allows running a single build of an application on any distribution. This solves a lot of ABI compatibility issues in one fell swoop.

And there was much rejoicing!

There are still some remnants of ABI compatibility issues left, for example CPU features and kernel system calls still need to be compatible. Additionally it is important to keep the individual ABI domains internally compatible over time. For flatpak this ends up being the responsibility of the maintainers of the runtime wheras for docker it is up to each container image. But, by and large, we can consider ABI compatibility “solved”.

However, the fact that we now run multiple ABI domains on one machine brings up some new issues that we really didn’t need to care much about before, namely protocol compatibility and forward compatibility.

Protocol compatibility in a container world

Server containers are very isolated from each other, relying mainly on the stability of things like DNS/HTTP/SQL. But the desktop world is a lot more interconnected. For example, it relies on X11/Wayland/OpenGL for graphics, PulseAudio for audio, and Cups for printing, etc. Lots of desktop services use DBus to expose commonly used desktop functionality (including portals), and there is a plethora of file formats shared between apps (such as icons, mime-types, themes, etc).

If there is only one ABI domain on the machine then all these local protocols and formats need not consider protocol compatibility at all, because services and clients can be upgraded in lock-step. For example, the evolution-data-server service has an internal versioned DBus API, but client apps just use the library. If the service protocol changes we just update the library and apps continue to work.

However, with multiple ABI domains, we may have different versions of the library, and if the library of one domain doesn’t match the version of the running service then it will break.

So, in a containerized world, any local service running on the desktop needs to consider protocol stability and extensibility much more careful than what we used to do. Fortunately protocols are generally much more limited and flexible than library ABIs, so it’s much easier to make compatible changes, and failures are generally error messages rather than crashes.

Forward compatibility in a container world

Historically forward compatibility has mainly needed to be considered for file formats. For instance, you might share a home directory between different machines and then use and different versions of an app to open a file on the two machines, and you don’t want the newer version to write files the older can’t read.

However, with multiple ABI domains it is going to be much more common that some of the software on the machine update versions faster than others. For example, you might be running a very up-to-date app on an older distribution. In this case it is important that any host services were designed to be forward compatible so that they doesn’t break when talking to the newer clients.

Summary

So, in this new world we need to have new rules. And the rule is this:

Any interface that spans ABI domains (for example between system and app, or between two different apps) should be backwards compatible and try to be forward compatible as much as possible.

Any consumer of such an interface should be aware of the risks involved with version differences and degrade gracefully in case of a mismatch.

Putting container updates on a diet

For infrastructure reasons the Fedora flatpaks are delivered using the container registry, rather than OSTree (which is normally used for flatpaks). Container registries are quite inefficient for updates, so we have been getting complaints about large downloads.

I’ve been working on ways to make this more efficient. That would help not only the desktop use-case, as smaller downloads are also important in things like IoT devices. Also, less bandwidth used could lead to significant cost savings for large scale registries.

Containers already have some features designed to save download size. Lets take a look at them in more detail to see why that often doesn’t work.

Consider a very simple Dockerfile:

FROM fedora:32 RUN dnf -y install httpd COPY index.html /var/www/html/index.html ENTRYPOINT /usr/sbin/httpd

This will produce an container image that looks like this:

The nice thing about this setup is that if you change the html layer and re-deploy you only need to download the last layer, because the other layers are unchanged.

However, as soon as one of the other layer changes you need to download the changed layer and all layers below it from scratch. For example, if there is a security issue and you need to update the base image all layers will change.

In practice, such updates actually change very little in the image. Most files are the same as the previous version, and the few that change are still similar to the previous version. If a client is doing an update from the previous version the old files are available, and if they could be reused that would save a lot of work.

One complexity of this is that we need to reconstruct the exact tar file that we would otherwise download, rather than the extracted files. This is because we need to checksum it and verify the checksum to protect against accidental or malicious modifications. For containers, the checksum that clients use is the checksum of the uncompressed tarball. Being uncompressed is fortunate for us, because reproducing identical compression is very painful.

To handle such reconstruction, I wrote a tool called tar-diff which does exactly what is needed here:

$ tar-diff old.tar.gz new.tar.gz delta.tardiff $ tar xf old.tar.gz -C extracted/ $ tar-patch delta.tardiff extracted/ reconstructed.tar $ gunzip new.tar.gz $ cmp new.tar reconstructed.tar

I.e. it can use the extracted data from an old version, together with a small diff file to reconstruct the uncompressed tar file.

tar-diff uses knowledge of the tar format, as well as the bsdiff binary diff algorithm and zstd compression to create very small files for typical updates.

Here are some size comparisons for a few representative images. This shows the relative size of the deltas compared to the size of the changed layers:

Red Hat Universal Base Image 8.0 and 8.1

fluent/fluentd, a Ruby application on top of a small base image

OpenShift Enterprise prometheus releases

These are some pretty impressive figures. Its clear from this that some updates are really very small, yet we are all downloading massive files anyway. Some updates are larger, but even for those the deltas are in the realm of 10-15% of the original size. So, even in the worst case deltas are giving around 10x improvement.

For this to work we need to store the deltas on a container registry and have a way to find the deltas when pulling an image. Fortunately it turns out that the OCI specification is quite flexible, and there is a new project called OCI artifacts specifying how to store other types of binary data in a container.

So, I was able to add support for this in skopeo and podman, allowing it both to generate deltas and use them to speed up downloads. Here is a short screen-cast of using this to generate and use deltas between two images stored on the docker hub:

All this is work in progress and the exact details of how to store deltas on the repository is still being discussed. However, I wanted to give a heads up about this because I think it is some really powerful technology that a lot of people might be interested in.

Introducing GVariant schemas

GLib supports a binary data format called GVariant, which is commonly used to store various forms of application data. For example, it is used to store the dconf database and as the on-disk data in OSTree repositories.

The GVariant serialization format is very interesting. It has a recursive type-system (based on the DBus types) and is very compact. At the same time it includes padding to correctly align types for direct CPU reads and has constant time element lookup for arrays and tuples. This make GVariant a very good format for efficient in-memory read-only access.

Unfortunately the APIs that GLib has for accessing variants are not always great. They are based on using type strings and accessing children via integer indexes. While this is very dynamic and flexible (especially when creating variants) it isn’t a great fit for the case where you have serialized data in a format that is known ahead of time.

Some negative aspects are:

Each g_variant_get_child() call allocates a new object.
There is a lot of unavoidable (atomic) refcounting.
It always uses generic codepaths even if the format is known.

If you look at some other binary formats, like Google protobuf, or Cap’n Proto they work by describing the types your program use in a schema, which is compiled into code that you use to work with the data.

For many use-cases this kind of setup makes a lot of sense, so why not do the same with the GVariant format?

With the new GVariant Schema Compiler you can!

It uses a interface definition language where you define the types, including extra information like field names and other attributes, from which it generates C code.

For example, given the following schema:

type Gadget {
  name: string;
  size: {
    width: int32;
    height: int32;
  };
  array: []int32;
  dict: [string]int32;
};

It generates (among other things) these accessors:

const char *    gadget_ref_get_name   (GadgetRef v);
GadgetSizeRef   gadget_ref_get_size   (GadgetRef v);
Arrayofint32Ref gadget_ref_get_array  (GadgetRef v);
const gint32 *  gadget_ref_peek_array (GadgetRef v,
                                       gsize    *len);
GadgetDictRef   gadget_ref_get_dict   (GadgetRef v);

gint32 gadget_size_ref_get_width  (GadgetSizeRef v);
gint32 gadget_size_ref_get_height (GadgetSizeRef v);

gsize  arrayofint32_ref_get_length (Arrayofint32Ref v);
gint32 arrayofint32_ref_get_at     (Arrayofint32Ref v,
                                    gsize           index);

gboolean gadget_dict_ref_lookup (GadgetDictRef v,
                                 const char   *key,
                                 gint32       *out);

Not only are these accessors easier to use and understand due to using C types and field names instead of type strings and integer indexes, they are also a lot faster.

I wrote a simple performance test that just decodes a structure over an over. Its clearly a very artificial test, but the generated code is over 600 times faster than the code using g_variant_get(), which I think still says something.

Additionally, the compiler has a lot of other useful features:

You can add a custom prefix to all generated symbols.
All fixed size types generate C struct types that match the binary format, which can be used directly instead of the accessor functions.
Dictionary keys can be declared sorted: [sorted string] { ... } which causes the generated lookup function to use binary search.
Fields can declare endianness: foo: bigendian int32 which will be automatically decoded when using the generated getters.
Typenames can be declared ahead of time and used like foo: []Foo, or declared inline: foo: [] 'Foo { ... }. If you don’t name the type it will be named based on the fieldname.
All types get generated format functions that are (mostly) compatible with g_variant_print().

Gthree – ready to play

Today I made a new release of Gthree, version 0.2.0.

Newly added in this release is support for Raycaster, which is important if you’re making interactive 3D applications. For example, it’s used if you want clicks on the window to pick a 3D object from the scene. See the interactive demo for an example of this.

Also new is support for shadow maps. This allows objects between a light source and a target to cast shadows on the target. Here is an example from the demos:

I’ve been looking over the list of feature that we support, and in this release I think all the major things you might want to do in a 3D app is supported to at least a basic level.

So, if you ever wanted to play around with 3D graphics, now would be a great time to do so. Maybe just build the code and study/tweak the code in the examples subdirectory. That will give you a decent introduction to what is possible.

If you just want to play I added a couple of new features to gnome-hexgl based on the new release. Check out how the tracks casts shadows on the buildings!