Quadlet, an easier way to run system containers

Kubernetes and its likes is an excellent way to run containers in the cloud. And for development and testing, manually running podman is very useful (although do check out toolbox). But sometimes you really want to run a system service using a container. This could be on your laptop, NUC, or maybe some kind of edge or embedded device. The container should automatically start at boot, restart on errors, etc.

The recommended way to do this is to run podman from a systemd service. A lot of work has gone into podman to make this work well (and it constantly improves), and there are lots of documentation around the internet on how to do this. Additionally podman itself has some tools to help starting out (see podman generate systemd). But, the end result of all of these is that you get a complex, hard to understand systemd unit file with a very long “podman run” command that you have to maintain.

There has to be a simpler way!
Enter quadlet.

Quadlet is a systemd generator that takes a container description and automatically generates a systemd service file from it. The container description is in the systemd unit file format and describes how you want to run the container (i.e. what image, which ports exposed, etc), as well as standard systemd options, like dependencies. However, it doesn’t need to bother with technical details about how a container gets created or how it integrates with systemd, which makes the file much easier to understand and maintain.

This is easiest demonstrated by an example:

[Unit]
Description=Redis container

[Container]
Image=docker.io/redis
PublishPort=6379:6379
User=999

[Service]
Restart=always

[Install]
WantedBy=local.target

If you install the above in a file called
/etc/containers/systemd/redis.container (or
/usr/share/containers/systemd/redis.container) then, during boot (and at systemctl daemon-reload time), this is used to generate the file /run/systemd/generator/redis.service, which is then made
available as a regular service.

To get a feeling for this, the above container file generates the following service file:

# Automatically generated by quadlet-generator
[Unit]
Description=Redis container
RequiresMountsFor=%t/containers
SourcePath=/etc/containers/systemd/redis.container

[X-Container]
Image=docker.io/redis
PublishPort=6379:6379
User=999

[Service]
Restart=always
Environment=PODMAN_SYSTEMD_UNIT=%n
KillMode=mixed
ExecStartPre=-rm -f %t/%N.cid
ExecStopPost=-/usr/bin/podman rm -f -i --cidfile=%t/%N.cid
ExecStopPost=-rm -f %t/%N.cid
Delegate=yes
Type=notify
NotifyAccess=all
SyslogIdentifier=%N
ExecStart=/usr/bin/podman run --name=systemd-%N --cidfile=%t/%N.cid --replace --rm -d --log-driver journald --pull=never --runtime /usr/bin/crun --cgroups=split --tz=local --init --sdnotify=conmon --security-opt=no-new-privileges --cap-drop=all --mount type=tmpfs,tmpfs-size=512M,destination=/tmp --user 999 --uidmap 999:999:1 --uidmap 0:0:1 --uidmap 1:362144:998 --uidmap 1000:363142:64538 --gidmap 0:0:1 --gidmap 1:362144:65536 -p=6379:6379 docker.io/redis

[Install]
WantedBy=local.target

Once started it looks like a regular service:

● redis.service - Redis container
Loaded: loaded (/etc/containers/systemd/redis.container; generated)
Active: active (running) since Tue 2021-10-12 12:34:14; 1s ago
Main PID: 1559371 (conmon)
Tasks: 8 (limit: 38373)
Memory: 32.0M
CPU: 387ms
CGroup: /system.slice/redis.service
├─container
│ ├─1559375 /dev/init -- docker-entrypoint.sh redis-server
│ └─1559489 "redis-server *:6379"
└─supervisor
  └─1559371 /usr/bin/conmon --api-version 1 -c 24184463a9>

In practice you don’t need to care about the generated file, all you need to maintain is the container file. In fact, over time as podman/systemd integration is improved it may generate slightly different files to take advantage of the new features.

In addition to being easier to understand, quadlet comes with a set of defaults for how the container is run that better fit the usecase of running system services. For example, it defaults to running without any capabilities, it has a basic init process in the container, it uses the journal log driver, and it sets up the cgroups in a mode that best matches what systemd needs.

Right now this is a separate project, but I’ve been in touch with the podman developers, and there is some discussions of how to make this feature part of podman instead. But, until then you can use it from github.com/containers/quadlet, and I have made a COPR build available for experimenting.

For more information see the docs linked from the README.

Scaling Flathub 100x

Flatpak relies on OSTree to distribute apps. This means that flatpak repositories, such as Flathub, are really just OSTree repositories. At the core of an OSTree repository is the summary file, which describes the content of the repository.  This is similar to the metadata that “apt-get update” downloads.

Every time you do an flatpak install it needs the information in the summary file. The file is cached between operations, but any time the repository changes the local copy needs to be updated.

This can be pretty slow, with Flathub having around 1000 applications (times 4 architectures). In addition, the more applications there are, the more likely it is that one has been updated since the last time which means you need to update.

This isn’t yet a major problem for Flathub, but its just a matter of time before it is, as apps keep getting added:

This is particularly problematic if we want to add new architectures, as that multiplies the number of applications.

So, the last month I’ve been working in OSTree and Flatpak to solve this by changing the flatpak repository format. Today I released Flatpak 1.9.2 which is the first version to support the new format, and Flathub is already serving it (and the old format for older clients).

The new format is not only more efficient, it is also split by architecture meaning each user only downloads a subset of the total summary. Additionally there is a delta-based incremental update method for updating from a previous version.

Here are some data for the latest Flathub summary:

  • Original summary: 6,6M  (1.8M compressed)
  • New (x86-64) summary: 2.7M (554k compressed)
  • Delta from previous summary: 20k

So, if you’re able to use the delta, then it needs 100 times less network bandwidth compared to the original (compressed) summary and will be much faster.

Also, this means we can finally start looking at supporting other architectures in Flathub, as doing so will not inconvenience users of the major architectures.

To the future and beyond!

Compatibility in a sandboxed world

Compatibility has always been a complex problems in the Linux world. With the advent of containers/sandboxing it has become even more complicated. Containers help solve compatibility problems, but there are still remaining issues. Especially on the Linux desktop where things are highly interconnected. In fact, containers even create some problems that we didn’t use to have.

Today I’ll take a look at the issues in more details and give some ideas on how to best think of compatibility in this post-container world, focusing on desktop use with technologies like flatpak and snap.

Forward and backwards compatibility

First, lets take a look at what we mean by compatibility. Generally we’re considering changing some part of the system that interacts with other parts (for example command-line arguments, a library, a service or even a file format). We say the change is compatible if the complete system still works as intended after the change.

Most of the time when compatibility is mentioned it refers to backwards compatibility. This means that a change is done such that all old code keeps working. However, after the update other things may start to rely on the new version and there is no guarantee that these new versions will work without the change. A simple example of a backwards compatible change is a file format change, where the new app can read the old files, but the old app may not necessarily read the new files.

However, there is also a another concept called forward compatibility. This is a property of the design rather than any particular change. If something is designed to be forward compatible that means any later change to it will not cause earlier versions to stop working. For example, a file format is designed to be forward compatible if it guarantees a file produced by a new app is still readable by an older app (possibly somewhat degraded due to the lack of new features).

The two concepts are complementary in the sense that if something is forward compatible you can “upgrade” to older versions and that change will be backwards compatible.

API compatibility

API stands for Application Programming Interface and it defines how a program interfaces with some other code. When we talk about API compatibility we mean at the programming level. In other words, a change is API compatible if the some other source code can be recompiled against your changed code and it still builds and works.

Since the source code is very abstract and flexible this means quite a lot of changes are API compatible. For example, the memory layout of a structure is not defined at the source code level, so that can change and still be API compatible.

API compatibility is mostly interesting to programmers, or people building programs from source. What affects regular users instead is ABI compatibility.

ABI compatibility

ABI means Application Binary Interface, and it describes how binaries compiled from source are compatible. Once the source code is compiled a lot more details are made explicit, such as the layout of memory. This means that a lot of changes that are API compatible are not ABI compatible.

For example, changing the layout of a (public) structure is API compatible, but not ABI compatible as the old layout is encoded in the compiled code. However, with sufficient care by the programmer there are still a lot of changes that can be made that are ABI backward compatible. The most common example is adding new functions.

Symbol versioning

One thing that often gets mentioned when talking about ABI compatibility is symbol versioning. This is a feature of the ELF executable format that allows the creation of multiple versions of a function in the binary with the same name. Code built against older versions of the library calls the old function, and code built against the new version will call the new function. This is a way to extend what is possible to change and still be backwards ABI compatible.

For example, using this it may be possible to change the layout of a structure, but keep a copy of the previous structure too. Then you have two functions that each work on its own particular layout, meaning the change is still ABI compatible.

Symbol versioning is powerful, but it is not a solution for all problems. It is mostly useful for small changes. For example, the above change is workable if only one function uses the structure. However, if the modified structure is passed to many functions then all those functions need to be duplicated, and that quickly becomes unmanageable.

Additionally, symbol versioning silently introduces problems with forward compatibility. Even if an application doesn’t rely on the feature that was introduced in the new version of the library a simple rebuild will pick up a dependency of the new version, making it unnecessarily incompatible with older versions. This is why Linux applications must be built against the oldest version of glibc they want to support running against rather than against the latest.

ABI domains

When discussing ABI compatibility there is normally an implicit context that is assumed to be fixed. For example, we naturally assume both builds are for the same CPU architecture.

For any particular system there are a lot of details that affect
this context, such as:

  • Supported CPU features
  • Kernel system call ABI
  • Function calling conventions
  • Compiler/Linker version
  • System compiler flags
  • ABI of all dependent modules (such as e.g. glibc or libjpeg versions)

I call any fixed combination of these an ABI domain.

Historically the ABI domain has not been very important in the context of a particular installation, as it is the responsibility of the distribution to ensure that all these properties stay compatible over time. However, the fact that ABI domains differ is one of the primary causes of incompatibility between different distributions (or between different versions of the same distribution).

Protocol compatibility

Another type of compatibility is that of communication protocols. Two programs that talk to each other using a networking API (which could be on two different machines, or locally on the same machine) need to use a protocol to understand each other. Changes to this protocol need to be carefully considered to ensure they are compatible.

In the remote case this is pretty obvious, as it is very hard to control what software two different machines use. However, even for local communication between processes care has to be taken. For example, a local service could be using a protocol that has several implementations and they all need to stay compatible.

Sometimes local services are split into a service and a library and the compatibility guarantees are defined by the library rather than the service. Then we can achieve some level of compatibility by ensuring the library and the service are updated in lock-step. For example a distribution could ship them in the same package.

Enter containers

Containers is (among other things) a way to allow having multiple ABI domains on the same machine. This allows running a single build of an application on any distribution. This solves a lot of ABI compatibility issues in one fell swoop.

And there was much rejoicing!

There are still some remnants of ABI compatibility issues left, for example CPU features and kernel system calls still need to be compatible. Additionally it is important to keep the individual ABI domains internally compatible over time. For flatpak this ends up being the responsibility of the maintainers of the runtime wheras for docker it is up to each container image. But, by and large, we can consider ABI compatibility “solved”.

However, the fact that we now run multiple ABI domains on one machine brings up some new issues that we really didn’t need to care much about before, namely protocol compatibility and forward compatibility.

Protocol compatibility in a container world

Server containers are very isolated from each other, relying mainly on the stability of things like DNS/HTTP/SQL. But the desktop world is a lot more interconnected. For example, it relies on X11/Wayland/OpenGL for graphics, PulseAudio for audio, and Cups for printing, etc. Lots of desktop services use DBus to expose commonly used desktop functionality (including portals), and there is a plethora of file formats shared between apps (such as icons, mime-types, themes, etc).

If there is only one ABI domain on the machine then all these local protocols and formats need not consider protocol compatibility at all, because services and clients can be upgraded in lock-step. For example, the evolution-data-server service has an internal versioned DBus API, but client apps just use the library. If the service protocol changes we just update the library and apps continue to work.

However, with multiple ABI domains, we may have different versions of the library, and if the library of one domain doesn’t match the version of the running service then it will break.

So, in a containerized world, any local service running on the desktop needs to consider protocol stability and extensibility much more careful than what we used to do. Fortunately protocols are generally much more limited and flexible than library ABIs, so it’s much easier to make compatible changes, and failures are generally error messages rather than crashes.

Forward compatibility in a container world

Historically forward compatibility has mainly needed to be considered for file formats. For instance, you might share a home directory between different machines and then use and different versions of an app to open a file on the two machines, and you don’t want the newer version to write files the older can’t read.

However, with multiple ABI domains it is going to be much more common that some of the software on the machine update versions faster than others. For example, you might be running a very up-to-date app on an older distribution. In this case it is important that any host services were designed to be forward compatible so that they doesn’t break when talking to the newer clients.

Summary

So, in this new world we need to have new rules. And the rule is this:

Any interface that spans ABI domains (for example between system and app, or between two different apps) should be backwards compatible and try to be forward compatible as much as possible.

Any consumer of such an interface should be aware of the risks involved with version differences and degrade gracefully in case of a mismatch.

Putting container updates on a diet

For infrastructure reasons the Fedora flatpaks are delivered using the container registry, rather than OSTree (which is normally used for flatpaks). Container registries are quite inefficient for updates, so we have been getting complaints about large downloads.

I’ve been working on ways to make this more efficient. That would help not only the desktop use-case, as smaller downloads are also important in things like IoT devices. Also, less bandwidth used could lead to significant cost savings for large scale registries.

Containers already have some features designed to save download size. Lets take a look at them in more detail to see why that often doesn’t work.

Consider a very simple Dockerfile:

FROM fedora:32
RUN dnf -y install httpd
COPY index.html /var/www/html/index.html
ENTRYPOINT /usr/sbin/httpd

This will produce an container image that looks like this:

The nice thing about this setup is that if you change the html layer and re-deploy you only need to download the last layer, because the other layers are unchanged.

However, as soon as one of the other layer changes you need to download the changed layer and all layers below it from scratch. For example, if there is a security issue and you need to update the base image all layers will change.

In practice, such updates actually change very little in the image. Most files are the same as the previous version, and the few that change are still similar to the previous version. If a client is doing an update from the previous version the old files are available, and if they could be reused that would save a lot of work.

One complexity of this is that we need to reconstruct the exact tar file that we would otherwise download, rather than the extracted files. This is because we need to checksum it and verify the checksum to protect against accidental or malicious modifications. For containers,  the checksum that clients use is the checksum of the uncompressed tarball. Being uncompressed is fortunate for us, because reproducing identical compression is very painful.

To handle such reconstruction, I wrote a tool called tar-diff which does exactly what is needed here:

$ tar-diff old.tar.gz new.tar.gz delta.tardiff
$ tar xf old.tar.gz -C extracted/
$ tar-patch delta.tardiff extracted/ reconstructed.tar
$ gunzip new.tar.gz
$ cmp new.tar reconstructed.tar

I.e. it can use the extracted data from an old version, together with a small diff file to reconstruct the uncompressed tar file.

tar-diff uses knowledge of the tar format, as well as the bsdiff binary diff algorithm and zstd compression to create very small files for typical updates.

Here are some size comparisons for a few representative images. This shows the relative size of the deltas compared to the size of the changed layers:

Red Hat Universal Base Image 8.0 and 8.1
fluent/fluentd, a Ruby application on top of a small base image
OpenShift Enterprise prometheus releases
Fedora 30 flatpak runtime updates

These are some pretty impressive figures. Its clear from this that some updates are really very small, yet we are all downloading massive files anyway. Some updates are larger, but even for those the deltas are in the realm of 10-15% of the original size. So, even in the worst case deltas are giving around 10x improvement.

For this to work we need to store the deltas on a container registry and have a way to find the deltas when pulling an image. Fortunately it turns out that the OCI specification is quite flexible, and there is a new project called OCI artifacts specifying how to store other types of binary data in a container.

So, I was able to add support for this in skopeo and podman, allowing it both to generate deltas and use them to speed up downloads. Here is a short screen-cast of using this to generate and use deltas between two images stored on the docker hub:

All this is work in progress and the exact details of how to store deltas on the repository is still being discussed. However, I wanted to give a heads up about this because I think it is some really powerful technology that a lot of people might be interested in.

Introducing GVariant schemas

GLib supports a binary data format called GVariant, which is commonly used to store various forms of application data. For example, it is used to store the dconf database and as the on-disk data in OSTree repositories.

The GVariant serialization format is very interesting. It has a recursive type-system (based on the DBus types) and is very compact. At the same time it includes padding to correctly align types for direct CPU reads and has constant time element lookup for arrays and tuples. This make GVariant a very good format for efficient in-memory read-only access.

Unfortunately the APIs that GLib has for accessing variants are not always great. They are based on using type strings and accessing children via integer indexes. While this is very dynamic and flexible (especially when creating variants) it isn’t a great fit for the case where you have serialized data in a format that is known ahead of time.

Some negative aspects are:

  • Each g_variant_get_child() call allocates a new object.
  • There is a lot of unavoidable (atomic) refcounting.
  • It always uses generic codepaths even if the format is known.

If you look at some other binary formats, like Google protobuf, or Cap’n Proto they work by describing the types your program use in a schema, which is compiled into code that you use to work with the data.

For many use-cases this kind of setup makes a lot of sense, so why not do the same with the GVariant format?

With the new GVariant Schema Compiler you can!

It uses a interface definition language where you define the types, including extra information like field names and other attributes, from which it generates C code.

For example, given the following schema:

type Gadget {
  name: string;
  size: {
    width: int32;
    height: int32;
  };
  array: []int32;
  dict: [string]int32;
};

It generates (among other things) these accessors:

const char *    gadget_ref_get_name   (GadgetRef v);
GadgetSizeRef   gadget_ref_get_size   (GadgetRef v);
Arrayofint32Ref gadget_ref_get_array  (GadgetRef v);
const gint32 *  gadget_ref_peek_array (GadgetRef v,
                                       gsize    *len);
GadgetDictRef   gadget_ref_get_dict   (GadgetRef v);

gint32 gadget_size_ref_get_width  (GadgetSizeRef v);
gint32 gadget_size_ref_get_height (GadgetSizeRef v);

gsize  arrayofint32_ref_get_length (Arrayofint32Ref v);
gint32 arrayofint32_ref_get_at     (Arrayofint32Ref v,
                                    gsize           index);

gboolean gadget_dict_ref_lookup (GadgetDictRef v,
                                 const char   *key,
                                 gint32       *out);

Not only are these accessors easier to use and understand due to using C types and field names instead of type strings and integer indexes, they are also a lot faster.

I wrote a simple performance test that just decodes a structure over an over. Its clearly a very artificial test, but the generated code is over 600 times faster than the code using g_variant_get(), which I think still says something.

Additionally, the compiler has a lot of other useful features:

  • You can add a custom prefix to all generated symbols.
  • All fixed size types generate C struct types that match the binary format, which can be used directly instead of the accessor functions.
  • Dictionary keys can be declared sorted: [sorted string] { ... } which causes the generated lookup function to use binary search.
  • Fields can declare endianness: foo: bigendian int32 which will be automatically decoded when using the generated getters.
  • Typenames can be declared ahead of time and used like foo: []Foo, or declared inline: foo: [] 'Foo { ... }. If you don’t name the type it will be named based on the fieldname.
  • All types get generated format functions that are (mostly) compatible with g_variant_print().

Gthree – ready to play

Today I made a new release of Gthree, version 0.2.0.

Newly added in this release is support for Raycaster, which is important if you’re making interactive 3D applications. For example, it’s used if you want clicks on the window to pick a 3D object from the scene. See the interactive demo for an example of this.

Also new is support for shadow maps. This allows objects between a light source and a target to cast shadows on the target. Here is an example from the demos:

I’ve been looking over the list of feature that we support, and in this release I think all the major things you might want to do in a 3D app is supported to at least a basic level.

So, if you ever wanted to play around with 3D graphics, now would be a great time to do so. Maybe just build the code and study/tweak the code in the examples subdirectory. That will give you a decent introduction to what is possible.

If you just want to play I added a couple of new features to gnome-hexgl based on the new release. Check out how the tracks casts shadows on the buildings!

Gaming with GThree

The last couple of week I’ve been on holiday and I spent some of that hacking on gthree. Gthree is a port of three.js, and a good way to get some testing of it is to port a three.js app. Benjamin pointed out HexGL, a WebGL racing game similar to F-Zero.

This game uses a bunch of cool features like shaders, effects, sprites, particles, etc, so it was a good target. I had to add a bunch of features to gthree and fix some bugs, but its now at a state where it looks pretty cool as a demo. However it needs more work to be playable as a game.

Check out this screenshot:

Or this (lower resolution) video:

If you’re interested in playing with it, the code is on github. It needs latest git versions of graphene and gthree to build.

I hope to have a playable version of this for GUADEC. See you there!

Gthree update, It moves!

Recently I have been backporting some missing three.js features and fixing some bugs. In particular, gthree supports:

  • An animation system based on keyframes and interpolation.
  • Skinning, where a model can have a skeleton and modifying a bone affects the whole model.
  • Support in the glTF loader for the above.

This is pretty cool as it enables us to easily load and animate character models. Check out this video:

Gthree is alive!

A long time ago in a galaxy far far away I spent some time converting three.js into a Gtk+ library, called gthree.

It was never really used for anything and wasn’t really product quality. However, I really like the idea of having an easy way to add some 3D effects to Gnome apps.

Also, time has moved on and three.js got cool new features like a PBR material and a GLTF support.

So, last week I spent some time refreshing gthree to the latest shaders from three.js, adding a GLTF loader and a few new features, ported to meson and did some general cleanup/polish. I still want to add some more features in like skinning, morphing and animations, but just the new rendering is showing off just how cool this stuff is:

Here is a screencast showing off the model viewer:

Cool, eh!

I have to do some other work too, but I hope to get some more time in the future to work on gthree and to use it for something interesting.

Broadway adventures in Gtk4

One of my long running side projects is a Gtk backend called “Broadway”. Instead of rendering to the screen this backend creates a HTTP server that you can connect to, and then exposes the UI remotely in the browser.

The original version of broadway was essentially streaming image frames, although there were various ways to optimize what got sent. This matches pretty well with how Gtk 3 rendering works, particularly on Wayland. Every frame it calls out to all widgets, letting them draw on top of a buffer and then sends the final frame to the compositor. Broadway just inserts some image delta computation and JavaScript magic in the middle of this.

Enter Gtk 4, breaking everything!

However, time moves on, and the current development branch of Gtk (which will be Gtk 4) has completely changed how rendering works, with the goal of doing efficient rendering on modern GPUs.

In the new model widgets don’t directly render to a buffer. Instead they build up a model of how the final result should look in terms of something called render nodes. These describe rendering as a tree of highlevel operations. The backend (we have software, OpenGL and Vulkan backends) then knows how to take this description and submit it to the GPU in an efficient way. This is somewhat similar to the firefox WebRender project.

Its would be possible to implement the broadway backend by hooking up the software renderer, letting it generate a buffer and then send that to the browser.  However, that is pretty lame!

CSS comes to the rescue!

Instead I’ve been looking at making the browser actually draw the render nodes. Gtk defines a lot of its UI in terms of CSS these days, and that means that the render nodes actually are very close to the CSS rendering model. For example, the basic drawing operation are things like rounded boxes with borders, shadows, etc.

So, I was thinking, could we not take these render node and turn them into actual DOM nodes with CSS styles and send them to the browser. Then every frame we can just diff the DOM trees, sending the minimal changes necessary.

Sounds crazy right? But, it turns out to work pretty well.

Check out this example page which I created with the magic of “save as”. In particular, try zooming into that page in the browser, and play with the developer tools inspector to see the nodes. Here is a part of it zoomed in:

The icons and the text are not CSS, so they don’t scale, but look at those gorgeous borders, shadows and gradients!

Entering the 3rd dimension!

Particularly interesting is the support in Gtk for general 3D transforms. This maps well to the CSS transform on the browser style.

Check out this example of a spinning-cube transition. If you open up the browser inspector you can see that each individual element in the cube is still a regular CSS box.

Some technical notes

If you look at the examples above they all use data: uris for images. This is a custom mode that lets “screenshots” like the above work. Normally broadway uses blobs for the images.

Also, looking at the examples they seem very heavy in terms of images, as all the text are images. However, in a typical frame most of the render tree is identical to the previous frame, meaning any label that was used in the last frame need not be sent again. In fact, even if it changes position in the tree due to a parent node changing (scrolling, cube-switching, etc) it can still be reused as-is.

However, text is clearly the weak point in here. Unfortunately HTML/CSS has no low-level text rendering APIs we could use. I’m considering generating a texture atlas with pre-rendered glyphs that can be reused (like CSS sprites) when rendering text, that would mean we will have to download less data at least. If anyone has other ideas I would love to hear about it.