Flatpak is fundamentally a bundling system, and as such there are worries about bloat compared to non-bundling systems. Obviously there is some truth to this, as bundling do generally increase size. It is my opinion that the advantages of bundling offset these costs, however that is a discussion for another time. Today we’re going to look into what flatpak does to minimize bloat.
On the highest level, flatpak avoids duplication by having a runtime/application split. This means that applications don’t have to bundle everything, and that common dependencies are shared between applications.
However, runtimes were not primarily added in order to minimize bloat. Runtimes were added to have a way to do shared ownership, maintainership and releases of core stuff. Lowlevel things that application authors are not interested in nor experienced with, letting the applications focus on the application specific stuff, while still getting security updates of the core.
In fact, flatpak uses a system called ostree for the application storage which means, even without runtime sharing, a lot of duplication automatically go away.
When I try to explain what ostree is I always start with “Its like git, but for directories with large binaries”. This helps because most linux users these days have some knowledge about git. However, most people only know how to use git, and not its underlying data model. These details are important to understand when discussing ostree, so I will try to give a short explanation.
Lets start by looking at a simple ostree repository, with a single branch called “master”, containing a single file “a-file.txt”. (Note: I manually shortened all IDs to 8 characters below to make it more readable)
Here is how the repository looks on disk:
├── config
├── objects
│ ├── 03
│ │ └── 02f468.file
│ ├── 7a
│ │ └── f088f5.commit
│ ├── 3e
│ │ └── ca26e9.dirtree
│ └── d5
│ └── 85282a.dirmeta
└── refs
└── heads
└── master
Pretty weird eh? Lets explore it a bit. The adventure starts with looking up the ref for the branch master:
$ cat repo/refs/heads/master
7af088f5
This references the id of the commit object that the master branch currently points to. The commit object info can be found in objects/7a/f088f5.commit. This is a binary file, but the contents are something like:
Root metadata: d585282a
Root files: 3eca26e9
Committer: Alex
Commit date: ...
The root metadata is a reference to objects/d5/85282a.dirmeta, which contains uid/gid/permissions for the root directory.
The root files contains a list of file and directory names and their object ids. In the case of subdirectories that would be a .dirtree object, but in this case there is only a file, 0302f468, which contains the actual file data:
$ cat repo/objects/03/02f468.file
a-file contents
So, we understand how to interpret the repository data, but it still seems weird. What are all those long ids for, and whats the reason for the whole thing? It turns out that the ids are the main point. They are in fact sha256 checksums of the object contents.
Whenever ostree commits a file it computes the checksum of the file and uses that as the id of the object in the repository. This means that if for whatever reason two files with identical contents are committed to the same repo, then they will use the same object. And this continues to directories too, if all the files and directories in two dirtree objects are identical (and have the same names) then the same dirtree object will be used.
In the example repo above, suppose that another branch was added, which contained the same file, plus some other file. Only the commit, the new file and the new dirtree for the root would be added, all the other objects would be reused.
What about flatpak?
This ostree feature is very interesting, because in flatpak each installed runtime and application is a branch in a single ostree repository (combined with a checked out version that has hard-links to the files in the repository). So, any files that are identical are automatically shared between applications.
It should be noted that this sharing does not only affect disk space use. Since shared files are hard-linked, any RAM used for disk cache and memory mappings are automatically shared too. Additionally, when applications are downloaded we don’t have to download any objects that are already locally available, so network bandwidth is shared too.
How well does this work in practice? One way to look at this is comparing the gnome and the freedesktop runtimes. The gnome runtime is based on the freedesktop runtime, starting with it a copy of it, removing some files and then adding some.
We can check the size of the individual runtimes:
$ du -sh org.gnome.Platform/x86_64/3.26
665M org.gnome.Platform/x86_64/3.26
$ du -sh org.freedesktop.Platform/x86_64/1.6
435M org.freedesktop.Platform/x86_64/1.6
But we can also see how much space they use together, which takes into account files that are just hard-linked copies of each other:
$ du -sh org.gnome.Platform/x86_64/3.26 \
org.freedesktop.Platform/x86_64/1.6
665M org.gnome.Platform/x86_64/3.26
18M org.freedesktop.Platform/x86_64/1.6
Here we see that installing both runtimes just adds an extra 18M due to the shared files. Similar measurements are true for the KDE runtime, which is based on the same freedesktop runtime build. And since they use the same base the gnome and KDE runtimes share this space too.
The above example is the ideal case for sharing, but even in unrelated builds this helps a lot. For example, let us compare the x86-64 and i386 builds of the gnome runtime:
$ du -sh org.gnome.Platform/i386/3.26
682M org.gnome.Platform/i386/3.26
$ du -sh org.gnome.Platform/x86_64/3.26
665M org.gnome.Platform/x86_64/3.26
$ du -sh org.gnome.Platform/x86_64/3.26 \
org.gnome.Platform/i386/3.26
665M org.gnome.Platform/x86_64/3.26
462M org.gnome.Platform/i386/3.26
Here we saved 220M in two completely unrelated builds. This is caused by things like fonts, icons, and app data being the identical in the two builds.
Thats all for todays deep dive into flatpak internals. I hope you found this interesting.