Flatpak External Data Checker

(This post is a slightly longer version of a lightning talk I gave at GUADEC 2019.)

Many non-free applications’ binaries cannot be redistributed (particularly not in modified form), so they cannot be included directly in a Flatpak. To work around this, Flatpak supports the concept of “extra data”: files which will be downloaded and unpacked from a third-party URI when the app is installed. The URI is accompanied by a checksum and a size, to provide some hope that the data unpacked on the user’s system is the same as what the packager tested. This is used by, for example, the Dropbox Flatpak.

Of course, the Flatpak needs to be kept up to date when new versions of the app are released. At best, the old URL will still point to the same file, so at least the old version of the app will continue to be installed; in some cases, however, vendors publish new versions of the app at the same URL, which means the Flatpak cannot be installed until it is updated.

Some time ago, Joaquim Rocha started work on Flatpak External Data Checker to periodically check a Flatpak manifest and report when it needs updating. As well as just checking that a URL is reachable and has the expected size and checksum, it also knows how to follow a redirect to a stable URI for the latest version (a helpful pattern some apps use), or to find the latest package in an apt repository. I subsequently taught it how to determine the new app’s version, update the AppData file, commit the necessary changes to Git, and send a pull request (like this one).

I tried moderately hard to preserve YAML and XML comments and formatting. For JSON, I gave up trying to preserve formatting (let alone json-glib’s non-standard extensions); the output is at least deterministic, so once it’s reformatted the JSON, the diffs will be smaller in future.

At Endless, we run this for a short list of apps on Flathub (and a few on Endless’s Flatpak repo). If you want to get PRs for an app you maintain, add the necessary metadata to your Flathub application’s manifest, then send a pull request to update the list of repos we check. I hope that in the medium term we could move this over to Flathub’s build infrastructure and run it on every repo (with some way to opt out).

There are a fair few open issues – PRs, suggestions and bug reports all very welcome!

Age rating data for Flathub apps

OARS (Open Age Ratings Service) defines a scheme to include content rating information in apps’ AppData/AppStream file. GNOME Software and similar tools use this metadata to show age ratings for applications. In Endless OS, we also support restricting which applications a given user can install based on this data – see this page, and the reports it links to, for a bit more information about this feature and its future.

Screenshot: “Age Rating: 7. The application was rated this way because it features: Characters in aggressive conflict easily distinguishable from reality”

Every new app on Flathub should include OARS metadata, but there are many existing apps which don’t have this data, so it’s not (yet) enforced at build time. Edit: Bartłomiej tells me that it has been enforced at build time for a little over a month. (See App Requirements and AppData Guidelines on the Flathub wiki for some more information on what’s required and recommended; this branch of appstream-glib is modified to enforce Flathub’s policies.) My colleague Andre Magalhaes crunched the data and opened a tracker task for Flathub apps without OARS metadata.1 This information is much more useful if it can be relied upon to be present.

If you’re familiar with an app on this list, generating the OARS data is a simple process: open this generator in your browser, answer some questions about the app, and receive some XML. The next step is to put that OARS data into the AppData.2 Take a look at the app’s Flathub repo and check whether it has an .appdata.xml file.

If it doesn’t, then the app’s AppData must be maintained upstream. Great! Find the upstream repository for the project, and send a merge request there. (Here’s one I sent earlier, for D-Feet.) You can either add the same patch to the Flathub packaging (as I did for D-Feet) or wait for a new upstream release and then update the Flathub packaging to that version (as I also did for D-Feet, a couple of days later).

If the appdata is maintained in the Flathub repo, make the relevant changes there directly. (Here’s a PR I opened for Tux, of Math Command while writing this post.) Ideally, the appdata would make its way upstream, but there are a fair few apps on Flathub which do not have active upstreams.

You might well find that the appdata requirements have become more strict about things other than OARS since the app was last updated, and these will have to be fixed too. In both the example cases above, I had to add release information, which has become mandatory for Flathub apps since these were last updated.

  1. We have a similar list for our in-house apps. []
  2. If you’re the upstream maintainer for the app, you probably already know how to do all this, and can stop reading here! []

Rebasing downstream translations

At Endless, we maintain downstream translations for an number of GNOME projects, such as gnome-software, gnome-control-center and gnome-initial-setup. These are projects where our (large) downstream modifications introduce new user-facing strings. Sometimes, our translation for a string differs from the upstream translation for the same string. This may be due to:

  • a deliberate downstream style choice – such as tú vs. usted in Spanish
  • our fork of the project changing the UI so that the upstream translation does not fit in the space available in our UI – “Suspend” was previously translated into German as „In Bereitschaft versetzen“, which we changed to „Bereitschaft“ for this reason
  • the upstream translation being incorrect
  • the whim of a translator

Whenever we update to a new version of GNOME, we have to reconcile our downstream translations with the changes from upstream. We want to preserve our intentional downstream changes, and keep our translations for strings that don’t exist upstream; but we also want to pull in translations for new upstream strings, as well as improved translations for existing strings. Earlier this year, the translation-rebase baton was passed to me. My predecessor would manually reapply our downstream changes for a set of officially-supported languages, but unlike him, I can pretty much only speak English, so I needed something a bit more mechanical.

I spoke to various people from other distros about this problem.1 A common piece of advice was to not maintain downstream translation changes: appealing, but not really an option at the moment. I also heard that Ubuntu follows a straightforward rule: once the translation for a string has been changed downstream, all future upstream changes to the translation for that string are ignored. The assumption is that all downstream changes to a translation must have been made for a reason, and should be preserved. This is essentially a superset of what we’ve done manually in the past.

I wrote a little tool to implement this logic, pomerge translate-o-tron 3000 (or “t3k” for short).2 Its “rebase” mode takes the last common upstream ancestor, the last downstream commit, and a working copy with the newest downstream code. For each locale, for each string in the translation in the working copy, it compares the old upstream and downstream translations – if they differ, it merges the latter into the working copy. For example, Endless OS 3.5.x was based on GNOME 3.26; Endless OS 3.6.x is based on GNOME 3.32. I might rebase the translations for a module with:

$ cd src/endlessm/gnome-control-center

# The eos3.6 branch is based on the upstream 3.32.1 tag
$ git checkout eos3.6

# Update the .pot file
$ ninja -C build meson-gnome-control-center-2.0-pot

# Update source strings in .po files
$ ninja -C build meson-gnome-control-center-2.0-update-po

# The eos3.6 branch is based on the upstream 3.26.1 tag;
# merge downstream changes between those two into the working copy
$ t3k rebase `pwd` 3.26.1 eos3.5

# Optional: Python's polib formats .po files slightly differently to gettext;
# reformat them back. This has no semantic effect.
$ ninja -C build meson-gnome-control-center-2.0-update-po

$ git commit -am 'Rebase downstream translations'

It also has a simpler “copy” mode which copies translations from one .po file to another, either when the string is untranslated in the target (the default) or for all strings. In some cases, we’ve purchased translations for strings which have not been translated upstream; I’ve used this to submit some of those upstream, such as Arabic translations of the myriad OARS categories, and hope to do more of that in the future now I can make a computer do the hard work.

$ t3k copy \
    ~/src/endlessm/gnome-software/po/ar.po \
  1. I’d love to credit the individuals I spoke to but my memory is awful. Please let me know if you remember being on the other side of these conversations and I’ll update this post! []
  2. Thanks to Alexandre Franke for pointing out the existence of at least one existing tool called “pomerge”. In my defence, I originally wrote this script on a Eurostar with no internet connection, so couldn’t search for conflicts at the time. []

The devil makes work for idle processes

TLDR: in Endless OS, we switched the IO scheduler from CFQ to BFQ, and set the IO priority of the threads doing Flatpak downloads, installs and upgrades to “idle”; this makes the interactive performance of the system while doing Flatpak operations indistinguishable from when the system is idle.

At Endless, we’ve been vaguely aware for a while that trying to use your computer while installing or updating apps is a bit painful, particularly on spinning-disk systems, because of the sheer volume of IO performed by the installation/update process. This was never particularly high priority, since app installations are user-initiated, and until recently, so were app updates.

But, we found that users often never updated their installed apps, so earlier this year, in Endless OS 3.3.10, we introduced automatic background app updates to help users take advantage of “new features and bug fixes” (in the generic sense you so often see in iOS/Android app release notes). This fixed the problem of users getting “stuck” on old app versions, but made the previous problem worse: now, your computer becomes essentially unusable at arbitrary times when app updates happen. It was particularly bad when users unboxed a system with an older version of Endless OS (and hence a hundred or so older apps) pre-installed, received an automatic OS update, then rebooted into a system that’s unusable until all those apps have been updated.

At first, I looked for logic errors in (our versions of) GNOME Software and Flatpak that might cause unneccessary IO during app updates, without success. We concluded that heavy IO load when updating a large app or runtime is largely unavoidable,1 so I switched to looking at whether we could mitigate this by tweaking the IO scheduler.

The BFQ IO scheduler is supposed to automatically prioritize interactive workloads over bulk workload, which is pretty much exactly what we’re trying to do. The specific example its developers give is watching a video, without hiccups, while copying a huge file in the background. I spent some time with the BFQ developers’ own suite of benchmarks on two test systems: a Lenovo Yoga 900 (with an Intel i5-6200U @ 2.30GHz and a consumer-grade M.2 SSD) and an Endless Mission One (an older system with a Celeron CPU and a laptop-class spinning disk). Neither JP nor I were able to reproduce any interesting results for the dropped-frames benchmark: with either BFQ or CFQ (the previous default IO scheduler), the Yoga essentially never dropped frames, whereas the IO workloads immediately rendered the Mission totally unusable. I had rather more success with a benchmark which measures the time to launch LibreOffice:

  • On the Yoga, when the system was idle, the mean launch time went from 2.838s under CFQ to 2.98s under BFQ (a slight regression), but with heavy background IO, the mean launch time went from 16s with CFQ (standard deviation 0.11) to 3s with BFQ (standard deviation 0.51).
  • On the Mission, with modest background IO, the mean launch time was 108 seconds under BFQ, which sounds awful; but under CFQ, I gave up waiting for LibreOffice to start after 8 minutes!

Emboldened by these results, I went on to look at how the same “time to launch LibreOffice” benchmark fared when the background IO load is “installing and uninstalling a Lollipop Flatpak bundle in a loop”. I also looked at using ionice -c3 to set the IO priority of the install/uninstall loop to idle, which does what its name suggests: BFQ essentially will never serve IO at the idle priority if there is IO pending at any higher priority. You can see some raw data or look at some extended discussion copied from our internal issue tracker to a Flatpak pull request, but I suggest just looking at this chart:

What does it all mean?

  • The coloured bars represent median launch time in seconds for LibreOffice, across 15/30 trials for Yoga/Mission respectively.
  • The black whiskers show the minimum and maximum launch times observed. I know this should have been a box-and-whiskers or violin plot, but I realised too late that multitime does not give enough information to draw those.
  • “unloaded” refers to the performance when the system is otherwise idle.
  • “shell-loop” refers to running while true; do flatpak install -y /home/wjt/Downloads/org.gnome.Lollypop.flatpak; flatpak uninstall -y org.gnome.Lollypop/x86_64/stable; done; “long-lived” refers to performing the same operations with the Flatpak API in a long-lived process. I tried this because I understood that BFQ gives new processes a slight performance boost, but on a real system the GNOME Software and Flatpak system helper processes are long-lived. As you can see, the behaviour under BFQ is actually the other way around in the worst case, and identical for CFQ and in the median case.
  • The “ionice-” prefix means the Flatpak operation was run under ionice -c3.
  • Switching from CFQ to BFQ makes the worst case a little worse at the default IO priority, but the median case much better.
  • Setting the IO priority of the Flatpak process(es) to idle erases that worst-case regression under BFQ, and dramatically improves the median case under CFQ.
  • In combination, the time to launch LibreOffice while performing Flatpak operations in the background on the Mission went from 24 seconds to 12 seconds by switching to BFQ & setting the IO priority to idle.

So, by switching to BFQ and setting IO priorities appropriately, the system’s interactive performance while performing background updates is now essentially indistinguishable from when the system is idle. To implement this in practice, Rob McQueen wrote some patches to set the IO priority of the Flatpak system helper and GNOME Software’s worker threads to idle (both changes are upstream) and changed Endless OS’s default IO scheduler to BFQ where available. As Matthias put it on #flatpak when shown this chart and that first link: “not bad for a 1-line change”.

Of course, this means apps take a bit longer to install, even on a mostly-idle system. No, I don’t have numbers on how big the impact is: this work happened months ago and it’s taken me this long to write it up because I couldn’t find the time to collect more data. But my colleague Umang is working on eliminating up to half of the disk IO performed during Flatpak installations so that should more than make up for it!

  1. modulo Umang’s work mentioned in the coda []

Wandering in the symlink forest forever

Last week, Philip Withnall told me that Meson has built-in support for generating code coverage reports: just configure with -Db_coverage=true, run your tests with ninja test, then run ninja coverage-{text,html,xml} to generate the report in the format of your choice. The XML format is compatible with Cobertura’s output, which is convenient since Endless’s Jenkins is already configure to consume Cobertura XML generated by Autotools projects using our EOS_COVERAGE_REPORT macro. So it was a simple matter of adding gcovr to the build enviroment, running ninja coverage-xml after the tests, and moving the report to the right place for Jenkins to find it. It worked well on the projects I tested, so I decided to enable it for all Meson projects built in our CI. Sure, I thought, it’s not so useful for our forks of GNOME and third-party projects, but it’s harmless and saves adding per-project config, right?

Fast-forward to yesterday, when someone noticed that a systemd build had been stuck on the ninja coverage-xml step for 16 hours. Uh oh.

It turns out that gcovr follows symlinks when scanning for coverage files, but didn’t check for cycles. systemd’s test suite generates a fake sysfs tree, with many circular references via symlinks. For example, there are 64 self-referential ttyX trees:

$ ls -l build/test/sys/devices/virtual/tty/tty1
total 12
-rw-r--r-- 1 wjt wjt    4 Oct  9 12:16 dev
drwxr-xr-x 2 wjt wjt 4096 Oct  9 12:16 power
lrwxrwxrwx 1 wjt wjt   21 Oct  9 12:16 subsystem -> ../../../../class/tty
-rw-r--r-- 1 wjt wjt   16 Oct  9 12:16 uevent
$ ls -l build/test/sys/devices/virtual/tty/tty1/subsystem/tty1
lrwxrwxrwx 1 wjt wjt 30 Oct  9 12:16 build/test/sys/devices/virtual/tty/tty1/subsystem/tty1 -> ../../devices/virtual/tty/tty1
$ readlink -f build/test/sys/devices/virtual/tty/tty1/subsystem/tty1

And, worse, all other ttyY trees are accessible via the symlinks from each ttyX tree. The kernel caps the number of symlinks per path to 40 before lookups fail with ELOOP, but that’s still 6440 paths to resolve, just for the fake ttys. Quite a big number!

The fix is straightforward: maintain a set of visited (st_dev, st_ino) pairs while walking the tree, and prune subtrees we’ve already visited. I tried adding a similar highly self-referential symlink graph to the gcovr test suite, so that it would run in reasonable time if the fix works and essentially never terminate if it does not. Unfortunately, pytest has exactly the same bug: while searching for tests to run, it gets lost wandering in the symlink forest forever.

This bug is a good metaphor for my habit of starting supposedly-quick side-projects.

They should have called it Mirrorball

TL;DR: there’s now an rsync server at rsync://images-dl.endlessm.com/public from which mirror operators can pull Endless OS images, along with an instance of Mirrorbits to redirect downloaders to their nearest—and hopefully fastest!—mirror. Our installer for Windows and the eos-download-image tool baked into Endless OS both now fetch images via this redirector, and from the next release of Endless OS our mirrors will be used as BitTorrent web seeds too. This should improve the download experience for users who are near our mirrors.

If you’re interested in mirroring Endless OS, check out these instructions and get in touch. We’re particularly interested in mirrors in Southeast Asia, Latin America and Africa, since our mission is to improve access to technology for people in these areas.

Big thanks to Niklas Edmundsson, who administers the mirror at Academic Computer Club, Umeå University, who recommended Mirrorbits and provided the nudge needed to get this work going, and to dotsrc.org and Mythic Beasts who are also mirroring Endless OS already.

Read on if you are interested in the gory details of setting this up.

We’ve received a fair few offers of mirroring over the years, but without us providing an rsync server, mirror operators would have had to fetch over HTTPS using our custom JSON manifest listing the available images: extra setup and ongoing admin for organisations who are already generously providing storage and bandwidth. So, let’s set up an rsync server! One problem: our images are not stored on a traditional filesystem, but in Amazon S3. So we need some way to provide an rsync server which is backed by S3.

I decided to use an S3-backed FUSE filesystem to mount the bucket holding our images. It needs to provide a 1:1 mapping from paths in S3 to paths on the mounted filesystem (preserving the directory hierarchy), perform reasonably well, and ideally offer local caching of file contents. I looked at two implementations (out of the many that are out there) which have these features:

  • s3fs-fuse, which is packaged for Debian as s3fs. Debian is the predominant OS in our server infrastructure, as well as the base for Endless OS itself, so it’s convenient to have a package.1
  • goofys, which claims to offer substantially better performance for file metadata than s3fs.

I went with s3fs first, but it is a bit rough around the edges:

  • Our S3 bucket name contains dots, which is not uncommon. By default, if you try to use one of these with s3fs, you’ll get TLS certificate errors. This turns out to be because s3fs accesses S3 buckets as $NAME.s3.amazonaws.com, and the certificate is for *.s3.amazonaws.com, which does not match foo.bar.s3.amazonaws.com. s3fs has a -o use_path_request_style flag which avoids this problem by putting the bucket name into the request path rather than the request domain, but this use of that parameter is only documented in a GitHub Issues comment.
  • If your bucket is in a non-default region, AWS serves up a redirect, but s3fs doesn’t follow it. Once again, there’s an option you can use to force it to use a different domain, which once again is documented in a comment on an issue.
  • Files created with s3fs have their permissions stored in an x-amz-meta-mode header. Files created by other tools (which is to say, all our files) do not have this header, so by default get mode 0000 (ie unreadable by anybody), and so the mounted filesystem is completely unusable (even by root, with FUSE’s default settings).

There are two ways to fix this last problem, short of adding this custom header to all existing and future files:

  1. The -o complement_stat option forces files without the magic header to have mode 0400 (user-readable) and directories 0500 (user-readable and -searchable).
  2. The -o umask=0222 option (from FUSE) makes the files and directories world-readable (an improvement on complement_stat in my opinion) at the cost of marking all files executable (which they are not)

I think these are all things that s3fs could do by itself, by default, rather than requiring users to rummage through documentation and bug reports to work out what combination of flags to add. None of these were showstoppers; in the end it was a catastrophic memory leak (since fixed in a new release) that made me give up and switch to goofys.

Due to its relaxed attitude towards POSIX filesystem semantics where performance would otherwise suffer, goofys’ author refers to it as a “Filey System”.2 In my testing, throughput is similar to s3fs, but walking the directory hierarchy is orders of magnitude faster. This is due to goofys making more assumptions about the bucket layout, not needing to HEAD each file to get its permissions (that x-amz-meta-mode thing is not free), and having heuristics to detect traversals of the directory tree and optimize for that case.3

For on-disk caching of file contents, goofys relies on catfs, a separate FUSE filesystem by the same author. It’s an interesting design: catfs just provides a write-through cache atop any other filesystem. The author has data showing that this arrangement performs pretty well. But catfs is very clearly labelled as alpha software (“Don’t use this if you value your data.”) and, as described in this bug report with a rather intemperate title, it was not hard to find cases where it DOSes itself or (worse) returns incorrect data. So we’re running without local caching of file data for now. This is not so bad, since this server only uses the file data for periodic synchronisation with mirrors: in day-to-day operation serving redirects, only the metadata is used.

With this set up, it’s plain sailing: a little rsync configuration generator that uses the filter directive to only publish the last two releases (rather than forcing terabytes of archived images onto our mirrors) and setting up Mirrorbits. Our Mirrorbits instance is configured with some extra “mirrors” for our CloudFront distribution so that users who are closer to a CloudFront-served region than any real mirrors are directed there; it could also have been configured to make the European mirrors (which they all are, as of today) only serve European users, and rely on its fallback mechanism to send the rest of the world to CloudFront. It’s a pretty nice piece of software.

If you’ve made it this far, and you operate a mirror in Southeast Asia, South America or Africa, we’d love to hear from you: our mission is to improve access to technology for people in these areas.

  1. Do not confuse s3fs-fuse with fuse-s3fs, a totally separate project packaged in Fedora. It uses its own flattened layout for the S3 bucket rather than mapping S3 paths to filesystem paths, so is not suitable for what we’re doing here. []
  2. This inspired a terrible “joke”. []
  3. I’ve implemented a similar optimization elsewhere in our codebase: since we have many "directories", it takes many less requests to ask S3 for the full contents of the bucket and transform that list into a tree locally than it does to list each directory individually. []

Everything In Its Right Place

Back in July, I wrote about trying to get Endless OS working on DVDs. To recap: we have published live ISO images of Endless OS for a while, but until recently if you burned one to a DVD and tried to boot it, you’d get the Endless boot-splash, a lot of noise from the DVD drive, and not much else. Definitely no functioning desktop or installer!

I’m happy to say that Endless OS 3.3 boots from a DVD. The problems basically boiled down to long seek times, which are made worse by data not being arranged in any particular order on the disk. Fixing this had the somewhat unexpected benefit of improving boot performance on fixed disks, too. For the gory details, read on!

The initial problem that caused the boot process to hang was that the D-Bus system bus took over a minute to start. Most D-Bus clients assume that any method call will get a reply within 25 seconds, and fail particularly badly if method calls to the bus itself time out. In particular, systemd calls a number of methods on the system bus right after it launches it; if these calls fail, D-Bus service activation will not work. iotop and systemd-analyze plot strongly suggested that dbus-daemon was competing for IO with systemd-udevd, modprobe incantations, etc. Booting other distros’ ISOs, I noticed local-fs.target had a (transitive) dependency on systemd-udev-settle.service, which as the name suggests waits for udev to settle down.1 This gets most hardware discovery out of the way before D-Bus and friends get started; doing the same in our ISOs means D-Bus starts relatively quickly and the boot process can continue.

Even with this change, and many smaller changes to remove obviously-unnecessary work from the boot sequence, DVDs took unacceptably long to reach the first-boot experience. This is essentially due to reading lots of small files which are scattered all over the disk: the laser has to be physically repositioned whenever you need to seek to a different part of the DVD, which is extremely slow. For example, initialising IBus involves running ibus-engine-m17n --xml which reads hundreds of tiny files. They’re all in the same directory, but are not necessarily physically close to one another on the disk. On an otherwise idle system with caches flushed, running this command from an loopback-mounted ISO file on an SSD took 0.82 seconds, which we can assume is basically all squashfs decompression overhead. From a DVD, this command took 40 seconds!

What to do? Our systemd is patched to resurrect systemd-readahead (which was removed upstream some time ago) because many of our target systems have spinning disks, and readahead improves boot performance substantially on those systems. It records which files are accessed during the boot sequence to a pack file; early in the next boot, the pack file is replayed using posix_fadvise(..., POSIX_FADV_WILLNEED); to instruct the kernel that these files will be accessed soon, allowing them to be fetched eagerly, in an order matching the on-disk layout. We include a pack file collected from a representative system in our OS images to have something to work from during the first boot.

This means we already have a list of all2 files which are accessed during the boot process, so we can arrange them contiguously on the disk. The main stumbling block is that our ISOs (like most distros’) contain an ext4 filesystem image, inside a GPT disk image, inside a squashfs filesystem image, and ext4 does not (to my knowledge!) provide a straightforward way to move certain files to a particular region of the disk. To work around this, we adapt a trick from Fedora’s livecd-tools, and create the ext4 image in two passes. First, we calculate the size of the files listed in the readahead pack file (it’s about 200MB), add a bit for filesystem overhead, create an ext4 image which is only just large enough to hold these files, and copy them in. Then we grow the filesystem image to its final size (around 10GB, uncompressed, for a DVD-sized image) and copy the rest of the filesystem contents. This ensures that the files used during boot are mostly contiguous, near the start of the disk.3

Does this help? Running ibus-engine-m17n --xml on a DVD prepared this way takes 5.6 seconds, an order of magnitude better than the 40 seconds observed on an unordered DVD, and booting the DVD is more than a minute faster than before this change. Hooray!

Due to the way our image build and install process works, the GPT disk image inside the ISO is the same one that gets written to disk when you install Endless OS. So: how will this trick affect the installed system? One potential problem is that mke2fs uses the filesystem size to determine various attributes, like block and inode sizes, and 200MB is small enough to trigger the small profile. So we pass -T default to explicitly select more appropriate parameters for the final filesystem size.4 As far as I can tell, the only impact on installed systems is positive: spinning disks also have high seek latency, and this change cuts 15% off the boot time on a Mission One. Of course, this will gradually decay when the OS is updated, since new files used at boot will not be contiguous, but it’s still nice to have. (In the back of my mind, I’ve always wondered why boot times always get worse across the lifetime of a device; this is the first time I’ve deliberately caused this to be the case.)

The upshot: from Endless OS 3.3 onwards, ISOs boot when written to DVD. However, almost all of our ISOs are larger than 4.7 GB! You can grab the Basic version, which does fit, from the Linux/Mac tab on our website and give it a try. I hope we’ll make more DVD-sized ISOs available in a future release. New installations of Endless OS 3.3 or newer should boot a bit more quickly on rotating hard disks, too. (Running the dual-boot installer for Windows from a DVD doesn’t work yet; for a workaround, copy all the files off the DVD and run them from the hard disk.)

Oh, and the latency simulation trick I described? Since it delays reads, not seeks, it is actually not a good enough simulation when the difference between the two matters, so I did end up burning dozens of DVD+Rs. Accurate simulation of optical drive performance would be a nice option in virtualisation software, if any Boxes or VirtualBox developers are reading!

  1. Fedora’s is via dmraid-activation.service, which may or may not be deliberate; anecdotally, SUSE DVDs deliberately add a dependency for this reason. []
  2. or at least the majority of []
  3. When I described this technique internally at Endless, Juan Pablo pointed out that DVDs can actually read data faster from the end (outside) of the disk. The outside of the disk has more data per rotation than the centre, and the disk spins at a constant rotation speed. A quick test with dd shows that my drive is twice as fast reading data from the end of the disk compared to the start. It’s harder to put the files at the end of the ext4 image, but we might be able to artificially fragment the squashfs image to put the first few hundred MBs of its contents at the end. []
  4. After Endless OS is installed, the filesystem is resized again to fill the free space on disk. []

Endless Reddit AMA

Along with many colleagues from all across the company (and globe), I’m taking part in Endless’s first Reddit Ask Me Anything today. From my perspective in London it starts at 5pm on Wednesday 11th; check our website for a countdown or this helpful table of time conversions.

Have you been wanting to ask about our work and products in a public forum, but never found a socially-acceptable opportunity? Now’s your chance!

Simulating read latency with device-mapper

Like most distros, Endless OS is available as a hybrid ISO 9660 image. The main uses (in my experience) of these images are to attach to a virtual machine’s emulated optical drive, or to write them to a USB flash drive. In both cases, disk access is relatively fast.

A few people found that our ISOs don’t always boot properly when written to a DVD. It seems to be machine-dependent and non-deterministic, and the journal from failed boots shows lots of things timing out, which suggests that it’s something to do with slower reads – and higher seek times – on optical media. I dug out my eight-year-old USB DVD-R drive, but didn’t have any blank discs and really didn’t want to have to keep burning DVDs on a hot summer day. It turned out to be pretty easy to reproduce using qemu-kvm plus device-mapper’s delay target.

According to AnandTech, DVD seek times are somewhere in the region of 90-135ms. It’s not a perfect simulation but we can create a loopback device backed by the ISO image (which lives on a fast SSD), then create a device-mapper device backed by the loopback device that delays all reads by 125 ms (for the sake of argument), and boot it:

$ sudo losetup --find --show \
$ echo "0 $(sudo blockdev --getsize /dev/loop0)" \
  "delay /dev/loop0 0 125" \
  | sudo dmsetup create delayed-loop0
$ qemu-kvm -cdrom /dev/mapper/delayed-loop0 -m 1GB

Sure enough, this fails with exactly the same symptoms we see booting a real DVD. (It really helps to remember the -m 1GB because modern desktop Linux distros do not boot very well if you only allow them QEMU’s default 128MB of RAM.)

My next EP will be released as a corrupted GPT image

Since July last year I’ve been working at Endless Computers on the downloadable edition of Endless OS.1 A big part of my work has been the Endless Installer for Windows: a Wubi-esque tool that “installs” Endless OS as a gigantic image file in your Windows partition2, sparing you the need to install via a USB stick and make destructive changes like repartitioning your drive. It’s derived from Rufus, the Reliable USB Formatting Utility, and our friends at Movial did a lot of the heavy lifting of turning it to our installer.

Endless OS is distributed as a compressed disk image, so you just write it to disk to install it. On first boot, it resizes itself to fill the whole disk. So, to “install” it to a file we decompress the image file, then extend it to the desired length. When booting, in principle we want to loopback-mount the image file and treat that as the root device. But there’s a problem: NTFS-3G, the most mature NTFS implementation for Linux, runs in userspace using FUSE. There are some practical problems arranging for the userspace processes to survive the transition out of the initramfs, but the bigger problem is that accessing a loopback-mounted image on an NTFS partition is slow, presumably because every disk access has an extra round-trip to userspace and back. Is there some way we can avoid this performance penalty?

Robert McQueen and Daniel Drake came up with a neat solution: map the file’s contents directly, using device mapper. Daniel wrote a little tool, ntfsextents, which uses the ntfs-3g library to find the position and size (in bytes) within the partition of each chunk of the Endless OS image file.3 We feed these to dm-setup to create a block device corresponding to the Endless OS image, and then boot from that – bypassing NTFS entirely! There’s no more overhead than an LVM root filesystem.

This is safe provided that you disallow concurrent modification of the image file via NTFS (which we do), and provided that you get the mapping right. If you’ve ensured that the image file is not encrypted, compressed, or sparse, and if ntfsextents is bug-free, then what could go wrong?

Unfortunately, we saw some weird problems as people started to use this installation method. At first, everything would work fine, but after a few days the OS image would suddenly stop booting. For some reason, this always seemed to happen in the second half of the week. We inspected some affected image files and found that, rather than ending in the secondary GPT header as you’d expect, they ended in zeros. Huh?

We were calling SetEndOfFile to extend the image file. It’s documented to “[set] the physical file size for the specified file”, and “if the file is extended, the contents of the file between the old end of the file and the new end of the file are not defined”. For our purposes this seems totally fine: the extended portion will be used as extra free space by Endless OS, so its contents don’t matter, but we need it to be fully physically allocated so we can use the extra space. But we missed an important detail! NTFS maintains two lengths for each file: the allocation size (“the size of the space that is allocated for a file on a disk”), and the valid data length (“the length of the data in a file that is actually written”).4 SetEndOfFile only updates the former, not the latter. When using an NTFS driver, reads past the valid data length return zero, rather than leaking whatever happens to be on the disk. When you write past the valid data length, the NTFS driver initializes the intervening bytes to zero as needed. We’re not using an NTFS driver, so were happily writing into this twilight zone of allocated-but-uninitialized bytes without updating the valid data length; but when the file is defragmented, the physical contents past the valid data length are not copied to their new home on the disk (what would be the point? it’s just uninitialized data, right?). So defragmenting the file would corrupt the Endless OS image.

One could fix this in our installer in two ways: write a byte at the end of the file (forcing the NTFS driver to write tens of gigabytes of zeros to initialize the file), or use SetFileValidData to mark the unused space as valid without actually initializing it. We chose the latter: installing a new OS is already a privileged operation, and the permissions on the Endless OS image file are set to deny read access to mere mortals, so it’s safe to avoid the cost of writing ten billion zeros.5

We weren’t quite home and dry yet, though: some users were still seeing their Endless OS image file corrupting itself after a few days. Having been burned once, we guessed this might be the defragmenter at work again. It turned out to be a quirk of how chunks of a file which happen to be adjacent can be represented, which we were not handling correctly in ntfsextents, leading us to map parts of the file more than once, like a glitchy tape loop. (We got lucky here: at least all the bytes we mapped really were part of the image file. Imagine if we’d mapped some arbitrary other part of the Windows system drive and happily scribbled over it…)

(Oh, why did these problems surface in the second half of any given week? By default, Windows defragments the system drive at 1am every Wednesday, or as soon as possible after that.)

  1. If you’re not familiar with Endless OS, it’s a GNOME- and Debian-derived desktop distribution, focused on reliable, easy-to-use computing for everyone. There was lots of nice coverage from CES last week. People seem particularly taken by the forthcoming “flip the window to edit the app” feature. []
  2. and configures a bootloader – more on this in a future post… []
  3. See debian/patches/endless*.patch in our ntfs-3g source package. []
  4. I gather many other filesystems do the same. []
  5. A note on the plural of “zero”: I conducted a poll on Twitter but chose to disregard the result when it was pointed out that MATLAB and NumPy both spell it without an “e”. See? No need to blindly implement the result of a non-binding referendum! []