My next EP will be released as a corrupted GPT image

Since July last year I’ve been working at Endless on the downloadable edition of Endless OS. ((If you’re not familiar with Endless OS, it’s a GNOME- and Debian-derived desktop distribution, focused on reliable, easy-to-use computing for everyone. There was lots of nice coverage from CES last week. People seem particularly taken by the forthcoming “flip the window to edit the app” feature.)) A big part of my work has been the Endless Installer for Windows: a Wubi-esque tool that “installs” Endless OS as a gigantic image file in your Windows partition ((and configures a bootloader – more on this in a future post…)), sparing you the need to install via a USB stick and make destructive changes like repartitioning your drive. It’s derived from Rufus, the Reliable USB Formatting Utility, and our friends at Movial did a lot of the heavy lifting of turning it to our installer.

Endless OS is distributed as a compressed disk image, so you just write it to disk to install it. On first boot, it resizes itself to fill the whole disk. So, to “install” it to a file we decompress the image file, then extend it to the desired length. When booting, in principle we want to loopback-mount the image file and treat that as the root device. But there’s a problem: NTFS-3G, the most mature NTFS implementation for Linux, runs in userspace using FUSE. There are some practical problems arranging for the userspace processes to survive the transition out of the initramfs, but the bigger problem is that accessing a loopback-mounted image on an NTFS partition is slow, presumably because every disk access has an extra round-trip to userspace and back. Is there some way we can avoid this performance penalty?

Robert McQueen and Daniel Drake came up with a neat solution: map the file’s contents directly, using device mapper. Daniel wrote a little tool, ntfsextents, which uses the ntfs-3g library to find the position and size (in bytes) within the partition of each chunk of the Endless OS image file. ((See debian/patches/endless*.patch in our ntfs-3g source package.)) We feed these to dm-setup to create a block device corresponding to the Endless OS image, and then boot from that – bypassing NTFS entirely! There’s no more overhead than an LVM root filesystem.

This is safe provided that you disallow concurrent modification of the image file via NTFS (which we do), and provided that you get the mapping right. If you’ve ensured that the image file is not encrypted, compressed, or sparse, and if ntfsextents is bug-free, then what could go wrong?

Unfortunately, we saw some weird problems as people started to use this installation method. At first, everything would work fine, but after a few days the OS image would suddenly stop booting. For some reason, this always seemed to happen in the second half of the week. We inspected some affected image files and found that, rather than ending in the secondary GPT header as you’d expect, they ended in zeros. Huh?

We were calling SetEndOfFile to extend the image file. It’s documented to “[set] the physical file size for the specified file”, and “if the file is extended, the contents of the file between the old end of the file and the new end of the file are not defined”. For our purposes this seems totally fine: the extended portion will be used as extra free space by Endless OS, so its contents don’t matter, but we need it to be fully physically allocated so we can use the extra space. But we missed an important detail! NTFS maintains two lengths for each file: the allocation size (“the size of the space that is allocated for a file on a disk”), and the valid data length (“the length of the data in a file that is actually written”). ((I gather many other filesystems do the same.)) SetEndOfFile only updates the former, not the latter. When using an NTFS driver, reads past the valid data length return zero, rather than leaking whatever happens to be on the disk. When you write past the valid data length, the NTFS driver initializes the intervening bytes to zero as needed. We’re not using an NTFS driver, so were happily writing into this twilight zone of allocated-but-uninitialized bytes without updating the valid data length; but when the file is defragmented, the physical contents past the valid data length are not copied to their new home on the disk (what would be the point? it’s just uninitialized data, right?). So defragmenting the file would corrupt the Endless OS image.

One could fix this in our installer in two ways: write a byte at the end of the file (forcing the NTFS driver to write tens of gigabytes of zeros to initialize the file), or use SetFileValidData to mark the unused space as valid without actually initializing it. We chose the latter: installing a new OS is already a privileged operation, and the permissions on the Endless OS image file are set to deny read access to mere mortals, so it’s safe to avoid the cost of writing ten billion zeros. ((A note on the plural of “zero”: I conducted a poll on Twitter but chose to disregard the result when it was pointed out that MATLAB and NumPy both spell it without an “e”. See? No need to blindly implement the result of a non-binding referendum!))

We weren’t quite home and dry yet, though: some users were still seeing their Endless OS image file corrupting itself after a few days. Having been burned once, we guessed this might be the defragmenter at work again. It turned out to be a quirk of how chunks of a file which happen to be adjacent can be represented, which we were not handling correctly in ntfsextents, leading us to map parts of the file more than once, like a glitchy tape loop. (We got lucky here: at least all the bytes we mapped really were part of the image file. Imagine if we’d mapped some arbitrary other part of the Windows system drive and happily scribbled over it…)

(Oh, why did these problems surface in the second half of any given week? By default, Windows defragments the system drive at 1am every Wednesday, or as soon as possible after that.)