Performing mounts securely on user owned directories

While working on a feature for snapd, we had a need to perform a "secure bind mount". In this context, "secure" meant: The source and/or target of the mount is owned by a less privileged user. User processes will continue to run while we're performing the mount (so solutions that involve suspending all user processes are out). While we can't prevent the user from moving the mount point, they should not be able to trick us into mounting to locations they don't control (e.g. by replacing the path with a symbolic link). The main problem is that the mount system call uses string path names to identify the mount source and target. While we can perform checks on the paths before the mounts, we have no way to guarantee that the paths don't point to another location when we move on to the mount() system call: a classic time of check to time of use race condition. One suggestion was to modify the kernel to add a MS_NOFOLLOW flag to prevent symbolic link attacks. This turns out to be harder than it would appear, since the kernel is documented as ignoring any flags other than MS_BIND and MS_REC when performing a bind mount. So even if a patched kernel also recognised the MS_NOFOLLOW, there would be no way to distinguish its behaviour from an unpatched kernel. Fixing this properly would probably require a new system call, which is a rabbit hole I don't want to dive down. So what can we do using the tools the kernel gives us? The common way to reuse a reference to a file between system calls is the file descriptor. We can securely open a file descriptor for a path using the following algorithm: Break the path into segments, and check that none are empty, ".", or "..". Open the root directory with open("/", O_PATH|O_DIRECTORY). Open the first segment with openat(parent_fd, "segment", O_PATH|O_NOFOLLOW|O_DIRECTORY). Repeat for each of the remaining file descriptors, closing parent descriptors as needed. Now we just need to find a way to use these file descriptors with the mount system call. I came up with two strategies to achieve this. Use the current working directory The first idea I tried was to make use of the fact that the mount system call accepts relative paths. We can use the fchdir system call to change to a directory identified by a file descriptor, and then refer to it as ".". Putting those together, we can perform a secure bind mount as a multi step process: fchdir to the mount source directory. Perform a bind mount from "." to a private stash directory. fchdir to the mount target directory. Perform a bind mount from the private stash directory to ".". Unmount the private stash directory. While this works, it has a few downsides. It requires a third intermediate location to stash the mount. It could interfere with anything else that relies on the working directory. It also only works for directory bind mounts,…