This is the first part of a series talking about the approach flatpak takes to security and sandboxing.
First of all, a lot of people think of container technology like docker, rkt or systemd-nspawn when they think of linux sandboxing. However, flatpak is fundamentally different to these in that it is unprivileged.
What I mean is that all the above run as root, and to use them you either have to be root, or your access to it is equivalent to root. For instance, if you have access to the docker socket then you can get a full root shell with a command like:
docker run -t -i --privileged -v /:/host fedora chroot /host
Flatpak instead runs everything as the regular user. To do this it uses a project called bubblewrap which is like a super-powered version of chroot, only you don’t have to be root to run it.
Bubblewrap can do more than just change the root, it lets you construct a custom filesystem mount tree for the process. Additionally it lets you create namespaces to further isolate things from the host. For instance if use –unshare-pid then your process will not see any processes from outside the sandbox.
Now, chroot is a root-only operation. How can it be that bubblewrap lets you do the same thing but doesn’t require root privileges? The answer is that it uses unprivileged user namespaces.
Inside such a user namespace you get a lot of capabilities that you don’t have outside it, such as creating new bind mounts or calling chroot. However, in order to be allowed to use this you have to set up a few process limits. In particular you need to set a process flag called PR_SET_NO_NEW_PRIVS. This causes all forms of privilege escalation (like setuid) to be disabled, which means the normal ways to escape a chroot jail don’t work.
Actually, I lied a bit above. We do use unprivileged user namespaces if we can, but many distributions disable them. The reason is that user namespaces open up a whole new attack surface against the kernel, allowing an unprivileged user access to lots of things that may not be perfectly adapted user access. For instance CVE-2016-3135 was a local root exploit which used a memory corruption in an iptables call. This is normally only accessible by root, but user namespaces made it user exploitable.
If user namespaces are disabled, bubblewrap can be built as a setuid helper instead. This still only lets you use the same features as before, and in many ways it is actually safer this way, because only a limited subset of the full functionality is exposed. For instance you cannot use bubblewrap to exploit the iptable bug above because it doesn’t set up iptable (and if it did it wouldn’t pass untrusted data to it).
Long story short, flatpak uses bubblewrap to create a filesystem namespace for the sandbox. This starts out with a tmpfs as the root filesystem, and in this we bind-mount read-only copies of the runtime on /usr and the application data on /app. Then we mount various system things like a minimal /dev, our own instance of /proc and symlinks into /usr from /lib and /bin. We also enable all the available namespaces so that the sandbox cannot see other processes/users or access the network.
On top of this we use seccomp to filter out syscalls that are risky. For instance ptrace, perf, and recursive use of namespaces, as well as weird network families like DECnet.
In order for the application to be able to write data anywhere we bind mount $HOME/.var/app/$APPID/
into the sandbox, but this is the only persistent writable location.
In this sandbox we then spawn the application (after having dropped all increased permissions). This is a very limited environment, and there isn’t much the application can do. In the next part of this series we’ll start looking into how things can be opened up to allow the app to do more.
Hm, would it make sense to use bubblewrap to run a web service?
Say I have nginx running and passing along requests to a set of custom processes (e.g. a PHP process interpreting a specific script or Python running a Django WSGI script), each of which I’d ideally want to have sandboxed in such a manner that if someone manages to break into one of them, they’re stuck there?
Or perhaps run something like Postgres – if someone breaks into that, they can read the data in the database, but that’s it.
I think most Linux distributions have a weak form of permission management by creating a separate user for each service, but I wonder if a better sandbox model could end up being the norm. The user thing really seems like a hack.
Ole: For sure, bubblewrap is useful not only for desktop apps, although for servers that are launched and operated by the sysadmin you could also use tools like systemd namespacing configuration, systemd-nspawn or even docker. Its up to you to pick what fits you best.
One very nice area for bubblewrap is for containing builds, for instance in CI systems or just for general paranoia.