Designing for Sandboxes

One of the things I talked about in my talk at Scale 17x is that there are a number of platform features coming that are enevitable.

One of those is application sandboxing.

But not every component within an application is created equal or deserves equal access to user data and system configuration. Building the next big application is increasingly requiring thinking about how you segment applications into security domains.

Given the constraints of our current operating systems, that generally means processes. Google’s Chrome was one of the first major applications to do this. The Chrome team had created a series of processes focused on different features. Each of those processes had capabilities removed (such as network, or GPU access) from the process space to reduce the damage of an attack.

Recently Google released sandboxed-api, which is an interesting idea around automatically sandboxing libraries on Linux. While interesting, limiting myself to designs that are Linux only is not currently realistic for my projects.

Since I happen to work on an IDE, one of the technologies I’ve had to become familiar with is Microsoft’s Language Server Protocol. It’s a design for worker processes to provide language-specific features.

It usually works like this:

  • Spawn a worker process, with a set of pipe()s for stdin/stdout you control
  • Use JSONRPC over the pipe()s with some well-formatted JSON commands

This design can be good for sandboxing because it allows you to spawn subprocesses that have reduced system capabilities, easily clean up after them, and provides an IPC format. Despite having written jsonrpc-glib and a number of helpers to make writing JSON from C rather clean, I’m still unhappy with it for a number of reasons. Those reasons range from everything from performance to correctness to brittleness of nonconforming implementations.

I’d like to use this design in more than just Builder but those applications are more demanding. They require passing FDs across the process boundary. (Also I’m sick of hand writing JSON RPCs and I don’t want to do that anymore).

Thankfully, we’ve had this great RPC system for years that fits the bill if you reuse the serialization format: DBus.

  • No ties to a DBus daemon
  • GDBus in GLib has a full implementation that plays well with async/sync code
  • gdbus-codegen can generate our RPC stubs and proxies
  • Well defined interfaces in XML files
  • Generated code does type enforcement to ensure contracts
  • We can easily pass FDs across the process boundary, useful for memfd/tmpfs/shm

To setup the sandboxes, we can use tools like flatpak-spawn or bwrap on Linux to restrict process capabilities before launching the target process. Stdin/stdout is left untouched so that we can communicate with the subprocess even after capabilities are dropped.

Before I (re)settled on DBus, I tried a number of other prototypes. That included writing an interface language/codegen for JSONRPC, using libvarlink, Thrift’s c_glib compiler and protobufs. I’m actually surprised I was happiest with the DBus implementation, but that’s how it goes sometimes.

While I don’t expect a lot of sandboxing around our Git support in Builder, I did use it as an opportunity to prototype what this multi-process design looks like. If you’re interested in checking it out, you can find the worker sources here.

What excites me about the future is how this type of design could be used to sandbox image loaders like GdkPixbuf. One could quite trivially have an RPC that passes a sealed memfd for compressed image contents and returns a memfd for the decoded framebuffer or pre-compressed GPU textures. Keep that process around a little while to avoid fork()/exec() overhead, and we gain a bit of robustness with very little performance drawbacks.

Compiler complexities

The other day I found myself perusing through some disassembly to get an idea of the code’s complexity. I do that occasionally because I find it the quickest way to determine if something is out of whack.

While I was there, I noticed a rather long _get_type() function. It looked a bit long and more importantly, I only saw one exit point (retq instruction on x86_64).

That piqued my interest because _get_type() functions are expected to be fast. In the fast-path (when they’ve already been registered), I’d expect them to check for a non-zero static GType type_id and if so return it. That is just a handful of instructions at most, so what gives?

The first thing that came to mind is that I use -O0 -g -fno-omit-frame-pointer on my local builds so that I get good debug symbols and ensure that Linux-perf can unwind the stack when profiling. So let’s disable that.

Now I’ve got everything rebuilt with the defaults (-O2 -fomit-frame-pointer). Now I see a couple exit points, but still what appears to be too many for the fast path and what is this __stack_chk_fail@plt I see?

A quick search yields some information about -fstack-protector which is a recent (well half-decade) compiler feature that employs various tricks to detect stack corruption. Distributions seem to enable this by default using -fstack-protector-strong. That tries to only add stack checks to code it thinks is accessing stack allocated data.

So quick recompile with -fno-stack-protector to disable the security feature and sure enough, a proper fast path emerges, We drop from 15 instructions (with 2 conditional jumps) to 5 instructions (with 1 conditional jump).

So the next question is: “Well okay, is this worth making faster?”

That’s a complicated question. The code was certainly faster before that feature was enabled by default. The more instructions you execute, the more code has to be loaded into the instruction cache (which is very small). To load instructions means pushing others out. So even if the code itself isn’t much faster, it can prevent the surrounding code from being faster.

But furthermore, we do a lot of _get_type() calls. They are used when doing function precondition checks, virtual methods, signals, type checks, interface lookups, checking a type for conformance, altering the reference count, marshaling data, accessing properties, … you get the idea.

So I mucked through about 5 different ways trying to see if I could make things faster without disabling the stack protector, without much luck. The way types are registered access some local data via macros. Nothing seemed to get me any closer to those magic 5 instructions.

GCC, since version 4.4, has allowed you to disable the stack-protector on a per-function basis. Just add __attribute__((optimize("no-stack-protector"))) to the function prototype.

#if G_GNUC_CHECK_VERSION(4, 4)
# define G_GNUC_NO_STACK_PROTECTOR \
  __attribute__((optimize("no-stack-protector")))
#else
# define G_GNUC_NO_STACK_PROTECTOR
#endif

GType foo_get_type (void) G_GNUC_NO_STACK_PROTECTOR;

Now we get back to our old (faster) version of the code.

 48 8b 05 b9 15 20 00    mov    0x2015b9(%rip),%rax   
 48 85 c0                test   %rax,%rax
 74 0c                   je     400ac8
 48 8b 05 ad 15 20 00    mov    0x2015ad(%rip),%rax   
 c3                      retq   

But what’s the difference you ask?

I put together a test that I can run on a number of machines. It was unilaterally faster in each case (as expected), but some by as much as 50% (likely due to various caches).

Arch OS Type Speedup
ARM Ubuntu 14.04 Odroid X2 +24%
x64_64 Fedora 28 X1 Carbon Gen3 +25.5%
x86_64 Fedora 28 i7 gen7 NUC +12.25%
x86_64 Fedora 28 Surface Book 2 i7 (gen8) +12.5%
x86_64 Fedora 27 Onda Tablet +50.6%
x86 Debian netbook +12.5%

It’s my opinion that one place where it makes sense to keep things very fast (and reduce instruction cache blow-out) is a type system. That code gets run a lot intermixed between all the code you really care about.

GTask and Threaded Workers

GTask is super handy, but it’s important you’re very careful with it when threading is involved. For example, the normal threaded use case might be something like this:

state = g_slice_new0 (State);
state->frob = get_frob_state (self);
state->baz = get_baz_state (self);

task = g_task_new (self, cancellable, callback, user_data);
g_task_set_task_data (task, state, state_free);
g_task_run_in_thread (task, state_worker_func);

The idea here is that you create your state upfront, and pass that state to the worker thread so that you don’t race accessing self-> fields from multiple threads. The “shared nothing” approach, if you will.

However, even this isn’t safe if self has thread usage requirements. For example, if self is a GtkWidget or some other object that is expected to only be used from the main-thread, there is a chance your object could be finalized in a thread.

Furthermore, the task_data you set could also be finalized in the thread. If your task data also holds references to objects which have thread requirements, those too can be unref’d from the thread (thereby cascading through the object graph should you hit this undesirable race).

Such can happen when you call g_task_return_pointer() or any of the other return variants from the worker thread. That call will queue the result to be dispatched to the GMainContext that created the task. If your CPU task-switches to that thread before the worker thread has released it’s reference you risk the chance the thread holds the last reference to the task.

In that situation self and task_data will both be finalized in that worker thread.

Addressing this in Builder

We already have various thread pools in Builder for work items so it would be nice if we could both fix the issue in our usage as well as unify the thread pools. Additionally, there are cases where it would be nice to “chain task results” to avoid doing duplicate work when two subsystems request the same work to be performed.

So now Builder has IdeTask which is very similar in API to GTask but provides some additional guarantees that would be very difficult to introduce back into the GTask implementation (without breaking semantics). We do this by passing the result and the threads last ownership reference to the IdeTask back to the GMainContext at the same time, ensuring the last unref happens in the expected context.

While I was at it, I added a bunch of debugging tools for myself which caught some bugs in my previous usage of GTask. Bugs were filed, GTask has been improved, yadda yadda.

But I anticipate the threading situation to remain in GTask and you should be aware of that if you’re writing new code using GTask.

Secure from whom

Side-channel attacks are a thing, this is true. But they also cost a lot of time and money to develop. If you want something that can be applied to more than just a single target, that cost explodes. That is why the two most common places where side-channel attacks are developed are nation states and universities specializing in that research.

What is not helpful, beyond informing people of the existence of them, is to simply state that side-channel attacks exist and therefore nothing is secure. Even more so without demonstrating how they are real-word applicable and how that information should alter the direction of development.

Security is a nebulous word and is almost always used as an incomplete sentence. It lacks an important qualifier. Secure from whom.

Creating a side-channel attack almost always requires knowing a bit about your target. Doubly so for something as delicate as timing attacks. Also, don’t forget to take into account development time for said attacks. If the software changes at a rate faster than you can develop your exploit, well, that’s note worthy.

Making it more difficult for an application to extract information from outside the containment zone does in fact protect the user from practical attacks which do not require a nation state to develop. It also most certainly cannot protect you from everything. Such is the reality of existence. I’m not safe from a meteorite hitting me but my risk assessment shows everything is going fine and it is not worth the mental stress to worry about.

So in summation, I’m far more interested in focusing on our ability to get security fixes out to users in a timely fashion. Herd immunity can work for software too.