Concurrency, Parallelism, I/O Scheduling, Thread Pooling, and Work-Stealing

Around 15 years ago I worked on some interesting pieces of software which are unfortunately still not part of my daily toolbox in GNOME and GTK programming. At the time, the world was going through changes to how we would write thread pools, particularly with regards to wait-free programming and thread-pooling.

New trends like work-stealing were becoming a big thing, multiple-CPUs with multiple NUMA nodes were emerging on easy to acquire computers. We all were learning that CPU frequency was going to stall and that non-heterogeneous CPUs were going to be the “Next Big Thing”.

To handle those changes gracefully, we were told that we need to write software differently. Intel pushed that forward with Threading Building Blocks (TBB). Python had been doing things with Twisted which had an ergonomic API and of course “Stackless Python” and similar was a thing. Apple eventually came out with Grand Central Dispatch. Microsoft Research had the Concurrency and Coordination Runtime (CCR) which I think came out of robotics work.

Meanwhile, we had GThreadPool which honestly hasn’t changed that much since. Eventually the _async/_finish paring we’re familiar with today emerged followed by GTask to provide a more ergonomic API on top of it.

A bit before the GTask we all know today, I had written libgtask which was more of a Python Twisted-style API which provided “deferreds” and nice ways to combine them together. That didn’t come across into GLib unfortunately. To further the pain there, what we got in the form of GTask has some serious side-effects which make it unsuitable as a general construct in my humble opinion.

After realizing libgtask was eclipsed by GLib itself, I set off on another attempt in the form of libiris. That took a different approach that tried to merge the mechanics of CCR (message passing, message ports, coordination arbiters, actors), the API ergonomics of Python Twisted’s Deferred, and Apple’s NSOperation. It provided a wait-free work-stealing scheduler to boot. But it shared a major drawback of GLib’s GTask (beyond correctness bugs that plague it today). Primarily that thread pools can only process the work queue and therefore if you need to combine poll() or GSource that attach to a GMainContext you’re going to require code-flow to repeatedly bounce between threads.

This is because you can simplify a thread pool worker to while (has_work()) do_work ();. Any GSource or I/O either needs to bounce to the main thread where the applications GMainContext exists or to another I/O worker thread if doing synchronous I/O. On Linux, for a very long time, synchronous I/O was the “best” option if you wanted to actually use the page cache provided by the kernel, so that’s largely what GLib and GIO does.

The reason we couldn’t do something else is that to remove an item from the global work queue required acquiring a GMutex and blocking until an item is available. On Linux at least, we didn’t have APIs to be able to wait on a Futex while also poll()ing a set of file-descriptors. (As a note I should mention for posterity that FD_FUTEX was a thing for a short while, but never really usable).

In the coming years, we got a lot of new features in Linux that allowed improvements to the stack. We got signalfd to be able to poll() on incoming Unix signals. We got eventfd() which allowed a rather low-overhead way to notify coordinating code with a poll()able file-descriptor. Then EFD_SEMAPHORE was added so that we can implement sem_t behavior with a file-descriptor. It even supports O_NONBLOCK.

The EFD_SEMAPHORE case is so interesting to me because it is provides the ability to do something similar to what IRIX did 20+ years ago, which is a pollable semaphore! Look for usnewpollsema() if you’re interested.

There was even some support in GLib to support epoll(), but that seems to have stalled out. And honestly, making it use io_uring might be smarter option now.

After finishing the GTK 4 port of GNOME Builder I realized how much code I’ve been writing in the GAsyncReadyCallback style. I don’t particularly love it and I can’t muster the energy to write more code in that style. I feel like I’m writing top-half/bottom-half interrupt handlers yet I lack the precision to pin things to a thread as well as having to be very delicate with ownership to tightly control object finalization. That last part is so bad we basically don’t use GLib’s GTask in Builder in favor of IdeTask which is smart about when and where to release object references to guarantee finalization on a particular thread.

One thing that all these previous projects and many hundreds of thousands of async-oriented C code written has taught me is that all these components are interlinked. Trying to solve one of them without the others is restrictive.

That brings me to 2022, where I’m foolishly writing another C library that solves this for the ecosystem I care most about, GTK. It’s goal is to provide the pieces I need in both applications as well as toolkit authoring. For example, if I were writing another renderer for GTK, this time I’d probably built it on something like this. Given the requirements, that means that some restrictions exist.

I know that I need GMainContext to work on thread pool threads if I have any hope of intermixing workloads I care about on worker threads.
I 100% don’t care about solving distributed computing workloads or HTTP socket servers. I refuse to race for requests-per-second at the cost of API ergonomics or power usage.
I know that I want work stealing between worker threads and that it should be wait-free to avoid lock contention.
Worker threads should try to pin similar work to their own thread to avoid bouncing between NUMA nodes. This increases data cacheline hits as well as reduces chances of allocations and frees moving between threads (something you want minimized).
I know that if I have 32 thread pool threads and 4 jobs are added to the global queue, I don’t want 32 threads waking up from poll() to try to take that those work items.
The API needs to be simple, composable, and obvious. There is certainly a lot of inspiration to be taken from things like std::future, JavaScript’s Promise, Python’s work on deferred execution.
GObject has a lot of features and because of that it goes through great lengths to provide correctness. That comes at great costs for things you want to feel like a new “primative”, so avoiding it makes sense. We can still use GTypeInstance though, following in the footsteps of GStreamer and GTK’s Scene Kit.
Cancellation is critical and not only should it cause the work you created to cancel, that cancellation should propagate to the things your work depended on unless non-cancelled work also depends on it.

I’ve built most of the base of this already. The primary API you interact with is DexFuture and there are many types of futures. You have futures for IO (using io_uring). Futures for unix signals. A polled semaphore where dex_semaphore_wait() is a future. It can wrap _async/_finish pairs and provide the result as a Future. Thread pools are efficient in waiting on work (so staying dormant until necessary and minimal thread wake-ups) while also coordinating to complete work items.

There is still a lot of work to go, and so far I’ve only focused on the abstractions and Linux implementations. But I feel like there is promise (no pun intended) and I’m hoping to port swaths of code in Builder to this in the coming months. If you’d like to help, I’d be happy to have you, especially if you’d like to focus on alternate DexAioBackends, DexSemaphore using something other than eventfd() on BSD/Solaris/macOS/Windows, and additional future types. Additionally, working to GLib to support GMainContext directly using io_uring would be appreciated.

You can find the code here, but it will likely change in the near future.