Designing software that is both fast and available to higher level languages generally means you end up writing C. There are guiding principles you should follow when doing so to ensure that you give your software the best chance for success.
Lets start with a look into my past. When I was employed at MongoDB a few years back, I was tasked with writing the modern, fast, C client library. The secondary goal was to speed up the other drivers that used bits of C “for performance reasons”. However, the performance gain from the C components was a meager 1-2x faster than just implementing it in the higher-level language. This is what happens when we fail to see the big picture, which is the first step in understanding.
The cost of a thunk in and out of the language runtime is reasonably fast these days. But when you do lots of them they quickly add up. (A thunk is simply a wrapper around calling another function that possibly has setup/teardown and possibly marshaling to perform).
In the example above, the reason for the meager gains in performance from C was simple. It was encoding/decoding each individual BSON document by calling into C (and then back up into python, ruby, etc) rather than as a set. Imagine if you get a result from the server containing 1000 documents. In this case you’d cross the language barrier at least 1000 times. Now what if while decoding those documents you have to create structures that are owned by the language runtime (calling back into the runtime to allocate). Now your 1000 could have just turned into 3000 at best, and more likely, many times worse.
However, if you simply dive into C once to decode the whole stream, you cut a large number of thunks out of the equation. If instead you move the whole database client, socket handling, encryption, etc into C, you can avoid even more thunks. This is why wrapping the libmongoc C library in python was closer to 10-15x faster than the native python version compared to the meager 1-2x faster with per-document decoding.
By maximizing the time you are in C, you give yourself the largest potential for performance improvement. Where you draw your language boundary is equally important to the data-structures you choose.
We use GObject across the board in GNOME. And for a living piece of software that is nearly old enough to drink, that is a good thing. Like all type systems designed in the 1990s, it has some warts. But generally, it gets the job done and provides the inter-language features we want with very little effort.
But you need to be careful when designing APIs if you intend for them to be accessible from multiple languages. For example, if your API relies on gsignal (what other languages often call “events”), you should at least think about the costs.
For example, imagine that the callback connected to your signal is in python. Your C code knows nothing of python and therefore likely does not hold the GIL (global interpreter lock). That means that when your signal fires, and it tries to thunk to the python callback, it must first marshal parameters (possibly copying), and then acquire the python GIL (generally fine). Now imagine you do this many times per second because your design emits signal everywhere (GtkWidget, for example). Now all of a sudden you are entering/exiting the language barrier many times in rapid succession. The thunks add up.
A very similar but equally important thing is the use of main loop timeouts. In GLib-based code, we generally use some form of g_idle_add_full() that registers a new GSource. First off, for every one of these we have to wake up the main loop, mutate data structures, detect level-triggered poll events, and possibly destroy it at the end of the main loop cycle (for one-shot sources). And that doesn’t even include the callback into your language runtime. Now imagine you do this on every frame of an animation. Now imagine that for every frame of the animation you update multiple actors in your scene graph. All of a sudden your thunk costs went through the roof, and you haven’t done any actual work yet.
Designing for success
So, how do we design APIs that don’t suffer from these issues? Well first off, really consider whether the use of gsignal is beneficial.
- Avoid gsignal when simply a single callback function will suffice. gsignal synchronizes all emissions via the global lock used to locate signal information. Obviously we can optimize this, but I’m not sure it changes anything.
- If you find yourself calling into C functions in a tight loop, stop and think about what you are doing.