Lets take a Fedora beta for a spin and see what looks out of place!
This time, I made the mistake of typing. Oops!
Input Performance
That lead me to an unreasonable amount of overhead in ibus-daemon
. Seems odd that something so critical to our typing latency would require so many CPU samples.
So I dove into the source of ibus
to see what is going on and made a few observations.
- It is doing a lot of D-Bus. That’s okay though, because if it were busy in logic that’d be a different thing to optimize.
- It uses a
GDBusMessageFilterFunction
on theGDBusConnection
which can run on the D-Bus I/O thread. At this point theGDBusMessage
it wants to modify are locked and require a copy. - After recording the messages on the private
ibus
D-Bus socket I can tell it’s doing a lot of communication with GNOME Shell which implements the wayland text protocol for applications. - There seems to be a huge amount of memory allocations involved seen when running
sysprof-cli --memprof
and that needs investigating. - Quite a bit of time spent in
GVariantTypeInfo
and hashtables.
TL;DR
Before I go any further, there are many merge requests awaiting review now which improve the situation greatly.
Preparing by Prototyping
The first thing I do when diving into a problem like this is to record a few things with Sysprof and just have a look around the flamegraphs. That helps me get a good feel for the code and how the components interact.
The next thing I do after finding what I think is a major culprit, is to sort of rewrite a minimal version of it to make sure I understand the problem with some level of expertise.
Last time, I did this by writing a minimal terminal emulator so I could improve VTE (and apparently that made people … mad?). This time, my eyes were set on GVariantBuilder
.
A faster GVariantBuilder
I was surprised to learn that GVariantBuilder
does not write to a serialized buffer while building. Instead, it builds a tree of GVariant*
.
That sets off my antennae because I know from experience that GVariant
uses GBytes
and GBytes
uses malloc
and that right there is 3 separate memory allocations (each aligned to 2*sizeof(void*)
) just to maybe store a 4-byte int32
.
So for a rectangle of (iiii)
serialized to GVariant
and built with GVariantBuilder
you’re looking at 3 * 4 + 3
minimum allocations, if you ignore the handful of others in the builder structures too.
So first step, prototype the same API but writing to a single buffer.
It’s about 3-4x faster in a tight loop. There are things you could do with a different API that would easily get you into the 10x range but what use is that if no software out there is using it.
So that is what we’re shooting for to guide our changes. I’ll stop when I get in that range as it’s likely diminishing returns.
A faster (slower) GVariantBuilder
I hesitate to just replace something like GVariantBuilder
for the same reason we don’t just rewrite the world in a new language. There are so many corner cases and history built up here that if you can get close without a complete rewrite, that is the less risky solution.
Instead, lets just shake out a bunch of things and see how fast we can make the venerable GVariantBuilder
.
GVariantTypeInfo
I saw a lot of time spent in g_variant_type_info_get()
specifically inside a GHashTable
. A quick printf()
later we can see that for some reason the GVariantTypeInfo
which contains information about a GVariantType
(a glorified type string) is being re-created over and over.
There is a cache, so that cache must not be working very well. Specifically, I see types being created a dozen times or more on every key-press or release.
Caches generally don’t work well if you release items from them when the last reference is dropped. This is because the only way for them to last across uses is for multi-threading to be in play keeping them alive.
So here we patch it to keep a number of them alive to improve the serial use-case.
Side Quests through HashTables
While looking at that code it was clear we were making copies of strings to do lookups in the hashtable. It’s not great when you have to malloc()
to do a hashtable lookup so maybe we can write custom hash/equal functions to handle that that.
Yup. Few more percent.
Keeping Release Code Fast
We have knobs to keep release builds fast. Sometimes that means compiling expensive assertions out which are there to help us catch bugs during development or to make the next bug hunters life easier.
For example, in my career I’ve done that to assert correctness of B-trees on every change, ensure file-system state is correct, validate all sorts of exotic data-structures and more. These are things that are helpful in debug builds that you very much don’t want in release code.
So I made some old-GVariant code which didn’t really follow the standard macros we use elsewhere do what you’d expect. Here and here.
Format String Fast-Paths
Sometimes you have a code-path that is so extremely hot that it warrants some sort of fast-path in and out. The format strings you use for building a GVariant
using GVariantBuilder
is one such example. We can see that normally that function will malloc()
and it would be a whole lot faster if it didn’t in our fast-path.
So make it do that.
Another fun fact about GVariantType
is that a simple definite type is a sub-type of itself. That is fancy lingo for saying "i"
is an "i"
.
Given that all GVariant
are eventually made up of these sort of basic definite types, they probably get hammered a lot when type-checking things at runtime.
In fact, yes, yes they do.
Doing Two Things at Once
It is extremely common in the GVariant
code-base to walk strings multiple times. Sometimes, those are type strings, sometimes those are user-provided UTF-8 encoded strings.
The API boundary checks for valid UTF-8 which requires walking the string and pulling it into memory. You can see that as basically:
g_return_val_if_fail (g_utf8_validate (string, -1, &endptr), NULL);
Then later, it does strlen()
on string
. Maybe just skip the strlen()
and do (endptr - string)
.
Playing Nicely with Malloc
GVariantBuilder
does this nice there where if you’re building an array of children, it will grow your destination array by powers of two so that it doesn’t need to call realloc()
each time. That is great for dynamically sized arrays.
At the end, it does this even nicer thing in that it shrinks the allocation to only what you needed by calling realloc()
again with the final size. Nice not to waste space, surely!
However, in extremely common cases we know the exact number of children up front so the realloc()
is pure overhead. Just check for that condition and save some more cycles.
Locks, Everywhere
I was surprised to find that GVariant
uses the g_bit_lock()
/g_bit_unlock()
API to manage whether or not a GVariant
is “floating”. I have things to say about floating references but that is for another day.
What I noticed though is that I see way too much g_bit_lock()
samples on my profiles considering there virtually never a valid reason to have a race condition in sinking a GVariant
s floating reference. In fact, g_variant_take_ref()
doesn’t even bother with it.
So save a heaping amount of time doing nothing by being clever.
More Ref Counting
With g_variant_ref_sink()
faster we still spend a lot of samples in g_bytes_ref()
. That sounds odd because we really only need a single reference and then ultimately g_bytes_unref()
when the variant is released.
A quick look and yup, we’re referencing only to immediately unref afterwards. Pick up a few more percent by transferring ownership to the callee.
Reducing Allocations
We still haven’t gotten rid of all those allocations and that makes me sad. Lets do something about those 3 allocations for a simple 32-bit integer.
Step One, GBytes
Our GVariant
structures are referencing a GBytes
which makes buffer management easy. It allows you to slice into smaller and smaller buffers and reference count the same larger data buffer safely.
What if we made it embed small allocations within its own allocation using flexible arrays? We just need to be careful to try to match the whole 2*sizeof(void*)
expectations that people get from malloc()
so we don’t break any code with that assumption.
Finally, from 3 allocations to two. This one was a whopping 10% off wallclock time alone.
But that of course should give us a good idea of what would happen if we did the same thing to GVariant
directly to go from 2 allocations to 1.
Step Two, GVariant
So here we go again and that’s another 10% on top of the previously stacked patches. Not bad.
Bonus Round
Calling g_variant_builder_init()
makes a copy of GVariantType
because theoretically you could free that type string before you call g_variant_builder_end()
. Realistically, nobody does that and it probably is just there to deal with bindings who may use g_variant_builder_new()
.
Either way, we don’t break ABI here so add g_variant_builder_init_static()
with different requirements for a few more percent by doing less malloc()
.
This is the one spot where if you want a few extra cycles, you gotta change your code.
How’d we do?
For my contrived benchmarks this gets us a 2.5x speedup over what was released in GNOME 47 when everything is built in “release mode fashion”. If you implement the bonus round you can get closer to 2.75x.
If you’d like to recreate the benchmark, just create a loop with a few million iterations generating some rather small/medium sized GVariant
with GVariantBuilder
.
Still a bit off from that 4x though, so if you like challenges perhaps replacing GVariantBuilder
with a serialized writer is just the project for you.