debuginfod-enabled Sysprof

Based on some initial work by Barnabás Pőcze Sysprof gained support for symbolizing stack traces using debuginfod.

If you don’t want to install debuginfo packages for your entire system but still want really useful function names, this is for you. The system-configured debuginfod servers will provide you access to those debuginfo-enabled ELF binaries so that we can discover the appropriate symbol names.

Much like in gdb, you can cancel them if you don’t care and those symbols will fallback to the “In File lib.so+offset” you’re used to.

A screenshot showing a popover of symbols being downloaded.

I expect a bit more UI work around this before GNOME 48 like preferences to disable it or configure alternate debuginfod servers.

Happy profiling!

GCC/Clang Indirect Functions

There is this extremely neat feature in GCC 10+ and Clang 9+ that allows you to determine which function should be patched in at runtime. Procress startup code will call your user-defined function to figure out which function should be used for the environment.

For example, take the following

static gboolean (*choose_utf8_impl (void)) (const char *, gssize, const char **)
{
  if (go_fast)
    return fast_version;
  else
    return slow_version;
}

gboolean
g_utf8_validate (const char  *str,
                 gssize       len,
                 const char **end)
  __attribute__((ifunc("choose_utf8_impl")));

This can be handy when you’re dealing with complex branching based on the system. For example, imagine you need to check if you’re running under Valgrind as one does occasionally to satisfy an incorrect assumption about accessing past heap boundaries.

The typical way to handle this is to #include "valgrind.h" and branch based on if (RUNNING_ON_VALGRIND) ....

However, that inlines a bunch of assembly you probably don’t want in your ultra-fast new utf8-validation implementation. It might be better to just do that once at process startup and save yourself a bunch of CPU cycles.

UTF-8 validation performance

One of the things that GLib-based APIs got right from early on is that generally speaking, our API boundaries expect UTF-8. When transitioning between API boundaries, UTF-8 validation is a common procedure.

Therefore, it is unsurprising that GLib does a lot of UTF-8 string validation of all sorts of string lengths.

The implementation we’ve had is fairly straight forward to read and reason about. Though it is not winning any performance awards.

UTF-8 validation performance has been on my radar for some time. Back when I was working on VTE performance it was an issue there. Recently with GVariant I had to look for ways to mitigate that.

There are many options out there which can be borrowed-from or inspired-by with varying levels of complexity.

Deterministic Finite Automaton

Björn Höhrmann wrote a branchless UTF-8 validator years ago which is the basis for many UTF-8 validators. Though once you really start making it integrate well with other API designs you’ll often find you need to add branches outside of it.

This DFA approach can be relatively fast, but sometimes slower than the venerable g_utf8_validate().

VTE uses a modified version of this currently for its UTF-8 validation. It allows some nice properties when processing streaming input like that of a PTY character stream.

Chromium uses a modified version of this which removes the ternary for even better performance.

simdutf

One of the more boundary-pushing implementations is simdutf used by the WebKit project. Simdutf is written in C++ and performance-wise it is extremely well regarded.

Being written in C++ does present challenges to integrate into a C abstraction library, though not insurmountable.

The underlying goal of a SIMD (single-instruction, multiple-data) approach is to look at more than a single character at a time using wide-registers and clever math.

c-utf8

Another implementation out there is the c-utf8 project. It is used by dbus-broker and libvarlink.

Like simdutf it attempts to take advantage of SIMD by using some very simple compiler built-ins. It is also simple C code (no assembly) with just the use of a few GCC’isms all of which can be removed or alternatives used for other compilers such as MSVC.

What to Choose

Just to throw up an idea to see if it could be shot down, I suggested we look at these for GLib (issue #3481) to replace what we have.

I haven’t been able to get as good of performance with the DFA approach as the SIMD approach. So I felt I needed to choose between c-utf8 and simdutf.

Given the ease of integration I went with the c-utf8 implementation. It required removing the case 0x01 ... case 0x7F: GCC-ism. Additionally I simplified a number of the macros which are part of sibling c-util projects in favor of GLib equivalents.

I opted to drop __builtin_assume_aligned(expr,n) and use __attribute__((aligned(n))) instead. It generates the same code on GCC but has the benefit of having an equivalent syntax on MSVC in the form of __declspec(align(n)) in the same syntax position (before the declarator). That means a simple preprocessor macro per-compiler gets the same effect.

Performance Numbers

The c-utf8 project has a nice benchmark suite available for testing UTF-8 validation performance. It will benchmark the performance against a trivial implementation which matches what GLib has at the time of writing this. It also will benchmark it against the performance of strlen() which is an extremely optimized piece of the stack.

I modeled the same tests but using g_utf8_validate() with my integration patches to ensure we were getting the same performance as c-utf8, which we are.

For long ASCII strings, you can expect more than 10x speed-ups. (On my Xeon it was 13x and on my Apple Silicon it was 12x). For long multi-byte (2-to-4 byte sequences) you can expect 4x-6x improvements.

I wouldn’t expect too much difference for small strings as you wont see much of a benefit until you can get to word-size operations. As long as we don’t regress there, I think this is a good direction to go in.

Merge Request !4319.

Messaging Needs

Lets take a Fedora beta for a spin and see what looks out of place!

This time, I made the mistake of typing. Oops!

Input Performance

That lead me to an unreasonable amount of overhead in ibus-daemon. Seems odd that something so critical to our typing latency would require so many CPU samples.

So I dove into the source of ibus to see what is going on and made a few observations.

  • It is doing a lot of D-Bus. That’s okay though, because if it were busy in logic that’d be a different thing to optimize.
  • It uses a GDBusMessageFilterFunction on the GDBusConnection which can run on the D-Bus I/O thread. At this point the GDBusMessage it wants to modify are locked and require a copy.
  • After recording the messages on the private ibus D-Bus socket I can tell it’s doing a lot of communication with GNOME Shell which implements the wayland text protocol for applications.
  • There seems to be a huge amount of memory allocations involved seen when running sysprof-cli --memprof and that needs investigating.
  • Quite a bit of time spent in GVariantTypeInfo and hashtables.

TL;DR

Before I go any further, there are many merge requests awaiting review now which improve the situation greatly.

Preparing by Prototyping

The first thing I do when diving into a problem like this is to record a few things with Sysprof and just have a look around the flamegraphs. That helps me get a good feel for the code and how the components interact.

The next thing I do after finding what I think is a major culprit, is to sort of rewrite a minimal version of it to make sure I understand the problem with some level of expertise.

Last time, I did this by writing a minimal terminal emulator so I could improve VTE (and apparently that made people … mad?). This time, my eyes were set on GVariantBuilder.

A faster GVariantBuilder

I was surprised to learn that GVariantBuilder does not write to a serialized buffer while building. Instead, it builds a tree of GVariant*.

That sets off my antennae because I know from experience that GVariant uses GBytes and GBytes uses malloc and that right there is 3 separate memory allocations (each aligned to 2*sizeof(void*)) just to maybe store a 4-byte int32.

So for a rectangle of (iiii) serialized to GVariant and built with GVariantBuilder you’re looking at 3 * 4 + 3 minimum allocations, if you ignore the handful of others in the builder structures too.

So first step, prototype the same API but writing to a single buffer.

It’s about 3-4x faster in a tight loop. There are things you could do with a different API that would easily get you into the 10x range but what use is that if no software out there is using it.

So that is what we’re shooting for to guide our changes. I’ll stop when I get in that range as it’s likely diminishing returns.

A faster (slower) GVariantBuilder

I hesitate to just replace something like GVariantBuilder for the same reason we don’t just rewrite the world in a new language. There are so many corner cases and history built up here that if you can get close without a complete rewrite, that is the less risky solution.

Instead, lets just shake out a bunch of things and see how fast we can make the venerable GVariantBuilder.

GVariantTypeInfo

I saw a lot of time spent in g_variant_type_info_get() specifically inside a GHashTable. A quick printf() later we can see that for some reason the GVariantTypeInfo which contains information about a GVariantType (a glorified type string) is being re-created over and over.

There is a cache, so that cache must not be working very well. Specifically, I see types being created a dozen times or more on every key-press or release.

Caches generally don’t work well if you release items from them when the last reference is dropped. This is because the only way for them to last across uses is for multi-threading to be in play keeping them alive.

So here we patch it to keep a number of them alive to improve the serial use-case.

Side Quests through HashTables

While looking at that code it was clear we were making copies of strings to do lookups in the hashtable. It’s not great when you have to malloc() to do a hashtable lookup so maybe we can write custom hash/equal functions to handle that that.

Yup. Few more percent.

Keeping Release Code Fast

We have knobs to keep release builds fast. Sometimes that means compiling expensive assertions out which are there to help us catch bugs during development or to make the next bug hunters life easier.

For example, in my career I’ve done that to assert correctness of B-trees on every change, ensure file-system state is correct, validate all sorts of exotic data-structures and more. These are things that are helpful in debug builds that you very much don’t want in release code.

So I made some old-GVariant code which didn’t really follow the standard macros we use elsewhere do what you’d expect. Here and here.

Format String Fast-Paths

Sometimes you have a code-path that is so extremely hot that it warrants some sort of fast-path in and out. The format strings you use for building a GVariant using GVariantBuilder is one such example. We can see that normally that function will malloc() and it would be a whole lot faster if it didn’t in our fast-path.

So make it do that.

Another fun fact about GVariantType is that a simple definite type is a sub-type of itself. That is fancy lingo for saying "i" is an "i".

Given that all GVariant are eventually made up of these sort of basic definite types, they probably get hammered a lot when type-checking things at runtime.

In fact, yes, yes they do.

Doing Two Things at Once

It is extremely common in the GVariant code-base to walk strings multiple times. Sometimes, those are type strings, sometimes those are user-provided UTF-8 encoded strings.

The API boundary checks for valid UTF-8 which requires walking the string and pulling it into memory. You can see that as basically:

g_return_val_if_fail (g_utf8_validate (string, -1, &endptr), NULL);

Then later, it does strlen() on string. Maybe just skip the strlen() and do (endptr - string).

Playing Nicely with Malloc

GVariantBuilder does this nice there where if you’re building an array of children, it will grow your destination array by powers of two so that it doesn’t need to call realloc() each time. That is great for dynamically sized arrays.

At the end, it does this even nicer thing in that it shrinks the allocation to only what you needed by calling realloc() again with the final size. Nice not to waste space, surely!

However, in extremely common cases we know the exact number of children up front so the realloc() is pure overhead. Just check for that condition and save some more cycles.

Locks, Everywhere

I was surprised to find that GVariant uses the g_bit_lock()/g_bit_unlock() API to manage whether or not a GVariant is “floating”. I have things to say about floating references but that is for another day.

What I noticed though is that I see way too much g_bit_lock() samples on my profiles considering there virtually never a valid reason to have a race condition in sinking a GVariants floating reference. In fact, g_variant_take_ref() doesn’t even bother with it.

So save a heaping amount of time doing nothing by being clever.

More Ref Counting

With g_variant_ref_sink() faster we still spend a lot of samples in g_bytes_ref(). That sounds odd because we really only need a single reference and then ultimately g_bytes_unref() when the variant is released.

A quick look and yup, we’re referencing only to immediately unref afterwards. Pick up a few more percent by transferring ownership to the callee.

Reducing Allocations

We still haven’t gotten rid of all those allocations and that makes me sad. Lets do something about those 3 allocations for a simple 32-bit integer.

Step One, GBytes

Our GVariant structures are referencing a GBytes which makes buffer management easy. It allows you to slice into smaller and smaller buffers and reference count the same larger data buffer safely.

What if we made it embed small allocations within its own allocation using flexible arrays? We just need to be careful to try to match the whole 2*sizeof(void*) expectations that people get from malloc() so we don’t break any code with that assumption.

Finally, from 3 allocations to two. This one was a whopping 10% off wallclock time alone.

But that of course should give us a good idea of what would happen if we did the same thing to GVariant directly to go from 2 allocations to 1.

Step Two, GVariant

So here we go again and that’s another 10% on top of the previously stacked patches. Not bad.

Bonus Round

Calling g_variant_builder_init() makes a copy of GVariantType because theoretically you could free that type string before you call g_variant_builder_end(). Realistically, nobody does that and it probably is just there to deal with bindings who may use g_variant_builder_new().

Either way, we don’t break ABI here so add g_variant_builder_init_static() with different requirements for a few more percent by doing less malloc().

This is the one spot where if you want a few extra cycles, you gotta change your code.

How’d we do?

For my contrived benchmarks this gets us a 2.5x speedup over what was released in GNOME 47 when everything is built in “release mode fashion”. If you implement the bonus round you can get closer to 2.75x.

If you’d like to recreate the benchmark, just create a loop with a few million iterations generating some rather small/medium sized GVariant with GVariantBuilder.

Still a bit off from that 4x though, so if you like challenges perhaps replacing GVariantBuilder with a serialized writer is just the project for you.

Threaded Spellchecking

Last I mentioned I was doing an ABI cleanup of libspelling as it was really just a couple hour hack that got used by people.

Part of that was to make way for performing spellchecking off the GTK thread. That just landed in today’s libspelling 0.3.0 release. In one contrived benchmark, spellchecking a 10,000 line document was 8x faster. YMMV.

One of the reasons it’s faster is we don’t need to use the GtkTextIter API. Pango provides all the things we need to do that without the PangoLayout overhead. So we just skip right past that and implement iterators off-thread.

I also extracted the GtkTextBuffer adapter so that we may add a GtkEditable adapter in the future. Internally, the adapters provide a series of callbacks to the SpellingEngine.

You might wonder what happens if you make an edit that collides with a region being spellchecked off thread? The answer is quite simple, it checks for collisions and either adapts by adjusting offsets or discards irreconcilable collisions.

Both Text Editor and Builder have been migrated to using libspelling instead of its ancestor code. If you’re running nightly Flatpak of those, take a crack at it and report bugs.

“It testing could use.”

Ptyxis in Fedora 40

Ptyxis has arrived in Fedora 40 thanks to my Red Hat colleague Nieves Montero Fernandez. You can sudo dnf install ptyxis to get it.

Discussions are ongoing to make Ptyxis the default terminal for Fedora Workstation (and Silverblue) though it has already been approved by FESCo. One benefit of this is that we can finally remove the downstream patches to gnome-terminal and return it to a more upstream experience.

Additionally, Ptyxis has landed in the developing CentOS Stream (c10s). Project Bluefin also ships Ptyxis as the default terminal.

Some distributions are configuring it with -Dgeneric=terminal which means it will look like a regular ol’ terminal without the Ptyxis name or branding. In fact it uses the previous “prompt” iconography.

If you ship GNOME on your distribution and want help switching to Ptyxis, feel free to reach out.

libspelling ABI cleanup

In the summer of ’23 I quickly put together a minimal library for spellchecking in GTK 4 from what I built for Text Editor and Builder. It has a fun little hybrid piecetable/B+Tree (bplus) which is probably the most complicated part of it.

And by quickly, I mean it was created over the course of about two hours because I got tired of hearing the same reasons why people weren’t porting to GTK 4.

Needless to say, not much thought went into naming things nicely for a library. But I’d like to progress towards actually committing to an ABI for libspelling and I’ve just made the changes I want in that.

Consider this a notice that the next release of libspelling will break ABI. However, it almost certainly won’t affect you. I don’t know of any consumers of it that are doing anything beyond connecting the GtkTextBuffer adapter whose API was untouched.

The changes were mostly cosmetic and will aid future contributors ability to understand the project.

As always, you can read the documentation.

Bisect’ing outside of the box

There is a sort of thrill to a bug hunt once you dig your heels in deep enough. This is one of those stories.

Earlier in the week a fellow Red Hatter messaged me about a bug in Ptyxis where if you right click on a tab, the tab stays in the active state. Clearly not a very good look. My immediate thought was that we don’t do anything special there, so maybe the issue is lower in the stack.

A screenshot of Ptyxis with 4 terminal tabs open. Two of which appear "focused" due to an overzealous active state on the tab widget.

Oopsie!

Libadwaita?

The first layer beneath that is of course libadwaita and so naturally I’ll test Text Editor to see if I get the same behavior. Lo-and-behold it exists there. Cool, file an issue in libadwaita while I simultaneously see if I can fix it.

Nothing stands out too obvious, but I try a few things which make it slightly better but I can still get into this state with enough attempts. Bummer. Also, Alice thinks the real issue is in GTK because we’ve seen this in a number of places. It appears to be something wrong with active state tracking.

GTK?

No worries, I know that code-base well, so we hop on over there and see what is going on. First things first though, where does active state tracking happen? A quick git grep later we see it is largely in gtkmain.c in response to incoming GdkEvent.

GTK or GDK?

Having written a large portion of the macOS back-end for GTK 4, I know that events can be extremely platform specific. So lets cut this problem in half by determining if the issue really is in the GTK-side or GDK-side. So we run the test again with GDK_BACKEND=x11 ptyxis -s and see if we can reproduce.

The issue appears to go away, lovely! Although, let’s test it on macOS too just to be certain. And again, it does not reproduce there. So that must mean the issue lies somewhere in gdk/wayland/ or related.

Regressions?

Another important detail that Alice mentioned was that it seemed to be a regression. That would be great because if so, it would mean that we can bisect to find an answer. However, since both gdk/wayland/ and compositors move forward at a somewhat lock-step pace, that can be quite difficult to prove reliably enough for bisect.

So instead I move on to something a bit whacky. I have a CentOS 7 installation here that I use to test that Ptyxis crazy GLibc hacks work for ptyxis-agent. That has GNOME 3.28 which can run as a Wayland compositor if you install the appropriate session file. Thus I do, and what do you know, it has the same issue there. That was surely to hit the “old wayland protocol” code-paths which are a bit different than the newer extensions, which was what I had hoped to test.

Either way, very old Mutter still equals broken. Womp.

Weston Enters the Ring

We have a “reference compositor” in the form of Weston which can help shake out protocol issues (assuming the diligence to Weston correctness is upheld). So I try running Ptyxis on my local machine inside a nested Weston instance (which appears to be 13.0.0). Surprising, things actually work!

Okay, but I still have that CentOS 7 VM running, I wonder if it has Weston available? It does not. Bummer. But I also have an Alma Linux 9 VM around that I use for testing occasionally and that must have it. Let’s try there.

In fact it does, but it’s version 8.0.0. Let’s see if it works there too!

A screenshot of GNOME Boxes running Alma Linux 9 in a GNOME session with a nested Weston compositor containing Ptyxis with 2 of 3 tabs broken.

Bisecting Weston

So now we know Weston 8.0.0 is broken and 13.0.0 works. Now we have something I can bisect!

So I fire up meson locally and ensure that my local build fails when it is 8.0.0. Indeed it does. And 13.0.0 appears to work. Bisect commence!

We find the commit which “fixes” things from a Weston point of view to be “Set grab client before calling grab functions“.

That gives me enough information to start commenting out lines of code in Mutter like the uncivilized monkey I am. Eventually, Shakespeare has been written and the bug goes away. Though certainly not in an upstream-able fashion. Bugs filed, lines of code pointed to, and here I hope for our lovely Mutter maintainers to take over.

Back to GDK

Lots of amazing things have happened since we got Wayland compositors. One of the more undeniable difficulties though has been equal event ordering amongst them. It’s almost certainly the case that other compositors are doing things in slightly different ways and it may not even be possible to change that based on their use-case/designs.

So some sort of mitigation will be required on the GDK side.

After some interpretation of what is really going on here, a very small patch to drop certain leave events.

And that’s the story of how I bisected Weston to find an issue in Mutter (and GDK).

Manuals on Flathub

Manuals contains the documentation engine from Builder as a standalone application. Not only does it browse documentation organized by SDK but can install additional SDKs too. This is done using the same techniques Builder uses to manage your project SDKs.

It should feel very familiar if you’re already using the documentation tooling in Builder.

In the past, we would just parse all the *.devhelp2 files up-front when loading. GMarkupParseContext is fast enough that it isn’t too much overhead at start-up for a couple hundred files.

However, once you start dealing with SDKs and multiple versions of all these files the startup performance can take quite a hit. So Manuals indexes these files into SQLite using GOM and performs queries using that instead. It conveniently makes cross-referencing easy too so you can jump between SDK revisions for a particular piece of documentation.

Enjoy!

Red Hat Day of Learning

Occasionally at Red Hat we have a “Day of Learning” where we get to spend time learning about technology of our choice.

I spent some time listening to various AI explanations which were suggested readings for the day. Nothing too surprising but also not exactly engaging to me. Maybe that’s because I grew up with a statistics professor for a father.

So while that was playing I spent a little time learning how the GitLab API works. Immediately it stood out that one of the primary challenges in presenting UI for such an API would be in bridging GListModel to their implementation.

So I spent a little time on the architecture for how you might want to do that in the form of a Gitlab-GLib library.

Pagination

There are essentially two modes of pagination supported by GitLab depending on the result set size. Both methods use HTTP headers to denote information about the result set.

Importantly, any design would want to have an indirection object (and in this case GitlabListItem) which can have it’s resource dynamically loaded.

Resources (such as a GitlabProject, or GitlabIssue) are “views” into the JSON result set. They are only used for reading, not for mutating or creating resources on the API server.

Offset-based Pagination

This form is somewhat handy because you can know the entire result-set size up front. Then you can back-fill entries as accessed by the user using bucketed lazy-loading.

When the bucketed page loads, the indirection objects are supplied with their resource object which provides a typed API over the JsonNode backing it.

This is the “ideal” form from the consuming standpoint but can put a great deal of load on the GitLab server instance.

Progressive Pagination

The other form of pagination lets you fetch the next page after each subsequent request. The header provides the next page number.

One might expect that you could use this to still jump around to the appropriate page. However if you are not provided the “number of pages” header from the server then there is not much you can do to clamp your page range.

Conclusion

This was a fun little side-project to get to know some of the inner workings of the API backing what those of us in GNOME use every day. I have no idea if anything will come of it, but it certainly could be useful from Builder if anyone has time to run with it.

For example, imagine having access to common GitLab operations from the header bar.

A screenshot of GNOME Builder with a new GtkMenuButton in the headerbar containing a GitLab icon and menu.