December Projects

Not all of my projects this December are code related. In fact a lot of them have been house maintenance things, joy of home ownership and all.

This week was spent building my new office and music space. I wanted a way to have my amplifiers and guitars more accessible while also creating a sort of “dark academia” sort of feeling for working.

The first step was to get the guitars mounted on the walls. I was looking for something blending artistic showpiece and functional use.

After that I quickly framed them in. Just quarter round with the hard edge inwards, 45° miter, some caulking, easy peasy.

My last office had Hale Navy as the color, but the sheen was too much that it made it difficult to actually see the color. This time I went flat and color drenched the space (so ceilings, trim, etc all in matching tone).

Then somewhat final result is here. I still want to have a lighting story for these that doesn’t involve a battery so some electrical fish taping is likely in my future.

I also converted the wide closet into a workstation area with the studio monitors for recording. But that is still partially finished as I need to plane all the slats for the slat wall, frame the builtin, and attach the countertop.

Layered Settings

Early on Builder had the concept of layered settings. You had an application default layer the user could control. You also had a project layer which allowed the user to change settings just for that project. But that was about the extent of it. Additionally, these settings were just stored in your normal GSettings data repository so there is no sharing of settings with other project collaborators. Boo!

With Foundry, I’d like to have a bit more flexibility and control. Specifically, I want three layers. One layer for the user’s preferences at the application level. Then project settings which can be bundled with the project by the maintainer for needs specific to the project. Lastly, a layer of user overrides which takes maximum preference.

Of course, it should still continue to use GSettings under the hood because that makes writing application UI rather easy. As mentioned previously, we’ll have a .foundry directory we place within the project with storage for both user and project data. That means we can use a GKeyFile back-end to GSettings and place the data there.

You can git commit your project settings if you’re the maintainer and ensure that your projects conventions are shared to your collaborators.

Of course, since this is all command-line based right now, there are tab-completable commands for this which again, makes unit testing this stuff easier.

# Reads the app.devsuite.foundry.project config-id gsetting
# taking into account all layers
$ foundry settings get project config-id

# Sets the config-id setting for just this user
$ foundry settings set project config-id "'org.example.app.json'"

# Sets the config-id for the project default which might
# be useful if you ship multiple flatpak manifest like GTK does
$ foundry settings set --project project config-id "'org.example.app.json'"

# Or maybe set a default for the app
$ foundry settings set --global project stop-signal SIGKILL

That code is now wired up to the FoundryContext via foundry_context_load_settings().

Next time I hope to cover the various sub-systems you might need in an IDE and how those services are broken down in Foundry.

CLI Command Tree

A core tenant of Foundry is a pleasurable command-line experience. And one of the most creature-comforts there is tab-completion.

But how you go about doing that is pretty different across every shell. In Flatpak, they use a hidden internal command called “complete” which takes a few arguments and then does magic to figure out what you wanted.

Implementing that when you have one layer of commands is not too difficult even to brute force. But imagine for a second that every command may have sub-commands and it can get much more difficult. Especially if each of those sub-commands have options that must be applied before diving into the next sub-command.

Such is the case with foundry, because I much prefer foundry config switch over foundry config-switch. Particularly because you may have other commands like foundry config list. It feels much more spatially aware to me.

There will be a large number of commands implemented over time, so keeping the code at the call-site rather small is necessary. Even more so when the commands could be getting proxied from another process or awaiting for futures to complete.

With all those requirements in mind, I came up with FoundryCliCommandTree. The tree is built as an n-ary tree using GNode where you register a command vtable with the command parts like ["foundry", "config", "switch"].

At each layer you can have GOptionEntry like you normally use with GLib-based projects but in this case they will end up in a FoundryCliOptions very similar to what GApplicationClass.local_command_line() does.

So now foundry has a builtin “complete” command like Flatpak and works fairly similarly though with the added complexity to support my ideal ergonomics.

Vacation? What’s that?

I tend to bulk most of my vacation at the end of the year because it creates enough space and time for fun projects. Last year, however, our dog Toby went paraplegic and so were care-taking every three hours for about two months straight. Erratic sleep, erratic self-care, but in the end he could walk again so definitely worth it.

That meant I didn’t really get to do my fun end-of-year hacks beyond just polishing Ptyxis which I had just prototyped for RHEL/CentOS/Bluefin (and more recently Fedora).

This year I’m trying something I’ve wondered about for a while. What would it look like if you shoved a full-featured IDE into the terminal?

The core idea that makes this possible is using a sub-shell with a persistent parent process. So just like you might “jhbuild shell” you can “foundry enter” to enter the “IDE”.

In the JHBuild case it would exec over itself after setting things up. In the foundry case it maintains an ancestor process and spawns a sub-shell beneath that.

When running foundry commands from a sub-shell it will proxy that work to the ancestor instance. This all happens with a private D-Bus peer-to-peer process. So you can have multiple of these in place across different terminal tabs eventually.

This is all built with a “libfoundry” that I could consume from Builder in the future to provide the same feature from a full-blown GTK-based IDE too. Not to mention the IDE becomes instantly script-able from your shell. It also becomes extremely easy to unit test.

Since I originally created Builder, I wrote a library to make doing futures, concurrency, and fibers much easier in C. That is libdex. I tend to make things more concurrent while also reducing bug counts when using it. Especially for the complex logic parts which can be written in synchronous looking C even though it is asynchronous in nature.

So the first tenant of the new code is that it will be heavily based on DexFuture.

The second tenant is going to be a reversal of something I tried hard to avoid in Builder. That is a “dot” directory in projects. I never liked how IDEs would litter projects with state files in projects. But since all the others continue to do so I don’t see much value in tying our hands behind our back out of my own OCD purity. Instead, we’ll drop a .foundry directory with appropriate VCS ignore files. This gives us convenient space for a tmpdir, project-wide settings, and user settings.

The project is just getting started, but you can follow along at chergert/foundry and I’ll try to write more tidbits as I go.

Next time, we’ll cover how the command line tools are built as an N-ary tree to make tab-completion from bash easy.

Ptyxis Progress Support

The upcoming systemd v257 release will have support for a feature originating from ConEmu (a terminal emulator for Windows) which was eventually adopted by Windows Terminal.

Specifically, it is an OSC (Operating System Command) escape sequence which defines progress state.

Various systemd tools will natively support this. Terminal emulators which do not support it simply ignore the OSC sequence but those that do support it may provide additional UI to the application.

Lennart discussed this briefly in their ongoing systemd v257 features series on Mastodon and so I took up a quick attempt to implement the sequence parsing for VTE-based terminals.

That has since been iterated upon and landed in VTE. Additionally, Ptyxis now has corresponding code to support it as well.

Once GNOME CI is back up and running smoothly this will be available in the Ptyxis nightly build.

A screenshot of Ptyxis running in a window with two tabs. One of the tabs has a progress indicator icon showing about 75 percent completion of a task.

Profiling w/o Frame Pointers

A couple years ago the Fedora council denied a request by Meta engineers to build the distribution with frame-pointers. Pretty immediately I pushed back by writing a number of articles to inform the council members why frame-pointers were necessary for a good profiling experience.

Profiling is used by developers, system administrators, and when we’re lucky by bug reporters!

Since then, many people have discussed other options. For example in the not too distant future we’ll probably see SFrame unwinding provide a reasonable way to unwind stacks w/o frame-pointers enabled and more importantly, without copying the contents of the stack.

Until then, it can be helpful to have a way to unwind stacks even without the presence of frame-pointers. This past week I implemented that for Sysprof based on a prototype put together by Serhei Makarov in the elfutils project called eu-stacktrace.

This prototype works by taking samples of the stack from perf (say 16KB-32KB worth) and resolving enough of the ELF data for DWARF/CFI (Call-frame-information)/etc to unwind the stacks in memory using a copy of the registers. From this you create a callchain (array of instruction pointers) which can be sent to Sysprof for recording.

I say “in memory” because the stack and register content doesn’t hit disk. It only lands inside the mmap()-based ring buffer used to communicate with Linux’s perf event subsystem. The (much smaller) array of instruction pointers eventually lands on disk if you’re not recording to a memfd.

I expanded upon this prototype with a new sysprof-live-unwinder process which does roughly the same thing as eu-stacktrace while fitting into the Sysprof infrastructure a bit more naturally. It consumes a perf data stream directly (eu-stacktrace consumed Sysprof-formatted data) and then provides that to Sysprof to help reduce overhead.

Additionally, eu-stacktrace only unwinds the user-space side of things. On x86_64, at least, you can convince perf to give you both callchains (PERF_SAMPLE_CALLCHAIN) as well as sample stack/registers (PERF_SAMPLE_STACK_USER|PERF_SAMPLE_REGS_USER). If you peek for the location of PERF_CONTEXT_USER to find the context switch, blending them is quite simple. So, naturally, Sysprof does that. The additional overhead for frame-pointer unwinding user-space is negligible when you don’t have frame-pointers to begin with.

I should start by saying that this still has considerable overhead compared to frame-pointers. Locally on my test machine (a Thinkpad X1 Carbon Gen 3 from around 2015, so not super new) that is about 10% of samples. I imagine I can shave a bit of that off by tracking the VMAs differently than libdwfl, so we’ll see.

Here is an example of it working on CentOS Stream 10 which does not have frame-pointers enabled. Additionally, this build is debuginfod-enabled so after recording it will automatically locate enough debug symbols to get appropriate function names for what was captured.

This definitely isn’t the long term answer to unwinding. But if you don’t have frame-pointers on your production operating system of choice, it might just get you by until SFrame comes around.

The code is at wip/chergert/translate but will likely get cleaned up and merged this next week.

debuginfod-enabled Sysprof

Based on some initial work by Barnabás Pőcze Sysprof gained support for symbolizing stack traces using debuginfod.

If you don’t want to install debuginfo packages for your entire system but still want really useful function names, this is for you. The system-configured debuginfod servers will provide you access to those debuginfo-enabled ELF binaries so that we can discover the appropriate symbol names.

Much like in gdb, you can cancel them if you don’t care and those symbols will fallback to the “In File lib.so+offset” you’re used to.

A screenshot showing a popover of symbols being downloaded.

I expect a bit more UI work around this before GNOME 48 like preferences to disable it or configure alternate debuginfod servers.

Happy profiling!

GCC/Clang Indirect Functions

There is this extremely neat feature in GCC 10+ and Clang 9+ that allows you to determine which function should be patched in at runtime. Procress startup code will call your user-defined function to figure out which function should be used for the environment.

For example, take the following

static gboolean (*choose_utf8_impl (void)) (const char *, gssize, const char **)
{
  if (go_fast)
    return fast_version;
  else
    return slow_version;
}

gboolean
g_utf8_validate (const char  *str,
                 gssize       len,
                 const char **end)
  __attribute__((ifunc("choose_utf8_impl")));

This can be handy when you’re dealing with complex branching based on the system. For example, imagine you need to check if you’re running under Valgrind as one does occasionally to satisfy an incorrect assumption about accessing past heap boundaries.

The typical way to handle this is to #include "valgrind.h" and branch based on if (RUNNING_ON_VALGRIND) ....

However, that inlines a bunch of assembly you probably don’t want in your ultra-fast new utf8-validation implementation. It might be better to just do that once at process startup and save yourself a bunch of CPU cycles.

UTF-8 validation performance

One of the things that GLib-based APIs got right from early on is that generally speaking, our API boundaries expect UTF-8. When transitioning between API boundaries, UTF-8 validation is a common procedure.

Therefore, it is unsurprising that GLib does a lot of UTF-8 string validation of all sorts of string lengths.

The implementation we’ve had is fairly straight forward to read and reason about. Though it is not winning any performance awards.

UTF-8 validation performance has been on my radar for some time. Back when I was working on VTE performance it was an issue there. Recently with GVariant I had to look for ways to mitigate that.

There are many options out there which can be borrowed-from or inspired-by with varying levels of complexity.

Deterministic Finite Automaton

Björn Höhrmann wrote a branchless UTF-8 validator years ago which is the basis for many UTF-8 validators. Though once you really start making it integrate well with other API designs you’ll often find you need to add branches outside of it.

This DFA approach can be relatively fast, but sometimes slower than the venerable g_utf8_validate().

VTE uses a modified version of this currently for its UTF-8 validation. It allows some nice properties when processing streaming input like that of a PTY character stream.

Chromium uses a modified version of this which removes the ternary for even better performance.

simdutf

One of the more boundary-pushing implementations is simdutf used by the WebKit project. Simdutf is written in C++ and performance-wise it is extremely well regarded.

Being written in C++ does present challenges to integrate into a C abstraction library, though not insurmountable.

The underlying goal of a SIMD (single-instruction, multiple-data) approach is to look at more than a single character at a time using wide-registers and clever math.

c-utf8

Another implementation out there is the c-utf8 project. It is used by dbus-broker and libvarlink.

Like simdutf it attempts to take advantage of SIMD by using some very simple compiler built-ins. It is also simple C code (no assembly) with just the use of a few GCC’isms all of which can be removed or alternatives used for other compilers such as MSVC.

What to Choose

Just to throw up an idea to see if it could be shot down, I suggested we look at these for GLib (issue #3481) to replace what we have.

I haven’t been able to get as good of performance with the DFA approach as the SIMD approach. So I felt I needed to choose between c-utf8 and simdutf.

Given the ease of integration I went with the c-utf8 implementation. It required removing the case 0x01 ... case 0x7F: GCC-ism. Additionally I simplified a number of the macros which are part of sibling c-util projects in favor of GLib equivalents.

I opted to drop __builtin_assume_aligned(expr,n) and use __attribute__((aligned(n))) instead. It generates the same code on GCC but has the benefit of having an equivalent syntax on MSVC in the form of __declspec(align(n)) in the same syntax position (before the declarator). That means a simple preprocessor macro per-compiler gets the same effect.

Performance Numbers

The c-utf8 project has a nice benchmark suite available for testing UTF-8 validation performance. It will benchmark the performance against a trivial implementation which matches what GLib has at the time of writing this. It also will benchmark it against the performance of strlen() which is an extremely optimized piece of the stack.

I modeled the same tests but using g_utf8_validate() with my integration patches to ensure we were getting the same performance as c-utf8, which we are.

For long ASCII strings, you can expect more than 10x speed-ups. (On my Xeon it was 13x and on my Apple Silicon it was 12x). For long multi-byte (2-to-4 byte sequences) you can expect 4x-6x improvements.

I wouldn’t expect too much difference for small strings as you wont see much of a benefit until you can get to word-size operations. As long as we don’t regress there, I think this is a good direction to go in.

Merge Request !4319.

Messaging Needs

Lets take a Fedora beta for a spin and see what looks out of place!

This time, I made the mistake of typing. Oops!

Input Performance

That lead me to an unreasonable amount of overhead in ibus-daemon. Seems odd that something so critical to our typing latency would require so many CPU samples.

So I dove into the source of ibus to see what is going on and made a few observations.

  • It is doing a lot of D-Bus. That’s okay though, because if it were busy in logic that’d be a different thing to optimize.
  • It uses a GDBusMessageFilterFunction on the GDBusConnection which can run on the D-Bus I/O thread. At this point the GDBusMessage it wants to modify are locked and require a copy.
  • After recording the messages on the private ibus D-Bus socket I can tell it’s doing a lot of communication with GNOME Shell which implements the wayland text protocol for applications.
  • There seems to be a huge amount of memory allocations involved seen when running sysprof-cli --memprof and that needs investigating.
  • Quite a bit of time spent in GVariantTypeInfo and hashtables.

TL;DR

Before I go any further, there are many merge requests awaiting review now which improve the situation greatly.

Preparing by Prototyping

The first thing I do when diving into a problem like this is to record a few things with Sysprof and just have a look around the flamegraphs. That helps me get a good feel for the code and how the components interact.

The next thing I do after finding what I think is a major culprit, is to sort of rewrite a minimal version of it to make sure I understand the problem with some level of expertise.

Last time, I did this by writing a minimal terminal emulator so I could improve VTE (and apparently that made people … mad?). This time, my eyes were set on GVariantBuilder.

A faster GVariantBuilder

I was surprised to learn that GVariantBuilder does not write to a serialized buffer while building. Instead, it builds a tree of GVariant*.

That sets off my antennae because I know from experience that GVariant uses GBytes and GBytes uses malloc and that right there is 3 separate memory allocations (each aligned to 2*sizeof(void*)) just to maybe store a 4-byte int32.

So for a rectangle of (iiii) serialized to GVariant and built with GVariantBuilder you’re looking at 3 * 4 + 3 minimum allocations, if you ignore the handful of others in the builder structures too.

So first step, prototype the same API but writing to a single buffer.

It’s about 3-4x faster in a tight loop. There are things you could do with a different API that would easily get you into the 10x range but what use is that if no software out there is using it.

So that is what we’re shooting for to guide our changes. I’ll stop when I get in that range as it’s likely diminishing returns.

A faster (slower) GVariantBuilder

I hesitate to just replace something like GVariantBuilder for the same reason we don’t just rewrite the world in a new language. There are so many corner cases and history built up here that if you can get close without a complete rewrite, that is the less risky solution.

Instead, lets just shake out a bunch of things and see how fast we can make the venerable GVariantBuilder.

GVariantTypeInfo

I saw a lot of time spent in g_variant_type_info_get() specifically inside a GHashTable. A quick printf() later we can see that for some reason the GVariantTypeInfo which contains information about a GVariantType (a glorified type string) is being re-created over and over.

There is a cache, so that cache must not be working very well. Specifically, I see types being created a dozen times or more on every key-press or release.

Caches generally don’t work well if you release items from them when the last reference is dropped. This is because the only way for them to last across uses is for multi-threading to be in play keeping them alive.

So here we patch it to keep a number of them alive to improve the serial use-case.

Side Quests through HashTables

While looking at that code it was clear we were making copies of strings to do lookups in the hashtable. It’s not great when you have to malloc() to do a hashtable lookup so maybe we can write custom hash/equal functions to handle that that.

Yup. Few more percent.

Keeping Release Code Fast

We have knobs to keep release builds fast. Sometimes that means compiling expensive assertions out which are there to help us catch bugs during development or to make the next bug hunters life easier.

For example, in my career I’ve done that to assert correctness of B-trees on every change, ensure file-system state is correct, validate all sorts of exotic data-structures and more. These are things that are helpful in debug builds that you very much don’t want in release code.

So I made some old-GVariant code which didn’t really follow the standard macros we use elsewhere do what you’d expect. Here and here.

Format String Fast-Paths

Sometimes you have a code-path that is so extremely hot that it warrants some sort of fast-path in and out. The format strings you use for building a GVariant using GVariantBuilder is one such example. We can see that normally that function will malloc() and it would be a whole lot faster if it didn’t in our fast-path.

So make it do that.

Another fun fact about GVariantType is that a simple definite type is a sub-type of itself. That is fancy lingo for saying "i" is an "i".

Given that all GVariant are eventually made up of these sort of basic definite types, they probably get hammered a lot when type-checking things at runtime.

In fact, yes, yes they do.

Doing Two Things at Once

It is extremely common in the GVariant code-base to walk strings multiple times. Sometimes, those are type strings, sometimes those are user-provided UTF-8 encoded strings.

The API boundary checks for valid UTF-8 which requires walking the string and pulling it into memory. You can see that as basically:

g_return_val_if_fail (g_utf8_validate (string, -1, &endptr), NULL);

Then later, it does strlen() on string. Maybe just skip the strlen() and do (endptr - string).

Playing Nicely with Malloc

GVariantBuilder does this nice there where if you’re building an array of children, it will grow your destination array by powers of two so that it doesn’t need to call realloc() each time. That is great for dynamically sized arrays.

At the end, it does this even nicer thing in that it shrinks the allocation to only what you needed by calling realloc() again with the final size. Nice not to waste space, surely!

However, in extremely common cases we know the exact number of children up front so the realloc() is pure overhead. Just check for that condition and save some more cycles.

Locks, Everywhere

I was surprised to find that GVariant uses the g_bit_lock()/g_bit_unlock() API to manage whether or not a GVariant is “floating”. I have things to say about floating references but that is for another day.

What I noticed though is that I see way too much g_bit_lock() samples on my profiles considering there virtually never a valid reason to have a race condition in sinking a GVariants floating reference. In fact, g_variant_take_ref() doesn’t even bother with it.

So save a heaping amount of time doing nothing by being clever.

More Ref Counting

With g_variant_ref_sink() faster we still spend a lot of samples in g_bytes_ref(). That sounds odd because we really only need a single reference and then ultimately g_bytes_unref() when the variant is released.

A quick look and yup, we’re referencing only to immediately unref afterwards. Pick up a few more percent by transferring ownership to the callee.

Reducing Allocations

We still haven’t gotten rid of all those allocations and that makes me sad. Lets do something about those 3 allocations for a simple 32-bit integer.

Step One, GBytes

Our GVariant structures are referencing a GBytes which makes buffer management easy. It allows you to slice into smaller and smaller buffers and reference count the same larger data buffer safely.

What if we made it embed small allocations within its own allocation using flexible arrays? We just need to be careful to try to match the whole 2*sizeof(void*) expectations that people get from malloc() so we don’t break any code with that assumption.

Finally, from 3 allocations to two. This one was a whopping 10% off wallclock time alone.

But that of course should give us a good idea of what would happen if we did the same thing to GVariant directly to go from 2 allocations to 1.

Step Two, GVariant

So here we go again and that’s another 10% on top of the previously stacked patches. Not bad.

Bonus Round

Calling g_variant_builder_init() makes a copy of GVariantType because theoretically you could free that type string before you call g_variant_builder_end(). Realistically, nobody does that and it probably is just there to deal with bindings who may use g_variant_builder_new().

Either way, we don’t break ABI here so add g_variant_builder_init_static() with different requirements for a few more percent by doing less malloc().

This is the one spot where if you want a few extra cycles, you gotta change your code.

How’d we do?

For my contrived benchmarks this gets us a 2.5x speedup over what was released in GNOME 47 when everything is built in “release mode fashion”. If you implement the bonus round you can get closer to 2.75x.

If you’d like to recreate the benchmark, just create a loop with a few million iterations generating some rather small/medium sized GVariant with GVariantBuilder.

Still a bit off from that 4x though, so if you like challenges perhaps replacing GVariantBuilder with a serialized writer is just the project for you.