VTE performance improvements

To celebrate every new GNOME release I try to do a little bit of work that would be intrusive to land at the end of the cycle. The 46 cycle is no different and this time I’m making our terminals faster.

The terminal is surely the most used desktop app for developers and things have changed in drawing models over the years. There might be some excellent energy savings to be had! So I made myself a little prototype to see how much faster we might be able to go without drastic design changes and use that as my guide to improving VTE performance.

VTE has been around since the early days of GNOME. It’s been touched in some manner by many programmers that I consider more talented than myself, but perhaps I can improve things yet!

So far I’ve landed a little over a dozen patches, none of which address drawing (yet). So that means these patches will make both the GTK 3 and GTK 4 versions of VTE faster. Once the last patch lands in this category we will have cut wall clock time down for a number of common scenarios by a solid 40%. That’s a pretty good win!

After these land I have a bunch of patches which introduce native GTK 4 drawing primitives instead of Cairo. Those patches will ultimately reduce draw latency on GTK 4 while not regressing GTK 3 performance. There are still a couple things to figure out around some “minifont” usage, but things are looking good.

I’d also like to find a way to get draw timing driven by the frame clock rather than some internal timeouts. Combining that with the GTK 4 native drawing will certainly make things feel faster on the “butt dyno”.

Anyway, I probably won’t go down the rabbit hole with this, I just want to get things inline with performance expectations.

And to nobodies surprise, this is the type of stuff that is much easier to do when armed with Sysprof and working frame-pointers.

What have frame-pointers given us anyway

I obsess over battery life. So having a working Sysprof in Fedora 39 with actually useful frame-pointers has been lovely. I heard it asked at an All Systems Go talk if having frame-pointers enabled has gained any large performance improvements and that probably deserves addressing.

The answer to that is quite simply yes. Sometimes it’s directly a side-effect of me and others sending performance patches (such as Shell search performance or systemd-oomd patches). Sometimes it just prevents the issues from showing up on peoples systems to begin with. Basically all the new code I write now is done in tandem with Sysprof to visualize how things ran. Misguided choices often stick out earlier.

I think it’s also important to recognize that in addition to gaining performance improvements we’ve not seen people complain about performance regressions. That means we can have visibility to improve things without a significant burden in exchange.

Here is a little gem that I would have been unlikely to find without system-wide frame-pointers. Basically API contract validation needs to do a couple lookups for flags on the TypeNode for GTypeInstance. I’ll remind the reader that GTypeInstance is what underlies GObject, GskRenderNode, and is likely to be our “performance escape hatch” from GObject.

Those checks, in particular for G_TYPE_IS_ABSTRACT() and G_TYPE_IS_DEPRECATED() were easily taking up nearly a percent of samples in some tight loop tests (like creating thousands of GTK render nodes). It turns out that both g_type_create_instance() and g_type_free_instance() were doing these checks. Additionally g_value_unset() on a GBoxed type can do this too (via g_boxed_free()). That gets used all the time for closure invocations such as through the g_signal_* API.

A quick peek with Sysprof, thanks to those frame-pointers, shows the common code paths which hit this. It looks like the flags for abstract and deprecated are stored on an accessory object for the TypeNode. This is a vestige of a day where we must have thought it prudent to be very tight about memory consumption in TypeNodes. But unfortunately, accessing that accessory data requires acquiring the read side of a GRWLock because the type system is mutable. As it were, there is space to cache these bits in the TypeNode directly and the patch linked above does just that.

Combining the above patch with this patch from Emmanuele does wonders for the g_type_create_instance() performance. It basically drops things down to the cost of your malloc() implementation, which is much more ideal.

All of this was only on my radar because I was fixing up a few performance issues in GTK’s OpenGL renderer. Getting extraneous TypeNode checks out of hot code paths and instead at consumer API boundaries instead is always a win for performance.

This is just one example of many. And thankfully, many more people are capable of casually improving performance rather than relying on someone like me thanks to Sysprof and frame-pointers on Fedora.