Keeping your fast code fast

Over the past few weeks I’ve been finishing up various projects for 3.36. None of this is surprising for those that follow me on twitter, but sadly I find it hard to blog as often as I should.

One of the projects I completed before the end of the cycle is a memory allocation tracker for Sysprof. It’s basically a modern port of the Memprof code from 20 years ago, but tied into Sysprof and using fancier techniques to move data quickly between processes. It uses an LD_PRELOAD to override many of the weak memory symbols in glibc such as malloc() and free(). When those functions are reached, a stack trace is captured directly into a mmap()‘d ring buffer shared by Sysprof. We create a new one of these per-thread so that no locking is necessary between threads. Sysprof will mux all the data together for us.

Below is a quick example running gtk4-widget-factory. We show similar callgraphs as we do when doing CPU profiling, but ordered by the amount of memory allocated. This simple tool and less than 20 minutes of effort found many allocations we could completely avoid across both GTK and Clutter.

A callgraph of memory allocations

I just want to mention how refreshing it is to have memory allocation tracking while still starting the application in what feels like instantly. It was quite a bit of tweaking to get that level of performance and I’m thrilled with the result.

Additionally, I spent some time looking at what sort of things cause temporary lockups in GNOME Shell during active use. With a fio script in hand, I had the necessary things to cause the buffer cache to be exhausted and force many applications working set out of memory. That usually does the trick to cause short lockups.

But what is going on when things stall? Does the GPU driver get bogged down? Does the Shell get blocked on GC? Is there some sort of blocking API involved?

To answer this I put together a scrappy little LD_PRELOAD tool called “iobt” which will write out a Sysprof capture file when some blocking operations are called. This found a very peculiar bug where GNOME Shell could end up blocking on the compositor thread when it thought it was doing all async I/O operations.

Furthermore, I found a number of other I/O operations happening on the main thread which will easily lock things up under heavy writeback scenarios. Patches for all of these are upstream, half of them are merged at this point, and some even backported to 3.28 for various distros.

There are still some things to do going forward, like use cgroupsv2 to help enforce CPU and Memory availability and other priorities. I’m also looking for pointers from GPU people on how to debug what is going on during long blocking eglSwapBuffers() calls as I’ve seen under memory pressure.

I’m always inspired by what the Shell developers build and I’m honored to get to help polish it even more.

2 thoughts on “Keeping your fast code fast”

  1. Patches for all of these are upstream, half of them are merged at this point, and some even backported to 3.28 for various distros.

    Do you mean some long supported distributions have gnome-shell 3.28 and that’s the component they backported these fixes to?

    1. I back-ported the fixes for RHEL 7 which has 3.28. If other distros pick those up, that would probably be a smart thing to do.

Comments are closed.