Keeping your fast code fast

Over the past few weeks I’ve been finishing up various projects for 3.36. None of this is surprising for those that follow me on twitter, but sadly I find it hard to blog as often as I should.

One of the projects I completed before the end of the cycle is a memory allocation tracker for Sysprof. It’s basically a modern port of the Memprof code from 20 years ago, but tied into Sysprof and using fancier techniques to move data quickly between processes. It uses an LD_PRELOAD to override many of the weak memory symbols in glibc such as malloc() and free(). When those functions are reached, a stack trace is captured directly into a mmap()‘d ring buffer shared by Sysprof. We create a new one of these per-thread so that no locking is necessary between threads. Sysprof will mux all the data together for us.

Below is a quick example running gtk4-widget-factory. We show similar callgraphs as we do when doing CPU profiling, but ordered by the amount of memory allocated. This simple tool and less than 20 minutes of effort found many allocations we could completely avoid across both GTK and Clutter.

A callgraph of memory allocations

I just want to mention how refreshing it is to have memory allocation tracking while still starting the application in what feels like instantly. It was quite a bit of tweaking to get that level of performance and I’m thrilled with the result.

Additionally, I spent some time looking at what sort of things cause temporary lockups in GNOME Shell during active use. With a fio script in hand, I had the necessary things to cause the buffer cache to be exhausted and force many applications working set out of memory. That usually does the trick to cause short lockups.

But what is going on when things stall? Does the GPU driver get bogged down? Does the Shell get blocked on GC? Is there some sort of blocking API involved?

To answer this I put together a scrappy little LD_PRELOAD tool called “iobt” which will write out a Sysprof capture file when some blocking operations are called. This found a very peculiar bug where GNOME Shell could end up blocking on the compositor thread when it thought it was doing all async I/O operations.

Furthermore, I found a number of other I/O operations happening on the main thread which will easily lock things up under heavy writeback scenarios. Patches for all of these are upstream, half of them are merged at this point, and some even backported to 3.28 for various distros.

There are still some things to do going forward, like use cgroupsv2 to help enforce CPU and Memory availability and other priorities. I’m also looking for pointers from GPU people on how to debug what is going on during long blocking eglSwapBuffers() calls as I’ve seen under memory pressure.

I’m always inspired by what the Shell developers build and I’m honored to get to help polish it even more.

GTK 3 Frame Profiler

I back-ported the GTK 4 frame-profiler by Matthias to GTK 3 today. Here is an example of a JavaScript application using GJS and GTK 3. The data contains mixed native and JS stack-traces along with compositor frame information from gnome-shell.

What is going on here is that we have binary streams (although usually captured into a memfd) that come from various systems. GJS is delivered a FD via GJS_TRACE_FD environment variable. GTK is delivered a FD via GTK_TRACE_FD. We get data from GNOME Shell using the org.gnome.Sysprof3.Profiler D-Bus service exported on the org.gnome.Shell peer. Stack-traces come from GJS using SIGPROF and the SpiderMonkey profiler API. Native stack traces come from the Linux kernel’s perf_event_open system. Various data like CPU frequency and usage, memory and such come from /proc.

Muxing all the data streams uses sysprof_capture_writer_cat() which knows how to read data frames from supplemental captures and adjust counter IDs, JIT-mappings, and other file-specific data into a joined capture.

A quick reminder that we have a Platform Profiling Initiative in case you’re interested in helping out on this stuff.

Sysprof Developments

This week I spent a little time fixing up a number of integration points with Sysprof and our tooling.

The libsysprof-capture-3.a static library is now licensed under the BSD 2-clause plus patent to make things easier to consume from all sorts of libraries and applications.

We have a MR for GJS to switch to libsysprof-capture-3.a and improve plumbing so Sysprof can connect automatically.

We also have a number of patches for GLib and GTK that improve the chances we can get useful stack-traces when unwinding from the Linux kernel (which perf_event_open does).

A MR for GNOME Shell automatically connects the GJS profiler which is required as libgjs is being used as a library here. The previous GJS patches only wire things up when the gjs binary is used.

With that stuff in place, you can get quite a bit of data correlated now.

# Logout, Switch to VT2
sysprof-cli -c "gnome-shell --wayland --display-server" --gjs --gnome-shell my-capture.syscap

If you don’t want mixed SpiderMonkey and perf stack-traces, you can use --no-perf. You can’t really rely on sample rates between two systems at the same time anyway.

With that in place, you can start correlating more frame data.

Sysprof Developments

Earlier this month, Matthias and I teamed up to push through some of our profiling tooling for GTK and GNOME. We took the occasional work I had done on Sysprof over the past few years and integrated that into the GTK-4.x tree.

Sysprof uses a binary log file to store information about execution in a manner that is easy to write-buffer and read-back using positioned reads. It helps keep the sampling overhead of sysprof low. But it’s too detail oriented for each application supporting the format to write. To make this stuff reusable I created a libsysprof-capture-3.a static library we embed from various layers of the platform.

GTK-4.x is now using this. Builder itself uses it to log internal statistics, tracing data, and counters for troubleshooting. I’ve also put forward patches for GJS to integrate with it. Georges revamped and pushed forward a prototype by Jonas to integrate with Mutter/Shell and get us frame timings and Cogl pipeline data. With some work we can finish off the i915 data sources that Eric Anholt did to correlate GPU commands too.

What this means for developers is that soon we’ll be able to capture system information from various layers in the stack and correlate them using similar clocks. We’re only scratching the surface right now, but it’s definitely promising. It’s already useful to quantify the true performance improvements of merge-requests in Mutter and Shell.

To help achieve this goal during the 3.34 cycle, I’ve started the GNOME Profiling Initiative to track integration of various problem spaces. If you’re an application developer, you can use sysprof_capture_writer_new_for_env() to create a new SysprofCaptureWriter if Sysprof is profiling you (otherwise you’ll get NULL back). Use that to write marks, logs, metadata, files, or anything else you would like to capture.

If you’re interested in helping to write more data collectors, that would be appreciated. Data sources like battery voltage/wattage consumption seem like good candidates so that we can better optimize our platform for battery-based devices.

I have a Sysprof Copr repository for Rawhide and F30 if you’d like to try stuff out and submit issues.

Many thanks to Red Hat for sponsoring all the work I do on GNOME and my amazing manager Matthias for visiting Portland to make this stuff happen even sooner.

As always, follow my grumpy ramblings on birdsite for early previews of my work.

Flatpak at SCaLE 15x

A decade ago I lived on the west side of Los Angeles. One of my favorite conferences was Southern California Linux Expo. Much like Karen, this is the conference where I performed my first technical talk. It’s also where I met and became friends with great people like Jono, Ted, Jeff, the fantastic organizing staff, and so many more.

I was happy to come back again this year and talk about what I’ve been working on in GNOME. The combination of Flatpak and Builder (and Sysprof).

The event was live streamed on youtube, and you can watch it here. I expect the rooms to be cut and uploaded as individual talks but I don’t know the timeline for that. I’ll update this if/when I discover that so you can youtube-dl if you prefer.

Profiling Flatpak’d applications

One of the great powers of namespace APIs on Linux (mount namespaces, user namespaces, etc) is that you can create a new view into the world of your computer that is very different from the host. This can make traditional profiling tools difficult.

To begin with, we need to ensure that we have access to ptrace or perf infrastructure. Easy enough, just don’t drop those privileges before calling execve(). This is the --allow=devel option to flatpak run. But after that, we need to do the detailed phase of translating instruction pointers to a function name.

Generally, the translation between an instruction pointer and a function name requires looking up what file is mapped over the address. Then you open that file with an ELF reader and locate the file containing debug information (which may be the same file). Then open that file and locate what function contains that instruction pointer. (We subtract the beginning of the map from the instruction pointer to get a relative offset).

Now, here is the trouble with mount namespaces. The “path” of the map might be something like “/newroot/usr/lib/libgtk-3.so“. Of course, “/newroot/” doesn’t actually exist.

So… what to do.

Well, we can find information about the mounts in the process by looking at /proc/$pid/mountinfo. Just look for the longest common prefix to get the translation from the mount namespace into the host. One caveat though. The translated path is relative to the root of the partition mounted. So the next step is to locate the file-system mount on the host (via /proc/mounts).

If you put this all together, you can profile and extract symbols from applications running inside of containers.

Now, most applications aren’t going to ship debug symbols with them. But we’ll need those for proper symbol extraction. So if you install the .Debug variant of a Flatpak runtime (org.gnome.Sdk.Debug for example) then we can setup symbol resolution specifically for /usr/lib/debug. Builder now does this.

I’m not sure, but I think this makes Builder and Sysprof one of the first (if not the first) profiler on Linux to support symbol extraction while profiling containerized applications from outside the container.

I’ve only tested with Flatpak, but I don’t see why this code can’t work with other tooling using mount namespaces.

Sysprof Plans for 3.24

The 3.24 cycle is just getting started, and I have a few plans for Sysprof to give us a more polished profiling experience in Builder. The details can be found on the mailing list.

In particular, I’d love to land support for visualizers. I expect this to happen soon, since there is just a little bit more to work through to make that viable. This will enable us to get a more holistic view of performance and allow us to drill into callgraphs during a certain problematic period of the profile.

Once we have visualizer support, we can start doing really cool things like extracting GPU counters, gdk/clutter frame-clock timing, dbus/cpu/network monitors and whatever else you come up with.

A CPU visualizer displayed above the callgraph to provide additional context to the profiled execution.
A CPU visualizer displayed above the callgraph to provide additional context to the profiled execution.

Additionally we have some work to do around getting access to symbols when we are running in binary stripped environments. This means we can upload a stripped binary to your IoT/low-power device to profile, but have the instruction-pointer-to-symbol resolver happen on the developers workstation.

As I just alluded to, I’d love to see remote profiling happen too. There is some plumbing that needs to occur here, but in general it shouldn’t be terribly complicated.

Builder Nightly Flatpak

First off, I’ll be in Portland at the first ever LAS giving demos and starting on Builder features for 3.24.

For a while now, you’ve been able to get Builder from the gnome-apps-nightly Flatpak repository. Until now, it had a few things that made it difficult to use. We care a whole lot about making our tooling available via Flatpak because it is going to allow us to get new code into users hands quicker, safer, and more stable.

So over the last couple of weeks I’ve dug in and really started polishing things up. A few patches in Flatpak, a few patches in Builder, and a few patches in Sysprof start getting us towards something refreshing.

Python Jedi should be working now. So you can autocomplete in Python (including GObject Introspection) to your hearts delight.

Jhbuild from within the Flatpak works quite well now. So if you have a jhbuild environment on your host, you can use the Flatpak nightly and still target your jhbuild setup.

One of the tricks to being a module maintainer is getting other people to do your work. Thankfully the magnificent Patrick Griffis came to the rescue and got polkit building inside of Flatpak. Combined with some additional Sysprof patches, we have a profiler that can run from Flatpak.

Another pain point was that the terminal was inside of a pid/mount/network namespace different than that of the host. This meant that /usr/lib was actually from the Flatpak runtime, not your system /usr/lib. This has been fixed using one of the new developer features in Flatpak.

Flatpak now supports executing programs on the host (a sandbox breakout) for applications that are marked as developer tools. For those of you building your own Flatpaks, this requires --allow=devel when running the flatpak build-finish command. Of course, one could expect UI/UX flows to make this known to the user so that it doesn’t get abused for nefarious purposes.

Now that we have access to execute a command on the host using the HostCommand method of the org.freedesktop.Flatpak.Development interface, we can piggy back to execute our shell.

The typical pty dance, performed in our program.

/* error handling excluded */
int master_fd = vte_pty_get_fd (pty);
grantpt (master_fd);
unlockpt (master_fd);
char *name = ptsname (master_fd);
int pty_fd = open (name, O_RDWR, 0);

Then when executing the HostCommand method we simply pass pty_fd as a mapping to stdin (0), stdout (1), and stderr (2).

On the Flatpak side, it will check if any of these file-descriptors are a tty (with the convenient isatty() function. If so, it performs the necessary ioctl() to make our spawned process the controller of the pty. Lovely!

So now we have a terminal, whose process is running on the host, using a pty from inside our Flatpak.