With the advent of immutable/re-provisional/read-only operating systems like Fedora’s Silverblue, people will be doing a lot more computing inside of containers on their desktops (as if they’re not already).
When you want to profile an entire system with tools like perf this can be problematic because the files that are mapped into memory could be coming from strange places like FUSE. In particular, fuse-overlayfs.
There doesn’t seem to be a good way to decode all this indirection which means in Sysprof, we’ve had broken ELF symbol decoding for your things running inside of podman containers (such as Fedora’s toolbox). For those of us who have to develop inside those containers, that can really be a drag.
The problem at the core is that Sysprof (and presumably other perf-based tooling) would think a file was mapped from somewhere like /usr/lib64/libglib-2.0.so according to the /proc/$pid/maps. Usually we translate that using /proc/$pid/mountinfo to the real mount or subvolume. But if fuse-overlayfs is in the picture, you don’t get any insight into that. When symbols are decoded, it looks at the host’s /usr/lib/libglib-2.0.so and finds an inode mismatch at which point it will stop trying to decode the instruction address.
But since we still have a limited number of container technologies to deal with today, we can just cheat. If we look at /proc/$pid/cgroup we can extract the libpod container identifier and use that to peek at ~/.local/share/containers/storage/overlay-containers/containers.json to get the overlayfs layer. With that, we can find the actual root for the container which might be something like ~/.local/share/containers/storage/overlay/$layer/diff.
It’s a nasty amount of indirection, and it’s brittle because it only works for the current user, but at least it means we can keep improving GNOME even if we have to do development in containers.
Obligatory screenshot of turtles. gtk4-demo running in jhbuild running in Fedora toolbox (podman) with a Fedora 34 image which uses fuse-overlayfs for file access within the container. Sysprof now can discover this and decode symbols appropriately alongside the rest of the system. Now if only we could get distributions to give up on omitting frame pointers everywhere just so their unjustifiable database benchmarks go up and to the right a pixel.
In the previous article of this series we covered Sysprof basics to help you use the tooling. Now I want to take a moment to show you how to use the command line tooling to profile systems like GNOME Shell.
Record an existing session
The easiest way to get started is to record your existing GNOME Shell session. With sysprof-cli, you can use the --gnome-shell option and it will attempt to connect to your active GNOME Shell instance over D-Bus to stream COGL pipeline information over a private file-descriptor.
This information can be combined with callgraphs to see what is happening during the duration of a COGL mark.
The details page can also provide some quick overview information about the marks and their duration. You will find this helpful when comparing patches to see if they really improved things over time.
Basic Shell Recording
Running something like a desktop session is complex. You have a D-Bus daemon, a compositor, series of background daemons, settings infrastructure, and programs saving to your home directory. For this reason you cannot really run two of them for the same user at the same time, or even nested.
Because of this, it is handy to log out of your desktop session and switch to a VT to profile GNOME Shell. Sysprof provides a sysprof-cli binary you can use to profile in complicated setups like this.
Start by switching to another VT like Control+Shift+3. I recommend stopping the current display server just so that it doesn’t get in the way of profiling, but usually it’s okay to not. Then we can enter our JHBuild environment with a new D-Bus session before we start Sysprof and GNOME Shell.
At this point, GNOME Shell will have spawned and you can exercise it to exhibit the behavior you’d like to improve. When done, open a terminal window to kill GNOME shell so that the profiler can clean up.
Now you’ll have a capture.syscap file in your current directory. Open that up with Sysprof to view the contents of your profiling session. Often I just spawn gnome-shell directly to open the syscap file and explore.
Recording Energy Consumption
On Linux, we have support for tracking energy usage as “Rolling Average Power Limit” or RAPL for short. Sysprof can include this information for you in your capture if you have the turbostat utility available. It provides power information per “package” such as the GPU and CPU.
Keeping power consumption low is an important part of a modern desktop that aims to be useful on laptops and smaller form factors. It’s useful to check in now and again to ensure that we’re keeping things tip top.
You might want to disable sampling while testing power consumption because that could have a larger effect in terms of wattage than the thing you’re profiling.
Reducing Memory Allocations
Plugging memory leaks is a great thing to do. But sometimes it’s better to never allocate things to begin with. The --memprof option can help you find extraneous allocations in your program. For example, I tested the --memprof option on GNOME Shell when writing it and immediately found a way to reduce temporary allocations by hundreds of MiB per minute of use.
This one requires you to build Sysprof until our next release, but you can use the --speedtrack option to find things running on your main loop that may not be a good idea. It will also insert marks for how long the main loop iterations run to find periods of time that you aren’t staying interactive.
Over the past few weeks I’ve been finishing up various projects for 3.36. None of this is surprising for those that follow me on twitter, but sadly I find it hard to blog as often as I should.
One of the projects I completed before the end of the cycle is a memory allocation tracker for Sysprof. It’s basically a modern port of the Memprof code from 20 years ago, but tied into Sysprof and using fancier techniques to move data quickly between processes. It uses an LD_PRELOAD to override many of the weak memory symbols in glibc such as malloc() and free(). When those functions are reached, a stack trace is captured directly into a mmap()‘d ring buffer shared by Sysprof. We create a new one of these per-thread so that no locking is necessary between threads. Sysprof will mux all the data together for us.
Below is a quick example running gtk4-widget-factory. We show similar callgraphs as we do when doing CPU profiling, but ordered by the amount of memory allocated. This simple tool and less than 20 minutes of effort found many allocations we could completely avoid across both GTK and Clutter.
I just want to mention how refreshing it is to have memory allocation tracking while still starting the application in what feels like instantly. It was quite a bit of tweaking to get that level of performance and I’m thrilled with the result.
Additionally, I spent some time looking at what sort of things cause temporary lockups in GNOME Shell during active use. With a fio script in hand, I had the necessary things to cause the buffer cache to be exhausted and force many applications working set out of memory. That usually does the trick to cause short lockups.
But what is going on when things stall? Does the GPU driver get bogged down? Does the Shell get blocked on GC? Is there some sort of blocking API involved?
To answer this I put together a scrappy little LD_PRELOAD tool called “iobt” which will write out a Sysprof capture file when some blocking operations are called. This found a very peculiar bug where GNOME Shell could end up blocking on the compositor thread when it thought it was doing all async I/O operations.
Furthermore, I found a number of other I/O operations happening on the main thread which will easily lock things up under heavy writeback scenarios. Patches for all of these are upstream, half of them are merged at this point, and some even backported to 3.28 for various distros.
There are still some things to do going forward, like use cgroupsv2 to help enforce CPU and Memory availability and other priorities. I’m also looking for pointers from GPU people on how to debug what is going on during long blocking eglSwapBuffers() calls as I’ve seen under memory pressure.
I’m always inspired by what the Shell developers build and I’m honored to get to help polish it even more.
What is going on here is that we have binary streams (although usually captured into a memfd) that come from various systems. GJS is delivered a FD via GJS_TRACE_FD environment variable. GTK is delivered a FD via GTK_TRACE_FD. We get data from GNOME Shell using the org.gnome.Sysprof3.Profiler D-Bus service exported on the org.gnome.Shell peer. Stack-traces come from GJS using SIGPROF and the SpiderMonkey profiler API. Native stack traces come from the Linux kernel’s perf_event_open system. Various data like CPU frequency and usage, memory and such come from /proc.
Muxing all the data streams uses sysprof_capture_writer_cat() which knows how to read data frames from supplemental captures and adjust counter IDs, JIT-mappings, and other file-specific data into a joined capture.
Earlier this month, Matthias and I teamed up to push through some of our profiling tooling for GTK and GNOME. We took the occasional work I had done on Sysprof over the past few years and integrated that into the GTK-4.x tree.
Sysprof uses a binary log file to store information about execution in a manner that is easy to write-buffer and read-back using positioned reads. It helps keep the sampling overhead of sysprof low. But it’s too detail oriented for each application supporting the format to write. To make this stuff reusable I created a libsysprof-capture-3.a static library we embed from various layers of the platform.
GTK-4.x is now using this. Builder itself uses it to log internal statistics, tracing data, and counters for troubleshooting. I’ve also put forward patches for GJS to integrate with it. Georges revamped and pushed forward a prototype by Jonas to integrate with Mutter/Shell and get us frame timings and Cogl pipeline data. With some work we can finish off the i915 data sources that Eric Anholt did to correlate GPU commands too.
What this means for developers is that soon we’ll be able to capture system information from various layers in the stack and correlate them using similar clocks. We’re only scratching the surface right now, but it’s definitely promising. It’s already useful to quantify the true performance improvements of merge-requests in Mutter and Shell.
To help achieve this goal during the 3.34 cycle, I’ve started the GNOME Profiling Initiative to track integration of various problem spaces. If you’re an application developer, you can use sysprof_capture_writer_new_for_env() to create a new SysprofCaptureWriter if Sysprof is profiling you (otherwise you’ll get NULL back). Use that to write marks, logs, metadata, files, or anything else you would like to capture.
If you’re interested in helping to write more data collectors, that would be appreciated. Data sources like battery voltage/wattage consumption seem like good candidates so that we can better optimize our platform for battery-based devices.
A decade ago I lived on the west side of Los Angeles. One of my favorite conferences was Southern California Linux Expo. Much like Karen, this is the conference where I performed my first technical talk. It’s also where I met and became friends with great people like Jono, Ted, Jeff, the fantastic organizing staff, and so many more.
I was happy to come back again this year and talk about what I’ve been working on in GNOME. The combination of Flatpak and Builder (and Sysprof).
The event was live streamed on youtube, and you can watch it here. I expect the rooms to be cut and uploaded as individual talks but I don’t know the timeline for that. I’ll update this if/when I discover that so you can youtube-dl if you prefer.
One of the great powers of namespace APIs on Linux (mount namespaces, user namespaces, etc) is that you can create a new view into the world of your computer that is very different from the host. This can make traditional profiling tools difficult.
To begin with, we need to ensure that we have access to ptrace or perf infrastructure. Easy enough, just don’t drop those privileges before calling execve(). This is the --allow=devel option to flatpak run. But after that, we need to do the detailed phase of translating instruction pointers to a function name.
Generally, the translation between an instruction pointer and a function name requires looking up what file is mapped over the address. Then you open that file with an ELF reader and locate the file containing debug information (which may be the same file). Then open that file and locate what function contains that instruction pointer. (We subtract the beginning of the map from the instruction pointer to get a relative offset).
Now, here is the trouble with mount namespaces. The “path” of the map might be something like “/newroot/usr/lib/libgtk-3.so“. Of course, “/newroot/” doesn’t actually exist.
So… what to do.
Well, we can find information about the mounts in the process by looking at /proc/$pid/mountinfo. Just look for the longest common prefix to get the translation from the mount namespace into the host. One caveat though. The translated path is relative to the root of the partition mounted. So the next step is to locate the file-system mount on the host (via /proc/mounts).
If you put this all together, you can profile and extract symbols from applications running inside of containers.
Now, most applications aren’t going to ship debug symbols with them. But we’ll need those for proper symbol extraction. So if you install the .Debug variant of a Flatpak runtime (org.gnome.Sdk.Debug for example) then we can setup symbol resolution specifically for /usr/lib/debug. Builder now does this.
I’m not sure, but I think this makes Builder and Sysprof one of the first (if not the first) profiler on Linux to support symbol extraction while profiling containerized applications from outside the container.
I’ve only tested with Flatpak, but I don’t see why this code can’t work with other tooling using mount namespaces.
Visualizers are almost ready to land. This week I got smooth scrolling, zoom, and selections working. The handy part about selections is that it allows you to update the callgraph to limit stack samples to those falling within a given time range.