Sysprof – Page 2 – Happenings in GNOME

Sysprof 45

Unfortunately I couldn’t be at GUADEC this year, but that wont stop me from demoing new things!

I’ve been doing a lot of work on Sysprof now that we have semi-reliable frame unwinding on Fedora, Silverblue and GNOME OS. When I have tolling that works on the OS it makes it a lot easier to build profilers and make them useful.

Additionally, we’re at a good point in GTK 4 where you can do really powerful things if you design your data models correctly. So this cycle I’ve spent time redesigning how we record and process our captured data.

There is certainly more work to be done, but the big strokes of the new design are in place. It could really use the benefit of another person joining in to help polish various bits of the apps like scales and legends.

For 45 I decided to remove the tabbed interface and Builder will now just open captures with Sysprof directly. It’s too cumbersome to try to shove all this information into a single view widget just so I can embed it in Builder.

Greeter

The first thing you’ll see is a new greeter. It still has a bit more to finish but my primary goal was to elevate how things work. That was something lacking with just icons like we had previously.

You’ll also notice you can capture either to disk or to memory. Depending on your situation that may be of use. For example, if you’re testing under memory pressure, creating an unbounded memfd may not be what you want. Instead you can capture to disk and the capture will periodically flush when the buffer is full.

Recording Pad

While recording, Sysprof now creates a much smaller recording pad that you can use to stop the recording. The goal here is to further reduce overhead created by Sysprof itself. It still updates once per second to give you an idea of how many data frames have been recorded to the capture.

Exploring Captures

After capturing your system, you’ll be presented with a window to explore the capture.

Things were getting pretty cramped before, so the new sections in the sidebar make it easier for us to put related information together in a way that is understandable.

I tried very hard to keep the callgraph in the three-section format we’ve used for many years. However, it has a nice filter now on the functions list thanks to GtkFilterListModel making it so easy.

Selecting Time Spans

Many parts of the window will automatically filter themselves based on the selected time span. Use the charts at the top of the window to select time ranges that are interesting. You can use the controls in the sidebar to navigate the capture as well.

You can click the + icon within the selection to zoom into that range.

Callgraph Options

There are a number of new callgraph options you can toggle.

Categorized Frames
Hide System Libraries
Include Threads
Bottom Up

They are all pretty standard things in a profiler so I don’t need to dwell on them much. But having a “Bottom Up” option means we have some help when you run into truncated stack traces and still want to get an idea of what’s going on by function fragments. The new “Include Threads” option lets you break up your callgraph by one more level, the thread that was running.

Categorized Stack Traces

While I was working on this I had to add a few things I’ve wanted for a while. One such thing was a utility sidebar that can be shown with additional information relative to the current selection. In this case, you can expand the callgraph and see a list of all the stack traces that contributed to that callgraph frame showing up in the capture. Additionally, we can categorize stack traces based on the libraries and functions contributing to them to give you a high-level overview of where time is being spent.

Logs View

When spawning an application from Sysprof it can write logs by integrating with libsysprof-capture-4.a. That’s not new but what is new is that Sysprof now has a journald collector which can be interposed in your capture.

Marks

Marks have gone through substantial work to be more useful.

A mark is just a data frame in the capture that has a time and duration associated with a category, name, and optional message. These are used by GNOME Shell to annotate what is happening in the compositor as well as by GTK to denote what is happening during the frame cycles. Furthermore, GLib has optional Sysprof support which can annotate your main loop cycles so you can see why applications are waking up and for how long.

Marks Chart

The first new view we have for this is the “Mark Chart”. It contains a breakdown of the selected time span by category and name. The X axis is of course time.

Marks Table

Sysprof now has a long-requested mark table.

Sometimes its easier to look at data in a more raw form. Especially since you can sort by column and dive into what you care about. It doesn’t hurt that this is much more accessibility friendly too.

Marks Waterfall

We still have the old waterfall style display as well so you can see how things naturally depend on one-another.

You can double click on these waterfall entries and the visible time region will update to match that item’s duration.

Marks Summary

It was a bit hidden before, but we still have a mark summary. Although I’ve beefed it up a bit and provide median values in addition to mean. These are also sortable like the other tables you’ll find in Sysprof.

Processes

We now give you a bit more insight into the processes we discovered running during your capture. The new Processes section shows you a timeline of the processes that ran.

Additionally there is a table view, again more accessible and sometimes easier to read, sort, and analyze. If you double click a row you’ll get additional information on that process such as the address layout, mounts, and thread information we have.

This is all information that Sysprof collects to be able to do it’s job as a profiler and we might as well make that available to you too.

D-Bus Messages

You can record D-Bus messages on your session or system bus now. We may end up needing to tweak how we get access to the system bus so that you are more certain to have privileges beyond just listening from your read socket.

There are no fancy viewers like Bustle yet, but you do have a table of messages. Someone could use this as a basis to connect the reply message with the send message so that you can draw proper message durations in a chart.

Counters

Counters have been broken up a bit more so that we can expand on them going forward. Different sections have different additional data to view. For example the CPU section will give you the CPU breakdown we recorded such as processor model and what CPU id maps to what core.

I find it strange that my Xeon skips core 6 and 7.

There are all the same counters we had previously for CPU, Energy (RAPL), Battery Charge, Disk I/O, Network I/O, and GTK counters such as FPS.

Files

Sysprof supports embedding files in chunks within the *.syscap file. The SysprofDocument exports a GListModel of those which can be reconstructed at will. Since we needed that support to be able to model process namespaces, we might as well give the user insight too. Lots of valuable information is stored here, typically compressed, although Sysprof will transparently decompress it for you.

This will hopefully speed up maintainers ability to get necessary system information without back-and-forths with someone filing an issue.

Metadata

A metadata frame is just a key/value pair that you can embed into capture files. Sysprof uses them to store various information about the capture for quick reference later. Since we’re capturing information about a user’s system, we want to put them in control of knowing what is in that capture. But again, this is generally system statistics that help us track down issues without back-and-forths.

Symbolizing

The symbolizing phase of Sysprof has also been redesigned. To effectively handle the changes in how systems are built now from when Sysprof was revamped requires quite a bit of hand-waving. We have containers with multiple and sometimes overlapping storage technologies, varying file-systems used for the operating system including those with subvolumes which might not match a processes, chroots and ostrees.

To make things mostly work across the number of systems I have at my fingertips to test with required quite a bit of iterative tweaking. The end result is that we basically try to model the mount namespace of the target process and the mount namespace of the host and cross-correlate to get a best guess at where to resolve the library path. At that point, we can try to resolve additional paths so that looking at .gnu_debuglink still results in something close to correct.

We also give you more data in the callgraph now so if you do get an inode mismatch or otherwise unresolveable symbol you at least get an offset within the .text section of the ELF you can manually disassemble in your debugger. Few people will likely do this, but I’ve had to a number of times.

To make that stuff fast, Sysprof has a new symbol cache. It is the combination of an augmented Red-Black tree with address ranges (so an interval tree). It’s maintained per-process and can significantly reduce decoding overhead.

PERF_EVENT_MMAP2 and build_id

Sysprof now records mmap2 records from Perf while also requesting build_id for executable pages. The goal here is that we would be able to use the build_id to resolve symbols rather than all the process mount namespace and .gnu_debuglink madness. In practice, I haven’t had too much success getting these values but in time I assume that would allow for symbolizing with tools such as debuginfod.

Writing your own Profiler

You can always write your own profiler using libsysprof and get exactly what you want. The API is significantly reduced and cleaned up for GNOME 45.

SysprofProfiler *profiler = sysprof_profiler_new ();
SysprofCaptureWriter *writer = sysprof_capture_writer_new ("capture.syscap", 0);

sysprof_profiler_add_instrument (profiler, sysprof_sampler_new ());
sysprof_profiler_add_instrument (profiler, sysprof_network_usage_new ());
sysprof_profiler_add_instrument (profiler, sysprof_disk_usage_new ());
sysprof_profiler_add_instrument (profiler, sysprof_energy_usage_new ());
sysprof_profiler_add_instrument (profiler, sysprof_power_profile_new ("performance"));

/* If you want to symbolize at end of capture and attach to the capture,
 * use this. It makes your capture more portable for sharing.
 */
sysprof_profiler_add_instrument (profiler, sysprof_symbols_bundle_new ());

sysprof_profiler_record_async (profiler, writer, record_cb, NULL, NULL);

You get the idea.

Writing your own Analyzer

You can also use libsysprof to analyze an existing capture.

SysprofDocumentLoader *loader = sysprof_document_loader_new ("capture.syscap");

/* there is a sensible default symbolizer, but you can even disable it if you
 * know you just want to look at marks/counters/etc.
 */
sysprof_document_loader_set_symbolizer (loader, sysprof_no_symbolizer_get ());

SysprofDocument *document = sysprof_document_loader_load (loader, NULL, &error);

GListModel *counters = sysprof_document_list_counters (document);
GListModel *samples = sysprof_document_list_samples (document);
GListModel *marks = sysprof_document_list_marks (document);

This stuff is all generally fast because at load time we’ve indexed the whole thing into low-cardinality indexes that can be intersected. The SysprofDocument itself is also a GListModel of every data frame in the capture which makes for fun data-binding opportunities.

Thanks for reading and happy performance hacking!

Sysprof and Podman

With the advent of immutable/re-provisional/read-only operating systems like Fedora’s Silverblue, people will be doing a lot more computing inside of containers on their desktops (as if they’re not already).

When you want to profile an entire system with tools like perf this can be problematic because the files that are mapped into memory could be coming from strange places like FUSE. In particular, fuse-overlayfs.

There doesn’t seem to be a good way to decode all this indirection which means in Sysprof, we’ve had broken ELF symbol decoding for your things running inside of podman containers (such as Fedora’s toolbox). For those of us who have to develop inside those containers, that can really be a drag.

The problem at the core is that Sysprof (and presumably other perf-based tooling) would think a file was mapped from somewhere like /usr/lib64/libglib-2.0.so according to the /proc/$pid/maps. Usually we translate that using /proc/$pid/mountinfo to the real mount or subvolume. But if fuse-overlayfs is in the picture, you don’t get any insight into that. When symbols are decoded, it looks at the host’s /usr/lib/libglib-2.0.so and finds an inode mismatch at which point it will stop trying to decode the instruction address.

But since we still have a limited number of container technologies to deal with today, we can just cheat. If we look at /proc/$pid/cgroup we can extract the libpod container identifier and use that to peek at ~/.local/share/containers/storage/overlay-containers/containers.json to get the overlayfs layer. With that, we can find the actual root for the container which might be something like ~/.local/share/containers/storage/overlay/$layer/diff.

It’s a nasty amount of indirection, and it’s brittle because it only works for the current user, but at least it means we can keep improving GNOME even if we have to do development in containers.

Obligatory screenshot of turtles. gtk4-demo running in jhbuild running in Fedora toolbox (podman) with a Fedora 34 image which uses fuse-overlayfs for file access within the container. Sysprof now can discover this and decode symbols appropriately alongside the rest of the system. Now if only we could get distributions to give up on omitting frame pointers everywhere just so their unjustifiable database benchmarks go up and to the right a pixel.

How to use Sysprof to… Part II

In the previous article of this series we covered Sysprof basics to help you use the tooling. Now I want to take a moment to show you how to use the command line tooling to profile systems like GNOME Shell.

Record an existing session

The easiest way to get started is to record your existing GNOME Shell session. With sysprof-cli, you can use the --gnome-shell option and it will attempt to connect to your active GNOME Shell instance over D-Bus to stream COGL pipeline information over a private file-descriptor.

This information can be combined with callgraphs to see what is happening during the duration of a COGL mark.

The details page can also provide some quick overview information about the marks and their duration. You will find this helpful when comparing patches to see if they really improved things over time.

The details button in the top right will show you information about marks and their min/max/avg duration.

Basic Shell Recording

Running something like a desktop session is complex. You have a D-Bus daemon, a compositor, series of background daemons, settings infrastructure, and programs saving to your home directory. For this reason you cannot really run two of them for the same user at the same time, or even nested.

Because of this, it is handy to log out of your desktop session and switch to a VT to profile GNOME Shell. Sysprof provides a sysprof-cli binary you can use to profile in complicated setups like this.
Start by switching to another VT like Control+Shift+3. I recommend stopping the current display server just so that it doesn’t get in the way of profiling, but usually it’s okay to not. Then we can enter our JHBuild environment with a new D-Bus session before we start Sysprof and GNOME Shell.

Fedora 32 (Workstation Edition)
Kernel 5.6.0-0.rc4.git0.1.fc32.x86_64 (tty3)

startdust login: christian
Password: 
$ sudo service gdm stop
$ dbus-run-session jhbuild shell
$

At this point, we can spawn GNOME Shell with Sysprof to start recording.

You can use -- to specify the command you want sysprof-cli to execute while it records. When that application exits, sysprof-cli will extract all the known symbols and finish it’s recording.

I want to mention briefly that the --gnome-shell option only works with an existing GNOME session. I hope to fix that in the near future though.

$ sysprof-cli -- gnome-shell --wayland --display-server

At this point, GNOME Shell will have spawned and you can exercise it to exhibit the behavior you’d like to improve. When done, open a terminal window to kill GNOME shell so that the profiler can clean up.

kill -9 $(pidof gnome-shell) seems to work well for me

Now you’ll have a capture.syscap file in your current directory. Open that up with Sysprof to view the contents of your profiling session. Often I just spawn gnome-shell directly to open the syscap file and explore.

Recording JavaScript stacks

Sometimes you want to profile JavaScript instead of the C code from Shell, Mutter, and friends. To do this, use the --gjs command line option. Currently, this can give mixed results if you also sample callstacks with the Linux perf support, as the timings are not guaranteed to be equivalent. My recommendation is to disable perf when sampling JavaScript using the --no-perf option.

$ sysprof-cli --gjs --no-perf -- gnome-shell --wayland --display-server

Now when you open the callgraph in Sysprof, you’ll see JavaScript samples.

Recording Energy Consumption

On Linux, we have support for tracking energy usage as “Rolling Average Power Limit” or RAPL for short. Sysprof can include this information for you in your capture if you have the turbostat utility available. It provides power information per “package” such as the GPU and CPU.

Keeping power consumption low is an important part of a modern desktop that aims to be useful on laptops and smaller form factors. It’s useful to check in now and again to ensure that we’re keeping things tip top.

$ sysprof-cli --rapl --no-perf -- gnome-shell --wayland --display-server

You might want to disable sampling while testing power consumption because that could have a larger effect in terms of wattage than the thing you’re profiling.

Don’t forget to check the counter and energy menus for additional graphs.

Reducing Memory Allocations

Plugging memory leaks is a great thing to do. But sometimes it’s better to never allocate things to begin with. The --memprof option can help you find extraneous allocations in your program. For example, I tested the --memprof option on GNOME Shell when writing it and immediately found a way to reduce temporary allocations by hundreds of MiB per minute of use.

$ sysprof-cli --memprof -- gnome-shell --wayland --display-server

Avoiding Main Loop Stalls

This one requires you to build Sysprof until our next release, but you can use the --speedtrack option to find things running on your main loop that may not be a good idea. It will also insert marks for how long the main loop iterations run to find periods of time that you aren’t staying interactive.

$ sysprof-cli --speedtrack -- gnome-shell --wayland --display-server

Anyway, that does it for now! Hope you found this brain dump insightful enough to help us all push forward on the performance curve.

Keeping your fast code fast

Over the past few weeks I’ve been finishing up various projects for 3.36. None of this is surprising for those that follow me on twitter, but sadly I find it hard to blog as often as I should.

One of the projects I completed before the end of the cycle is a memory allocation tracker for Sysprof. It’s basically a modern port of the Memprof code from 20 years ago, but tied into Sysprof and using fancier techniques to move data quickly between processes. It uses an LD_PRELOAD to override many of the weak memory symbols in glibc such as malloc() and free(). When those functions are reached, a stack trace is captured directly into a mmap()‘d ring buffer shared by Sysprof. We create a new one of these per-thread so that no locking is necessary between threads. Sysprof will mux all the data together for us.

Below is a quick example running gtk4-widget-factory. We show similar callgraphs as we do when doing CPU profiling, but ordered by the amount of memory allocated. This simple tool and less than 20 minutes of effort found many allocations we could completely avoid across both GTK and Clutter.

I just want to mention how refreshing it is to have memory allocation tracking while still starting the application in what feels like instantly. It was quite a bit of tweaking to get that level of performance and I’m thrilled with the result.

Additionally, I spent some time looking at what sort of things cause temporary lockups in GNOME Shell during active use. With a fio script in hand, I had the necessary things to cause the buffer cache to be exhausted and force many applications working set out of memory. That usually does the trick to cause short lockups.

But what is going on when things stall? Does the GPU driver get bogged down? Does the Shell get blocked on GC? Is there some sort of blocking API involved?

To answer this I put together a scrappy little LD_PRELOAD tool called “iobt” which will write out a Sysprof capture file when some blocking operations are called. This found a very peculiar bug where GNOME Shell could end up blocking on the compositor thread when it thought it was doing all async I/O operations.

Furthermore, I found a number of other I/O operations happening on the main thread which will easily lock things up under heavy writeback scenarios. Patches for all of these are upstream, half of them are merged at this point, and some even backported to 3.28 for various distros.

There are still some things to do going forward, like use cgroupsv2 to help enforce CPU and Memory availability and other priorities. I’m also looking for pointers from GPU people on how to debug what is going on during long blocking eglSwapBuffers() calls as I’ve seen under memory pressure.

I’m always inspired by what the Shell developers build and I’m honored to get to help polish it even more.

Network and Disk sources

Sysprof has gained network and disk device statistics. You can use the combined graphs for a quick overview, or view them individually.

GTK 3 Frame Profiler

I back-ported the GTK 4 frame-profiler by Matthias to GTK 3 today. Here is an example of a JavaScript application using GJS and GTK 3. The data contains mixed native and JS stack-traces along with compositor frame information from gnome-shell.

What is going on here is that we have binary streams (although usually captured into a memfd) that come from various systems. GJS is delivered a FD via GJS_TRACE_FD environment variable. GTK is delivered a FD via GTK_TRACE_FD. We get data from GNOME Shell using the org.gnome.Sysprof3.Profiler D-Bus service exported on the org.gnome.Shell peer. Stack-traces come from GJS using SIGPROF and the SpiderMonkey profiler API. Native stack traces come from the Linux kernel’s perf_event_open system. Various data like CPU frequency and usage, memory and such come from /proc.

Muxing all the data streams uses sysprof_capture_writer_cat() which knows how to read data frames from supplemental captures and adjust counter IDs, JIT-mappings, and other file-specific data into a joined capture.

A quick reminder that we have a Platform Profiling Initiative in case you’re interested in helping out on this stuff.

Sysprof Developments

This week I spent a little time fixing up a number of integration points with Sysprof and our tooling.

The libsysprof-capture-3.a static library is now licensed under the BSD 2-clause plus patent to make things easier to consume from all sorts of libraries and applications.

We have a MR for GJS to switch to libsysprof-capture-3.a and improve plumbing so Sysprof can connect automatically.

We also have a number of patches for GLib and GTK that improve the chances we can get useful stack-traces when unwinding from the Linux kernel (which perf_event_open does).

A MR for GNOME Shell automatically connects the GJS profiler which is required as libgjs is being used as a library here. The previous GJS patches only wire things up when the gjs binary is used.

With that stuff in place, you can get quite a bit of data correlated now.

# Logout, Switch to VT2
sysprof-cli -c "gnome-shell --wayland --display-server" --gjs --gnome-shell my-capture.syscap

If you don’t want mixed SpiderMonkey and perf stack-traces, you can use --no-perf. You can’t really rely on sample rates between two systems at the same time anyway.

With that in place, you can start correlating more frame data.

Sysprof Developments

Earlier this month, Matthias and I teamed up to push through some of our profiling tooling for GTK and GNOME. We took the occasional work I had done on Sysprof over the past few years and integrated that into the GTK-4.x tree.

Sysprof uses a binary log file to store information about execution in a manner that is easy to write-buffer and read-back using positioned reads. It helps keep the sampling overhead of sysprof low. But it’s too detail oriented for each application supporting the format to write. To make this stuff reusable I created a libsysprof-capture-3.a static library we embed from various layers of the platform.

GTK-4.x is now using this. Builder itself uses it to log internal statistics, tracing data, and counters for troubleshooting. I’ve also put forward patches for GJS to integrate with it. Georges revamped and pushed forward a prototype by Jonas to integrate with Mutter/Shell and get us frame timings and Cogl pipeline data. With some work we can finish off the i915 data sources that Eric Anholt did to correlate GPU commands too.

What this means for developers is that soon we’ll be able to capture system information from various layers in the stack and correlate them using similar clocks. We’re only scratching the surface right now, but it’s definitely promising. It’s already useful to quantify the true performance improvements of merge-requests in Mutter and Shell.

To help achieve this goal during the 3.34 cycle, I’ve started the GNOME Profiling Initiative to track integration of various problem spaces. If you’re an application developer, you can use sysprof_capture_writer_new_for_env() to create a new SysprofCaptureWriter if Sysprof is profiling you (otherwise you’ll get NULL back). Use that to write marks, logs, metadata, files, or anything else you would like to capture.

If you’re interested in helping to write more data collectors, that would be appreciated. Data sources like battery voltage/wattage consumption seem like good candidates so that we can better optimize our platform for battery-based devices.

I have a Sysprof Copr repository for Rawhide and F30 if you’d like to try stuff out and submit issues.

Many thanks to Red Hat for sponsoring all the work I do on GNOME and my amazing manager Matthias for visiting Portland to make this stuff happen even sooner.

As always, follow my grumpy ramblings on birdsite for early previews of my work.

Flatpak at SCaLE 15x

A decade ago I lived on the west side of Los Angeles. One of my favorite conferences was Southern California Linux Expo. Much like Karen, this is the conference where I performed my first technical talk. It’s also where I met and became friends with great people like Jono, Ted, Jeff, the fantastic organizing staff, and so many more.

I was happy to come back again this year and talk about what I’ve been working on in GNOME. The combination of Flatpak and Builder (and Sysprof).

The event was live streamed on youtube, and you can watch it here. I expect the rooms to be cut and uploaded as individual talks but I don’t know the timeline for that. I’ll update this if/when I discover that so you can youtube-dl if you prefer.

Profiling Flatpak’d applications

One of the great powers of namespace APIs on Linux (mount namespaces, user namespaces, etc) is that you can create a new view into the world of your computer that is very different from the host. This can make traditional profiling tools difficult.

To begin with, we need to ensure that we have access to ptrace or perf infrastructure. Easy enough, just don’t drop those privileges before calling execve(). This is the --allow=devel option to flatpak run. But after that, we need to do the detailed phase of translating instruction pointers to a function name.

Generally, the translation between an instruction pointer and a function name requires looking up what file is mapped over the address. Then you open that file with an ELF reader and locate the file containing debug information (which may be the same file). Then open that file and locate what function contains that instruction pointer. (We subtract the beginning of the map from the instruction pointer to get a relative offset).

Now, here is the trouble with mount namespaces. The “path” of the map might be something like “/newroot/usr/lib/libgtk-3.so“. Of course, “/newroot/” doesn’t actually exist.

So… what to do.

Well, we can find information about the mounts in the process by looking at /proc/$pid/mountinfo. Just look for the longest common prefix to get the translation from the mount namespace into the host. One caveat though. The translated path is relative to the root of the partition mounted. So the next step is to locate the file-system mount on the host (via /proc/mounts).

If you put this all together, you can profile and extract symbols from applications running inside of containers.

Now, most applications aren’t going to ship debug symbols with them. But we’ll need those for proper symbol extraction. So if you install the .Debug variant of a Flatpak runtime (org.gnome.Sdk.Debug for example) then we can setup symbol resolution specifically for /usr/lib/debug. Builder now does this.

I’m not sure, but I think this makes Builder and Sysprof one of the first (if not the first) profiler on Linux to support symbol extraction while profiling containerized applications from outside the container.

I’ve only tested with Flatpak, but I don’t see why this code can’t work with other tooling using mount namespaces.