Sysprof in your Mesa

Thanks to the work of Christian Gmeiner, support for annotating time regions using Sysprof marks has landed in Mesa.

That means you’ll be able to open captures with Sysprof and see the data along other useful information including callgraphs and flamegraphs.

I do think there is a lot more we can do around better visualizations in Sysprof. If that is something you’re interested in working on please stop by #gnome-hackers on Libera.chat or drop me an email and I can find things for you to work on.

See the merge request here.

Profiling w/o Frame Pointers

A couple years ago the Fedora council denied a request by Meta engineers to build the distribution with frame-pointers. Pretty immediately I pushed back by writing a number of articles to inform the council members why frame-pointers were necessary for a good profiling experience.

Profiling is used by developers, system administrators, and when we’re lucky by bug reporters!

Since then, many people have discussed other options. For example in the not too distant future we’ll probably see SFrame unwinding provide a reasonable way to unwind stacks w/o frame-pointers enabled and more importantly, without copying the contents of the stack.

Until then, it can be helpful to have a way to unwind stacks even without the presence of frame-pointers. This past week I implemented that for Sysprof based on a prototype put together by Serhei Makarov in the elfutils project called eu-stacktrace.

This prototype works by taking samples of the stack from perf (say 16KB-32KB worth) and resolving enough of the ELF data for DWARF/CFI (Call-frame-information)/etc to unwind the stacks in memory using a copy of the registers. From this you create a callchain (array of instruction pointers) which can be sent to Sysprof for recording.

I say “in memory” because the stack and register content doesn’t hit disk. It only lands inside the mmap()-based ring buffer used to communicate with Linux’s perf event subsystem. The (much smaller) array of instruction pointers eventually lands on disk if you’re not recording to a memfd.

I expanded upon this prototype with a new sysprof-live-unwinder process which does roughly the same thing as eu-stacktrace while fitting into the Sysprof infrastructure a bit more naturally. It consumes a perf data stream directly (eu-stacktrace consumed Sysprof-formatted data) and then provides that to Sysprof to help reduce overhead.

Additionally, eu-stacktrace only unwinds the user-space side of things. On x86_64, at least, you can convince perf to give you both callchains (PERF_SAMPLE_CALLCHAIN) as well as sample stack/registers (PERF_SAMPLE_STACK_USER|PERF_SAMPLE_REGS_USER). If you peek for the location of PERF_CONTEXT_USER to find the context switch, blending them is quite simple. So, naturally, Sysprof does that. The additional overhead for frame-pointer unwinding user-space is negligible when you don’t have frame-pointers to begin with.

I should start by saying that this still has considerable overhead compared to frame-pointers. Locally on my test machine (a Thinkpad X1 Carbon Gen 3 from around 2015, so not super new) that is about 10% of samples. I imagine I can shave a bit of that off by tracking the VMAs differently than libdwfl, so we’ll see.

Here is an example of it working on CentOS Stream 10 which does not have frame-pointers enabled. Additionally, this build is debuginfod-enabled so after recording it will automatically locate enough debug symbols to get appropriate function names for what was captured.

This definitely isn’t the long term answer to unwinding. But if you don’t have frame-pointers on your production operating system of choice, it might just get you by until SFrame comes around.

The code is at wip/chergert/translate but will likely get cleaned up and merged this next week.

System Extensions from Flatpak

I write about Sysprof here quite often. Mostly in hopes of encouraging readers to use it to improve Linux as a whole.

An impediment to that is the intrusiveness to test out new features as they are developed. If only we had a Flatpak which you could install to test things right away.

One major hurdle is how much information a profiler needs to be useful. The first obvious “impossible to sandbox” API you run into is the perf subsystem. It provides information about all processes running on the system and their memory mappings which would make snooping on other processes trivial. Both perf and ptrace are disabled in the Flatpak sandbox.

After that, you still need unredacted access to the kernel symbols and their address mappings (kallsyms). You also need to be in a PID namespaces that allows you to see all the processes running on the system and their associated memory mappings which essentially means CAP_SYS_ADMIN.

Portable Services

Years ago, portable services were introduced into systemd through portablectl. I had high-hopes for this because it meant that I could perhaps ship a squashfs and inject it as a transient service on the host.

However, Sysprof needs more integration than could be provided by this because portable services are still rather isolated from the host. We need to own a D-Bus name, policy-kit action integration, in addition to the systemd service.

Even if that were all possible with portable services it wouldn’t get us access to some of the host information we need to properly decode symbols.

System Extensions

Then came along systemd-sysext. It provides a way to “layer” extensions on top of the host system’s /usr installation rather than in an isolated mount namespace.

This sounds much more promising because it would allow us to install .policy for policy-kit, .service files for Systemd and D-Bus, or even udev rules.

Though, with great power comes excruciating pain, or something like that.

So if you need to provide binaries that run on the host you need to either static-link (rust, go, zig perhaps?) or use something you can reasonably expect to be there (python?).

In the Sysprof case, everything is C so it can statically link almost everything by being clever with how it builds against glibc. Though this still requires glibc and quite frankly I’m fine with that. Potentially, one could use MUSL or ucLibc if they had high enough pain threshold for build tooling.

Bridging Flatpak and System Extensions

The next step would be to find a way to bridge system extensions and Flatpak.

In the wip/chergert/sysext branch of Sysprof I’ve made it build a number of things statically so that I can provide a system extension directory tree at /app/lib/extensions. We can of course choose a different path for this but that seemed similar to /var/lib/extensions.

Here we see the directory tree laid out. To do this right for systemd-sysext we also need to install an extension point file but I’ll save that for another day.

The Directory Tree

$ find /app/lib/extensions -type f
/app/lib/extensions/usr/lib/systemd/system/sysprof3.service
/app/lib/extensions/usr/share/polkit-1/actions/org.gnome.sysprof3.policy
/app/lib/extensions/usr/share/dbus-1/system-services/org.gnome.Sysprof3.service
/app/lib/extensions/usr/share/dbus-1/system.d/org.gnome.Sysprof3.conf
/app/lib/extensions/usr/libexec/sysprofd

Registering the Service

First we need to symlink our system extension into the appropriate place for systemd-sysext to pick it up. Typically /var/lib/extensions is used for transient services so if this were being automated we might use another directory for this.

# mkdir -p /var/lib/extensions
# ln -s /var/lib/flatpak/org.gnome.Sysprof.Devel/current/active/files/lib/extensions/ org.gnome.Sysprof.Devel

Now we need to merge the extension so it overlays into /usr. We must use --force because we didn’t yet provide an appropriate extension point file for systemd.

# systemd-sysext merge --force
Using extensions 'org.gnome.Sysprof.Devel'.
Merged extensions into '/usr'.

And now make sure our service was installed to the approriate location.

# ls /usr/lib/systemd/systemd/sysprof3.service
-rw-r--r-- 2 root root 115 Dec 31 1969 /usr/lib/systemd/system/sysprof3.service

Next we need to reload the systemd daemon, but newer versions of systemd do this automatically.

# systemctl daemon-reload

Here is where things are a bit tricky because they are somewhat specific to the system. I think we should make this better in the appropriate upstream projects to avoid this altogether but also easily handled with a flatpak installation trigger.

First make sure that policy-kit reloads our installed policy file.

# systemctl restart polkit.service

With dbus-broker, we also need to reload configuration to pick up our new service file. I’m not sure if dbus-daemon would require this, I haven’t tested that. Though I wouldn’t be surprised if this is related to inotify file-monitors and introducing a merged /usr.

# gdbus call -y -d org.freedesktop.DBus \
-o /org/freedesktop/DBus \
-m org.freedesktop.DBus.ReloadConfig

At this point, the service should be both systemd and D-Bus activatable. We can verify that with another gdbus call quick.

# gdbus call -y -d org.gnome.Sysprof3 \
-o /org/gnome/Sysprof3 \
-m org.freedesktop.DBus.Peer.Ping
()

Now I can run the Flatpak as normal and it should be able to use the system extension to get profiling and system data from the host as if it were package installed.

$ flatpak run org.gnome.Sysprof.Devel

The following screenshots come from GNOME OS using yesterdays build with what I’ve described in this post. However, it also works on Fedora Rawhide (and probably Fedora 40) if you boot with selinux=0. More on that in the FAQ below.

Flatpak Integration

So obviously nobody would want to do all the work above just to make their Flatpak work. The user-facing goal here would be for the appropriate triggers to be provided by Flatpak to handle this automatically.

Making this happen in an automated fashion from flatpak installation triggers on the --system installation does not seem terribly out-of-scope. It’s possible that we might want to do it from within the flatpak binary itself but I don’t think that is necessary yet.

FAQ

What about non-system installations?

It would be expected that system extensions require installing to a system installation.

It does not make sense to allow for a --user installation, controllable by an unprivileged user or application, to be merged onto the host.

Does SELinux affect this?

In fact it does.

While all of this works out-of-the-box on GNOME OS, systems like Fedora will need work to ensure their SELinux policy to not prevent system extentions from functioning. Of course you can boot with selinux=0 but that is not viable/advised on end-user installations.

In the Sysprof case, AVC denials would occur when trying to exec /usr/libexec/sysprofd.

Does /usr become read-only?

If you have systemd <= 255 then system-sysext will most definitely leave /usr read-only. This is problematic if you want to modify your system after merging but makes sense because sysext was designed for immutable systems.

For example, say you wanted to sudo dnf install a-package on Fedora. That would fail because /usr becomes read-only after systemd-sysext merge.

In systemd >= 256 there is effort underway to make /usr writable by redirecting writes to the top-most writable layer. Though my early testing of Fedora Rawhide with systemd 256~rc1 still shows this is not yet working.

So why not a Portal?

One could write a portal for profilers alone but that portal would essentially be sysprofd and likely to be extremely application specific.

Can I use this for udev rules?

You could.

Though you might be better served by using the new Device and/or USB portals which will both save you code and systems integration hassle.

Can I have different binaries per OS?

Yes.

The systemd-sysext subsystem has a directory layout which allows for matching on some specific information in /etc/os-release. You could, for example, have a different system extension for specific Debian or CentOS Stream versions.

Can they be used at boot?

If we choose to symlink into a persistent systemd-sysext location (perhaps /etc/extensions) then they would be available at boot.

Can services run independent of user app?

Yes.

It would be possible to have a system service that could run independently of the user facing application.

Improving poll() timeout precision

Recently I was looking at a VTE performance issue so I added a bunch of Sysprof timing marks to be picked up by the profiler. I combined that with GTK frame timing information and GNOME Shell timing information because Sysprof will just do that for you. I noticed a curious thing in that almost every ClutterFrameClock.dispatch() callback was rougly 1 millisecond late.

A quick look at the source code shows that ClutterFrameClock uses g_source_set_ready_time() to specify it’s next deadline to awaken. That is in µsec using the synchronized monotonic clock (CLOCK_MONOTONIC).

Except, for various reasons, GLib still uses poll() internally which only provides 1 millisecond timeout resolution. So whatever µsec deadline was requested by the ClutterFrameClock doesn’t really matter if nothing else wakes up around the same time. And since the GLib GSource code will always round up (to avoid spinning the CPU) that means a decent amount late.

With the use of ppoll() out of question, the next thing to use on Linux would be a timerfd(2).

Here is a patch to make GLib do that. I don’t know if that is something we should have there as it will create an extra timerfd for every GMainContext you have, but it doesn’t seem insane to do it there either.

If that isn’t to be, then here is a patch to ClutterFrameClock which does the same thing there.

And finally, here is a graph of how the jitter looks when not using timerfd and when using timerfd.

A graph comparing the use of timerfd in ClutterFrameClock. Before, there is an erratic line jumping many times between 100usec and 1000usec. After, the line is stable at around 10usec.

Performance Profiling for Fedora Magazine

I’ve authored an article recently for Fedora Magazine on Performance Profiling in Fedora.

It covers both the basics on how to get started as well as the nitty-gritty details of how profilers work. I’d love for others to be more informed on that so I’m not the only person maintaining Sysprof.

Hopefully I was able to distill the information down a bit better than my typical blog posts. If you felt like those were maybe too difficult to follow, give this one a read.

Faster Numbers

The venerable GtkSourceView project provides a GtkWidget for various code languages. It has a number of features including the most basic, showing a line number next to your line of text.

A screenshot of GNOME Text Editor with line numbers enabled containing the file gtktextbuffer.c.

It turns out that takes a lot more effort than you might think, particularly when you want to do it at 240hz with kinetic scrolling on crappy hardware that may barely have enough engine for the GL driver.

First, you need to have the line number as a string to be rendered. For a few years now, GtkSourceView has code which will optimizes the translation from number to strings with minimal overhead. If you g_snprintf(), you’re gonna be slow.

After that you need to know the X,Y coordinate of the particular line within the gutter and it’s line height when wrapped. Then you need to know the measured pixel width of the line number string. Further still you need the xalign/yalign and xpad/ypad to apply proper alignments based on application needs. You may even want to align based on first line, last wrapped line, or the entire cell.

In the GtkSourceView 5.x port I created GtkSourceGutterLines which can cache some of that information. It’s still extremely expensive to calculate but at least we only have to do it once per-frame now no matter how many GtkSourceGutterRenderer are packed into the GtkSourceGutter.

After that, we can create (well recycle) a PangoLayout to setup what we want to render. Except, that is also extremely expensive because you need to measure the contents and go through a PangoRenderer for each line you render.

If you are kinetic scrolling through a GtkSourceView with something like a touch pad there is a good chance that a decent chunk of CPU is wasted on line numbers. Nicht gut.

Astute readers will remember that I spent a little time making VTE render faster this cycle and one of the ways to do that was to avoid PangoLayout. We can do the same here as it’s extremely simple and controlled input. Just cache the PangoGlyphInfo for 0..9 and use that to build a suitable PangoGlyphString. Armed with a PangoFont and said string, we can use gsk_text_node_new() and gtk_snapshot_append_node() instead of gtk_snapshot_render_layout().

A quick hour or so later I have given you back double digit CPU percentages but more importantly, smoother and lower latency input.

Sysprof makes it easy to locate, triage, and verify performance fixes.

A flamegraph showing that the line number gutter renderer in GtkSourceView was an extremely complex code path.

A flamegraph showing that line number rendering is now a very simple code path.

That said, in the future, if I were redesigning something to replace all of this I’d probably just use widgets for each line number and recycle them like GtkListView. Then you get GtkWidget render node caching for free. C’est la vie.

Flamegraphs for Sysprof

A long requested feature for Sysprof (and most profiler tools in general) is support for visualizing data as FlameGraphs. They are essentially a different view on the same callgraph data we already generate. So yesterday afternoon I spent a bit of time prototyping them to sneak into GNOME 45.

Many tools out there use the venerable flamegraphs.pl but since we already have all the data conveniently in memory, we just draw it with GtkSnapshot. Colorization comes from the same stacktrace categorization I wrote about previously.

A screenshot of flamegraph visualization of a callgraph in Sysprof.

If you select a new time range using the scrubber at the top, the flamegraph will update to stacktraces limited to that selection.

Selecting frames within the flamegraph will dive into those leaving enough breadcrumbs to work your way back out.

Visualizing Scheduler Details

One thing we’ve wanted for a while in Sysprof is the ability to look at what the process scheduler is doing. It can be handy to see what processes where switched and how they may be dependent on one-another. Previously, I’d fire up kernelshark for that as it’s a pretty invaluable tool. But having scheduler data inline with everything else you capture is too useful to pass up.

So here we have the sched:sched_switch tracepoint integrated into Sysprof marks so you can correlate that with the rest of your recording.

Scheduled processes displayed in a time series, segmented by CPU.

Writing Fast Search

The problem we encountered in my last writing was that gnome-clocks was taking about 300 milliseconds to complete a basic search query. I guess the idea is that if you type “paris” into GNOME Shell you’ll get the time in either Paris, France or one of the Paris’ in the United States. I guess 300 milliseconds wouldn’t be so bad if it didn’t also consume 100% of the CPU during that time.

Thankfully in my career I’ve had plenty of opportunity to work with database search indexes. So I have some practical experience in making that stuff fast(er).

So this morning I put together a small search index which can be generated from the Locations.bin using the libgweather API. That search index contains the serialized document form and a series of trigrams for the GWeatherLocation textual representation. That search index is meant to be static and installed along side Locations.bin.

Then for search, you take your term list and generate another series of trigrams. The SearchIndex provides iterators for each of those trigrams to find documents which contain it. So if you line those up with a sorted document list you can create an O(n*m) worst case iterator across potentially matching documents. In practice you look at a very small subset of the corpus.

As you iterate through those, you do your full termlist matching as you would have previously. Except instead of looking at thousands of entries, you look at just a few.

Long story short, you can go from 100% CPU for 300 milliseconds repeatedly to about 10 milliseconds and it keeps getting faster the more you type.

Once again, without tools like Sysprof and distributions with courage to enable frame-pointers like GNOME OS and Fedora, finding this stuff can be quite nebulous.

How to use Sysprof (again)

Every once in a while I take a moment to test GNOME OS on physical hardware.

The experience today was quite a bit underwhelming. Fresh install, type a few characters into the search box, and things grind to a halt.

Being the system profiler author I am, where would I consider spending time to make this better? Here ya go, and please do help because I can make the tools but I need people like you to help go resolve them.

I had to build Sysprof from source quick on GNOME OS until new GNOME OS builds are out (soon).

$ sysprof-cli --session-bus --gnome-shell capture.syscap 
$ sysprof ./capture.syscap

An overview of time spent in various processes

Interesting, a couple systemd-coredump processes busy doing ztsd compression on Nautilus crashes (in search providers). Issue filed.

Next up, gnome-software clocking in at 23% CPU (and remember, we’re competing against multiple zstd compressors for CPU time) which is busy doing appstream search for Flatpaks. Seems a bit high for something which is pre-compiled into a binary format and mmap()d at runtime to reduce CPU and memory overhead. Issue filed.

A screenshot of gnome-clocks search provider busyin libgweather deserialization.

Next is gnome-clocks at a whopping 15% to show me the time in cities near to whatever I type which is obviously “Riga” given GUADEC. Again, that’s 15% while competing with multiple zstd so in reality it’d be even more. Appears to be busy in libgweather doing deserialization, but specifically in finding the nearest city to a lat/lon position. A quick look at the code shows that this is probably one of the most expensive operations you can do and it’s done for every object deserialized. Probably could use some flags to avoid that from a search provider. Issue filed.

A screenshot of gnome-characters search provider taking 10% of system time in filter_keywords

Lastly in our top-offenders list is gnome-characters search provider. It’s clocking in at roughly 10% of system time (again, would be more if not for zstd) filtering keywords and getting character names. Considering we’re only showing up to maybe 3 of these results that seems significantly high. Issue filed.

So I implore my readers to go and make things fast.

Additionally, to be a good citizen myself, I put together an MR that makes search in Characters much, much faster.

And some fixes to make libxmlb faster (Software) here and here.