BOLT’ing Libraries

I did a little experimenting with BOLT today to optimize libraries post-link.

I’m not an expert on it or anything, but it seems to allow you to reorder functions in your executable/library based on feedback from perf record and some special post-processing. You can merge multiple runs together in case you have different workloads you’d like to optimize for. But in the end, hot functions get placed near each other to reduce instruction cache pressure.

In all, it says you can expect gains up-to about 7% which fits in line with my experiment. For example, I open gnome-text-editor with a large C file, the overview map enabled, and syntax highlighting on. Then hold down Page Down to the bottom, Page Up to the top, and then Page Down back to the bottom.

The first pass through the source code is usually a little slower because you’re doing the incremental syntax-highlighting process.

After using BOLT on Pango, I saw roughly a 6% reduction in time spent measuring text (which is one of the most expensive parts of the overview map).

To test this out, I did have to play with my CFLAGS to have -Wl,--emit-relocs linker option. After that and a meson setup --wipe $SRCDIR things seem to work as expected.

Trying it For Yourself

sudo dnf install llvm-bolt perf

perf record -e cycles:u -j any,u -o perf.data -- gnome-text-editor

# you can do this for any of the binaries
perf2bolt -p perf.data -o perf.fdata ~/.jhbuild/lib/libpango-1.0.so

llvm-bolt ~/.jhbuild/lib/libpango-1.0.so -o libpango-1.0.so.bolt -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats

mv libpango-1.0.so.bolt ~/.jhbuild/lib/libpango-1.0.so

Rinse and repeat.

Frame Pointers in the Media

BPF Performance Tools author and all around profiling expert Brendan Gregg wrote a blog post that sums up what was in my Fedora Magazine article quite well.

Though he has this to say on the topic of Fedora who made this ground breaking change and Ubuntu who followed along afterwards:

The main users of this change are enterprise Linux. Back-end servers.

Which is true in the sense of absolute numbers. But I must say it’s been extremely valuable on the desktop.

I can’t imagine having contributed to making VTE (a code-base I was unfamiliar with) twice as fast without it. Especially when that work happened over the course of about two weeks. It’s so much easier to do performance work when one monitor has usable profiler flamegraphs and the other code.

The wash/rinse/repeat cycle has gotten really good on Fedora and our performance future is bright.

Improving poll() timeout precision

Recently I was looking at a VTE performance issue so I added a bunch of Sysprof timing marks to be picked up by the profiler. I combined that with GTK frame timing information and GNOME Shell timing information because Sysprof will just do that for you. I noticed a curious thing in that almost every ClutterFrameClock.dispatch() callback was rougly 1 millisecond late.

A quick look at the source code shows that ClutterFrameClock uses g_source_set_ready_time() to specify it’s next deadline to awaken. That is in µsec using the synchronized monotonic clock (CLOCK_MONOTONIC).

Except, for various reasons, GLib still uses poll() internally which only provides 1 millisecond timeout resolution. So whatever µsec deadline was requested by the ClutterFrameClock doesn’t really matter if nothing else wakes up around the same time. And since the GLib GSource code will always round up (to avoid spinning the CPU) that means a decent amount late.

With the use of ppoll() out of question, the next thing to use on Linux would be a timerfd(2).

Here is a patch to make GLib do that. I don’t know if that is something we should have there as it will create an extra timerfd for every GMainContext you have, but it doesn’t seem insane to do it there either.

If that isn’t to be, then here is a patch to ClutterFrameClock which does the same thing there.

And finally, here is a graph of how the jitter looks when not using timerfd and when using timerfd.

A graph comparing the use of timerfd in ClutterFrameClock. Before, there is an erratic line jumping many times between 100usec and 1000usec. After, the line is stable at around 10usec.

Accessibility in Ptyxis

First off, what the heck is Ptyxis?

Ptyxis is the new name of what was formerly Prompt. The extremely nice people at Panic let me know they had a product that might be confused with Prompt and I agreed it could be confusing. Thankfully, their office is a few miles from me in Portland so I had a chance to meet the team face to face!

We found a lot to talk about, especially when it comes to text rendering with the GPU and people getting aggressive over their beloved fonts.

Hopefully, you like the new name. If not, feel comfort knowing that the desktop spec has support for GenericName and Keywords allowing you to type whatever you like to find the application.

The application icon for Ptyxis which contains what looks like a keyboard key covered in leaves and an insertion caret.

In the North, Spring is about to burst which is a great reason to learn what Ptyxis is. You can even find a beautiful example of it on the cover of The Linux Programming Interface which if you’re into terminals and Linux you should already own.

Now, the new bit.

Accessibility is super important for many reasons I shouldn’t need to repeat.

I put together an implementation of the new GtkAccessibleText for VteTerminal and am bundling it with Ptyxis to get more testing.

Since this has the potential to really overload the screen reader as the final interfaces are figured out, it’s gated behind a toggle in preferences. I also am hesitant to enable it by default until we have a way in the a11y stack to be extremely lazy about initialization. I don’t want to waste a lot of CPU cycles tracking changes only for them to be sent to an accessibility D-Bus where nobody is listening.

Hopefully this is one less thing preventing Linux distributions from shipping a GTK 4 based terminal emulator by default.

A screenshot of Ptyxis preferences in the Behavior section. The toggle to enable Screen Reader is on. A terminal window says Computer, read this text.

Performance Profiling for Fedora Magazine

I’ve authored an article recently for Fedora Magazine on Performance Profiling in Fedora.

It covers both the basics on how to get started as well as the nitty-gritty details of how profilers work. I’d love for others to be more informed on that so I’m not the only person maintaining Sysprof.

Hopefully I was able to distill the information down a bit better than my typical blog posts. If you felt like those were maybe too difficult to follow, give this one a read.

Sidebars in Libpanel

One of the more recent design trends in GNOME has been the use of sidebars. It looks great, it’s functional, and it gives separation of content from hierarchy.

A screenshot showing a number of GNOME applications which contain sidebars including Nautilus, D-Spy, Control Center, and Calendar. The image contains both light and dark variants split by a line from lower left to upper right of the image.

Builder, on the other hand, has been stuck a bit closer to the old-hat design of IDEs where the hierarchy falls strictly from the headerbar. This is simply because libpanel was designed before that design trend. Some attempt was made in Builder to make it look somewhat sidebar’ish, but that was the extent of it given available time.

A screenshot of the GNOME 45 release of Builder where the headerbar is across the top and panels, documents, and project panel below.

Last week I had a moment of inspiration on a novel way we could solve it without uprooting the applications which use libpanel. You can now insert edge widgets in PanelDockChild which are always visible even when the child is not. Combining that with being able to place a headerbar inside a PanelDockChild along with your PanelFrames means you can get something that looks more familiar in modern GNOME.

A screenshot of what will become Builder for GNOME 46 which includes the common sidebar styling.

If you’d like to improve things further, you know where to find the code.

Faster Numbers

The venerable GtkSourceView project provides a GtkWidget for various code languages. It has a number of features including the most basic, showing a line number next to your line of text.

A screenshot of GNOME Text Editor with line numbers enabled containing the file gtktextbuffer.c.

It turns out that takes a lot more effort than you might think, particularly when you want to do it at 240hz with kinetic scrolling on crappy hardware that may barely have enough engine for the GL driver.

First, you need to have the line number as a string to be rendered. For a few years now, GtkSourceView has code which will optimizes the translation from number to strings with minimal overhead. If you g_snprintf(), you’re gonna be slow.

After that you need to know the X,Y coordinate of the particular line within the gutter and it’s line height when wrapped. Then you need to know the measured pixel width of the line number string. Further still you need the xalign/yalign and xpad/ypad to apply proper alignments based on application needs. You may even want to align based on first line, last wrapped line, or the entire cell.

In the GtkSourceView 5.x port I created GtkSourceGutterLines which can cache some of that information. It’s still extremely expensive to calculate but at least we only have to do it once per-frame now no matter how many GtkSourceGutterRenderer are packed into the GtkSourceGutter.

After that, we can create (well recycle) a PangoLayout to setup what we want to render. Except, that is also extremely expensive because you need to measure the contents and go through a PangoRenderer for each line you render.

If you are kinetic scrolling through a GtkSourceView with something like a touch pad there is a good chance that a decent chunk of CPU is wasted on line numbers. Nicht gut.

Astute readers will remember that I spent a little time making VTE render faster this cycle and one of the ways to do that was to avoid PangoLayout. We can do the same here as it’s extremely simple and controlled input. Just cache the PangoGlyphInfo for 0..9 and use that to build a suitable PangoGlyphString. Armed with a PangoFont and said string, we can use gsk_text_node_new() and gtk_snapshot_append_node() instead of gtk_snapshot_render_layout().

A quick hour or so later I have given you back double digit CPU percentages but more importantly, smoother and lower latency input.

Sysprof makes it easy to locate, triage, and verify performance fixes.

A flamegraph showing that the line number gutter renderer in GtkSourceView was an extremely complex code path.

A flamegraph showing that line number rendering is now a very simple code path.

That said, in the future, if I were redesigning something to replace all of this I’d probably just use widgets for each line number and recycle them like GtkListView. Then you get GtkWidget render node caching for free. C’est la vie.

Prompt

Prompt is a terminal that marries the best of GNOME Builder’s seamless container support, the beauty of GNOME Text Editor, and the robustness of VTE. I like to think of it as a companion terminal to Builder.

Though it’s also useful for immutable/container-oriented desktops like Fedora Silverblue or Project Bluefin where containers are front-and-center.

A screenshot of Prompt with a menu open showing a list of available containers to spawn a new terminal shell within.

This came out of a prototype I made for the GNOME Builder IDE nearly a decade ago. We already had all the container abstractions so why not expose them as a Terminal Workspace?

Prompt extracts the container portion of the code-foundry into a standalone program.

My prototype didn’t go anywhere in recent years because I was conflicted. I have a high performance bar for software I ship and VTE wasn’t there yet on Wayland-based compositors which I use. But if you frequent this blog you already know that I reached out to the meticulous VTE maintainers and helped them pick the right battles to nearly double performance this GNOME cycle. I also ported gnome-terminal to GTK 4 which provided me ample opportunity to see where and how containers would integrate from an application perspective.

I designed Prompt to be Flatpak-first. That has design implications if you want a robust feature-set. Typically an application is restricted to the PID and PTY namespace within the Flatpak sandbox even if you’re capable of executing processes on the host. That means using TTY API like tcgetpgrp() becomes utterly useless when the kernel ioctl() returns you a PID of 0 (as it’s in a different namespace). Notably, 0 is the one value tcgetpgrp() is not documented to return. How fun!

To give Prompt the best chance at tracking containers and foreground processes a prompt-agent runs from the host system. It is restricted to very old versions of GLib/GObject/GIO and JSON-GLib because we know that /usr/bin/flatpak will already require them. Using those libraries instead of certain GLibc API helps us in situations where GLibc is only backward-compatible and not forwards-compatible. Combined with point-to-point D-Bus serialization on top of a socketpair() we have a robust way to pass file-descriptors between the agent and the UI process and we’ll use that a bunch.

There are a lot of little tricks in here to keep things fast and avoid energy-drain. For example, process tracking is done with a combination of exponential-backoff and re-triggering based on either new content arriving or certain key-presses. It gives a very low-latency feeling to the sudo/ssh feature I love from Console, albeit with less overhead.

One thing I find helpful with Builder is that when I come back to it my project session is right there. So this has session support too. It will restore your tabs/containers how they were before. So if you have a similar workflow you might find that useful. If not? Just turn it off in Preferences.

I want to have a bit of fun because right now I’m stuck indoors caring for my young, paraplegic dog. So it’s packed full of palettes you can play with. Who doesn’t like a little color!

A screenshot of Prompt with the preferences window open allowing the selection of numerous palettes with diverse color sets. The terminal window is recolored using colors from the palette.

There are some subtle performance choices that make for a better experience in Prompt. For example, I do like having a small amount of padding around the terminal so that rounded corners look nice and also avoids an extra framebuffer when rendering on the GPU. However, that looks odd with scrollback. So Prompt rewrites the snapshot from VTE to remove the background and extraneous clipping. We already have a background from window recoloring anyway. It’s a small detail that just feels good when using it.

Another subtle improvement is detecting when we are entering the tab overview. Currently, libadwaita uses a GtkWidgetPaintable to represent the tab contents. This works great for the likes of Epiphany where the contents are backed by GL textures. But for a terminal application we have a lot of text and we don’t want to redraw it scaled as would happen in this case. That puts a lot of pressure on the glyph cache. So instead, we create a single texture upfront and scale that texture. Much smoother.

For people writing terminal applications there is a little inspector you can popup to help you out. It can be difficult to know if you’re doing the right thing or getting the right behavior so this might be something we can extend going forward to make that easier for you. GTK’s inspector already does so much so this is just an extension of what you could do there.

A terminal window open with a secondary "inspector" window open. The inspector shows what column and row the mouse is positioned as well as the cursor and what non-visible OSC hyperlink is under the pointer.

Creating Prompt has elevated problems we should fix.

  • Podman creates an additional PTY which sort of breaks the whole notion of foreground processes. Filed an issue upstream and it seems likely we can get that addressed for personal containers. That will improve what happens when you close your terminal tab with something running or if you SSH’d into another host from the container.
  • Container tracking is currently limited to Fedora hosts because toolbox only emits the container-tracking escape sequences when the host is Fedora. The current goal I’ve discussed with VTE maintainers is that we’ll use a new “termprop” feature in VTE that will be silently dropped on terminal applications not interested in it. That way toolbox and the likes can safely emit the escape sequence.
  • Currently podman will exit if you pass a --user or --workdir that does not exist in the container. That isn’t a problem with toolbox as it is always your user and fails gracefully for directories. So we need a good strategy to see if both of those are available to inherit when creating new tabs.
  • This does have transparency support, but it’s hidden in a GSetting for now. Once we have libadwaita with CSS variable support we can probably make this look better during transitions which is where it falls down now. We also need some work on how AdwTabOverview does snapshots of tabs and inserts a background.
  • I have a .bashrc snippet to treat jhbuild as a container which is helpful for those of us GNOME developers still using it.
  • Accessibility is an extremely important piece of our infrastructure in GNOME. So part of this work will inevitably tie into making sure the a11y portion of VTE works with the soon-to-land a11y improvements in GTK. That has always been missing on GTK 4-based VTE and therefore every terminal based upon it.
$ flatpak install --user --from https://nightly.gnome.org/repo/appstream/org.gnome.Prompt.Devel.flatpakref

If you like software that I write, consider donating to a pet shelter near you this holiday season. We’re so lucky to have great pet care in Oregon but not everywhere is so lucky.

Happy Holiday Hacking!

Toby is Recovering in ER ICU

Normally I’m posting about code here, but for the past two weeks most of my time has been spent taking care of our 4 year old Australian Shepherd. Toby is very special to me and we even share the same birthday!

Toby recently lost control of his hind legs, which was related to a herniated disk and likely IVDD. For the past two weeks my wife and I have been on full-time care duty. Diapers, sponge baths, the whole gamut.

Previously we had X-Rays done but that type of imaging is not all that conclusive for spinal injuries. So yesterday he had a doggy MRI then rushed into surgery for his L1/L2 discs and a spinal tap. The spinal tap is to make sure the situation wasn’t caused by meningitis. It seems to have gone well but this morning he still hasn’t regained control of his hind legs (and that’s to be expected so soon after surgery). He does have feeling in them though, so that’s a positive sign.

He’s still a very happy boy and we got to spend a half hour with him this morning in the ICU.

Thanks to everyone who has been supporting us through this, especially with watching our 5 month kitten June who can be a bit of a rascal during the day. We wouldn’t be able to get any sleep without y’all.

Toby, a 4 year old Australian Shepherd with a soft plushy toy laying on a microfoam bed to ease the pain on his back.

To keep his mind active, Tenzing started to teach him to sing. He’s already got part of La Bouche’s “Be My Lover” down which is just too adorable not to share with you.

VTE performance improvements

To celebrate every new GNOME release I try to do a little bit of work that would be intrusive to land at the end of the cycle. The 46 cycle is no different and this time I’m making our terminals faster.

The terminal is surely the most used desktop app for developers and things have changed in drawing models over the years. There might be some excellent energy savings to be had! So I made myself a little prototype to see how much faster we might be able to go without drastic design changes and use that as my guide to improving VTE performance.

VTE has been around since the early days of GNOME. It’s been touched in some manner by many programmers that I consider more talented than myself, but perhaps I can improve things yet!

So far I’ve landed a little over a dozen patches, none of which address drawing (yet). So that means these patches will make both the GTK 3 and GTK 4 versions of VTE faster. Once the last patch lands in this category we will have cut wall clock time down for a number of common scenarios by a solid 40%. That’s a pretty good win!

After these land I have a bunch of patches which introduce native GTK 4 drawing primitives instead of Cairo. Those patches will ultimately reduce draw latency on GTK 4 while not regressing GTK 3 performance. There are still a couple things to figure out around some “minifont” usage, but things are looking good.

I’d also like to find a way to get draw timing driven by the frame clock rather than some internal timeouts. Combining that with the GTK 4 native drawing will certainly make things feel faster on the “butt dyno”.

Anyway, I probably won’t go down the rabbit hole with this, I just want to get things inline with performance expectations.

And to nobodies surprise, this is the type of stuff that is much easier to do when armed with Sysprof and working frame-pointers.