GTK 4 NGL Renderer

I spent a lot of time in 2020 working on projects tangential to what I’d consider my “main” projects. GtkSourceView got a port to GTK 4 and a load of new features, GTK 4 got a new macOS backend, and in December I started putting together a revamp of GTK 4’s GL renderer.

The nice thing about having multiple renderer backends in GTK 4 is that we still have Cairo rendering as an option. So while doing bring-up of the new GTK macOS backend I could just use that. Making software rendering fast enough to not be annoying is a good first step because it forces you to shake out performance issues pretty early.

But once that is working, the next step is to address how well the other backends can work there. We had two other backends. OpenGL (requiring 3.2 Core and up) and Vulkan. Right now, the OpenGL renderer is the best supported renderer for acceleration in terms of low bug count, so that seemed like the right way to go if you want to stay inline with Linux and Windows backends. Especially after you actually try to use MoltenVK on macOS and realize it’s a giant maze. The more work we can share across platforms (even if temporarily) the better we can make our Linux experience. Personally, that is something I care about.

From what I’ve seen, it looks like OpenGL on the M1 was built on top of Metal, so it seems fine to have chosen that route for now. People seem to think that OpenGL is going to magically go away just because Apple says they’ll remove it. First off, if they did, we’d just fallback to another renderer. Second, it’s likely that Zink will be a viable (and well funded) alternative soon. Third, they just released a brand new hardware architecture and it still works. That was the best point in time to drop it if there ever was one.

The NGL renderer makes full snapshots of uniforms and attachments while processing render nodes so that we can reorder batches going forward. Currently, we only reorder by render target, but that alone is a useful thing. We can start to do a lot more in the future as we have time. That might include tiling, executing batches on threads, and reordering batches within render targets based on programs so long as vertices do not overlap.

But anyway, my real motivation for cleaning up the GL renderer was so that someone who is interested in Metal can use it as a template for writing a renderer. Maybe that’s you?

Major shout-out to everyone that worked on the previous GL renderer in GTK. I learned so much from it and it’s really quite amazing to see GTK 4 ship with such interesting designs.

GTK 4 got a new macOS backend (now with OpenGL)

I’ve been busy the past few months writing a new GDK backend for macOS when not maintaining my other projects. Historically our macOS performance wasn’t something to rave about. But it’s getting better in GTK 4.

The new backend can do both software rendering with Cairo and hardware-based OpenGL rendering using the same OpenGL renderer as we use on GNU/Linux.

This was a fairly substantial “greenfield” rewrite of the backend because so much of it had bit-rotted during the development of GTK 4. GDK hardly looks the same as it did in previous releases and that is a good thing. It’s much easier to write a new backend these days.

I tried to polish it off a bit too, teaching it to do CSD edge-snapping and more. If you’re unfortunate enough to be using the software renderer, it does have some tricks to make drawing a bit faster than in the past. We dropped our use of the quartz Cairo backend in favor of the image backend because, well, it’s faster. Additionally we get a bit clever with opaque regions to speed up CSD compositing.

It also uses the CVDisplayLink to get presentation timing information from the display server to drive our frame clock.

A screenshot of the macOS backend

Thanks again to my employer, Red Hat, for funding this work so we can all benefit from having our applications reach more users.

GtkSourceView gets a JIT

I just merged a new regex implementation for GtkSourceView’s language specifications. Previously it used GRegex (based on PCRE) and now it uses PCRE2 directly similar to what VTE did.

Not only does this get us on a more modern PCRE implementation, but it also allows us to use new features such as a JIT.

JITs are interesting in that you can trade a little bit of memory and time to generate executable code upfront for huge gains in execution time. Given that you only compile language specifications once per regex, but execute them many, many times, it’s a worthwhile feature for GtkSoureView.

Trying to highlight the minified HTML of google.com/ won’t even highlight (due to timeouts) with GRegex. But with PCRE2 and the JIT, it can get by.

In many cases, I found that the cost to JIT was about 4x vs PCRE2 without JIT. For execution times it is about 4x reduction to use the JIT (but sometimes many times faster than that). When you run these regexes millions of times across an edited file, it can really cut down on the amount of energy consumed as well as time taken away from doing things like rendering GTK’s scene graph.

I should note that it’s about 4x improvement on a per-regex basis, so when you run potentially thousands of those in one main loop cycle, the improvement can be much more drastic in what you can do.

If you have any issues with language specifications please let us know! It’s a very large change, so I wouldn’t be surprised if there is some fallout.

Regex Creation
Language Min (msec) Max (msec) Average (msec) # of Calls Notes
C .001 .104 .007 91 gtktreeview.c
C (with JIT) .004 .383 .031 91
Difference .003 .279 .024
CSS .001 .711 .022 197 gtk-contained.css
CSS (with JIT) .004 3.147 .101 198
Difference .003 2.436 .079
Regex Execution Loading File
Language Min (msec) Max (msec) Average (msec) # Calls
C .000 .196 .003 17698 gtktreeview.c
C (with JIT) .000 .061 .001 17698
Difference -.000 -.135 -.002 ~35 ms
CSS .000 .211 .022 84347 gtk-contained.css
CSS (with JIT) .000 .061 .001 74812
Difference -.000 -.135 -.002 ~150 ms

GtkSourceView Next

Earlier this year I started a branch to track GTK 4 development which is targeted for release by end-of-year. I just merged it which means that our recently released gtksourceview-4-8 branch is going to be our LTS for GTK 3. As you might remember from the previous maintainer, GtkSourceView 4.x is the continuation of the GtkSourceView 3.x API with all the deprecated API removed and a number of API improvements.

Currently, GtkSourceView.Next is 5.x targeting the GTK 4.x API. It’s a bit of an unfortunate number clash, but it’s been fine for WebKit so we’ll see how it goes.

It’s really important that we start getting solid testing because GtkSourceView is used all over the place and is one of those “must have” dependencies when moving to a new GTK major ABI.

Preparations in GTK 4

Since I also spend time contributing to GTK, I decided to help revamp GtkTextView for GTK 4. My goal was to move various moving parts into GtkTextView directly so that we could make them more resilient.

Undo Support

One feature was undo support. GTK 4 now has native support for undo by implementing text history in a compact form within GTK itself. You can now set the enable-undo properties to TRUE on GtkTextView, GtkEditable widgets like GtkText or GtkEntry, and others.

GPU Rendered Text (sort of)

Matthias Clasen and I sat down one afternoon last year and wrote a new PangoRenderer for GSK using render nodes and the texture atlas provided by the OpenGL and Vulkan renderers. Since then, GtkTextView gained a GtkTextLineDisplay cache so that we can keep these immutable render nodes around across multiple snapshots.

Text is still rendered on the CPU into a texture atlas, which is uploaded to the GPU and re-used when possible. Maybe someday things like pathfinder will provide a suitable future.

GtkTextView and Widgets

Previously, the gutters for GtkTextView were simply a GdkWindow which could be rendered to with Cairo. This didn’t fit well into the “everything should be a widget” direction for GTK 4. So now you can pack a widget into each of the 4 gutters around the edges of a GtkTextView. This means you can handle input better too using GtkGesture and GtkEventControllers. More importantly, though, it means you can improve performance of gutter rendering using snapshots and cached render nodes when it makes sense to do so.

Changes in GtkSourceView Next

Moving to a new major ABI is a great time to do cleanups too as it will cause the least amount of friction. So I took this opportunity to revamp much of the GtkSourceView code. We follow more modern GObject practices and have bumped our compiler requirements to closely match GTK 4 itself. This still means no g_autoptr() usage from within GtkSourceView sadly thanks to MSVC being … well the worse C compiler still in wide use.

GtkSourceGutterRenderer is now a GtkWidget

Now that we have margins which can contain widgets and contribute to the render node tree, both GtkSourceGutter and GtkSourceGutterRenderer are GtkWidget. This will mean you need to change custom gutter renderers a bit, but in practice it means a lot less code than they previously contained. It also makes supporting HiDPI much easier.

GtkSourceCompletion Revamp

I spent a lot of time making completion a pleasing experience in GNOME Builder and that work has finally made it upstream. To improve performance and simplicity of implementation, this has changed the GtkSourceCompletionProvider and GtkSourceCompletionProposal interfaces in significant ways.

GtkSourceCompletionProposal is now a mostly superfluous type used to denote a specialized GObject. It doesn’t have any functions in the vtable nor any properties currently and the goal is to avoid adding them. Simply G_IMPLEMENT_INTERFACE (GTK_SOURCE_TYPE_COMPLETION_PROPOSAL, NULL) when defining your proposal object GType.

This is because all of the completion provider implementation can now be performed from GtkSourceCompletionProvider. This interface focus on using interfaces like GListModel (like the rest of GTK 4) and how to asynchronously generate and refine the results with additional key-presses.

The completion window has been revamped and now allows proposals to fill a number of columns including an icon, return-type (Left Hand Side), Typed Text, and supplementary text. It resizes with content and ensures that we only inflate the number of GObjects necessary to view the current set. A fixed number of widgets are also created to reduce CSS and measurement costs.

Further, proposals may now have “alternates” which allows for providers to keep all of the DoSomething() proposals with 20 overloaded forms for each base type in whatever language of the day is being used from clogging up the suggestions.

The new GtkSourceCompletionCell widget is a generic container used throughout completion for everything from containing icons, text, or even custom widgetry for the completion details popover.

Completion Preview

GtkSourceGutterLines

A new abstraction, GtkSourceGutterLines, was added to help reduce overhead in generation of content in the gutter. The design of gutters lead to an exorbitant amount of measurement work on every frame. This was actually the biggest hurdle in making GTK 3 applications scroll smoothly. The new design allows for all the renderers to collect information about lines in one pass (along with row height measurements) and then snapshot in their second pass. Combined with the ability to cache render nodes, gutter renderers should have what they need to remain fast even in HiDPI environments.

The implementation of this also has a few nice details to further reduce overhead, but I’ll leave that to those interested in reading the code.

GtkSourceBuffer::cursor-moved

GtkSourceBuffer now has a cursor-moved signal. This seemed to be something implemented all over the place so we might as well have it upstream.

Reduce signal emission overhead

A number of places have had signal emission overhead reduced. Especially in property notifications.

Spaces Drawing

The GtkSourceSpaceDrawer now caches render nodes for drawing spaces. This should improve the performance in the vast majority of cases. However, one case still could be improved upon: tabs when the tab width changes (generally when used after text or spaces).

New Features

Snippets

A new snippet engine has landed based on a much improved version from GNOME Builder. You can provide bundles using an XML snippets file. You can also create them dynamically from your application and insert them into the GtkSourceView. In fact, many completion providers are expected to do this.

The snippet language is robust and shares many features and implementation details from GNOME Builder.

Assistants

A new subsystem, GtkSourceAssistant is used to provide accessory information in a GtkSourceView. Currently this type is private and an implementation detail. However, GtkSourceCompletion and GtkSourceSnippet build upon it to provide some of their features. In the long term, we expect hover providers to also take advantage of this subsystem.

Sysprof Support

GtkSourceView now uses the Sysprof collector API just like GTK 4 does (among many other GNOME projects). This means you can get profiling information about renderings right in the Sysprof visualizer along other data.

Future Work

PCRE2

With GRegex on the chopping block for deprecation, it’s time to start moving to PCRE2 much like VTE did. Doing so will not only make us more deprecation safe, but ensure that we can actually use the JIT feature of the regex engine. With how much regexes are used by the highligting engine, this should be a fairly sizable improvement.

This has now been implemented.

Hover Providers

In GNOME Builder, we added an abstraction for “Hover Providers”. This is also a thing in the Language Server Protocol realm. Nothing exists upstream in GtkSourceView for this and that should probably change. Otherwise all the trickyness in making transient popovers work is put on application authors.

Style Schemes

I would like to remove or revamp some of our default style schemes. They do not handle the world of dyanmic GTK themes so well and become a constant source of bug reports by applications that want a “one size fits all” style scheme. I’m not sure yet on the complete right answer long term here, but my expectation is that we’d want to move toward a default style scheme that is mostly font changes rather than color changes which eventually fall apart on the more … interesting themes.

Anyway, that’s all for now!

GObject Class Private Data

It can be very handy to store things you might do as meta programming in your GObjectClass‘s private data (See G_TYPE_CLASS_GET_PRIVATE()).

Doing so is perfectly fine, but you need to be aware of how GTypeInstance initialization works. Each of your parent classes instance init functions are called before your subclasses instance init (and in order of the type hierarchy). What might seem non-obvious though is that the GTypeInstance.g_class pointer is updated as each successive _init() function is called.

That means if you have my_widget_init() and your parent class is GtkWidget, the gtk_widget_init() does not know it’s instantiating a subclass. Further more, GTK_WIDGET_GET_CLASS() called from gtk_widget_init() will get you the base classes GtkWidgetClass, not the subclasses GtkWidgetClass.

There are ways around this if you don’t use G_DEFINE_TYPE(), but honestly, who wants to do that.

One technique around this, which I used in Bonsai’s DAO, is to use a single-linked list where the head is in each subclass, but the tail exists in each of the parent classes. That way you share all the parent structures, but the subclasses can access all of theirs. You’ll still want to defer most setup work until constructed() though so you can get the full class information of the subclass and hierarchy.

How to use Sysprof to… Part II

In the previous article of this series we covered Sysprof basics to help you use the tooling. Now I want to take a moment to show you how to use the command line tooling to profile systems like GNOME Shell.

Record an existing session

The easiest way to get started is to record your existing GNOME Shell session. With sysprof-cli, you can use the --gnome-shell option and it will attempt to connect to your active GNOME Shell instance over D-Bus to stream COGL pipeline information over a private file-descriptor.

This information can be combined with callgraphs to see what is happening during the duration of a COGL mark.

The details page can also provide some quick overview information about the marks and their duration. You will find this helpful when comparing patches to see if they really improved things over time.

The details button in the top right will show you information about marks and their min/max/avg duration.

Basic Shell Recording

Running something like a desktop session is complex. You have a D-Bus daemon, a compositor, series of background daemons, settings infrastructure, and programs saving to your home directory. For this reason you cannot really run two of them for the same user at the same time, or even nested.

Because of this, it is handy to log out of your desktop session and switch to a VT to profile GNOME Shell. Sysprof provides a sysprof-cli binary you can use to profile in complicated setups like this.
Start by switching to another VT like Control+Shift+3. I recommend stopping the current display server just so that it doesn’t get in the way of profiling, but usually it’s okay to not. Then we can enter our JHBuild environment with a new D-Bus session before we start Sysprof and GNOME Shell.

Fedora 32 (Workstation Edition)
Kernel 5.6.0-0.rc4.git0.1.fc32.x86_64 (tty3)

startdust login: christian
Password: 
$ sudo service gdm stop
$ dbus-run-session jhbuild shell
$ 

At this point, we can spawn GNOME Shell with Sysprof to start recording.

You can use -- to specify the command you want sysprof-cli to execute while it records. When that application exits, sysprof-cli will extract all the known symbols and finish it’s recording.

I want to mention briefly that the --gnome-shell option only works with an existing GNOME session. I hope to fix that in the near future though.

$ sysprof-cli -- gnome-shell --wayland --display-server

At this point, GNOME Shell will have spawned and you can exercise it to exhibit the behavior you’d like to improve. When done, open a terminal window to kill GNOME shell so that the profiler can clean up.

kill -9 $(pidof gnome-shell) seems to work well for me

Now you’ll have a capture.syscap file in your current directory. Open that up with Sysprof to view the contents of your profiling session. Often I just spawn gnome-shell directly to open the syscap file and explore.

Recording JavaScript stacks

Sometimes you want to profile JavaScript instead of the C code from Shell, Mutter, and friends. To do this, use the --gjs command line option. Currently, this can give mixed results if you also sample callstacks with the Linux perf support, as the timings are not guaranteed to be equivalent. My recommendation is to disable perf when sampling JavaScript using the --no-perf option.

$ sysprof-cli --gjs --no-perf -- gnome-shell --wayland --display-server

Now when you open the callgraph in Sysprof, you’ll see JavaScript samples.

JavaScrpt callgraph example

Recording Energy Consumption

On Linux, we have support for tracking energy usage as “Rolling Average Power Limit” or RAPL for short. Sysprof can include this information for you in your capture if you have the turbostat utility available. It provides power information per “package” such as the GPU and CPU.

Keeping power consumption low is an important part of a modern desktop that aims to be useful on laptops and smaller form factors. It’s useful to check in now and again to ensure that we’re keeping things tip top.

$ sysprof-cli --rapl --no-perf -- gnome-shell --wayland --display-server

You might want to disable sampling while testing power consumption because that could have a larger effect in terms of wattage than the thing you’re profiling.

Don’t forget to check the counter and energy menus for additional graphs.

Reducing Memory Allocations

Plugging memory leaks is a great thing to do. But sometimes it’s better to never allocate things to begin with. The --memprof option can help you find extraneous allocations in your program. For example, I tested the --memprof option on GNOME Shell when writing it and immediately found a way to reduce temporary allocations by hundreds of MiB per minute of use.

$ sysprof-cli --memprof -- gnome-shell --wayland --display-server

Avoiding Main Loop Stalls

This one requires you to build Sysprof until our next release, but you can use the --speedtrack option to find things running on your main loop that may not be a good idea. It will also insert marks for how long the main loop iterations run to find periods of time that you aren’t staying interactive.

$ sysprof-cli --speedtrack -- gnome-shell --wayland --display-server

Anyway, that does it for now! Hope you found this brain dump insightful enough to help us all push forward on the performance curve.

How to use Sysprof to…

First off, before using Sysprof to improve the performance of a particular piece of software, make sure you’re compiling with flags that allow us to have enough information to unwind stack frames. Sysprof will use libunwind in some cases, but a majority of our stack unwinding is done by the Linux kernel which can currently only follow eh_frame (exception handling) information.

In my ~/.config/jhbuildrc, I have the following:

os.environ['CFLAGS'] = '-ggdb -O2 -fno-omit-frame-pointer'
os.environ['G_SLICE'] = 'always-malloc'

I generally disable the G_SLICE allocator because it isn’t really all that helpful on modern Linux systems using glibc and can also make it more difficult to track down leaks. Furthermore, it can get in the way of releasing memory back to the system in the form of malloc_trim() should we start doing that in the future. (Hint, I’d like to).

Finding code run often on the system

Sysprof, at it’s core, is a “whole system” profiler. That means it is not designed to profile just your single program, but instead all the processes on the system. This is very useful in a desktop scenario where we have lots of interconnected components.

Ensure the “Callgraph” aid is selected and click “Record”.

At this point, excercise your system to try to bring out the behavior you want to optimize. Then click “Stop” to stop recording and view the results.

You’ll be presented with a callgraph like the following after it has completed recording and loaded the information.

You’ll notice a lot of time in gnome-software there. It turns out I’m on a F32 alpha install and there was a behavior change in libcurl that has screwed up a number of previously valid use cases. But if I didn’t know that already, this would point me where to start looking. You’ll notice that I hadn’t compiled libcurl or gnome-software from source, so the stack traces are not as detailed as they would be otherwise.

On the right side is a callgraph starting from “[Everything]”. It is split out by process and then by the callstack you see in that program. On the top-left side, is a list of all functions that were collected (and decoded). On the bottom-left side is a list of callers for the selected function above it. This is useful when you want to backtrack to all the places a function was called. (Note that this is a sampling-based profiler, so there is no guarantee all functions were intercepted).

Use this information to find the relevant code within a particular project. Tweak some things, try again, test…

Tracking down extraneous allocations

One of the things that can slow down your application is doing memory allocations in the hot paths. Allocating memory is still pretty expensive compared to all of the other things your application could be doing.

In 3.36, Sysprof gained support for tracking memory allocations with a LD_PRELOAD. However, it must spawn the application directly.

Start by toggling “Launch Application” and set arguments for the application you want to profile. Select “Track Allocations”.

At this point run your application to exercise the targeted behavior. Then press “Stop” and you’ll be presented with the recording. Usually the normal callgraph is selected by default. Select the “Memory Allocations” row and you’ll see the memory callgraph.

This time you’ll see memory allocation size next to the function. Explore a bit, and look for things that seem out of place. In the following image, I notice a lot of transforms being allocated. After a quick discussion with Benjamin, he landed a small patch to make those go away. So sometimes you don’t even have to write code yourself!

A variant of this patch went into Mutter’s copy of Clutter for a healthy memory improvement too.

Finding main loop slow downs

In Sysprof master, we have a “Speedtrack” aid that can help you find various long running operations such as fsync(). I used this late in the 3.36 cycle to fix a bunch of I/O happening on GNOME Shell’s compositor thread. Select the “Speedtrack” aid, and disable the “Callgraph” as that will clash with speedtrack currently. This also uses an LD_PRELOAD so you’ll have to spawn the application just like for memory tracking.

The aid will give you callgraphs of various things that happened in your main thread that you might want to avoid doing. Stuff like fsync(), read() and more. It also creates marks for the duration of these calls so you can track down how long they ran for.

Deep in Pango, various files are being loaded on demand which can mean expensive read() during the main loop operations.

You can also see how long some operations have taken. Here we see g_main_context_iteration() took 22 milliseconds. On a 60hz system, that can’t be good because we either missed a frame or took too long to do something to be able to submit our frame in time. You can select the time range by activating this row. In the future we want this to play better with callgraphs so you can see what was sampled during that timespan.

Anyway, I hope that gives you some insight into how to use things!

Keeping your fast code fast

Over the past few weeks I’ve been finishing up various projects for 3.36. None of this is surprising for those that follow me on twitter, but sadly I find it hard to blog as often as I should.

One of the projects I completed before the end of the cycle is a memory allocation tracker for Sysprof. It’s basically a modern port of the Memprof code from 20 years ago, but tied into Sysprof and using fancier techniques to move data quickly between processes. It uses an LD_PRELOAD to override many of the weak memory symbols in glibc such as malloc() and free(). When those functions are reached, a stack trace is captured directly into a mmap()‘d ring buffer shared by Sysprof. We create a new one of these per-thread so that no locking is necessary between threads. Sysprof will mux all the data together for us.

Below is a quick example running gtk4-widget-factory. We show similar callgraphs as we do when doing CPU profiling, but ordered by the amount of memory allocated. This simple tool and less than 20 minutes of effort found many allocations we could completely avoid across both GTK and Clutter.

A callgraph of memory allocations

I just want to mention how refreshing it is to have memory allocation tracking while still starting the application in what feels like instantly. It was quite a bit of tweaking to get that level of performance and I’m thrilled with the result.

Additionally, I spent some time looking at what sort of things cause temporary lockups in GNOME Shell during active use. With a fio script in hand, I had the necessary things to cause the buffer cache to be exhausted and force many applications working set out of memory. That usually does the trick to cause short lockups.

But what is going on when things stall? Does the GPU driver get bogged down? Does the Shell get blocked on GC? Is there some sort of blocking API involved?

To answer this I put together a scrappy little LD_PRELOAD tool called “iobt” which will write out a Sysprof capture file when some blocking operations are called. This found a very peculiar bug where GNOME Shell could end up blocking on the compositor thread when it thought it was doing all async I/O operations.

Furthermore, I found a number of other I/O operations happening on the main thread which will easily lock things up under heavy writeback scenarios. Patches for all of these are upstream, half of them are merged at this point, and some even backported to 3.28 for various distros.

There are still some things to do going forward, like use cgroupsv2 to help enforce CPU and Memory availability and other priorities. I’m also looking for pointers from GPU people on how to debug what is going on during long blocking eglSwapBuffers() calls as I’ve seen under memory pressure.

I’m always inspired by what the Shell developers build and I’m honored to get to help polish it even more.

GtkSourceView Branched

I’ve branched GtkSourceView for 4.6 (gtksourceview-4-6) which you should be using instead of master for your application’s Nightly Flatpak builds. I will land the GTK 4 port on master early next week. A message to gnome-announce-list has been sent and will hopefully make it into distribution packagers inbox shortly.

Long story short is that the 4.6 series will be our long-term (and last) series for GTK 3 applications. I expect this to be maintained for many years. Master will become the beginning of our transition to GTK 4 and the place we land lots of upstream features for Next.0.

GtkSourceView Snippets

I’m trying to blog about every week now this year, so here we go again.

The past week I’ve been pushing hard on finishing up the snippets work for the GTK 4 port. It’s always quite a bit more work to push something upstream because you have to be so much more complete while being generic at the same time.

I think at this point though I can move on to other features and projects as the branch seems to be in good shape. I’ve fixed a number of bugs in the GTK 4 port along the way and made tests, documentation, robustness fixes, style-scheme integration, a completion provider, file-format and parser, and support for layering snippet files the same way style-schemes and language-specs work.

As part of the GTK 4 work I’ve spent a great deal time modernizing the code-base. Now that we can depend on the same things that GTK 4 will depend on, we can use some more modern compiler features. Additionally, GObject has matured so much since most of the library was written and we can use that to our advantage.