Thoughts on employing PGO and BOLT on the GNOME stack

Christian was looking at PGO and BOLT recently I figured I’d write down my notes from the discussions we had on how we’d go about making things faster on our stack, since I don’t have time or the resource to pursue those plans myself atm.

First off let’s start with the basics, PGO (profile guided optimizations) and BOLT (Binary Optimization and Layout Tool) work in similar ways. You capture one or more “profiles” of a workload that’s representative of a usecase of your code and then the tools do their magic to make the common hot paths more efficient/cache-friendly/etc. Afterwards they produce a new binary that is hopefully faster than the old one and functionally identical so you can just replace it.

Now already we have two issues here that arise here:

First of all we don’t really have any benchmarks in our stack, let alone, ones that are rounded enough to account for the majority of usecases. Additionally we need better instrumentation to capture stats like frames, frame-times, and export them both for sysprof and so we can make the benchmark runners more useful.

Once we have the benchmarks we can use them to create the profiles for optimizations and to verify that any changes have the desired effect. We will need multiple profiles of all the different hardware/software configurations.

For example for GTK ideally we’d want to have a matrix of profiles for the different render backends (NGL/Vulkan) along with the mesa drivers they’d use depending on different hardware AMD/Intel and then also different architectures, so additional profile for Raspberrypi5 and Asahi stacks. We might also want to add a profile captured under qemu+virtio while we are it too.

Maintaining the benchmarks and profiles would be a lot of work and very tailored to each project so they would all have to live in their upstream repositories.

On the other hand, the optimization itself has to be done during the Tree/userland/OS composition and we’d have to aggregate all the profiles from all the projects to apply them. This is easily done when you are in control of the whole deployment as we can do for the GNOME Flatpak Runtime. It’s also easy to do if you are targeting an embedded deployment where most of the time you have custom images you are in full control off and know exactly the workload you will be running.

If we want distros to also apply these optimizations and for this to be done at scale, we’d have to make the whole process automatic and part of the usual compilation process so there would be no room for error during integration. The downside of this would be that we’d have a lot less opportunities for aggregating different usecases/profiles as projects would either have to own optimizations of the stack beneath them (ex: GTK being the one relinking pango) or only relink their own libraries.

To conclude, Post-linktime optimization would be a great avenue to explore as it seems to be one of the lower-hanging fruits when it comes to optimizing the whole stack. But it also would be quite the effort and require a decent amount of work to be committed to it. It would be worth it in the long run.

One thought on “Thoughts on employing PGO and BOLT on the GNOME stack”

  1. I saw the BOLT post, while 7% gains in performance are not insignificant (especially if they can be brought to resource constrained platforms like mobile/embedded) there are bigger barriers than just automating deployment for distros:

    – most distros don’t use clang/llvm, the main ones are gcc/glibc based

    – traditional package based distros are aiming to have their binaries be auditable/reproducible

    – even if they weren’t, using facebook software to scramble all the binaries and their hashes on a system is going to make r/linux and hackernews throw a shitfit

    – binary distros are extremely conservative when it comes to microarchitecture optimization to maintain a long support lifecycle. most distros don’t build their packages with AVX enabled. microarchitecture tuning is a Gentoo thing.

    – x86/64 is a total mess, intel/amd make no guarantees in regards to the cycle time of instructions between mobile/desktop/server variants of the same chip, and between generations the differences are immense.

    – if the profiling is based on sysprof and includes gpu performance then the number of profiles has combinatorial complexity: every desktop is different, cpu/gpu bound frankenbuilds will have a weird performance characteristics, different thermal designs will influence how reference notebook chipsets perform, and every intel macbook is it’s own quirk: the iconic unibody macbooks had mobile i5/i7 cpus paired with high bandwidth xeon northbridges.

    – building the profile data implies a networked fleet of reference machines that has geolocation, procurement, and administration implications, even before integrating them into CD/CI. postmarketOS has such a project, so it’s not impossible.

    I’m not trying to be negative, I just think that post link optimization may not make sense for the repo build stage of binary package distribution. Facebook has the advantage of knowing what they’re running and what’s in their racks ahead of time.

    If we look at Android’s design we see that OTA updates trigger JVM binary cache optimization. If the most authoritative profile data will always be sampled from the user’s machine then potentially post link optimization is better run as a post update hook by the package manager, where it can also be disabled for auditable systems.

Leave a Reply

Your email address will not be published. Required fields are marked *