The GNOME (and WebKitGTK+) Networking Stack

WebKit currently has four network backends:

  • CoreFoundation (used by macOS and iOS, and thus Safari)
  • CFNet (used by iTunes on Windows… I think only iTunes?)
  • cURL (used by most Windows applications, also PlayStation)
  • libsoup (used by WebKitGTK+ and WPE WebKit)

One guess which of those we’re going to be talking about in this post. Yeah, of course, libsoup! If you’re not familiar with libsoup, it’s the GNOME HTTP library. Why is it called libsoup? Because before it was an HTTP library, it was a SOAP library. And apparently somebody thought that when Mexican people say “soap,” it often sounds like “soup,” and also thought that this was somehow both funny and a good basis for naming a software library. You can’t make this stuff up.

Anyway, libsoup is built on top of GIO’s sockets APIs. Did you know that GIO has Object wrappers for BSD sockets? Well it does. If you fancy lower-level APIs, create a GSocket and have a field day with it. Want something a bit more convenient? Use GSocketClient to create a GSocketConnection connected to a GNetworkAddress. Pretty straightforward. Everything parallels normal BSD sockets, but the API is nice and modern and GObject, and that’s really all there is to know about it. So when you point WebKitGTK+ at an HTTP address, libsoup is using those APIs behind the scenes to handle connection establishment. (We’re glossing over details like “actually implementing HTTP” here. Trust me, libsoup does that too.)

Things get more fun when you want to load an HTTPS address, since we have to add TLS to the picture, and we can’t have TLS code in GIO or GLib due to this little thing called “copyright law.” See, there are basically three major libraries used to implement TLS on Linux, and they all have problems:

  • OpenSSL is by far the most popular, but it’s, hm, shall we say technically non-spectacular. There are forks, but the forks have problems too (ask me about BoringSSL!), so forget about them. The copyright problem here is that the OpenSSL license is incompatible with the GPL. (Boring details: Red Hat waves away this problem by declaring OpenSSL a system library qualifying for the GPL’s system library exception. Debian has declared the opposite, so Red Hat’s choice doesn’t gain you anything if you care about Debian users. The OpenSSL developers are trying to relicense to the Apache license to fix this, but this process is taking forever, and the Apache license is still incompatible with GPLv2, so this would make it impossible to use GPLv2+ software except under the terms of GPLv3+. Yada yada details.) So if you are writing a library that needs to be used by GPL applications, like say GLib or libsoup or WebKit, then it would behoove you to not use OpenSSL.
  • GnuTLS is my favorite from a technical standpoint. Its license is LGPLv2+, which is unproblematic everywhere, but some of its dependencies are licensed LGPLv3+, and that’s uncomfortable for many embedded systems vendors, since LGPLv3+ contains some provisions that make it difficult to deny you your freedom to modify the LGPLv3+ software. So if you rely on embedded systems vendors to fund the development of your library, like say libsoup or WebKit, then you’re really going to want to avoid GnuTLS.
  • NSS is used by Firefox. I don’t know as much about it, because it’s not as popular. I get the impression that it’s more designed for the needs of Firefox than as a Linux system library, but it’s available, and it works, and it has no license problems.

So naturally GLib uses NSS to avoid the license issues of OpenSSL and GnuTLS, right?

Haha no, it uses a dynamically-loadable extension point system to allow you to pick your choice of OpenSSL or GnuTLS! (Support for NSS was started but never finished.) This is OK because embedded systems vendors don’t use GPL applications and have no problems with OpenSSL, while desktop Linux users don’t produce tivoized embedded systems and have no problems with LGPLv3. So if you’re using desktop Linux and point WebKitGTK+ at an HTTPS address, then GLib is going to load a GIO extension point called glib-networking, which implements all of GIO’s TLS APIs — notably GTlsConnection and GTlsCertificate — using GnuTLS. But if you’re building an embedded system, you simply don’t build or install glib-networking, and instead build a different GIO extension point called glib-openssl, and libsoup will create GTlsConnection and GTlsCertificate objects based on OpenSSL instead. Nice! And if you’re Centricular and you’re building GStreamer for Windows, you can use yet another GIO extension point, glib-schannel, for your native Windows TLS goodness, all hidden behind GTlsConnection so that GStreamer (or whatever application you’re writing) doesn’t have to know about SChannel or OpenSSL or GnuTLS or any of that sad complexity.

Now you know why the TLS extension point system exists in GIO. Software licenses! And you should not be surprised to learn that direct use of any of these crypto libraries is banned in libsoup and WebKit: we have to cater to both embedded system developers and to GPL-licensed applications. All TLS library use is hidden behind the GTlsConnection API, which is really quite nice to use because it inherits from GIOStream. You ask for a TLS connection, have it handed to you, and then read and write to it without having to deal with any of the crypto details.

As a recap, the layering here is: WebKit -> libsoup -> GIO (GLib) -> glib-networking (or glib-openssl or glib-schannel).

So when Epiphany fails to load a webpage, and you’re looking at a TLS-related error, glib-networking is probably to blame. If it’s an HTTP-related error, the fault most likely lies in libsoup. Same for any other GNOME applications that are having connectivity troubles: they all use the same network stack. And there you have it!

P.S. The glib-openssl maintainers are helping merge glib-openssl into glib-networking, such that glib-networking will offer a choice of GnuTLS or OpenSSL and obsoleting glib-openssl. This is still a work in progress. glib-schannel will be next!

P.S.S. libcurl also gives you multiple choices of TLS backend, but makes you choose which at build time, whereas with GIO extension points it’s actually possible to choose at runtime from the selection of installed extension points. The libcurl approach is fine in theory, but creates some weird problems, e.g. different backends with different bugs are used on different distributions. On Fedora, it used to use NSS, but now uses OpenSSL, which is fine for Fedora, but would be a license problem elsewhere. Debian actually builds several different backends and gives you a choice, unlike everywhere else. I digress.

Mesa Update Breaks WebKitGTK+ in Fedora 29

If you’re using Fedora and discovered that WebKitGTK+ is displaying blank pages, the cause is a bad mesa update, mesa-18.2.3-1.fc29. This in turn was caused by a GCC bug that resulted in miscompilation of mesa.

To avoid this bug, downgrade to mesa-18.2.2-1.fc29:

$ sudo dnf downgrade mesa*

You can also update to mesa-18.2.4-2.fc29, but this build has not yet reached updates-testing, let alone stable, so downgrading is easier for now. Another workaround is to run your application with accelerated compositing mode disabled, to avoid OpenGL usage:

$ WEBKIT_DISABLE_COMPOSITING_MODE=1 epiphany

On the bright side of things, from all the bug reports I’ve received over the past two days I’ve discovered that lots of people use Epiphany and notice when it’s broken. That’s nice!

Huge thanks to Dave Airlie for quickly preparing the fixed mesa update, and to Jakub Jelenik for handling the same for GCC.

WebKitGTK+ 2.22.2 and 2.22.3, Media Source Extensions, and YouTube

Last month, I attended the Web Engines Hackfest (hosted by Igalia in A Coruña, Spain) and also the WebKit Contributors Meeting (hosted by Apple in San Jose, California). These are easily the two biggest WebKit development events of the year, and it’s always amazing to meet everyone in person yet again. A Coruña is an amazing city, and every browser developer ought to visit at least once. And the Contributors Meeting is a no-brainer event for WebKit developers.

One of the main discussion points this year was Media Source Extensions (MSE). MSE is basically a way for browsers to control how videos are downloaded. Until recently, if you were to play a YouTube video in Epiphany, you’d notice that the video loads way faster than it does in other browsers. This is because WebKitGTK+ — until recently — had no support for MSE. In other browsers, YouTube uses MSE to limit the speed at which video is downloaded, in order to reduce wasted bandwidth in case you stop watching the video before it ends. But with WebKitGTK+, MSE was not available, so videos would load as quickly as possible. MSE also makes it harder for browsers to offer the ability to download the videos; you’ll notice that neither Firefox nor Chrome offer to download the videos in their context menus, a feature that’s been available in Epiphany for as long as I remember.

So that sounds like it’s good to not have MSE. Well, the downside is that YouTube requires it in order to receive HD videos, to avoid that wasted bandwidth and to make it harder for users to download HD videos. And so WebKitGTK+ users have been limited to 720p video with H.264 and 480p video with WebM, where other browsers had access to 1080p and 1440p video. I’d been stuck with 480p video on Fedora for so long, I’d forgotten that internet video could look good.

Unfortunately, WebKitGTK+ was quite late to implement MSE. All other major browsers turned it on several years ago, but WebKitGTK+ dawdled. There was some code to support MSE, but it didn’t really work, and was disabled. And so it came to pass that, in September of this year, YouTube began to require MSE to access any WebM video, and we had a crisis. We don’t normally enable major new features in stable releases, but this was an exceptional situation and users would not be well-served by delaying until the next release cycle. So within a couple weeks, we were able to release WebKitGTK+ 2.22.2 and Epiphany 3.30.1 (both on September 21), and GStreamer 1.14.4 (on October 2, thanks to Tim-Philipp Müller for expediting that release). Collectively, these releases enabled basic video playback with MSE for users of GNOME 3.30. And if you still use of GNOME 3.28, worry not: you are still supported and can get MSE if you update to Epiphany 3.28.5 and also have the aforementioned versions of WebKitGTK+ and GStreamer.

MSE in WebKitGTK+ 2.22.2 had many rough edges because it was a mad rush to get the feature into a minimally-viable state, but those issues have been polished off in 2.22.3, which we released earlier this week on October 29. Be sure you have WebKitGTK+ 2.22.3, plus GStreamer 1.14.4, for a good experience on YouTube. Unfortunately we can’t provide support for older software versions anymore: if you don’t have GStreamer 1.14.4, then you’ll need to configure WebKitGTK+ with -DENABLE_MEDIA_SOURCE=OFF at build time and suffer from lack of MSE.

Epiphany 3.28.1 uses WebKitSettings to turn on the “enable-mediasource” setting. Turn that on if your application wants MSE now (if it’s a web browser, it certainly does). This setting will be enabled by default in WebKitGTK+ 2.24. Huge thanks to the talented developers who made this feature possible! Enjoy your 1080p and 1440p video.

On WebKit Build Options (Also: How to Accidentally Disable Important Security Features!)

When building WebKitGTK+, it’s a good idea to stick to the default values for the build options. If you’re building some sort of embedded system and really know what you’re doing, then OK, it might make sense to change some settings and disable some stuff. But Linux distros are generally well-advised to stick to the defaults to avoid creating problems for users.

One exception is if you need to disable certain features to avoid newer dependencies when building WebKit for older systems. For example, Ubuntu 18.04 disables web fonts (ENABLE_WOFF2=OFF) because it doesn’t have the libbrotli and libwoff2 dependencies that are required for that feature to work, hence some webpages will display using subpar fonts. And distributions shipping older versions of GStreamer will need to disable the ENABLE_MEDIA_SOURCE option (which is missing from the below feature list by mistake), since that requires the very latest GStreamer to work.

Other exceptions are the ENABLE_GTKDOC and ENABLE_MINIBROWSER settings, which distros do want. ENABLE_GTKDOC is disabled by default because it’s slow to build, and ENABLE_MINIBROWSER because, well, actually I don’t know why, you always want that one and it’s just annoying to find it’s not built.

OK, but really now, other than those exceptions, you should probably leave the defaults alone.

The feature list that prints when building WebKitGTK+ looks like this:

--  ENABLE_ACCELERATED_2D_CANVAS .......... OFF
--  ENABLE_DRAG_SUPPORT                     ON
--  ENABLE_GEOLOCATION .................... ON
--  ENABLE_GLES2                            OFF
--  ENABLE_GTKDOC ......................... OFF
--  ENABLE_ICONDATABASE                     ON
--  ENABLE_INTROSPECTION .................. ON
--  ENABLE_JIT                              ON
--  ENABLE_MINIBROWSER .................... OFF
--  ENABLE_OPENGL                           ON
--  ENABLE_PLUGIN_PROCESS_GTK2 ............ ON
--  ENABLE_QUARTZ_TARGET                    OFF
--  ENABLE_SAMPLING_PROFILER .............. ON
--  ENABLE_SPELLCHECK                       ON
--  ENABLE_TOUCH_EVENTS ................... ON
--  ENABLE_VIDEO                            ON
--  ENABLE_WAYLAND_TARGET ................. ON
--  ENABLE_WEBDRIVER                        ON
--  ENABLE_WEB_AUDIO ...................... ON
--  ENABLE_WEB_CRYPTO                       ON
--  ENABLE_X11_TARGET ..................... ON
--  USE_LIBHYPHEN                           ON
--  USE_LIBNOTIFY ......................... ON
--  USE_LIBSECRET                           ON
--  USE_SYSTEM_MALLOC ..................... OFF
--  USE_WOFF2                               ON

And, asides from the exceptions noted above, those are probably the options you want to ship with.

Why are some things disabled by default? ENABLE_ACCELERATED_2D_CANVAS is OFF by default because it is experimental (i.e. not great :) and requires CairoGL, which has been available in most distributions for about half a decade now, but still hasn’t reached Debian yet, because the Debian developers know that the Cairo developers consider CarioGL experimental (i.e. not great!). Many of our developers use Debian, and we’re not keen on having two separate sets of canvas bugs depending on whether you’re using Debian or not, so best keep this off for now. ENABLE_GLES2 switches you from desktop GL to GLES, which is maybe needed for embedded systems with crap proprietary graphics drivers, but certainly not what you want when building for a general-purpose distribution with mesa. Then ENABLE_QUARTZ_TARGET is for building on macOS, not for Linux. And then we come to USE_SYSTEM_MALLOC.

USE_SYSTEM_MALLOC disables WebKit’s bmalloc memory allocator (“fast malloc”) in favor of glibc malloc. bmalloc is performance-optimized for macOS, and I’m uncertain how its performance compares to glibc malloc on Linux. Doesn’t matter really, because bmalloc contains important heap security features that will be disabled if you switch to glibc malloc, and that’s all you need to know to decide which one to use. If you disable bmalloc, you lose the Gigacage, isolated heaps, heap subspaces, etc. I don’t pretend to understand how any of those things work, so I’ll just refer you to this explanation by Sam Brown, who sounds like he knows what he’s talking about. The point is that, if an attacker has found a memory vulnerability in WebKit, these heap security features make it much harder to exploit and take control of users’ computers, and you don’t want them turned off.

USE_SYSTEM_MALLOC is currently enabled (bad!) in openSUSE and SUSE Linux Enterprise 15, presumably because when the Gigacage was originally introduced, it crashed immediately for users who set address space (virtual memory allocation) limits. Gigacage works by allocating a huge address space to reduce the chances that an attacker can find pointers within that space, similar to ASLR, so limiting the size of the address space prevents Gigacage from working. At first we thought it made more sense to crash than to allow a security feature to silently fail, but we got a bunch of complaints from users who use ulimit to limit the address space used by processes, and also from users who disable overcommit (which is required for Gigacage to allocate ludicrous amounts of address space), and so nowadays we just silently disable Gigacage instead if enough address space for it cannot be allocated. So hopefully there’s no longer any reason to disable this important security feature at build time! Distributions should be building with the default USE_SYSTEM_MALLOC=OFF.

The openSUSE CMake line currently looks like this:

%cmake \
  -DCMAKE_BUILD_TYPE=Release \
  -DLIBEXEC_INSTALL_DIR=%{_libexecdir}/libwebkit2gtk%{_wk2sover} \
  -DPORT=GTK \
%if 0%{?suse_version} == 1315
  -DCMAKE_C_COMPILER=gcc-7 \
  -DCMAKE_CXX_COMPILER=g++-7 \
  -DENABLE_WEB_CRYPTO=OFF \
  -DUSE_GSTREAMER_GL=false \
%endif
%if 0%{?suse_version} <= 1500
  -DUSE_WOFF2=false \
%endif
  -DENABLE_MINIBROWSER=ON \
%if %{with python3}
  -DPYTHON_EXECUTABLE=%{_bindir}/python3 \
%endif
%if !0%{?is_opensuse}
  -DENABLE_PLUGIN_PROCESS_GTK2=OFF \
%endif
%ifarch armv6hl ppc ppc64 ppc64le riscv64 s390 s390x
  -DENABLE_JIT=OFF \
%endif
  -DUSE_SYSTEM_MALLOC=ON \
  -DCMAKE_EXE_LINKER_FLAGS="-Wl,--as-needed -Wl,-z,now -pthread" \
  -DCMAKE_MODULE_LINKER_FLAGS="-Wl,--as-needed -Wl,-z,now -pthread" \
  -DCMAKE_SHARED_LINKER_FLAGS="-Wl,--as-needed -Wl,-z,now -pthread"

which all looks pretty reasonable to me: certain features that require “newer” dependencies are disabled on the old distros, and NPAPI plugins are not supported in the enterprise distro, and JIT doesn’t work on odd architectures. I would remove the ENABLE_JIT=OFF lines only because WebKit’s build system should be smart enough nowadays to disable it automatically to save you the trouble of thinking about which architectures the JIT works on. And I would also remove the -DUSE_SYSTEM_MALLOC=ON line to ensure users are properly protected.

On Flatpak Nightlies

Here’s a little timeline of some fun we had with the GNOME master Flatpak runtime last week:

  • Tuesday, July 10: a bad runtime build is published.  Trying to start any application results in error while loading shared libraries: libdw.so.1: cannot open shared object file: No such file or directory. Problem is the library is present in org.gnome.Sdk instead of org.gnome.Platform, where it is required.
  • Thursday, July 12:  the bug is reported on WebKit Bugzilla (since it broke Epiphany Technology Preview)
  • Saturday, July 14: having returned from GUADEC, I notice the bug report and bisect the issue to a particular runtime build. Mathieu Bridon fixes the issue in the freedesktop SDK and opens a merge request.
  • Monday, July 16: Mathieu’s fix is committed. We now have to wait until Tuesday for the next build.
  • Tuesday, Wednesday, and Thursday: we deal with various runtime build failures. Each day, we get a new build log and try to fix whatever build failure is reported. Then, we wait until the next day and see what the next failure is. (I’m not aware of any way to build the runtime locally. No doubt it’s possible somehow, but there are no instructions for doing so.)
  • Friday, July 20: we wait. The build has succeeded and the log indicates the build has been published, but it’s not yet available via flatpak update
  • Saturday, July 21: the successful build is now available. The problem is fixed.

As far as I know, it was not possible to run any nightly applications during this two week period, except developer applications like Builder that depend on org.gnome.Sdk instead of the normal org.gnome.Platform. If you used Epiphany Technology Preview and wanted a functioning web browser, you had to run arcane commands to revert to the last good runtime version.

This multi-week response time is fairly typical for us. We need to improve our workflow somehow. It would be nice to be able to immediately revert to the last good build once a problem has been identified, for instance.

Meanwhile, even when the runtime is working fine, some apps have been broken for months without anyone noticing or caring. Perhaps it’s time for a rethink on how we handle nightly apps. It seems likely that only a few apps, like Builder and Epiphany, are actually being regularly used. The release team has some hazy future plans to take over responsibility for the nightly apps (but we have to take over the runtimes first, since those are more important), and we’ll need to somehow avoid these issues when we do so. Having some form of notifications for failed builds would be a good first step.

P.S. To avoid any possible misunderstandings: the client-side Flatpak technology itself is very good. It’s only the server-side infrastructure that is problematic here. Clearly we have a lot to fix, but it won’t require any changes in Flatpak.

Thank you, address sanitizer developers

I don’t often write useless blog posts, but today will be an exception. The address sanitizer (asan) is a ludicrously good tool. The developers deserve a huge thank you.

There are other good sanitizers too, but asan is what I have been using recently.

Thank you!

Security vulnerability in Epiphany Technology Preview

If you use Epiphany Technology Preview, please update immediately and ensure you have revision 3.29.2-26 or newer. We discovered and resolved a vulnerability that allowed websites to access internal Epiphany features and thereby exfiltrate passwords from the password manager. We apologize for this oversight.

The unstable Epiphany 3.29.2 release is the only affected release. Epiphany 3.29.1 is not affected. Stable releases, including Epiphany 3.28, are also not affected.

There is no reason to believe that the issue was discovered or exploited by any attackers, but you might wish to change your passwords if you are concerned.

Thoughts on Flatpak after four months of Epiphany Technology Preview

It’s been four months since I announced Epiphany Technology Preview — which I’ve been using as my main browser ever since — and five months since I announced the availability of a stable channel via Flatpak. For the most part, it’s been a good experience. Having the latest upstream development code for everything is wonderful and makes testing very easy. Any user can painlessly download and install either the latest stable version or the bleeding-edge development version on any Linux system, regardless of host dependencies, either via a couple clicks in GNOME Software or one command in the terminal. GNOME Software keeps it updated, so I always have a recent version. Thanks to this, I’m often noticing problems shortly after they’re introduced, rather than six months later, as was so often the case for me in the past. Plus, other developers can no longer complain that there’s a problem with my local environment when I report a bug they can’t reproduce, because Epiphany Technology Preview is a canonical distribution environment, a ground truth of sorts.

There have been some rough patches where Epiphany Technology Preview was not working properly — sometimes for several days — due to various breaking changes, and the long time required to get a successful SDK build when it’s failing. For example, multimedia playback was broken for all of last week, due to changes in how the runtime is built. H.264 video is still broken, since the relevant Flatpak extension is only compatible with the 3.28 runtime, not with master. Opening files was broken for a while due to what turned out to be a bug in mutter that was causing the OpenURI portal to crash. I just today found another bug where closing a portal while visiting Slack triggered a gnome-shell crash. For the most part, these sorts of problems are expected by testers of unstable nightly software, though I’m concerned about the portal bugs because these affect stable users too. Anyway, these are just bugs, and all software has bugs: they get fixed, nothing special.

So my impression of Flatpak is still largely positive. Flatpak does not magically make our software work properly in all host environments, but it hugely reduces the number of things that can go wrong on the host system. In recent years, I’ve seen users badly break Epiphany in various ways, e.g. by installing custom mimeinfo or replacing the network backend. With Flatpak, either of these would require an incredible amount of dedicated effort. Without a doubt, Flatpak distribution is more robust to user error. Another advantage is that we get the latest versions of OS dependencies, like GStreamer, libsoup, and glib-networking, so we can avoid the many bugs in these components that have been fixed in the years since our users’ LTS distros froze the package versions. I appreciate the desire of LTS distros to provide stability for users, but at the same time, I’m not impressed when users report issues with the browser that we fixed two years ago in one dependency or another. Flatpak is an excellent compromise solution to this problem: the LTS distro retains an LTS core, but specific applications can use newer dependencies from the Flatpak runtime.

But there is one huge downside to using Flatpak: we lose crash reports. It’s at best very difficult — and often totally impossible — to investigate crashes when using Flatpak, and that’s frankly more important than any of the gains I mention above. For example, today Epiphany Technology Preview is crashing pretty much constantly. It’s surely a bug in WebKit, but that’s all I can figure out. The way to get a backtrace from a crashing app in flatpak is to use coredumpctl to manually dump the core dump to disk, then launch a bash shell in the flatpak environment and manually load it up in gdb. The process is manual, old-fashioned, primitive, and too frustrating for me by a lot, so I wrote a little pyexpect script to automate this process for Epiphany, thinking I could eventually generalize it into a tool that would be useful for other developers. It’s a horrible hack, but it worked pretty well the day I tested it. I haven’t seen it work since. Debuginfo seems to be constantly broken, so I only see a bunch of ???s in my backtraces, and how are we supposed to figure out how to debug that? So I have no way to debug or fix the WebKit bug, because I can’t get a backtrace. The broken, inconsistent, or otherwise-unreliable debuginfo is probably just some bug that will be fixed eventually (and which I half suspect may be related to our recent freedesktop SDK upgrade. Update: Alex has debugged the debuginfo problem and it looks like that’s on track to be solved), but even once it is, we’re back to square one: it’s still too much effort to get the backtrace, relative to developing on the host system, and that’s a hard problem to solve. It requires tools that do not exist, and for which we have no plans to create, or even any idea of how to create them.

This isn’t working. I need to be able to effortlessly get a backtrace out of my application, with no or little more effort than running coredumpctl gdb as I would without Flatpak in the way. Every time I see Epiphany or WebKit crash, knowing I can do nothing to debug or investigate, I’m very sorely tempted to switch back to using Fedora’s Epiphany, or good old JHBuild. (I can’t promote BuildStream here, because BuildStream has the same problem.)

So the developer experience is just not good, but set that aside: the main benefits of Flatpak are for users, not developers, after all. Now, what if users run into a crash, how can they report the bug? Crash reports are next to useless without a backtrace, and wise developers refuse to look at crash reports until a quality backtrace has been posted. So first we need to fix the developer experience to work properly, but even then, it’s not enough: we need an automatic crash reporter, along the lines of ABRT or apport, to make reporting crashes realistically-achievable for users, as it already is for distro-packaged apps. But this is a much harder problem to solve. Such a tool will require integration with coredumpctl, and I have not the faintest clue how we could go about making coredumpctl support container environments. Yet without this, we’re asking application developers to give up their most valuable data — crash reports — in order to use Flatpak.

Eventually, if we don’t solve crash reporting, Epiphany’s experiment with Flatpak will have to come to an end, because that’s more important to me than the (admittedly-tremendous) benefits of Flatpak. I’m still hopeful that the ingenuity of the Flatpak community will find some solutions. We’ll see how this goes.

On setenv() and Explosions

Skim over this abbreviated and badly-formatted function definition, from glibpacrunner.c:

static void
handle_method_call (GDBusConnection *connection, /* ... */)
{
/* ... */

if (/* ... */)
{
gchar *libproxy_url = g_strdup_printf ("pac+%s", pac_url);
g_setenv ("http_proxy", libproxy_url, TRUE);
g_free (libproxy_url);
}
else
g_setenv ("http_proxy", "wpad://", TRUE);

g_proxy_resolver_lookup_async (/* ... */);
}

Note:

  1. This is a GDBus method call handler, so you know GDBus is currently running, and you know this is a multithreaded application.
  2. The code calls setenv().

Because you’re a careful programmer, you check the manpage before using a function, and read all the way to the bottom, so you know that setenv() is not threadsafe. Unless your application is trivial and links to no libraries, you can be pretty confident that it is running secondary threads, and that, somewhere, those threads call getenv(), because getenv() is used pervasively in libraries. (A frequent culprit for internationalized applications is gettext(), conventionally defined as the underscore function _().)

So, what does this mean in practice? It means that if any other thread is calling getenv() at the same time that you call setenv(), one or the other call will blow up. This is a timing-dependent race that will only happen occasionally. Thanks to Murphy, you can expect it to only occur on the hardware that your important customer is running, and never for your developers when they attempt to reproduce the issue.

The safest solution is to ensure your application only ever calls setenv() at the very top of main(), before your first secondary thread is initialized. It can be quite hard to know which library calls will initialize secondary threads. (What about if you use GOptionContext? Or GApplication? If you need to check a setting before changing the environment, how many threads does a GSettings object spawn when you create it? Even if it was zero — and it’s not — what if a custom GIO module was in use?) Your best bet is to restrict yourself to using setenv() at the very, very top of main(), where you know it’s safe. This applies to unsetenv(), as well.

This is not some theoretical problem. In practice, we have documented:

  • Unsafe use of setenv() in gnome-session would cause gnome-session to crash, bringing down the entire desktop session and returning the user to the login screen.
  • In a downstream product using a modified version of gnome-initial-setup, unsafe use of setenv() to change LANG on the language selection page would cause the first boot experience to crash, leaving the sufficiently-unlucky user staring at an empty unusable desktop after turning on her new computer.

Both of these issues were rare races that were not possible to reproduce.

In a sense, setenv() is not really special, in that lots of POSIX functions modify global state and are unsafe to use across threads. The problem with setenv() is that, once such an issue is identified, it can be very difficult to fix. Sometimes there is an alternative to setting an environment variable, but often environment variables are the only way to configure important settings. If you have no other choice but to set an environment variable, then there is not really much you can do other than risk randomly crashing. Lesson: never make an environment variable the only way for an application to configure an important setting at runtime. Your inconsiderate mandatory environment variable might not be a problem for your own code, but it will be a problem for unforeseen future code that needs to use your code. Mandatory environment variables have caused many problems in GNOME, some unfixable on platforms without library constructors, and some frankly ludicrous. (libproxy, you are not batting very well right now.) The Linux desktop is already stuck with many mandatory environment variables that we cannot get rid of; we do not need to keep adding more. Please, please, please: learn from our mistakes.

I have no clue how to fix glib-pacrunner. It’s already a separate D-Bus service just to allow it to use unsetenv() (which it does safely) to trick libproxy. I suppose fixing this would require spawning a second helper process in order to run setenv() before the proxy lookup. More likely, we’ll just leave it broken, since to my knowledge, nobody has reported a crash here yet.

Software is wonderful!

On Compiling WebKit (now twice as fast!)

Are you tired of waiting for ages to build large C++ projects like WebKit? Slow headers are generally the problem. Your C++ source code file #includes a few headers, all those headers #include more, and those headers #include more, and more, and more, and since it’s C++ a bunch of these headers contain lots of complex templates to slow down things even more. Not fun.

It turns out that much of the time spent building large C++ projects is effectively spent parsing the same headers again and again, over, and over, and over, and over, and over….

There are three possible solutions to this problem:

  • Shred your CPU and buy a new one that’s twice as fast.
  • Use C++ modules: import instead of #include. This will soon become the best solution, but it’s not standardized yet. For WebKit’s purposes, we can’t use it until it works the same in MSVCC, Clang, and three-year-old versions of GCC. So it’ll be quite a while before we’re able to take advantage of modules.
  • Use unified builds (sometimes called unity builds).

WebKit has adopted unified builds. This work was done by Keith Miller, from Apple. Thanks, Keith! (If you’ve built WebKit before, you’ll probably want to say that again: thanks, Keith!)

For a release build of WebKitGTK+, on my desktop, our build times used to look like this:

real 62m49.535s
user 407m56.558s
sys 62m17.166s

That was taken using WebKitGTK+ 2.17.90; build times with any 2.18 release would be similar. Now, with trunk (or WebKitGTK+ 2.20, which will be very similar), our build times look like this:

real 33m36.435s
user 214m9.971s
sys 29m55.811s

Twice as fast.

The approach is pretty simple: instead of telling the compiler to build the original C++ source code files that developers see, we instead tell the compiler to build unified source files that look like this:

// UnifiedSource1.cpp
#include "CSSValueKeywords.cpp"
#include "ColorData.cpp"
#include "HTMLElementFactory.cpp"
#include "HTMLEntityTable.cpp"
#include "JSANGLEInstancedArrays.cpp"
#include "JSAbortController.cpp"
#include "JSAbortSignal.cpp"
#include "JSAbstractWorker.cpp"

Since files are included only once per translation unit, we now have to parse the same headers only once for each unified source file, rather than for each individual original source file, and we get a dramatic build speedup. It’s pretty terrible, yet extremely effective.

Now, how many original C++ files should you #include in each unified source file? To get the fastest clean build time, you would want to #include all of your C++ source files in one, that way the compiler sees each header only once. (Meson can do this for you automatically!) But that causes two problems. First, you have to make sure none of the files throughout your entire codebase use conflicting variable names, since the static keyword and anonymous namespaces no longer work to restrict your definitions to a single file. That’s impractical in a large project like WebKit. Second, because there’s now only one file passed to the compiler, incremental builds now take as long as clean builds, which is not fun if you are a WebKit developer and actually need to make changes to it. Unifying more files together will always make incremental builds slower. After some experimentation, Apple determined that, for WebKit, the optimal number of files to include together is roughly eight. At this point, there’s not yet much negative impact on incremental builds, and past here there are diminishing returns in clean build improvement.

In WebKit’s implementation, the files to bundle together are computed automatically at build time using CMake black magic. Adding a new file to the build can change how the files are bundled together, potentially causing build errors in different files if there are symbol clashes. But this is usually easy to fix, because only files from the same directory are bundled together, so random unrelated files will never be built together. The bundles are always the same for everyone building the same version of WebKit, so you won’t see random build failures; only developers who are adding new files will ever have to deal with name conflicts.

To significantly reduce name conflicts, we now limit the scope of using statements. That is, stuff like this:

using namespace JavaScriptCore;
namespace WebCore {
//...
}

Has been changed to this:

namespace WebCore {
using namespace JavaScriptCore;
// ...
}

Some files need to be excluded due to unsolvable name clashes. For example, files that include X11 headers, which contain lots of unnamespaced symbols that conflict with WebCore symbols, don’t really have any chance. But only a few files should need to be excluded, so this does not have much impact on build time. We’ve also opted to not unify most of the GLib API layer, so that we can continue to use conventional GObject names in our implementation, but again, the impact of not unifying a few files is minimal.

We still have some room for further performance improvement, because some significant parts of the build are still not unified, including most of the WebKit layer on top. But I suspect developers who have to regularly build WebKit will already be quite pleased.