Common GLib Programming Errors, Part Two: Weak Pointers

This post is a sequel to Common GLib Programming Errors, where I covered four common errors: failure to disconnect a signal handler, misuse of a GSource handle ID, failure to cancel an asynchronous function, and misuse of main contexts in library or threaded code. Although there are many ways to mess up when writing programs that use GLib, I believe the first post covered the most likely and most pernicious… except I missed weak pointers. Sébastien pointed out that these should be covered too, so here we are.

Mistake #5: Failure to Disconnect Weak Pointer

In object-oriented languages, weak pointers are a safety improvement. The idea is to hold a non-owning pointer to an object that gets automatically set to NULL when that object is destroyed to prevent use-after-free vulnerabilities. However, this only works well because object-oriented languages have destructors. Without destructors, we have to deregister the weak pointer manually, and failure to do so is a disaster that will result in memory corruption that’s extremely difficult to track down. For example:

static void
a_start_watching_b (A *self,
                    B *b)
{
  // Keep a weak reference to b. When b is destroyed,
  // self->b will automatically be set to NULL.
  self->b = b;
  g_object_add_weak_pointer (b, &self->b);
}

static void
a_do_something_with_b (Foo *self)
{
  if (self->b) {
    // Do something safely here, knowing that b
    // is assuredly still alive. This avoids a
    // use-after-free vulnerability if b is destroyed,
    // i.e. self->b cannot be dangling.
  }
}

Let’s say that the Bar in this example outlives the Foo, but Foo failed to call g_object_remove_weak_pointer() . Then when Bar is destroyed later, the memory that used to be occupied by self->bar will get clobbered with NULL. Hopefully that will result in an immediate crash. If not, good luck trying to debug what’s going wrong when some innocent variable elsewhere in your program gets randomly clobbered. This is often results in a frustrating wild goose chase when trying to track down what is going wrong (example).

The solution is to always disconnect your weak pointer. In most cases, your dispose function is the best place to do this:

static void
a_dispose (GObject *object)
{
  A *a = (A *)object;
  g_clear_weak_pointer (&a->b);
  G_OBJECT_CLASS (a_parent_class)->dispose (object);
}

Note that g_clear_weak_pointer() is equivalent to:

if (a->b) {
  g_object_remove_weak_pointer (a->b, &a->b);
  a->b = NULL;
}

but you probably guessed that, because it follows the same pattern as the other clear functions that we’ve used so far.

Common GLib Programming Errors

Let’s examine four mistakes to avoid when writing programs that use GLib, or, alternatively, four mistakes to look for when reviewing code that uses GLib. Experienced GNOME developers will find the first three mistakes pretty simple and basic, but nevertheless they still cause too many crashes. The fourth mistake is more complicated.

These examples will use C, but the mistakes can happen in any language. In unsafe languages like C, C++, and Vala, these mistakes usually result in security issues, specifically use-after-free vulnerabilities.

Mistake #1: Failure to Disconnect Signal Handler

Every time you connect to a signal handler, you must think about when it should be disconnected to prevent the handler from running at an incorrect time. Let’s look at a contrived but very common example. Say you have an object A and wish to connect to a signal of object B. Your code might look like this:

static void
some_signal_cb (gpointer user_data)
{
  A *self = user_data;
  a_do_something (self);
}

static void
some_method_of_a (A *self)
{
  B *b = get_b_from_somewhere ();
  g_signal_connect (b, "some-signal", (GCallback)some_signal_cb, a);
}

Very simple. Now, consider what happens if the object B outlives object A, and Object B emits some-signal after object A has been destroyed. Then the line a_do_something (self) is a use-after-free, a serious security vulnerability. Drat!

If you think about when the signal should be disconnected, you won’t make this mistake. In many cases, you are implementing an object and just want to disconnect the signal when your object is disposed. If so, you can use g_signal_connect_object() instead of the vanilla g_signal_connect(). For example, this code is not vulnerable:

static void
some_method_of_a (A *self)
{
  B *b = get_b_from_somewhere ();
  g_signal_connect_object (b, "some-signal", (GCallback)some_signal_cb, a, 0);
}

g_signal_connect_object() will disconnect the signal handler whenever object A is destroyed, so there’s no longer any problem if object B outlives object A. This simple change is usually all it takes to avoid disaster. Use g_signal_connect_object() whenever the user data you wish to pass to the signal handler is a GObject. This will usually be true in object implementation code.

Sometimes you need to pass a data struct as your user data instead. If so, g_signal_connect_object() is not an option, and you will need to disconnect manually. If you’re implementing an object, this is normally done in the dispose function:

// Object A instance struct (or priv struct)
struct _A {
  B *b;
  gulong some_signal_id;
};

static void
some_method_of_a (A *self)
{
  B *b = get_b_from_somewhere ();
  g_assert (a->some_signal_id == 0);
  a->some_signal_id = g_signal_connect (b, "some-signal", (GCallback)some_signal_cb, a, 0);
}

static void
a_dispose (GObject *object)
{
  A *a = (A *)object;
  g_clear_signal_handler (&a->some_signal_id, a->b);
  G_OBJECT_CLASS (a_parent_class)->dispose (object);
}

Here, g_clear_signal_handler() first checks if &a->some_signal_id is 0. If not, it disconnects and sets &a->some_signal_id to 0. Setting your stored signal ID to 0 and checking whether it is 0 before disconnecting is important because dispose may run multiple times to break reference cycles. Attempting to disconnect the signal multiple times is another common programmer error!

Instead of calling g_clear_signal_handler(), you could equivalently write:

if (a->some_signal_id != 0) {
  g_signal_handler_disconnect (a->b, a->some_signal_id);
  a->some_signal_id = 0;
}

But writing that manually is no fun.

Yet another way to mess up would be to use the wrong integer type to store the signal ID, like guint instead of gulong.

There are other disconnect functions you can use to avoid the need to store the signal handler ID, like g_signal_handlers_disconnect_by_data(), but I’ve shown the most general case.

Sometimes, object implementation code will intentionally not disconnect signals if the programmer believes that the object that emits the signal will never outlive the object that is connecting to it. This assumption may usually be correct, but since GObjects are refcounted, they may be reffed in unexpected places, leading to use-after-free vulnerabilities if this assumption is ever incorrect. Your code will be safer and more robust if you disconnect always.

Mistake #2: Misuse of GSource Handler ID

Mistake #2 is basically the same as Mistake #1, but using GSource rather than signal handlers. For simplicity, my examples here will use the default main context, so I don’t have to show code to manually create, attach, and destroy the GSource. The default main context is what you’ll want to use if (a) you are writing application code, not library code, and (b) you want your callbacks to execute on the main thread. (If either (a) or (b) does not apply, then you need to carefully study GMainContext to ensure you do not mess up; see Mistake #4.)

Let’s use the example of a timeout source, although the same style of bug can happen with an idle source or any other type of source that you create:

static gboolean
my_timeout_cb (gpointer user_data)
{
  A *self = user_data;
  a_do_something (self);
  return G_SOURCE_REMOVE;
}

static void
some_method_of_a (A *self)
{
  g_timeout_add (42, (GSourceFunc)my_timeout_cb, a);
}

You’ve probably guessed the flaw already: if object A is destroyed before the timeout fires, then the call to a_do_something() is a use-after-free, just like when we were working with signals. The fix is very similar: store the source ID and remove it in dispose:

// Object A instance struct (or priv struct)
struct _A {
  gulong my_timeout_id;
};

static gboolean
my_timeout_cb (gpointer user_data)
{
  A *self = user_data;
  a_do_something (self);
  a->my_timeout_id = 0;
  return G_SOURCE_REMOVE;
}

static void
some_method_of_a (A *self)
{
  g_assert (a->my_timeout_id == 0);
  a->my_timeout_id = g_timeout_add (42, (GSourceFunc)my_timeout_cb, a);
}

static void
a_dispose (GObject *object)
{
  A *a = (A *)object;
  g_clear_handle_id (&a->my_timeout_id, g_source_remove);
  G_OBJECT_CLASS (a_parent_class)->dispose (object);
}

Much better: now we’re not vulnerable to the use-after-free issue.

As before, we must be careful to ensure the source is removed exactly once. If we remove the source multiple times by mistake, GLib will usually emit a critical warning, but if you’re sufficiently unlucky you could remove an innocent unrelated source by mistake, leading to unpredictable misbehavior. This is why we need to write a->my_timeout_id = 0; before returning from the timeout function, and why we need to use g_clear_handle_id() instead of g_source_remove() on its own. Do not forget that dispose may run multiple times!

We also have to be careful to return G_SOURCE_REMOVE unless we want the callback to execute again, in which case we would return G_SOURCE_CONTINUE. Do not return TRUE or FALSE, as that is harder to read and will obscure your intent.

Mistake #3: Failure to Cancel Asynchronous Function

When working with asynchronous functions, you must think about when it should be canceled to prevent the callback from executing too late. Because passing a GCancellable to asynchronous function calls is optional, it’s common to see code omit the cancellable. Be suspicious when you see this. The cancellable is optional because sometimes it is really not needed, and when this is true, it would be annoying to require it. But omitting it will usually lead to use-after-free vulnerabilities. Here is an example of what not to do:

static void
something_finished_cb (GObject      *source_object,
                       GAsyncResult *result,
                       gpointer      user_data)
{
  A *self = user_data;
  B *b = (B *)source_object;
  g_autoptr (GError) error = NULL;

  if (!b_do_something_finish (b, result, &error)) {
    g_warning ("Failed to do something: %s", error->message);
    return;
  }

  a_do_something_else (self);
}

static void
some_method_of_a (A *self)
{
  B *b = get_b_from_somewhere ();
  b_do_something_async (b, NULL /* cancellable */, a);
}

This should feel familiar by now. If we did not use A inside the callback, then we would have been able to safely omit the cancellable here without harmful effects. But instead, this example calls a_do_something_else(). If object A is destroyed before the asynchronous function completes, then the call to a_do_something_else() will be a use-after-free.

We can fix this by storing a cancellable in our instance struct, and canceling it in dispose:

// Object A instance struct (or priv struct)
struct _A {
  GCancellable *cancellable;
};

static void
something_finished_cb (GObject      *source_object,
                       GAsyncResult *result,
                       gpointer      user_data)
{
  B *b = (B *)source_object;
  A *self = user_data;
  g_autoptr (GError) error = NULL;

  if (!b_do_something_finish (b, result, &error)) {
    if (!g_error_matches (error, G_IO_ERROR, G_IO_ERROR_CANCELLED))
      g_warning ("Failed to do something: %s", error->message);
    return;
  }
  a_do_something_else (self);
}

static void
some_method_of_a (A *self)
{
  B *b = get_b_from_somewhere ();
  b_do_something_async (b, a->cancellable, a);
}

static void
a_init (A *self)
{
  self->cancellable = g_cancellable_new ();
}

static void
a_dispose (GObject *object)
{
  A *a = (A *)object;

  g_cancellable_cancel (a->cancellable);
  g_clear_object (&a->cancellable);

  G_OBJECT_CLASS (a_parent_class)->dispose (object);
}

Now the code is not vulnerable. Note that, since you usually do not want to print a warning message when the operation is canceled, there’s a new check for G_IO_ERROR_CANCELLED in the callback.

Update #1: I managed to mess up this example in the first version of my blog post. The example above is now correct, but what I wrote originally was:

if (!b_do_something_finish (b, result, &error) &&
    !g_error_matches (error, G_IO_ERROR, G_IO_ERROR_CANCELLED)) {
  g_warning ("Failed to do something: %s", error->message);
  return;
}
a_do_something_else (self);

Do you see the bug in this version? Cancellation causes the asynchronous function call to complete the next time the application returns control to the main context. It does not complete immediately. So when the function is canceled, A is already destroyed, the error will be G_IO_ERROR_CANCELLED, and we’ll skip the return and execute a_do_something_else() anyway, triggering the use-after-free that the example was intended to avoid. Yes, my attempt to show you how to avoid a use-after-free itself contained a use-after-free. You might decide this means I’m incompetent, or you might decide that it means it’s too hard to safely use unsafe languages. Or perhaps both!

Update #2:  My original example had an unnecessary explicit check for NULL in the dispose function. Since g_cancellable_cancel() is NULL-safe, the dispose function will cancel only once even if dispose runs multiple times, because g_clear_object() will set a->cancellable = NULL. Thanks to Guido for suggesting this improvement in the comments.

Mistake #4: Incorrect Use of GMainContext in Library or Threaded Code

My fourth common mistake is really a catch-all mistake for the various other ways you can mess up with GMainContext. These errors can be very subtle and will cause functions to execute at unexpected times. Read this main context tutorial several times. Always think about which main context you want callbacks to be invoked on.

Library developers should pay special attention to the section “Using GMainContext in a Library.” It documents several security-relevant rules:

  • Never iterate a context created outside the library.
  • Always remove sources from a main context before dropping the library’s last reference to the context.
  • Always document which context each callback will be dispatched in.
  • Always store and explicitly use a specific GMainContext, even if it often points to some default context.
  • Always match pushes and pops of the thread-default main context.

If you fail to follow all of these rules, functions will be invoked at the wrong time, or on the wrong thread, or won’t be called at all. The tutorial covers GMainContext in much more detail than I possibly can here. Study it carefully. I like to review it every few years to refresh my knowledge. (Thanks Philip Withnall for writing it!)

Properly-designed libraries follow one of two conventions for which main context to invoke callbacks on: they may use the main context that was thread-default at the time the asynchronous operation started, or, for method calls on an object, they may use the main context that was thread-default at the time the object was created. Hopefully the library explicitly documents which convention it follows; if not, you must look at the source code to figure out how it works, which is not fun. If the library documentation does not indicate that it follows either convention, it is probably unsafe to use in threaded code.

Conclusion

All four mistakes are variants on the same pattern: failure to prevent a function from being unexpectedly called at the wrong time. The first three mistakes commonly lead to use-after-free vulnerabilities, which attackers abuse to hack users. The fourth mistake can cause more unpredictable effects. Sadly, today’s static analyzers are probably not smart enough to catch these mistakes. You could catch them if you write tests that trigger them and run them with an address sanitizer build, but that’s rarely realistic. In short, you need to be especially careful whenever you see signals, asynchronous function calls, or main context sources.

Sequel

The adventure continues in Common GLib Programming Errors, Part Two: Weak Pointers.

Best Practices for Build Options

Build options are sometimes tricky to get right. Here’s my take on best practices. The golden rule is to set good upstream defaults. Everything else follows from this.

Rule #1: Choose Good Upstream Defaults

Occasionally I see upstream developers complain that a downstream operating system has built their software “incorrectly,” generally because some important dependency or feature has been disabled. Sometimes downstreams really do mess up, but more often poor upstream defaults are to blame. Upstreams must set good defaults because upstream software developers know far more about their projects than downstream packagers do. Upstreams generally have a good idea of how they expect software to be built by downstreams, whereas downstreams generally do not. Accordingly, do the thinking upstream whenever possible. When you set good defaults, it becomes easier for downstreams to build your software the way you expect, because active effort is required for downstreams to mess things up.

For example, say a project has the following two build options:

Option Name Default Value
--enable-thing-you-usually-want-enabled false
--disable-thing-you-rarely-want-disabled true

The thing you usually want enabled is not enabled by default, and the thing you rarely want disabled is disabled by default. Sad. Unfortunately, this pattern used to be extremely common with Autotools build systems, because in the real world, the names of the options are more subtle than this, and also because nobody likes squinting at configure.ac files to audit whether the options make sense. Meson build systems tend to be somewhat better because meson_options.txt is separate from the rest of the build definitions, making it easier to review all your options and check to ensure their defaults are set appropriately. However, there are still a few more subtle ways you can mess up your Meson build system, which I’ll discuss below.

Rule #2: Prefer Upstream Defaults Downstream

Conversely, downstreams should not second-guess upstream defaults unless you have a good reason to do so and really know what you’re doing.

For example, glib-networking’s Meson build system provides you with two different TLS backend options: OpenSSL or GnuTLS. The GnuTLS backend is enabled by default (well, sort of, see the next section on auto dependencies) while the OpenSSL backend is disabled by default. There’s a good reason for this: the OpenSSL backend is half-baked, partially due to bugs in glib-networking, and partially because OpenSSL just cannot do certain things that GnuTLS can. The OpenSSL backend is provided because some environments really do require it for license reasons, but it’s not the right choice for general-purpose operating systems. It may be tempting to think that you can pick whichever library you prefer, but you really should not.

Another example: WebKitGTK’s CMake build system provides a USE_WPE_RENDERER build option, which is enabled by default. This option controls which graphics rendering stack is used: if enabled, rendering uses libwpe and wpebackend-fdo, whereas if disabled, rendering uses a legacy internal Wayland compositor. The option is provided because libwpe and wpebackend-fdo are newer dependencies that are expected to be unavailable on older (pre-2020) operating systems, so older operating systems legitimately need to be able to disable it. But this configuration receives little serious testing and the upstream developers do not notice when it breaks, so you really should not be using it unless you have to. This recently caused rendering bugs that appeared to be distribution-specific, which upstream developers were not willing to investigate because upstream developers could not reproduce the issue.

Sticking with upstream defaults is generally safest. Sometimes you really need to override them. If so, go ahead. Just be careful.

Rule #3: Handle Auto Dependencies and Features with Care

The worst default ever is “build with feature enabled only if dependency xyz is installed; otherwise, disable it.” This is called an auto dependency. If using CMake or Autotools, auto dependencies are almost never permissible, and in this case “handle with care” means repent and fix it. Auto dependencies are acceptable only if you are using the Meson build system.

The theory behind auto dependencies is that it’s convenient for people casually building the software to do so with the fewest number of build errors possible, which is true. Problem is, this screws over serious production builds of your software by requiring your downstreams to possess magical knowledge of what dependencies are required to build your software properly. Users generally expect most features to be enabled at build time, but if upstream uses auto dependencies, the result is a build dependencies lottery: your feature will be enabled or disabled due to luck, based on which downstream build dependencies transitively depend on which other build dependencies. Even if it’s built properly today, that could easily change tomorrow when some other dependency changes in some other package. Just say no. Do not expect downstreams to look at your build system at all, let alone study the possible build options and make accurate judgments about which build dependencies are required to enable them. Avoiding auto dependencies is part of setting good upstream defaults.

Look at this example from WebKit’s OptionsGTK.cmake:

if (ENABLE_SPELLCHECK)
    find_package(Enchant)
    if (NOT PC_ENCHANT_FOUND)
        message(FATAL_ERROR "Enchant is needed for ENABLE_SPELLCHECK")
    endif ()
endif ()

ENABLE_SPELLCHECK is ON by default. If you don’t have enchant installed, the build will fail unless you manually disable it by passing -DENABLE_SPELLCHECK=OFF". This makes it hard to mess up: downstreams have to make an intentional choice to build with spellchecking disabled. It cannot happen by accident.

Many projects would instead write it like this:

if (ENABLE_SPELLCHECK)
    find_package(Enchant)
    if (NOT PC_ENCHANT_FOUND)
        set(ENABLE_SPELLCHECK OFF)
    endif ()
endif ()

But this is an auto dependency, which results in downstream build dependency lottery. If you write your build system like this, you cannot complain when the feature winds up disabled by mistake in downstream builds. Don’t do this.

Exception: if you use Meson, auto dependencies are acceptable if you use the feature option type and set the default to auto. Although auto features are silently enabled or disabled by default depending on whether the required dependency is present, you can easily override this behavior for serious production builds by passing -Dauto_features=enabled, which enables all the auto features and will result in build failures if dependencies are missing. All major Linux operating systems do this when building Meson packages, so Meson’s auto features should not cause problems.

Rule #4: Be Very Careful with Meson’s Build Types

Let’s categorize software builds into production builds or non-production builds. A production build is intended to be either distributed to end users or else run production workloads, whereas a non-production build is intended for testing or development and might have extra debug features enabled, like slow assertions. (These are more commonly called release builds or debug builds, but that terminology would be very confusing in the context of this discussion, as you’re about to see.)

The CMake and Meson build systems give us more than just two build types. Compare CMake build types to the corresponding Meson build types:

CMake Build Type Meson Build Type Meson debug Option Production Build?  (excludes Windows)
Release release false Yes
Debug debug true No
RelWithDebInfo debugoptimized true Yes, be careful!
MinSizeRel minsize true Yes, be careful!
N/A plain false Yes

To simplify, let’s exclude Windows from the discussion for now. (We’ll come back to Windows in a bit.) Now, notice the nomenclature difference between CMake’s RelWithDebInfo (“release with debuginfo”) build type versus Meson’s debugoptimized build type. This build type functions exactly the same for both Meson and CMake, but CMake’s name is better because it clearly indicates that this is a release or production build type, whereas the Meson name seems to indicate it is a debug or non-production build type, and Meson’s debug option is set to true. In fact, it is an optimized production build with debuginfo enabled, the same style of build that almost all Linux operating systems use for their packages (although operating systems use the plain build type instead). The same problem exists for Meson’s minsize build type. This is another production build type where debug is true.

The Meson build type name accurately reflects that the debug option is enabled, but this is very confusing because for most platforms, that option only controls whether debuginfo is generated. Looking at the table above, you can see that you must never use the debug option alone to decide whether you have a production build or a non-production build. As the table indicates, the only non-production build type is the vanilla debug build type, which you can detect by checking the combination of the debug and optimization options. You have a non-production (debug) build if debug is true and if optimization is 0 or g; otherwise, you have a production build.  I wrote this in bold because it is important and not at all obvious. (However, before applying this rule in a cross-platform project, keep reading below to see the huge caveat regarding Windows.)

Here’s an example of what not to do in your meson.build:

# Use debug/optimization flags to determine whether to enable debug or disable
# cast checks
gtk_debug_cflags = []
debug = get_option('debug')
optimization = get_option('optimization')
if debug
  gtk_debug_cflags += '-DG_ENABLE_DEBUG'
  if optimization in ['0', 'g']
    gtk_debug_cflags += '-DG_ENABLE_CONSISTENCY_CHECKS'
  endif
elif optimization in ['2', '3', 's']
  gtk_debug_cflags += ['-DG_DISABLE_CAST_CHECKS', '-DG_DISABLE_ASSERT']
endif

This is from GTK’s meson.build. The code based only on the optimization option is OK, but the code that sets -DG_ENABLE_DEBUG is looking only at the debug option. What the code really wants to do is set G_ENABLE_DEBUG if this is a non-production build, but instead it is tied to debuginfo, which is not the desired result. Downstreams are forced to scratch their heads as to what they should do. Impassioned build engineers have held spirited debates about this particular meson.build snippet. Don’t do this! (I will submit a merge request to improve this.)

Here’s a much better, although still not perfect, example of how to do the same thing, this time from GLib’s meson.build:

# Use debug/optimization flags to determine whether to enable debug or disable
# cast checks
glib_debug_cflags = []
glib_debug = get_option('glib_debug')
if glib_debug.enabled() or (glib_debug.auto() and get_option('debug'))
  glib_debug_cflags += ['-DG_ENABLE_DEBUG']
  message('Enabling various debug infrastructure')
elif get_option('optimization') in ['2', '3', 's']
  glib_debug_cflags += ['-DG_DISABLE_CAST_CHECKS']
  message('Disabling cast checks')
endif

if not get_option('glib_assert')
  glib_debug_cflags += ['-DG_DISABLE_ASSERT']
  message('Disabling GLib asserts')
endif

if not get_option('glib_checks')
  glib_debug_cflags += ['-DG_DISABLE_CHECKS']
  message('Disabling GLib checks')
endif

Notice how GLib provides explicit build options that allow downstreams to decide whether debug should be enabled or not. Using explicit build options here was a good idea! The defaults for glib_assert and glib_checks are intentionally set to true to encourage their use in production builds, while G_DISABLE_CAST_CHECKS is based only on the optimization level. But sadly, if not explicitly configured, GLib sets the value of the glib_debug_cflags option automatically, based on only the value of the debug option. This is actually an OK use of an auto feature, because it is a carefully-considered attempt to provide good default behavior for downstreams, but it fails here because it assumes that debug means “non-production build,” which we have previously established cannot be determined without checking optimization as well. (I will submit a merge request to improve this.)

Here’s another helpful table that shows how the various build types correspond to CFLAGS:

CMake/Meson Build Type CMake CFLAGS Meson CFLAGS
Release/release -O3 -DNDEBUG -O3
Debug/debug -g -O0 -g
RelWithDebInfo/debugoptimized -O2 -g -DNDEBUG -O2 -g
MinSizeRel/minsize -Os -DNDEBUG -Os -g

Notice Meson’s minsize build type includes debuginfo, while CMake’s does not. Since debuginfo requires a huge amount of space, CMake’s behavior seems better here. We’ll discuss NDEBUG momentarily.

OK, so that all makes sense, right? Well I thought so too, until I ran a draft of this blog post past Jussi, who pointed out that the Meson build types function completely differently on Windows than they do on other platforms. Unfortunately, whereas on most platforms the debug option only controls debuginfo generation, on Windows it instead controls whether the C library enables extra runtime debugging checks. So while debugoptimized and minsize are production build types on Linux and have nice corresponding CMake build types, they are non-production build types on Windows. This is a Meson defect. The point to remember is that the debug option is completely different on Windows than it is on other platforms, so my otherwise-nice rule for detecting production builds does not work properly on Windows. Cross-platform projects need to be especially careful with the debug option. There are various ways this could be fixed in Meson in the future: a nice simple proposal would be to add a new debuginfo option separate from debug, then deprecate the debugoptimized build type and replace it with releasewithdebuginfo.

CMake dodges all these problems and avoids any ambiguity because its build types are named differently: “RelWithDebInfo” and “MinSizeRel” leave no doubt that you are dealing with a release (production) build.

Rule #5: Think about NDEBUG

The other behavior difference visible in the table above is that CMake defines NDEBUG for its production build types, whereas Meson has a separate option bn_debug that controls whether to define NDEBUG. NDEBUG controls whether the C and C++ assert() macro is enabled: if this value is defined, asserts are disabled. CMake is the only build system that defines NDEBUG for you automatically. You really need to think about this: if your software is performance-sensitive and contains slow assertions, the consequences of messing this up are severe, e.g. see this historical mesa bug where Fedora’s mesa package suffered a 10x slowdown because mesa upstream accidentally enabled assertions by default. Again, please, do not blame downstreams for bad upstream defaults: downstreams are (usually) not experts on upstream software, and cannot possibly be expected to pick better defaults than upstream’s.

Meson allows developers to explicitly choose whether to enable assertions in production builds. Assertions are enabled in production by default, the opposite of CMake’s behavior. Some developers prefer that all asserts be disabled in production builds to optimize speed as far as possible, but this is usually not the best choice: having assertions enabled in production provides valuable confidence that your code is actually functioning as intended, and often improves security by converting many code execution exploits into denial of service. Most assertions do not have noticeable performance impact, so I prefer to leave most assertions enabled by default in production, and disable only asserts that are slow. Hence, I like Meson’s default behavior. But many engineers disagree with me, and some projects really do need assertions disabled in production; in particular, everyone agrees that performance-sensitive assertions should not be running in production builds. If you’re using Meson and want assertions disabled in production builds, you’re in theory supposed to use b_ndebug=if-release, but it doesn’t actually work because it only disables assertions if your build type is release or plain, while leaving assertions enabled for debugoptimized and minsize builds. We’ve already established that these are both production build types, so sadly that behavior is broken. Instead, it’s better to manually define NDEBUG except in non-production builds. Again, you have a non-production (debug) build when debug is true and if optimization is 0 or g; otherwise, you have a production build (except on Windows).

Rule #6: plain Means “Production Build,” Not “No Flags”

The GNOME release team recently had an exciting debate about the meaning of Meson’s plain build type. It is impressive how build engineers can be so enthusiastic about build options!

I asked Jussi to explain the plain build type. His response was: “Meson does not, by itself, add compiler flags,” emphasis mine. It does not mean your project should not add its own compiler flags, and it certainly does not mean it’s OK to set bad defaults as long as they are vanilla-flavored. It is a production build type, and you should ensure that it receives defaults in line with the other production build types. You’ll be fine if you follow the same rule we already established: you have a non-production (debug) build if debug is true and if optimization is 0 or g; otherwise, you have a production build (except on Windows).

The plain build type exists because it makes it easier for downstreams to implement their own compiler flags. Downstreams have to pass -O2 -g via CFLAGS because CMake and Meson are the only build systems that can do this automatically, and it’s easier to let downstreams disable this functionality than to force downstreams to set different CFLAGS separately for each supported build system.

Rule #7: Don’t Forget Hardening Flags

Sadly, by default all build systems generate insecure, unhardened binaries that should never be used in production. This is true of Autotools, CMake, Meson, and likely also whatever other build system you are thinking of. You must manually add your own hardening flags or your builds will be insecure. Unfortunately this is a little complicated to do. Fedora and RHEL’s recommended compiler flags are documented here. The freedesktop-sdk and GNOME Flatpak runtimes use these recommendations as the basis for their compiler flags, and by default, so do Flatpak applications based on these runtimes. It’s actually not very easy to replicate the same hardening flags since libraries and executables require different flags, so naively setting CFLAGS is not possible. Fedora and RHEL use GCC spec files to achieve this, whereas freedesktop-sdk relies on building GCC itself with a non-default configuration (yes, second-guessing upstream defaults). The good news is that all major downstreams have figured this out one way or another, so you only need to worry about it if you’re doing your own production builds.

Conclusion

That’s all you need to know. Remember, upstreams know their software better than downstreams do, so the hard thinking should happen upstream. We can minimize mistakes and trouble if upstreams carefully set good defaults, and if downstreams deviate from those defaults only when truly necessary. Keep these rules in mind to avoid unnecessary bug reports from dissatisfied users.

History

I updated this blog post on August 3, 2022 to change the primary guidance to “You have a non-production (debug) build if debug is true and if optimization is 0 or g; otherwise, you have a production build.” Originally, I failed to consider g. -Og means “optimize debugging experience” and it is supposedly a better choice than -O0 for debugging according to gcc(1). It’s definitely not actually, but at least that’s the intent.

Jussi responded to this blog post on August 13, 2022 to discuss why Meson’s build types don’t work so well. Read his response.

Creating Quality Backtraces for Crash Reports

Hello Linux users! Help developers help you: include a quality backtrace taken with gdb each and every time you create an issue report for a crash. If you don’t, most developers will request that you provide a backtrace, then ignore your issue until you manage to figure out how to do so. Save us the trouble and just provide the backtrace with your initial report, so everything goes smoother. (Backtraces are often called “stack traces.” They are the same thing.)

Don’t just copy the lower-quality backtrace you see in your system journal into your issue report. That’s a lot better than nothing, but if you really want the crash to be fixed, you should provide the developers with a higher-quality backtrace from gdb. Don’t know how to get a quality backtrace with gdb? Read on.

(Note: this blog post is occasionally updated to maintain relevance and remove historical information. Last update: May 2022)

Modern Crash Reporting

Here are instructions for getting a quality backtrace for a crashing process on Fedora 35, or any other Linux-based OS that enables coredumpctl and debuginfod:

$ coredumpctl gdb
(gdb) bt full

Enter ‘c’ (continue) when required. Enter ‘y’ when prompted to enable debuginfod. When it’s done printing, press ‘q’ to quit. That’s it! That’s all you need to know. You’re done. Two points of note:

  • When a process crashes, a core dump is caught by systemd-coredump and stored for future use. The coredumpctl gdb command opens the most recent core dump in gdb. systemd-coredump has been enabled by default in Fedora since Fedora 26. (It’s also enabled by default in RHEL 8.)
  • After opening the core dump, gdb uses debuginfod to automatically download all required debuginfo packages, ensuring the generated backtrace is useful. debuginfod has been enabled by default in Fedora since Fedora 35.

Quality Linux operating systems ought to configure both debuginfod and systemd-coredump for you, so that they are running out-of-the-box. If you’re missing debuginfod or systemd-coredump, then read on to learn how to take a backtrace without these tools. It will be more complicated, of course.

systemd-coredump

If your operating system enables systemd-coredump by default, then congratulations! This makes reporting crashes much easier because you can easily retrieve a core dump for any recent crash using the coredumpctl command. For example, coredumpctl alone will list all available core dumps. coredumpctl gdb will open the core dump of the most recent crash in gdb. coredumpctl gdb 1234 will open the core dump corresponding to the most recent crash of a process with pid 1234. It doesn’t get easier than this.

Core dumps are stored under /var/lib/systemd/coredump. systemd-coredump will automatically delete core dumps that exceed configurable size limits (2 GB by default). It also deletes core dumps if your free disk space falls below a configurable threshold (15% free by default). Additionally, systemd-tmpfiles will delete core dumps automatically after some time has passed (three days by default). This ensures your disk doesn’t fill up with old core dumps. Although most of these settings seem good to me, the default 2 GB size limit is way too low in my opinion, as it causes systemd to immediately discard crashes of any application that uses WebKit. I recommend raising this limit to 20 GB by creating an /etc/systemd/coredump.conf.d/50-coredump.conf drop-in containing the following:

[Coredump]
ProcessSizeMax=20G
ExternalSizeMax=20G

The other settings are likely sufficient to prevent your disk from filling up with core dumps.

Sadly, although systemd-coredump has been around for a good while now and many Linux operating systems have it enabled by default, many still do not. Most notably, the Debian ecosystem is still not yet on board. To check if systemd-coredump is enabled on your system:

$ cat /proc/sys/kernel/core_pattern

If you see systemd-coredump, then you’re good.

To enable it in Debian or Ubuntu, just install it:

# apt install systemd-coredump

Ubuntu users, note this will cause apport to be uninstalled, since it is currently incompatible. Also note that I switched from $ (which indicates a normal prompt) to # (which indicates a root prompt).

In other operating systems, you may have to manually enable it:

# echo "kernel.core_pattern=|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" > /etc/sysctl.d/50-coredump.conf
# /usr/lib/systemd/systemd-sysctl --prefix kernel.core_pattern

Note the exact core pattern to use changes occasionally in newer versions of systemd, so these instructions may not work everywhere.

Detour: Manual Core Dump Handling

If you don’t want to enable systemd-coredump, life is harder and you should probably reconsider, but it’s still possible to debug most crashes. First, enable core dump creation by removing the default 0-byte size limit on core files:

$ ulimit -c unlimited

This change is temporary and only affects the current instance of your shell. For example, if you open a new tab in your terminal, you will need to set the ulimit again in the new tab.

Next, run your program in the terminal and try to make it crash. A core file will be generated in the current directory. Open it by starting the program that crashed in gdb and passing the filename of the core file that was created. For example:

$ gdb gnome-chess ./core

This is downright primitive, though:

  • You’re going to have a hard time getting backtraces for services that are crashing, for starters. If starting the service normally, how do you set the ulimit? I’m sure there’s a way to do it, but I don’t know how! It’s probably easier to start the service manually, but then what command line flags are needed to properly do so? It will be different for each service, and you have to figure this all out for yourself.
  • Special situations become very difficult. For example, if a service is crashing only when run early during boot, or only during an initial setup session, you are going to have an especially hard time.
  • If you don’t know how to reproduce a crash that occurs only rarely, it’s inevitably going to crash when you’re not prepared to manually catch the core dump. Sadly, not all crashes will occur on demand when you happen to be running the software from a terminal with the right ulimit configured.
  • Lastly, you have to remember to delete that core file when you’re done, because otherwise it will take up space on your disk space until you do. You’ll probably notice if you leave core files scattered in ~/home, but you might not notice if you’re working someplace else.

Seriously, just enable systemd-coredump. It solves all of these problems and guarantees you will always have easy access to a core dump when something crashes, even for crashes that occur only rarely.

Debuginfo Installation

Now that we know how to open a core dump in gdb, let’s talk about debuginfo. When you don’t have the right debuginfo packages installed, the backtrace generated by gdb will be low-quality. Almost all Linux software developers deal with low-quality backtraces on a regular basis, because most users are not very good at installing debuginfo. Again, if you’re using Fedora 35 or newer, you don’t have to worry about this anymore because debuginfod will take care of everything for you. I would be thrilled if other Linux operating systems would quickly adopt debuginfod so we can put the era of low-quality crash reports behind us. But if you’re using an operating system that does not provide a debuginfod server, you’ll need to learn how to install debuginfo manually.

As an example, I decided to force gnome-chess to crash using the command killall -SEGV gnome-chess, then I ran coredumpctl gdb to open the resulting core dump in gdb. After a bunch of spam, I saw this:

Missing separate debuginfos, use: dnf debuginfo-install gnome-chess-40.1-1.fc34.x86_64
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/usr/bin/gnome-chess --gapplication-service'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fa23d8b55bf in __GI___poll (fds=0x5636deb06930, nfds=2, timeout=2830)
    at ../sysdeps/unix/sysv/linux/poll.c:29
29  return SYSCALL_CANCEL (poll, fds, nfds, timeout);
[Current thread is 1 (Thread 0x7fa23ca0cd00 (LWP 140177))]
(gdb)

If you are using Fedora, RHEL, or related operating systems, the line “missing separate debuginfos” is a good hint that debuginfo is missing. It even tells you exactly which dnf debuginfo-install command to run to remedy this problem! But this is a Fedora ecosystem feature, and you won’t see this on most other operating systems. Usually, you’ll need to manually locate the right debuginfo packages to install. Debian and Ubuntu users can do this by searching for and installing -dbg or -dbgsym packages until each frame in the backtrace looks good. You’ll just have to manually guess the names of which debuginfo packages you need to install based on the names of the libraries in the backtrace. Look here for instructions for popular operating systems.

How do you know when the backtrace looks good? When each frame has file names, line numbers, function parameters, and local variables! Here is an example of a bad backtrace, if I continue the gnome-chess example above without properly installing the required debuginfo:

(gdb) bt full
#0 0x00007fa23d8b55bf in __GI___poll (fds=0x5636deb06930, nfds=2, timeout=2830)
    at ../sysdeps/unix/sysv/linux/poll.c:29
        sc_ret = -516
        sc_cancel_oldtype = 0
#1 0x00007fa23eee648c in g_main_context_iterate.constprop () at /lib64/libglib-2.0.so.0
#2 0x00007fa23ee8fc03 in g_main_context_iteration () at /lib64/libglib-2.0.so.0
#3 0x00007fa23e4b599d in g_application_run () at /lib64/libgio-2.0.so.0
#4 0x00005636dd7b79a2 in chess_application_main ()
#5 0x00007fa23d7e7b75 in __libc_start_main (main=0x5636dd7aaa50 <main>, argc=2, argv=0x7fff827b6438, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff827b6428)
    at ../csu/libc-start.c:332
        self = <optimized out>
        result = <optimized out>
        unwind_buf = 
              {cancel_jmp_buf = {{jmp_buf = {94793644186304, 829313697107602221, 94793644026480, 0, 0, 0, -829413713854928083, -808912263273321683}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x2, 0x7fff827b6438}, data = {prev = 0x0, cleanup = 0x0, canceltype = 2}}}
        not_first_call = <optimized out>
#6 0x00005636dd7aaa9e in _start ()

This backtrace has six frames, which shows where the code was during program execution when the crash occurred. You can see line numbers for frame #0 (poll.c:29) and #5 (libc-start.c:332), and these frames also show the values of function parameters and variables on the stack, which are often useful for figuring out what went wrong. These frames have good debuginfo because I already had debuginfo installed for glibc. But frames #1 through #4 do not look so useful, showing only function names and the library and nothing else. This is because I’m using Fedora 34 rather than Fedora 35, so I don’t have debuginfod yet, and I did not install proper debuginfo for libgio, libglib, and gnome-chess. (The function names are actually only there because programs in Fedora include some limited debuginfo by default. In many operating systems, you will see ??? instead of function names.) A developer looking at this backtrace is not going to know what went wrong.

Now, let’s run the recommended debuginfo-install command:

# dnf debuginfo-install gnome-chess-40.1-1.fc34.x86_64

When the command finishes, we’ll start gdb again, using coredumpctl gdb just like before. This time, we see this:

Missing separate debuginfos, use: dnf debuginfo-install avahi-libs-0.8-14.fc34.x86_64 colord-libs-1.4.5-2.fc34.x86_64 cups-libs-2.3.3op2-7.fc34.x86_64 fontconfig-2.13.94-2.fc34.x86_64 glib2-2.68.4-1.fc34.x86_64 graphene-1.10.6-2.fc34.x86_64 gstreamer1-1.19.1-2.1.18.4.fc34.x86_64 gstreamer1-plugins-bad-free-1.19.1-3.1.18.4.fc34.x86_64 gstreamer1-plugins-base-1.19.1-2.1.18.4.fc34.x86_64 gtk4-4.2.1-1.fc34.x86_64 json-glib-1.6.6-1.fc34.x86_64 krb5-libs-1.19.2-2.fc34.x86_64 libX11-1.7.2-3.fc34.x86_64 libX11-xcb-1.7.2-3.fc34.x86_64 libXfixes-6.0.0-1.fc34.x86_64 libdrm-2.4.107-1.fc34.x86_64 libedit-3.1-38.20210714cvs.fc34.x86_64 libepoxy-1.5.9-1.fc34.x86_64 libgcc-11.2.1-1.fc34.x86_64 libidn2-2.3.2-1.fc34.x86_64 librsvg2-2.50.7-1.fc34.x86_64 libstdc++-11.2.1-1.fc34.x86_64 libxcrypt-4.4.25-1.fc34.x86_64 llvm-libs-12.0.1-1.fc34.x86_64 mesa-dri-drivers-21.1.8-1.fc34.x86_64 mesa-libEGL-21.1.8-1.fc34.x86_64 mesa-libgbm-21.1.8-1.fc34.x86_64 mesa-libglapi-21.1.8-1.fc34.x86_64 nettle-3.7.3-1.fc34.x86_64 openldap-2.4.57-5.fc34.x86_64 openssl-libs-1.1.1l-1.fc34.x86_64 pango-1.48.9-2.fc34.x86_64

Yup, Fedora ecosystem users will need to run dnf debuginfo-install twice to install everything required, because gdb doesn’t list all required packages until the second time. Next, we’ll run coredumpctl gdb one last time. There will usually be a few more debuginfo packages that are still missing because they’re not available in the Fedora repositories, but now you’ll probably have enough to get a quality backtrace:

(gdb) bt full
#0  0x00007fa23d8b55bf in __GI___poll (fds=0x5636deb06930, nfds=2, timeout=2830)
    at ../sysdeps/unix/sysv/linux/poll.c:29
        sc_ret = -516
        sc_cancel_oldtype = 0
#1  0x00007fa23eee648c in g_main_context_poll
    (priority=, n_fds=2, fds=0x5636deb06930, timeout=, context=0x5636de7b24a0)
    at ../glib/gmain.c:4434
        ret = 
        errsv = 
        poll_func = 0x7fa23ee97c90 
        max_priority = 2147483647
        timeout = 2830
        some_ready = 
        nfds = 2
        allocated_nfds = 2
        fds = 0x5636deb06930
        begin_time_nsec = 30619110638882
#2  g_main_context_iterate.constprop.0
    (context=context@entry=0x5636de7b24a0, block=block@entry=1, dispatch=dispatch@entry=1, self=)
    at ../glib/gmain.c:4126
        max_priority = 2147483647
        timeout = 2830
        some_ready = 
        nfds = 2
        allocated_nfds = 2
        fds = 0x5636deb06930
        begin_time_nsec = 30619110638882
#3  0x00007fa23ee8fc03 in g_main_context_iteration
    (context=context@entry=0x5636de7b24a0, may_block=may_block@entry=1) at ../glib/gmain.c:4196
        retval = 
#4  0x00007fa23e4b599d in g_application_run
    (application=0x5636de7ae260 [ChessApplication], argc=-2105843004, argv=)
    at ../gio/gapplication.c:2560
        arguments = 0x5636de7b2400
        status = 0
        context = 0x5636de7b24a0
        acquired_context = 
        __func__ = "g_application_run"
#5  0x00005636dd7b79a2 in chess_application_main (args=0x7fff827b6438, args_length1=2)
    at src/gnome-chess.p/gnome-chess.c:5623
        _tmp0_ = 0x5636de7ae260 [ChessApplication]
        _tmp1_ = 0x5636de7ae260 [ChessApplication]
        _tmp2_ = 
        result = 0
...

I removed the last two frames because they are triggering a strange WordPress bug, but that’s enough to get the point. It looks much better! Now the developer can see exactly where the program crashed, including filenames, line numbers, and the values of function parameters and variables on the stack. This is as good as a crash report is normally going to get. In this case, it crashed when running poll() because gnome-chess was not actually doing anything at the time of the crash, since we crashed it by manually sending a SIGSEGV signal. Normally the backtrace will look more interesting.

debuginfod for Debian Users

Debian users can use debuginfod, but it has to be enabled manually:

$ DEBUGINFOD_URLS=https://debuginfod.debian.net/ gdb

See here for more information. This requires Debian 11 “bullseye” or newer. If you’re using Ubuntu or other operating systems derived from Debian, you’ll need to wait until a debuginfod server for your operating system is available.

Flatpak

If your application uses Flatpak, you can use the flatpak-coredumpctl script to open core dumps in gdb. For most runtimes, including those distributed by GNOME or Flathub, you will need to manually install (a) the debug extension for your app, (b) the SDK runtime corresponding to the platform runtime that you are using, and (c) the debug extension for the SDK runtime. For example, to install everything required to debug Epiphany 40 from Flathub, you would run:

$ flatpak install org.gnome.Epiphany.Debug//stable
$ flatpak install org.gnome.Sdk//40
$ flatpak install org.gnome.Sdk.Debug//40

(flatpak-coredumpctl will fail to start if you don’t have the correct SDK runtime installed, but it will not fail if you’re missing the debug extensions. You’ll just wind up with a bad backtrace.)

The debug extensions need to exactly match the versions of the app and runtime that crashed, so backtrace generation may be unreliable after you install them for the very first time, because you would have installed the latest versions of the extensions, but your core dump might correspond to an older app or runtime version. If the crash is reproducible, it’s a good idea to run flatpak update after installing to ensure you have the latest version of everything, then reproduce the crash again.

Once your debuginfo is installed, you can open the backtrace in gdb using flatpak-coredumpctl. You just have to tell flatpak-coredumpctl the app ID to use:

$ flatpak-coredumpctl org.gnome.Epiphany

You can pass matches to coredumpctl using -m. For example, to open the core dump corresponding to a crashed process with pid 1234:

$ flatpak-coredumpctl -m 1234 org.gnome.Epiphany

Thibault Saunier wrote flatpak-coredumpctl because I complained about how hard it used to be to debug crashed Flatpak applications. Clearly it is no longer hard. Thanks Thibault!

On newer versions of Debian and Ubuntu, flatpak-coredumpctl is included in the libflatpak-dev subpackage rather than the base flatpak package, so you’ll have to install libflatpak-dev first. But on older OS versions, including Debian 10 “buster” and Ubuntu 20.04, it is unfortunately installed as /usr/share/doc/flatpak/examples/flatpak-coredumpctl rather than /usr/bin/flatpak-coredumpctl due to a regrettable packaging choice that has been corrected in newer package versions. As a workaround, you can simply copy it to /usr/local/bin. Don’t forget to delete your copy after upgrading to a newer OS version, or it will shadow the packaged version.

Fedora Flatpaks

Flatpaks distributed by Fedora are different than those distributed by GNOME or by Flathub because they do not have debug extensions. Historically, this has meant that debugging crashes was impractical. The best solution was to give up.

Good news! Fedora’s Flatpaks are compatible with debuginfod, which means debug extensions will no longer be missed. You do still need to manually install the org.fedoraproject.Sdk runtime corresponding to the version of the org.fedoraproject.Platform runtime that the application uses, because this is required for flatpak-coredumpctl to work, but nothing else is required. For example, to get a backtrace for Fedora’s Epiphany Flatpak using a Fedora 35 host system, I ran:

$ flatpak install org.fedoraproject.Sdk//f34
$ flatpak-coredumpctl org.gnome.Epiphany
(gdb) bt full

(The f34 is not a typo. Epiphany currently uses the Fedora 34 runtime regardless of what host system you are using.)

That’s it!

Miscellany

At this point, you should know enough to obtain a high-quality backtrace on most Linux systems. That will usually be all you really need, but it never hurts to know a little more, right?

Alternative Types of Backtraces

At the top of this blog post, I suggested using bt full to take the backtrace because this type of backtrace is the most useful to most developers. But there are other types of backtraces you might occasionally want to collect:

  • bt on its own without full prints a much shorter backtrace without stack variables or function parameters. This form of the backtrace is more useful for getting a quick feel for where the bug is occurring, because it is much shorter and easier to read than a full backtrace. But because there are no stack variables or function parameters, it might not contain enough information to solve the crash. I sometimes like to paste the first few lines of a bt backtrace directly into an issue report, then submit the bt full version of the backtrace as an attachment, since an entire bt full backtrace can be long and inconvenient if pasted directly into an issue report.
  • thread apply all bt prints a backtrace for every thread. Normally these backtraces are very long and noisy, so I don’t collect them very often, but when a threadsafety issue is suspected, this form of backtrace will sometimes be required.
  • thread apply all bt full prints a full backtrace for every thread. This is what automated bug report tools generally collect, because it provides the most information. But these backtraces are usually huge, and this level of detail is rarely needed, so I normally recommend starting with a normal bt full.

If in doubt, just use bt full like I showed at the top of this blog post. Developers will let you know if they want you to provide the backtrace in a different form.

gdb Logging

You can make gdb print your session to a file. For longer backtraces, this may be easier than copying the backtrace from a terminal:

(gdb) set logging on

Memory Corruption

While a backtrace taken with gdb is usually enough information for developers to debug crashes, memory corruption is an exception. Memory corruption is the absolute worst. When memory corruption occurs, the code will crash in a location that may be far removed from where the actual bug occurred, rendering gdb backtraces useless for tracking down the bug. As a general rule, if you see a crash inside a memory allocation routine like malloc() or g_slice_alloc(), you probably have memory corruption. If you see magazine_chain_pop_head(), that’s called by g_slice_alloc() and is a sure sign of memory corruption. Similarly, crashes in GTK’s CSS machinery are almost always caused by memory corruption somewhere else.

Memory corruption is generally impossible to debug unless you are able to reproduce the issue under valgrind. valgrind is extremely slow, so it’s impractical to use it on a regular basis, but it will get to the root of the problem where gdb cannot. As a general rule, you want to run valgrind with --track-origins=yes so that it shows you exactly what went wrong:

$ valgrind --track-origins=yes my_app

If you cannot reproduce the issue under valgrind, you’re usually totally out of luck. Memory corruption that only occurs rarely or under unknown conditions will lurk in your code indefinitely and cause occasional crashes that are effectively impossible to fix.

Another good tool for debugging memory corruption is address sanitizer (asan), but this is more complicated to use. Experienced users who are comfortable with rebuilding applications using special compiler flags may find asan very useful. However, because it can be very difficult to use,  I recommend sticking with valgrind if you’re just trying to report a bug.

Apport and ABRT

There are two popular downstream bug reporting tools: Ubuntu has Apport, and Fedora has ABRT. These tools are relatively easy to use — no command line knowledge required — and produce quality crash reports. Unfortunately, while the tools are promising, the crash reports go to downstream packagers who are generally either not watching bug reports, or else not interested or capable of fixing upstream software problems. Since downstream reports are very often ignored, it’s better to report crashes directly to upstream if you want your issue to be seen by the right developers and actually fixed. Of course, only report issues upstream if you’re using a recent software version. Fedora and Arch users can pretty much always safely report directly to upstream, as can Ubuntu users who are using the very latest version of Ubuntu. If you are an Ubuntu LTS user, you should stick with reporting issues to downstream only, or at least take the time to verify that the issue still occurs with a more recent software version.

There are a couple more problems with these tools. As previously mentioned, Ubuntu’s apport is incompatible with systemd-coredump. If you’ve read this far, you know you really want systemd-coredump enabled, so I recommend disabling apport until it learns to play ball with systemd-coredump.

The technical design of Fedora’s ABRT is currently better because it actually retrieves your core dumps from systemd-coredump, so you don’t have to choose between one or the other. Unfortunately, ABRT has many serious user experience bugs and warts. I can’t recommend it for this reason, but it if it works well enough for you, it does create some great downstream crash reports. Whether a downstream package maintainer will look at those reports is hit or miss, though.

What is a crash, really?

Most developers consider crashes on Unix systems to be program termination via a Unix signal that triggers creation of a core dump. The most common of these are SIGSEGV (segmentation fault, “invalid memory reference”) or SIBABRT (usually an intentional crash due to an assertion failure). Less-common signals are SIGBUS (“bad memory access”) or SIGILL (“illegal instruction”). Sandboxed applications might occasionally see SIGSYS (“bad system call”). See the manpage signal(7) for a full list. These are cases where you can get a backtrace to help with tracking down the issues.

What is not a crash? If your application is hanging or just not behaving properly, that is not a crash. If your application is killed using SIGTERM or SIGKILL — this can happen when systemd-oomd determines you are low on memory,  or when a service is taking too long to stop — this is also not a crash in the usual sense of the word, because you’re not going to be able to get a backtrace for it. If a website is slow or unavailable, the news might say that it “crashed,” but it’s obviously not the same thing as what we’re talking about here. The techniques in this blog post are no use for these sorts of “crashes.”

Conclusion

If you have systemd-coredump enabled and debuginfod installed and working, most crash reports will be simple.  Memory corruption is a frustrating exception. Encourage your operating system to enable systemd-coredump and debuginfod if it doesn’t already.  Happy crash reporting!

Reminder: SoupSessionSync and SoupSessionAsync default to no TLS certificate verification

This is a public service announcement! The modern SoupSession class is secure by default, but the older, deprecated SoupSessionSync and SoupSessionAsync subclasses of SoupSession are not. If your code uses SoupSessionSync or SoupSessionAsync and does not set SoupSession:tls-database, SoupSession:ssl-use-system-ca-file, or SoupSession:ssl-ca-file, then you get no TLS certificate verification. This is almost always worth requesting a CVE.

SoupSessionSync and SoupSessionAsync have both been deprecated since libsoup 2.42 was released back in 2013, so surely they are no longer commonly used, right? Right? Of course not, so please check your applications and make sure they’re actually doing TLS certificate verification. Thanks!

Update: we decided to change this behavior.

Free Software Charities

I believe we have reached a point where it is time to discontinue donations to the Free Software Foundation, in light of the outrageously poor judgment shown by its board of directors in reinstating Richard Stallman to the board. I haven’t seen other free software community members calling for cutting off donations yet. Even the open letter doesn’t call for this. Edit: I’ve been pointed to the line “We urge those in a position to do so to stop supporting the Free Software Foundation,” which surely implies a call to stop donating, so I was wrong about that.

I have no doubt there will be follow-up blog posts explaining why cutting off donations is harmful to the community and the FSF’s staff, and will hurt the FSF in the long-term… but seriously, enough is enough. If we don’t draw the line here, there will never be any line anywhere. Continued support for the FSF is continued complicity, and is harming rather than helping advance the ideals of the free software community. So, where should you donate instead?

The Software Freedom Conservancy is a great choice. It hosts many member projects you’re probably familiar with, including Outreachy and git, among many others. It does some GPL compliance work and takes a strict view on software freedom. For US donors, it is a 501(c)(3), so your donations may be tax-deductible. This is where I send my free software donations not directed to GNOME. Read its statement on recent events at the FSF.

Another good option, especially for people in the EU, is the Free Software Foundation Europe, an independently-run sister organization of the FSF. If you’re not familiar with FSFE, think of it as a more moderate version of the FSF, with a special focus on Europe. It shares the same goals as the FSF, but with more reasonable leadership and much less popcorn. Read its statement on recent events at the FSF.

Most people reading this blog are likely GNOME users. Contributing to GNOME is a great way to support the desktop you use to get things done. Monthly sustaining donations are especially appreciated. Despite the historical association between GNOME and GNU, GNOME has had little to do with GNU for a long time now. The GNOME Foundation is run independently and has formally signed the open letter calling for the resignation of the FSF’s board of directors. For US donors, it is a 501(c)(3). (If you use KDE, donate to KDE here.)

It should be obvious, but this is a personal blog. None of the above organizations have endorsed cutting off donations to the FSF, to my knowledge. They would probably find it to be in poor taste to abuse a crisis to solicit funds. But I doubt they’d object if you send some money their way.

Understanding systemd-resolved, Split DNS, and VPN Configuration

So, systemd-resolved is enabled by default in Fedora 33. Most users won’t notice the difference, but if you use VPNs — or depend on DNSSEC, more on that at the bottom of this post — then systemd-resolved might be big deal for you. When testing Fedora 33, we found one bug report where a user discovered that systemd-resolved broke his VPN configuration. After this bug was fixed, and nobody reported any further issues, I was pretty confident that migration to systemd-resolved would go smoothly. Then Fedora 33 was released, and I noticed a significant number of users on Ask Fedora and Reddit asking for help with broken VPNs, problems that Fedora 33 beta testers had failed to detect. This was especially surprising to me because Ubuntu has enabled systemd-resolved by default since Ubuntu 16.10, so we were four full years behind Ubuntu here, which should have been plenty of time for any problems to be ironed out. So what went wrong?

First, let’s talk about how things worked before systemd-resolved, so we can see what was wrong and why we needed change. We’ll see how split DNS with systemd-resolved is different than traditional DNS. Finally, we’ll learn how custom VPN software must configure systemd-resolved to avoid problems that result in broken DNS.

I want to note that, although I wrote the Fedora change proposal and have done some evangelism on behalf of systemd-resolved, I’m not a systemd developer and haven’t contributed any code to systemd-resolved.

Traditional DNS with nss-dns

Let’s first see how things worked before systemd-resolved. There are two important configuration files to discuss. The first is /etc/nsswitch.conf, which controls which NSS modules are invoked by glibc when performing name resolution. Note these are glibc Name Service Switch modules, which are totally unrelated to Firefox’s NSS, Network Security Services, which unfortunately uses the same acronym. Also note that, in Fedora (and also Red Hat Enterprise Linux), /etc/nsswitch.conf is managed by authselect and must not be edited directly. If you want to change it, you need to edit /etc/authselect/user-nsswitch.conf instead, then run sudo authselect apply-changes.

Anyway, in Fedora 32, the hosts line in /etc/nsswitch.conf looked like this:

hosts: files mdns4_minimal [NOTFOUND=return] dns myhostname

That means: first invoke nss-files, which looks at /etc/hosts to see if the hostname is hardcoded there. If it’s not, then invoke nss-mdns4_minimal, which uses avahi to implement mDNS resolution. [NOTFOUND=return] means it’s OK for avahi to not be installed; in that case, it just gets ignored. (Edit: this was wrong. Mantas mentioned in the comment below that this is to allow returning early for queries to .local domains, which should never go to the remaining nss modules.) Then most DNS resolution is performed by nss-dns. And finally, we have nss-myhostname, which is just there to guarantee that your own local hostname is always resolvable. Anyway, nss-dns is the key part here. nss-dns is what reads /etc/resolv.conf.

Next, let’s look at /etc/resolv.conf. This file contains a list of up to three DNS servers to use. The servers are attempted in order. If the first server in the list is broken, then the second server will be used. If the second server is broken, the third server will be used. If the third server is also broken, then everything fails, because no matter how many servers you list here, all except the first three are ignored. In Fedora 32, /etc/resolv.conf was, by default, a plain file managed by NetworkManager. It might look like this:

# Generated by NetworkManager
nameserver 192.168.122.1

That’s a pretty common example. It means that all DNS requests should be sent to my router. My router must have configured this via DHCP, causing NetworkManager to dutifully add it to /etc/resolv.conf.

Traditional DNS Problems

Traditional DNS is all well and good for a simple case like we had above, but turns out it’s really broken once you start adding VPNs to the mix. Let’s consider two types of VPNs: a privacy VPN that is always enabled and which is the default route for all web traffic, and a corporate VPN that only receives traffic for internal company resources. (To switch between these two different types of VPN configuration, use the checkbox “Use this connection only for resources on its network” at the bottom of the IPv4 and IPv6 tabs of your VPN’s configuration in System Settings.)

Now, what happens if we connect to both VPNs? The VPN that you connect to first gets listed first in /etc/resolv.conf, followed by the VPN that you connect to second, followed by your local DNS server. Assuming the DNS servers are all working properly, that means:

  • If you connect to your privacy VPN first and your corporate VPN second, all DNS requests will be sent to your privacy VPN, and you won’t be able to visit internal corporate websites. (This scenario is exactly why I become interested in systemd-resolved. After joining Red Hat, I discovered that I couldn’t access various redhat.com websites if I connected to my VPNs in the wrong order.)
  • If you connect to your corporate VPN first and your privacy VPN second, then all your DNS goes to your corporate VPN, and none to your privacy VPN. As that defeats the point of using the privacy VPN, we can be confident it’s not what users expect to happen.
  • If you ever connect the VPNs in the opposite order — say, if your connection to one temporarily drops, and you need to reconnect — then you’ll get the opposite behavior. If you don’t notice this pattern behind the failures, it can make problems difficult to reproduce.

You don’t need two VPNs for this to be a problem, of course. Let’s say you have no privacy VPN, only a corporate VPN.  Well, your employer may fire you if it notices DNS requests it doesn’t like. If you’re making 30 requests per hour to facebook.com, youtube.com, or more salacious websites, that sure looks like you’re not doing very much work. It’s really never in the employee’s best interests to send more DNS than necessary to an employer.

If you use only a privacy VPN, the failure case is arguably even more severe. Let’s say your privacy VPN’s DNS server temporarily goes offline. Then, because /etc/resolv.conf is a list, glibc will fall back to using your normal DNS, probably either your ISP’s DNS server, or your router that forwards everything to your ISP. And now your DNS query has gone to your ISP. If you’re making the wrong sort of DNS requests in the wrong sort of countries — say, if you’re visiting websites opposed to your government — this could get you imprisoned or executed.

Finally, either type of VPN will break resolution of local domains, e.g. fritz.box, because only your router can resolve that properly, but you’re sending your DNS query to your VPN’s DNS server. So local resources will be broken for as long as you’re connected to a VPN.

All things considered, the status quo prior to systemd-resolved was pretty terrible. The need for something better should be clear. Now let’s look at how systemd-resolved fixes this.

Modern DNS with nss-resolve

First, let’s look at /etc/nsswitch.conf, which looks a bit different in Fedora 33:

hosts: files mdns4_minimal [NOTFOUND=return] resolve [!UNAVAIL=return] myhostname dns

nss-myhostname and nss-dns have switched places, but that’s just a minor change that ensures your local hostname is always local even if your DNS server thinks otherwise. (March 2021 Update: nss-myhostname has been moved before nss-mdns4_minimal for Fedora 34, so our new configuration is files myhostname mdns4_minimal [NOTFOUND=return] resolve [!UNAVAIL=return] dns.)

The important change here is the addition of resolve [!UNAVAIL=return]. nss-resolve uses systemd-resolved to resolve hostnames, via either its varlink API (with systemd 247) or its D-Bus API (with older versions of systemd). If systemd-resolved is running, glibc will stop there, and refuse to continue on to nss-myhostname or nss-dns even if nss-resolve doesn’t return a result, since both nss-myhostname and nss-dns are obsoleted by nss-resolve. But if systemd-resolved is not running, then it continues on (and, if resolving something other than the local hostname, will up using nss-dns and reading /etc/resolv.conf, as before).

Importantly, when nss-resolve is used, glibc does not read /etc/resolv.conf when performing name resolution, so any configuration that you put there is totally ignored. That means any script or program that writes to /etc/resolv.conf is probably broken. /etc/resolv.conf still exists, though: it’s managed by systemd-resolved to maintain compatibility with programs that manually read /etc/resolv.confand do their own name resolution, bypassing glibc. Although systemd-resolved supports several different modes for managing /etc/resolv.conf, the default mode, and the mode used in both Fedora and Ubuntu, is for /etc/resolv.conf to be a symlink to /run/systemd/resolve/stub-resolv.conf, which now looks like this:

# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53
options edns0 trust-ad
search redhat.com lan

The redhat.com search domain is coming from my corporate VPN, but the rest of this /etc/resolv.conf should look like yours. Notably, 127.0.0.53 is systemd-resolved’s local stub responder. This allows programs that manually read /etc/resolv.conf to continue to work without changes: they will just wind up talking to systemd-resolved on 127.0.0.53 rather than directly connecting to your real DNS server, as before.

A Word about Ubuntu

Although Ubuntu has used systemd-resolved for four years now, it has not switched from nss-dns to nss-resolve, contrary to upstream recommendations. This means that on Ubuntu, glibc still reads /etc/resolv.conf, finds 127.0.0.53 listed there, and then makes an IP connection to systemd-resolved rather than talking to it via varlink or D-Bus, as occurs on Fedora. The practical effect is that, on Ubuntu, you can still manually edit /etc/resolv.conf and applications will respond to those changes, unlike Fedora. Of course, that would be a disaster, since it would cause all of your DNS configuration in systemd-resolved to be completely ignored. But it’s still possible on Ubuntu. On Fedora, that won’t work at all.

If you’re using custom VPN software that doesn’t work with systemd-resolved, chances are it probably tries to write to /etc/resolv.conf.

Split DNS with systemd-resolved

OK, so now we’ve looked at how /etc/nsswitch.conf and /etc/resolve.conf have changed, but we haven’t actually explained how split DNS is configured. Instead of sending all your DNS requests to the first server listed in /etc/resolv.conf, systemd-resolved is able to split your DNS on the basis of DNS routing domains.

IP Routing Domains, DNS Routing Domains, and DNS Search Domains: Oh My!

systemd-resolved works with DNS routing domains and DNS search domains. A DNS routing domain determines only which DNS server your DNS query goes to.  It doesn’t determine where IP traffic goes to: that would be an IP routing domain. Normally, when people talk about “routing domains,” they probably mean IP routing domains, not DNS routing domains, so be careful not to confuse these two concepts. For the rest of this article, I will use “routing domain” or “DNS domain” to mean DNS routing domain.

A DNS search domain is also different. When you query a name that is only a single label — a domain without any dots — a search domain gets appended to your query. For example, because I’m currently connected to my Red Hat VPN, I have a search domain configured for redhat.com. This means that if I make a query to a domain that is only a single label, redhat.com will be appended to the query. For example, I can query bugzilla and this will be treated as a query for bugzilla.redhat.com. This probably won’t work in your web browser, because web browsers like to convert single-label domains into web searches, but it does work at the DNS level.

In systemd-resolved, each DNS routing domain may or may not be used as a search domain. By default, systemd-resolved will add search domains for every configured routing domain that is not prefixed by a tilde. For example, ~example.com is a routing domain only, while example.com is both a routing domain and a search domain. There is also a global routing domain,  ~.

Example Split DNS Configurations

Let’s look at a complex example with three network interfaces:

$ resolvectl
Global
Protocols: LLMNR=resolve -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub

Link 2 (enp4s0)
Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6 
Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 192.168.1.1 
DNS Servers: 192.168.1.1 
DNS Domain: lan

Link 5 (tun0)
Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6 
Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 10.8.0.1 
DNS Servers: 10.8.0.1 
DNS Domain: ~.

Link 9 (tun1)
Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6 
Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 10.9.0.1 
DNS Servers: 10.9.0.1 10.9.0.2
DNS Domain: example.com

To simplify this example, I’ve removed several uninteresting network interfaces from the output above: my unused second Ethernet interface, my unused Wi-Fi interface wlp5s0, and two virtual network interfaces that I presume are used by libvirt. This means we only have three interfaces to consider: normal Ethernet enp4s0, the privacy VPN tun0, and the corporate VPN tun1. I’m currently running NetworkManager 1.26.4, so I have also fudged the output a bit to make it look like it would if I were using NetworkManager 1.26.6 — I’ll discuss the difference below — so that this example will be good for the future. Let’s look at a few points of note:

  • enp4s0 is configured with +DefaultRoute and no routing domains.
  • tun0 is configured with +DefaultRoute and a global routing domain, ~.
  • tun1 is configured with -DefaultRoute and a routing domain for example.com. (It also has a search domain for example.com, because it doesn’t start with a tilde.)

systemd-resolved first decides which network interface is most appropriate for your DNS query based on the domain name you are querying, then sends your query to the DNS server associated with that interface. In this case, queries for example.com, foo.example.com, etc. will be sent to 10.9.0.1, since that is the DNS server configured for tun1, which is associated with the domain example.com. All other requests go to 10.8.0.1, since tun0 has the global domain ~. Nothing ever goes to 192.168.1.1, because a privacy VPN is enabled, and that would be a privacy disaster. Very simple, right?

If you do not use a privacy VPN, you will not have any ~. domain configured. In this case, your query will go to all interfaces that have +DefaultRoute. For example, if tun0 were removed from the above configuration, then queries not for example.com would be sent to 192.168.1.1, my router, which is good because tun1 is my corporate VPN and should only receive DNS queries corresponding to its own DNS domains.

Enter NetworkManager

How does systemd-resolved come up with the above configuration? It doesn’t. Everything I wrote in the previous section assumes that you are using NetworkManager, because systemd-resolved doesn’t actually make any decisions about where to send your DNS. That is all the responsibility of higher-level network management software, typically NetworkManager. If you use custom VPN software — anything that’s not a NetworkManager VPN plugin — then that software is also responsible for configuring systemd-resolved and playing nice with NetworkManager.

NetworkManager normally does a very good job of configuring systemd-resolved to work as you would expect, so most users should not need to make any changes. But if your DNS isn’t working as you expect, and you run resolvectl and find that systemd-resolved’s configuration is not what you want, do not report a bug against systemd-resolved! Report a bug against NetworkManager instead (if you’re confident there is a real bug).

If you don’t use NetworkManager, you can still make systemd-resolved do what you want, but you’re on your own. It will not configure itself for you.

NetworkManager 1.26.6

If you’re reading this in December 2020, you’re probably using NetworkManager 1.26.4 or earlier. Things are slightly different here, because NetworkManager recently landed a major behavior change. Previously, NetworkManager would always configure a ~. domain for exactly one network interface. This means that the value of systemd-resolved’s DefaultRoute settings was always ignored, since ~. takes precedence. Accordingly, NetworkManager did not bother to configure DefaultRoute at all. I told you that I fudged the output of the example above a little. In actuality, NetworkManager 1.26.4 has configured +DefaultRoute on my tun1 corporate VPN. That doesn’t make sense, because it should only receive DNS for example.com, but it previously did not matter, because there was previously always a ~. domain on some interface. If you’re not using any VPNs, then your Ethernet or Wi-Fi interface would receive the ~. domain. But since 1.26.6, NetworkManager now only ever configures a ~. domain when you are using a privacy VPN, so the DefaultRoute setting now matters.

Prior to NetworkManager 1.26.6, you could rely on resolvectl domain alone to see where your DNS goes, because there was always a ~. domain. Since NetworkManager 1.26.6 no longer always creates a ~. domain, that no longer works. You’ll need to use look at the full output of resolvectl instead, since that will show you the DefaultRoute settings, which are now important.

My Corporate VPN is Missing a Routing Domain, What Should I Do?

Say your corporate VPN is example.com. You want all requests for example.com to be resolved by the VPN, and they are, because NetworkManager creates an appropriate routing domain for it. But you also want requests for some other domain, say example.org, to be resolved by the VPN as well. What do you do?

Most VPN protocols allow the VPN to tell NetworkManager which domains should be resolved by the VPN. Others allow specifying this in the connection profile that you import into NetworkManager. Sadly, not all VPNs actually do this properly, since it doesn’t matter for traditional non-split DNS. Worse, there is no graphical configuration in GNOME System Settings to fix this. There really should be. But for now, you’ll have to use nmcli to set the ipv4.dns-search and ipv6.dns-search properties of your VPN connection profile. Confusingly, even though that setting says “search,” it also creates a routing domain. Hopefully you never have to mess with this. If you do this, consider contacting your IT department to ask them to fix your VPN configuration to properly declare its DNS routing domains, so you don’t have to fix it manually. (This actually sometimes works!) You might have to do this more than once, if you discover additional domains that need to be resolved by the corporate VPN.

Custom VPN Software

By “custom VPN software,” I mean any VPN that is not a NetworkManager plugin. That includes proprietary VPN applications offered by VPN services, and also packaged software like openvpn or wg-quick, when invoked by something other than NetworkManager.

If your custom VPN software is broken, you could report a bug against your VPN software to ask for support for systemd-resolved, but it’s really best to ditch your custom software and configure your VPN using NetworkManager instead, if possible. There are really only two good reasons to use custom VPN software: if NetworkManager doesn’t have a plugin appropriate for your corporate VPN, or if you need to use Wireguard and your desktop doesn’t support Wireguard yet. (NetworkManager itself supports Wireguard, but GNOME does not yet, because Wireguard is special and not treated the same as other VPNs. Help welcome.)

If you use NetworkManager to configure your VPN, as desktop developers intend for you to do, then NetworkManager will take care of configuring systemd-resolved appropriately. Fedora ships with several NetworkManager VPN plugins installed by default, so the vast majority of VPN users should be able to configure your VPN directly in System Settings. This also allows you to control your VPN using your desktop environment’s VPN integration, rather than using the command line or a custom proprietary application.

OpenVPN users will want to look into using the unofficial update-systemd-resolved script. However, NetworkManager has good support for OpenVPN, and this is totally unnecessary if you configure your VPN with NetworkManager. So it’s probably better to use NetworkManager instead.

Now, what if you maintain custom VPN software and want it to work properly with systemd-resolved, or what if you can’t use NetworkManager for whatever reason? First, stop trying to write to /etc/resolv.conf, at least if it’s managed by systemd-resolved. You’ll instead want to use the systemd-resolved D-Bus API to configure an appropriate routing domain for your VPN interface. Read this documentation. You could also shell out to resolvectl, but it’s probably better to use the D-Bus API unless your VPN is managed by a shell script. Privacy VPNs (or corporate VPNs that wish to eschew split DNS and hijack all the user’s DNS) can also use the resolvconf compatibility script, but note this will only work properly with NetworkManager 1.26.6 and newer, because the best you can do with it is add a global routing domain to a network interface, but that’s not going to work as expected if another network interface already has a global routing domain. Did I mention that you might want to use the D-Bus API instead? With the D-Bus API, you can remove the global routing domain from any other network interfaces, to ensure only your VPN’s interface gets a global routing domain.

Split DNS Without systemd-resolved

Quick tangent: systemd-resolved is not the only software available that implements split DNS. Previously, the most popular solution for this was to use dnsmasq. This has always been available in Fedora, but you had to go out of your way to install and configure it, so almost nobody did. Other custom solutions were possible too — I know one developer who runs Unbound locally — but systemd-resolved and dnsmasq are the only options supported by NetworkManager.

One significant difference between systemd-resolved and dnsmasq is that systemd-resolved, as a system daemon, allows for multiple sources of configuration. In contrast, NetworkManager runs dnsmasq as a subprocess, so only NetworkManager itself is allowed to configure dnsmasq. For most users, this distinction will not matter, but it’s important for custom VPN software.

Servers and DNSSEC

You might have noticed that the rest of this blog post focused pretty much exclusively on desktop use cases. Your server is probably not using a VPN. It’s probably not using mDNS. It’s probably not expected to be able to resolve local hostnames. Conclusion: most servers don’t need split DNS! Servers do benefit from systemd-resolved’s systemwide DNS cache, so running systemd-resolved on servers is still a good idea. But it’s not nearly as important for servers as it is for desktops.

There are some disadvantages for servers as well. First, systemd-resolved is not intended to be used on DNS servers. If you’re running a DNS server, you’ll need to disable systemd-resolved before setting up BIND or Unbound instead. That is one extra step to get your DNS server working relative to before, so enabling systemd-resolved by default is an inconvenience here, but that’s hardly difficult to do, so not a big deal.

However, systemd-resolved currently has several bugs in how it handles DNSSEC, and this is potentially a big deal if you depend on that. If you’re a desktop user, you’ll probably never notice, because DNSSEC on desktops is a total failure. Due to widespread and unfixable compatibility issues, it’s very unlikely that we would be able to enable DNSSEC validation by default in the next 10-15 years. If you have a desktop computer that never leaves your home and a good ISP, or a server sitting in a data center, then you can probably safely turn it on manually in /etc/systemd/resolved.conf, but this is highly inadvisable for laptops. So DNSSEC is currently useful for securing DNS between DNS servers, but not for securing DNS between you devices and your DNS server.  (For that, we plan to use DNS over TLS instead.) And we’ve already established that DNS servers should not use systemd-resolved. So what’s the problem?

Well, it turns out DNS servers are not the only server software that expects DNSSEC to work properly. In particular, broken DNSSEC can result in broken mail servers. Other stuff might break too. If you’re running a server that needs functional DNSSEC, you’re going to need to disable systemd-resolved for now. These problems with DNSSEC resulted in some extremely vocal opposition to the Fedora 33 systemd-resolved change proposal, which unfortunately we didn’t properly appreciate until too late in the Fedora 33 development cycle. The good news is that these problems are being treated as bugs to be fixed. In particular, I am keeping an eye on this bug and this bug. Development is currently very active, so I’m hopeful that systemd-resolved’s DNSSEC support will look much better in time for Fedora 34.

Tell Me More!

Wow, you made it to the end of a long blog post, and you still want to know more? Next step is to read my colleague Zbigniew’s Fedora Magazine article, which describes some of the concepts I’ve already mentioned in greater detail. (However, when reading that article, be aware of the NetworkManager 1.26.6 changes I mentioned above. The article predates NetworkManager 1.26.6, so you will see in the examples that a ~. global routing domain is assigned to non-VPN interfaces. That will no longer happen.)

Conclusion

Split DNS is designed to just work, like the rest of the modern Linux desktop, and it should for everyone not using custom VPN software. If you do run into trouble with custom VPN software, the bottom line is to try using a NetworkManager VPN plugin instead, if possible. In the short term, you will also need to disable systemd-resolved if you depend on DNSSEC, but hopefully that won’t be necessary for much longer. Everyone else should hopefully never notice that systemd-resolved is there.

Happy resolving!

Accurate Conclusions from Bogus Data: Methodological Issues in “Collaboration in the open-source arena: The WebKit case”

Nearly five years ago, when I was in grad school, I stumbled across the paper Collaboration in the open-source arena: The WebKit case when trying to figure out what I would do for a course project in network theory (i.e. graph theory, not computer networking; I’ll use the words “graph” and “network” interchangeably). The paper evaluates collaboration networks, which are graphs where collaborators are represented by nodes and relationships between collaborators are represented by edges. Our professor had used collaboration networks as examples during lecture, so it seemed at least mildly relevant to our class, and I wound up writing a critique on this paper for the class project. In this paper, the authors construct collaboration networks for WebKit by examining the project’s changelog files to define relationships between developers. They perform “community detection” to visually group developers who work closely together into separate clusters in the graphs. Then, the authors use those graphs to arrive at various conclusions about WebKit (e.g. “[e]ven if Samsung and Apple are involved in expensive patent wars in the courts and stopped collaborating on hardware components, their contributions remained strong and central within the WebKit open source project,” regarding the period from 2008 to 2013).

At the time, I contacted the authors to let them know about some serious problems I found with their work. Then I left the paper sitting in a short-term to-do pile on my desk, where it has been sitting since Obama was president, waiting for me to finally write this blog post. Unfortunately, nearly five years later, the authors’ email addresses no longer work, which is not very surprising after so long — since I’m no longer a student, the email I originally used to contact them doesn’t work anymore either — so I was unable to contact them again to let them know that I was finally going to publish this blog post. Anyway, suffice to say that the conclusions of the paper were all correct; however, the networks used to arrive at those conclusions suffered from three different mistakes, each of which was, on its own, serious enough to invalidate the entire work.

So if the analysis of the networks was bogus, how did the authors arrive at correct conclusions anyway? The answer is confirmation bias. The study was performed by visually looking at networks and then coming to non-rigorous conclusions about the networks, and by researching the WebKit community to learn what is going on with the major companies involved in the project. The authors arrived at correct conclusions because they did a good job at the latter, then saw what they wanted to see in the graphs.

I don’t want to be too harsh on the authors of this paper, though, because they decided to publish their raw data and methodology on the internet. They even published the python scripts they used to convert WebKit changelogs into collaboration graphs. Had they not done so, there is no way I would have noticed the third (and most important) mistake that I’ll discuss below, and I wouldn’t have been able to confirm my suspicions about the second mistake. You would not be reading this right now, and likely nobody would ever have realized the problems with the paper. The authors of most scientific papers are not nearly so transparent: many researchers today consider their source code and raw data to be either proprietary secrets to be guarded, or simply not important enough to merit publication. The authors of this paper deserve to be commended, not penalized, for their openness. Mistakes are normal in research papers, and open data is by far the best way for us to be able to detect mistakes when they happen.

Collaboration network
A collaboration network from the paper. The paper reports that this network represents collaboration between September 2008 (when Google began contributing to WebKit) and February 2011 (the departure of Nokia from the project). Because the authors posted their data online, I noticed that this was a mistake in the paper: the graph actually represents the period between February 2011 and July 2012. The paper’s analysis of this graph is therefore questionable, but note this was only a minor mistake compared to the three major mistakes that impact this network. Note the suspiciously-high number of unaffiliated (“Other”) contributors in a corporate-dominated project.

The rest of this blog post is a simplified version of my original school paper from 2016. I’ve removed maybe half the original content, including some flowery academic language and unnecessary references to class material (e.g. “community detection was performed using fast modularity maximization to generate an alternate visualization of the network,” good for high scores on class papers, not so good for blog posts). But rewriting everything to be informal takes a long time, and I want to finish this today so it’s not still on my desk five more years from now, so the rest of this blog post is still going to be much more formal than normal. Oh well. Tone shift now!

We (“we” means “I”) examine various methodological issues discovered by analyzing the paper. The first section discusses the effects on the collaboration network of choosing a poor definition of collaboration. The second section discusses a major source of error in detecting the company affiliation of many contributors. The third section describes a serious mistake in the data collection process. Each of these issues is quite severe, and any one alone calls into question the validity of the entire study. It must be noted that such issues are not necessarily unique to this paper, and must be kept in mind for all future studies that utilize collaboration networks.

Mistake #1: Poorly-defined Collaboration

The precise definition used to model collaboration has tremendous impact on the usefulness of the resultant collaboration network. Many collaboration networks are built using definitions of collaboration that are self-evidently useful, where there is little doubt that edges in the network represent real-world collaboration. The paper adopts an approach to building collaboration networks where developers are represented by nodes, and an edge exists between two nodes if the corresponding developers modified the same source code file during the time period under consideration for the construction of the network. However, it is not clear that this definition of collaboration is actually useful. Consider that it is a regular occurrence for developers who do not know each other and may have never communicated to modify the same files. Consider also that modifying a common file does not necessarily reflect any shared interest in a particular portion of the software project. For instance, a file might be modified when making an interface change in another file, or when fixing a build error occurring on a particular platform. Such occurrences are, in fact, extremely common in the WebKit project. Additionally, consider that there exist particular source code files that are unusually central to the project, and must be modified more frequently than other files. It is highly likely that almost all developers will at one point or another make some change in such a file, and therefore be connected via a collaboration edge to all other developers who have ever modified that file. (My original critique shows a screenshot of the revision history of WebPageProxy.cpp, to demonstrate that the developers modifying this file were working on unrelated projects.)

It is true, as assumed by the paper, that particular developers work on different portions of the WebKit source code, and collaborate more with particular other developers. For instance, developers who work for the same company typically, though not always, collaborate most with other developers from that same company. However, the paper’s naive definition of collaboration should ensure that most developers will be considered to have collaborated equally with most other developers, regardless of the actual degree of collaboration. For instance, consider developers A and B who regularly collaborate on a particular source file. Now, developer C, who works on a platform that does not use this file and would not ordinarily need to modify it, makes a change to some cross-platform interface in another file that requires updating this file. Developer C is now considered to have collaborated with developers A and B on this file! Clearly, this is not a desirable result, as developers A and B have collaborated far more on the development of the file. Moreover, consider that an edge exists between two developers in the collaboration network if they have ever both modified any file anywhere in WebKit during the time period under review; then we can expect to form a network that is almost complete (a “full” graph where edges exists between most nodes). It is evident that some method of weighting collaboration between different contributors would be desirable, as the unweighted collaboration network does not seem useful.

One might argue that the networks presented in the paper clearly show developers exist in subcommunities on the peripheries of the network, that the network is clearly not complete, and that therefore this definition of collaboration sufficed, at least to some extent. However, this is only due to another methodological error in the study. Mistake #3, discussed later, explains how the study managed to produce collaboration networks with noticeable subcommunities despite these issues.

We note that the authors chose this same definition of collaboration in their more recent work on OpenStack, so there exist multiple studies using this same flawed definition of collaboration. We speculate that this definition of collaboration is unlikely to be more suitable for OpenStack or for other software projects than it is for WebKit. The software engineering research community must explore alternative models of collaboration when undertaking future studies of software development collaboration networks in order to more accurately reflect collaboration.

Mistake #2: Misdetected Contributor Affiliation

One difficulty when building collaboration networks is the need to correctly match each contributor with the correct company affiliation. Although many free software projects are dominated by unaffiliated contributors, others, like WebKit, are primarily developed by paid contributors. Looking at the number of times a particular email domain appears in WebKit changelog entries made during 2015, most contributors commit using corporate emails, but many developers commit to WebKit using personal email accounts, such as Gmail accounts; additionally, many developers use generic webkit.org email aliases, which were previously available to active WebKit contributors. These developers may or may not be affiliated with companies that contribute to the project. Use of personal email addresses is a source of inaccuracy when constructing collaboration networks, as it results in an undercount of corporate contributions.  We can expect this issue has led to serious inaccuracies in the reported collaboration networks.

This substantial source of error is neither mentioned nor accounted for; all contributors using such email accounts were therefore miscategorized as unaffiliated. However, the authors clearly recognized this issue, as it has been accounted for in their more recent work covering OpenStack by cross-referencing email addresses from git revision history with a database containing corporate affiliations maintained by the OpenStack Foundation. Unfortunately, no such effort was made for the WebKit data set.

The WebKit project was previously dominated by contributors with chromium.org email domains. This domain is equivalent to webkit.org in that it can be used by contributors to the Chromium project regardless of corporate affiliation; however, most contributors with Chromium emails are actually Google employees. The high use of Chromium emails by Google employees appears to have led to a dramatic — by roughly an entire order of magnitude — undercount of Google’s contributors to the WebKit project, as only contributors with google.com emails were considered to be Google employees. The vast majority of Google employees used chromium.org emails, and so were counted as unaffiliated developers. This explains the extraordinarily high number of unaffiliated developers in the networks presented by the paper, despite the fact that WebKit development is, in reality, dominated by corporate contributors.

Mistake #3: Missing Most Changelog Data

The paper incorrectly claims to have gathered its data from both WebKit’s Subversion revision history and from its changelog files. We must draw a distinction between changelog entries and Subversion revision history. Changelog entries are inserted into changelog files that are committed into the Subversion repository; they are completely separate from the Subversion history. Each subproject within the WebKit project has its own set of changelog files used to record changes under the corresponding directory.

In fact, the paper processed only the changelog files. This was actually a good choice, as WebKit’s changelog files are much more accurate than the Subversion history, for two reasons. Firstly, it is easy for a contributor to change the email address entered into a changelog file, e.g. after a change in company affiliation. However, it is difficult to change the email address used to commit to Subversion, as this requires requesting a new Subversion account from the Subversion administrator; accordingly, contributors are more likely to use older email addresses, lacking accurate company affiliation, in Subversion revisions than in changelog files.  Secondly, many Subversion revisions are not directly committed by contributors, but rather are actually committed by the commit queue bot, which runs various tests before committing the revision. Subversion revisions are also, more rarely, committed by a completely different contributor than the patch author. In both cases, the proper contributor’s name will appear in only the changelog file, and not the Subversion data. Some developers are dramatically more likely to use the commit queue than others. Various other reviews of WebKit contribution history that examine data from Subversion history rather than from changelog files are flawed for this reason. Fortunately, by relying on changelog files rather than Subversion metadata, the authors avoid this problem.

Unfortunately, a serious error was made in processing the changelog data. WebKit has many different sets of changelog files, stored in various project subdirectories (JavaScriptCore, WebCore, WebKit, etc.), as well as toplevel changelogs stored in the root directory of the project. Regrettably, the authors were unaware of the changelogs in subdirectories, and based their analysis only on the toplevel changelogs, which contain only changes that occurred in subdirectories that lack their own changelog files. In practice, this inadvertently restricted the scope of the analysis to a very small minority of changes, primarily to build system files, manual tests, and the WebKit website. That is, the reported collaboration networks do not reflect collaboration on any actual source code files. All source code files are contained in subdirectories with their own changelog files, and therefore no source code files were actually considered in the analysis of collaboration on source code changes.

We speculate that the analysis’s focus on build system files likely exaggerates the effects of clustering in the network, as different companies used different build systems and thus were less likely to edit the build systems used by other companies, and that an analysis based on the correct data would display less of a clustering effect. Certainly, there would be dramatically more edges in the already-dense networks, because an edge exists between two developers if there exists any one file in WebKit that both developers have modified. Omitting all of the source code files from the analysis therefore dramatically reduces the likelihood of edges existing between nodes in the network.

Conclusion

We found that the original study was impacted by an unsuitable definition of collaboration used to build the collaboration networks, severe errors in counting contributor affiliation (including the classification of most Google employees as unaffiliated developers), and the omission of almost all the required data from the analysis, including all data on modifications to source code files. The authors constructed and studied essentially meaningless networks. Nevertheless, the authors were able to derive many accurate conclusions about the WebKit project from their inaccurate collaboration networks. Such conclusions illustrate the dangers of seeking to find particular meanings or explanations through visual inspection of collaboration networks. Researchers must work forwards from the collaboration networks to arrive at their conclusions, rather than backwards by attempting to match the networks to conclusions gained from prior knowledge.

Original Report

Wow, OK, you actually read this far? Since this blog post criticizes an academic paper, and since this blog post does not include various tables and examples that support my arguments, I’ve attached my original analysis in full. It is a boring, student-quality grad school project written with the objective of scoring the highest-possible grade in a class rather than for clarity, and you probably don’t want to look at it unless you are investigating the paper in detail. (If you download that, note that I no longer work for Igalia, and the paper was not authorized by Igalia either; I used my company email to disclose my affiliation and maybe impress my professor a little.) Phew, now I can finally remove this from my desk!

Epiphany 3.38 and WebKitGTK 2.30

It’s that time of year again: a new GNOME release, and with it, a new Epiphany. The pace of Epiphany development has increased significantly over the last few years thanks to an increase in the number of active contributors. Most notably, Jan-Michael Brummer has solved dozens of bugs and landed many new enhancements, Alexander Mikhaylenko has polished numerous rough edges throughout the browser, and Andrei Lisita has landed several significant improvements to various Epiphany dialogs. That doesn’t count the work that Igalia is doing to maintain WebKitGTK, the WPE graphics stack, and libsoup, all of which is essential to delivering quality Epiphany releases, nor the work of the GNOME localization teams to translate it to your native language. Even if Epiphany itself is only the topmost layer of this technology stack, having more developers working on Epiphany itself allows us to deliver increased polish throughout the user interface layer, and I’m pretty happy with the result. Let’s take a look at what’s new.

Intelligent Tracking Prevention

Intelligent Tracking Prevention (ITP) is the headline feature of this release. Safari has had ITP for several years now, so if you’re familiar with how ITP works to prevent cross-site tracking on macOS or iOS, then you already know what to expect here.  If you’re more familiar with Firefox’s Enhanced Tracking Protection, or Chrome’s nothing (crickets: chirp, chirp!), then WebKit’s ITP is a little different from what you’re used to. ITP relies on heuristics that apply the same to all domains, so there are no blocklists of naughty domains that should be targeted for content blocking like you see in Firefox. Instead, a set of innovative restrictions is applied globally to all web content, and a separate set of stricter restrictions is applied to domains classified as “prevalent” based on your browsing history. Domains are classified as prevalent if ITP decides the domain is capable of tracking your browsing across the web, or non-prevalent otherwise. (The public-friendly terminology for this is “Classification as Having Cross-Site Tracking Capabilities,” but that is a mouthful, so I’ll stick with “prevalent.” It makes sense: domains that are common across many websites can track you across many websites, and domains that are not common cannot.)

ITP is enabled by default in Epiphany 3.38, as it has been for several years now in Safari, because otherwise only a small minority of users would turn it on. ITP protections are designed to be effective without breaking too many websites, so it’s fairly safe to enable by default. (You may encounter a few broken websites that have not been updated to use the Storage Access API to store third-party cookies. If so, you can choose to turn off ITP in the preferences dialog.)

For a detailed discussion covering ITP’s tracking mitigations, see Tracking Prevention in WebKit. I’m not an expert myself, but the short version is this: full third-party cookie blocking across all websites (to store a third-party cookie, websites must use the Storage Access API to prompt the user for permission); cookie-blocking latch mode (“once a request is blocked from using cookies, all redirects of that request are also blocked from using cookies”); downgraded third-party referrers (“all third-party referrers are downgraded to their origins by default”) to avoid exposing the path component of the URL in the referrer; blocked third-party HSTS (“HSTS […] can only be set by the first-party website […]”) to stop abuse by tracker scripts; detection of cross-site tracking via link decoration and 24-hour expiration time for all cookies created by JavaScript on the landing page when detected; a 7-day expiration time for all other cookies created by JavaScript (yes, this applies to first-party cookies); and a 7-day extendable lifetime for all other script-writable storage, extended whenever the user interacts with the website (necessary because tracking companies began using first-party scripts to evade the above restrictions). Additionally, for prevalent domains only, domains engaging in bounce tracking may have cookies forced to SameSite=strict, and Verified Partitioned Cache is enabled (cached resources are re-downloaded after seven days and deleted if they fail certain privacy tests). Whew!

WebKit has many additional privacy protections not tied to the ITP setting and therefore not discussed here — did you know that cached resources are partioned based on the first-party domain? — and there’s more that’s not very well documented which I don’t understand and haven’t mentioned (tracker collusion!), but that should give you the general idea of how sophisticated this is relative to, say, Chrome (chirp!). Thanks to John Wilander from Apple for his work developing and maintaining ITP, and to Carlos Garcia for getting it working on Linux. If you’re interested in the full history of how ITP has evolved over the years to respond to the changing threat landscape (e.g. tracking prevention tracking), see John’s WebKit blog posts. You might also be interested in WebKit’s Tracking Prevention Policy, which I believe is the strictest anti-tracking stance of any major web engine. TL;DR: “we treat circumvention of shipping anti-tracking measures with the same seriousness as exploitation of security vulnerabilities. If a party attempts to circumvent our tracking prevention methods, we may add additional restrictions without prior notice.” No exceptions.

Updated Website Data Preferences

As part of the work on ITP, you’ll notice that Epiphany’s cookie storage preferences have changed a bit. Since ITP enforces full third-party cookie blocking, it no longer makes sense to have a separate cookie storage preference for that, so I replaced the old tri-state cookie storage setting (always accept cookies, block third-party cookies, block all cookies) with two switches: one to toggle ITP, and one to toggle all website data storage.

Previously, it was only possible to block cookies, but this new setting will additionally block localStorage and IndexedDB, web features that allow websites to store arbitrary data in your browser, similar to cookies. It doesn’t really make much sense to block cookies but allow other types of data storage, so the new preferences should better enforce the user’s intent behind disabling cookies. (This preference does not yet block media keys, service workers, or legacy offline web application cache, but it probably should.) I don’t really recommend disabling website data storage, since it will cause compatibility issues on many websites, but this option is there for those who want it. Disabling ITP is also not something I want to recommend, but it might be necessary to access certain broken websites that have not yet been updated to use the Storage Access API.

Accordingly, Andrei has removed the old cookies dialog and moved cookie management into the Clear Personal Data dialog, which is a better place because anyone clearing cookies for a particular website is likely to also want to clear other personal data. (If you want to delete a website’s cookies, then you probably don’t want to leave its SQL databases intact, right?) He had to remove the ability to clear data from a particular point in time, because WebKit doesn’t support this operation for cookies, but that function is probably  rarely-used and I think the benefit of the change should outweigh the cost. (We could bring it back in the future if somebody wants to try implementing that feature in WebKit, but I suspect not many users will notice.) Treating cookies as separate and different from other forms of website data storage no longer makes sense in 2020, and it’s good to have finally moved on from that antiquated practice.

New HTML Theme

Carlos Garcia has added a new Adwaita-based HTML theme to WebKitGTK 2.30, and removed support for rendering HTML elements using the GTK theme (except for scrollbars). Trying to use the GTK theme to render web content was fragile and caused many web compatibility problems that nobody ever managed to solve. The GTK developers were never very fond of us doing this in the first place, and the foreign drawing API required to do so has been removed from GTK 4, so this was also good preparation for getting WebKitGTK ready for GTK 4. Carlos’s new theme is similar to Adwaita, but gradients have been toned down or removed in order to give a flatter, neutral look that should blend in nicely with all pages while still feeling modern.

This should be a fairly minor style change for Adwaita users, but a very large change for anyone using custom themes. I don’t expect everyone will be happy, but please trust that this will at least result in better web compatibility and fewer tricky theme-related bug reports.

Screenshot demonstrating new HTML theme vs. GTK theme
Left: Adwaita GTK theme controls rendered by WebKitGTK 2.28. Right: hardcoded Adwaita-based HTML theme with toned down gradients.

Although scrollbars will still use the GTK theme as of WebKitGTK 2.30, that will no longer be possible to do in GTK 4, so themed scrollbars are almost certain to be removed in the future. That will be a noticeable disappointment in every app that uses WebKitGTK, but I don’t see any likely solution to this.

Media Permissions

Jan-Michael added new API in WebKitGTK 2.30 to allow muting individual browser tabs, and hooked it up in Epiphany. This is good when you want to silence just one annoying tab without silencing everything.

Meanwhile, Charlie Turner added WebKitGTK API for managing autoplay policies. Videos with sound are now blocked from autoplaying by default, while videos with no sound are still allowed. Charlie hooked this up to Epiphany’s existing permission manager popover, so you can change the behavior for websites you care about without affecting other websites.

Screenshot displaying new media autoplay permission settings
Configure your preferred media autoplay policy for a website near you today!

Improved Dialogs

In addition to his work on the Clear Data dialog, Andrei has also implemented many improvements and squashed bugs throughout each view of the preferences dialog, the passwords dialog, and the history dialog, and refactored the code to be much more maintainable. Head over to his blog to learn more about his accomplishments. (Thanks to Google for sponsoring Andrei’s work via Google Summer of Code, and to Alexander for help mentoring.)

Additionally, Adrien Plazas has ported the preferences dialog to use HdyPreferencesWindow, bringing a pretty major design change to the view switcher:

Screenshot showing changes to the preferences dialog
Left: Epiphany 3.36 preferences dialog. Right: Epiphany 3.38. Note the download settings are present in the left screenshot but missing from the right screenshot because the right window is using flatpak, and the download settings are unavailable in flatpak.

User Scripts

User scripts (like Greasemonkey) allow you to run custom JavaScript on websites. WebKit has long offered user script functionality alongside user CSS, but previous versions of Epiphany only exposed user CSS. Jan-Michael has added the ability to configure a user script as well. To enable, visit the Appearance tab in the preferences dialog (a somewhat odd place, but it really needs to be located next to user CSS due to the tight relationship there). Besides allowing you to do, well, basically anything, this also significantly enhances the usability of user CSS, since now you can apply certain styles only to particular websites. The UI is a little primitive — your script (like your CSS) has to be one file that will be run on every website, so don’t try to design a complex codebase using your user script — but you can use conditional statements to limit execution to specific websites as you please, so it should work fairly well for anyone who has need of it. I fully expect 99.9% of users will never touch user scripts or user styles, but it’s nice for power users to have these features available if needed.

HTTP Authentication Password Storage

Jan-Michael and Carlos Garcia have worked to ensure HTTP authentication passwords are now stored in Epiphany’s password manager rather than by WebKit, so they can now be viewed and deleted from Epiphany, which required some new WebKitGTK API to do properly. Unfortunately, WebKitGTK saves network passwords using the default network secret schema, meaning its passwords (saved by older versions of Epiphany) are all leaked: we have no way to know which application owns those passwords, so we don’t have any way to know which passwords were stored by WebKit and which can be safely managed by Epiphany going forward. Accordingly, all previously-stored HTTP authentication passwords are no longer accessible; you’ll have to use seahorse to look them up manually if you need to recover them. HTTP authentication is not very commonly-used nowadays except for internal corporate domains, so hopefully this one-time migration snafu will not be a major inconvenience to most users.

New Tab Animation

Jan-Michael has added a new animation when you open a new tab. If the newly-created tab is not visible in the tab bar, then the right arrow will flash to indicate success, letting you know that you actually managed to open the page. Opening tabs out of view happens too often currently, but at least it’s a nice improvement over not knowing whether you actually managed to open the tab or not. This will be improved further next year, because Alexander is working on a completely new tab widget to replace GtkNotebook.

New View Source Theme

Jim Mason changed view source mode to use a highlight.js theme designed to mimic Firefox’s syntax highlighting, and added dark mode support.

Image showing dark mode support in view source mode
Embrace the dark.

And More…

  • WebKitGTK 2.30 now supports video formats in image elements, thanks to Philippe Normand. You’ll notice that short GIF-style videos will now work on several major websites where they previously didn’t.
  • I added a new WebKitGTK 2.30 API to expose the paste as plaintext editor command, which was previously internal but fully-functional. I’ve hooked it up in Epiphany’s context menu as “Paste Text Only.” This is nice when you want to discard markup when pasting into a rich text editor (such as the WordPress editor I’m using to write this post).
  • Jan-Michael has implemented support for reordering pinned tabs. You can now drag to reorder pinned tabs any way you please, subject to the constraint that all pinned tabs stay left of all unpinned tabs.
  • Jan-Michael added a new import/export menu, and the bookmarks import/export features have moved there. He also added a new feature to import passwords from Chrome. Meanwhile, ignapk added support for importing bookmarks from HTML (compatible with Firefox).
  • Jan-Michael added a new preference to web apps to allow running them in the background. When enabled, closing the window will only hide the the window: everything will continue running. This is useful for mail apps, music players, and similar applications.
  • Continuing Jan-Michael’s list of accomplishments, he removed Epiphany’s previous hidden setting to set a mobile user agent header after discovering that it did not work properly, and replaced it by adding support in WebKitGTK 2.30 for automatically setting a mobile user agent header depending on the chassis type detected by logind. This results in a major user experience improvement when using Epiphany as a mobile browser. Beware: this functionality currently does not work in flatpak because it requires the creation of a new desktop portal.
  • Stephan Verbücheln has landed multiple fixes to improve display of favicons on hidpi displays.
  • Zach Harbort fixed a rounding error that caused the zoom level to display oddly when changing zoom levels.
  • Vanadiae landed some improvements to the search engine configuration dialog (with more to come) and helped investigate a crash that occurs when using the “Set as Wallpaper” function under Flatpak. The crash is pretty tricky, so we wound up disabling that function under Flatpak for now. He also updated screenshots throughout the  user help.
  • Sabri Ünal continued his effort to document and standardize keyboard shortcuts throughout GNOME, adding a few missing shortcuts to the keyboard shortcuts dialog.

Epiphany 3.38 will be the final Epiphany 3 release, concluding a decade of releases that start with 3. We will match GNOME in following a new version scheme going forward, dropping the leading 3 and the confusing even/odd versioning. Onward to Epiphany 40!

Disrupted CVE Assignment Process

Due to an invalid TLS certificate on MITRE’s CVE request form, I have — ironically — been unable to request a new CVE for a TLS certificate verification vulnerability for a couple weeks now. (Note: this vulnerability does not affect WebKit and I’m only aware of one vulnerable application, so impact is limited; follow the link if you’re curious.) MITRE, if you’re reading my blog, your website’s contact form promises a two-day response, but it’s been almost three weeks now, still waiting.

Update May 29: I received a response today stating my request has been forwarded to MITRE’s IT department, and less than an hour later the issue is now fixed. I guess that’s score +1 for blog posts. Thanks for fixing this, MITRE.

Browser security warning on MITRE's CVE request form

Of course, the issue is exactly the same as it was five years ago, the server is misconfigured to send only the final server certificate with no chain of trust, guaranteeing failure in Epiphany or with command-line tools. But the site does work in Chrome, and sometimes works in Firefox… what’s going on? Again, same old story. Firefox is accepting incomplete certificate chains based on which websites you’ve visited in the past, so you might be able to get to the CVE request form or not depending on which websites you’ve previously visited in Firefox, but a fresh profile won’t work. Chrome has started downloading the missing intermediate certificate automatically from the issuer, which Firefox refuses to implement for fear of allowing the certificate authority to track which websites you’re visiting. Eventually, we’ll hopefully have this feature in GnuTLS, because Firefox-style nondeterministic certificate verification is nuts and we have to do one or the other to be web-compatible, but for now that is not supported and we reject the certificate. (I fear I may have delayed others from implementing the GnuTLS support by promising to implement it myself and then never delivering… sorry.)

We could have a debate on TLS certificate verification and the various benefits or costs of the Firefox vs. Chrome approach, but in the end it’s an obvious misconfiguration and there will be no further CVE requests from me until it’s fixed. (Update May 29: the issue is now fixed. :) No, I’m not bypassing the browser security warning, even though I know exactly what’s wrong. We can’t expect users to take these seriously if we skip them ourselves.