Creating Quality Backtraces for Crash Reports

Hello Linux users! Help developers help you: include a quality backtraces taken with gdb each and every time you create an issue report for a crash. If you don’t, most developers will request that you provide a backtrace, then ignore your issue until you manage to figure out how to do so. Save us the trouble and just provide the backtrace with your initial report, so everything goes smoother. (Backtraces are often called “stack traces.” They are the same thing.)

Don’t just copy the lower-quality backtrace you see in your system journal into your issue report. That’s a lot better than nothing, but if you really want the crash to be fixed, you should provide the developers with a higher-quality backtrace from gdb. Don’t know how to get a quality backtrace with gdb? Read on.

Future of Crash Reporting

Here are instructions for getting a quality backtrace for a crashing process on Fedora 35, which is scheduled to be released in October:

$ coredumpctl gdb
(gdb) bt full

Press ‘c’ (continue) when required. When it’s done printing, press ‘q’ to quit. That’s it! That’s all you need to know. You’re done. Two points of note:

  • When a process crashes, a core dump is caught by systemd-coredump and stored for future use. The coredumpctl gdb command opens the most recent core dump in gdb. systemd-coredump has been enabled by default in Fedora since Fedora 26, which was released four years ago. (It’s also enabled by default in RHEL 8.)
  • After opening the core dump, gdb uses debuginfod to automatically download all required debuginfo packages, ensuring the generated backtrace is useful. debuginfod is a new feature in Fedora 35.

If you’re not an inhabitant of the future, you are probably missing at least debuginfod today, and possibly also systemd-coredump depending on which operating system you are using, so we will also learn how to take a backtrace without these tools. It will be more complicated, of course.

systemd-coredump

If your operating system enables systemd-coredump by default, then congratulations! This makes reporting crashes much easier because you can easily retrieve a core dump for any recent crash using the coredumpctl command. For example, coredumpctl alone will list all available core dumps. coredumpctl gdb will open the core dump of the most recent crash in gdb. coredumpctl gdb 1234 will open the core dump corresponding to the most recent crash of a process with pid 1234. It doesn’t get easier than this.

Core dumps are stored under /var/lib/systemd/coredump. systemd-coredump will automatically delete core dumps that exceed configurable size limits (2 GB by default). It also deletes core dumps if your free disk space falls below a configurable threshold (15% free by default). Additionally, systemd-tmpfiles will delete core dumps automatically after some time has passed (three days by default). This ensures your disk doesn’t fill up with old core dumps. Although most of these settings seem good to me, the default 2 GB size limit is way too low in my opinion, as it causes systemd to immediately discard crashes of any application that uses WebKit. I recommend raising this limit to 20 GB by creating an /etc/systemd/systemd.conf.d/50-coredump.conf drop-in containing the following:

[Coredump]
ProcessSizeMax=20G
ExternalSizeMax=20G

The other settings are likely sufficient to prevent your disk from filling up with core dumps.

Sadly, although systemd-coredump has been around for a good while now and many Linux operating systems have it enabled by default, many still do not. Most notably, the Debian ecosystem is still not yet on board. To check if systemd-coredump is enabled on your system:

$ cat /proc/sys/kernel/core_pattern

If you see systemd-coredump, then you’re good.

To enable it in Debian or Ubuntu, just install it:

# apt install systemd-coredump

Ubuntu users, note this will cause apport to be uninstalled, since it is currently incompatible. Also note that I switched from $ (which indicates a normal prompt) to # (which indicates a root prompt).

In other operating systems, you may have to manually enable it:

# echo "kernel.core_pattern=|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" > /etc/sysctl.d/50-coredump.conf
# /usr/lib/systemd/systemd-sysctl --prefix kernel.core_pattern

Note the exact core pattern to use changes occasionally in newer versions of systemd, so these instructions may not work everywhere.

Detour: Manual Core Dump Handling

If you don’t want to enable systemd-coredump, life is harder and you should probably reconsider, but it’s still possible to debug most crashes. First, enable core dump creation by removing the default 0-byte size limit on core files:

$ ulimit -c unlimited

This change is temporary and only affects the current instance of your shell. For example, if you open a new tab in your terminal, you will need to set the ulimit again in the new tab.

Next, run your program in the terminal and try to make it crash. A core file will be generated in the current directory. Open it by starting the program that crashed in gdb and passing the filename of the core file that was created. For example:

$ gdb gnome-chess ./core

This is downright primitive, though:

  • You’re going to have a hard time getting backtraces for services that are crashing, for starters. If starting the service normally, how do you set the ulimit? I’m sure there’s a way to do it, but I don’t know how! It’s probably easier to start the service manually, but then what command line flags are needed to properly do so? It will be different for each service, and you have to figure this all out for yourself.
  • Special situations become very difficult. For example, if a service is crashing only when run early during boot, or only during an initial setup session, you are going to have an especially hard time.
  • If you don’t know how to reproduce a crash that occurs only rarely, it’s inevitably going to crash when you’re not prepared to manually catch the core dump. Sadly, not all crashes will occur on demand when you happen to be running the software from a terminal with the right ulimit configured.
  • Lastly, you have to remember to delete that core file when you’re done, because otherwise it will take up space on your disk space until you do. You’ll probably notice if you leave core files scattered in ~/home, but you might not notice if you’re working someplace else.

Seriously, just enable systemd-coredump. It solves all of these problems and guarantees you will always have easy access to a core dump when something crashes, even for crashes that occur only rarely.

Debuginfo Installation

Now that we know how to open a core dump in gdb, let’s talk about debuginfo. When you don’t have the right debuginfo packages installed, the backtrace generated by gdb will be low-quality. Almost all Linux software developers deal with low-quality backtraces on a regular basis, because most users are not very good at installing debuginfo. Again, if you’re reading this in the future using Fedora 35 or newer, you don’t have to worry about this anymore because debuginfod will take care of everything for you. I would be thrilled if other Linux operating systems would quickly adopt debuginfod so we can put the era of low-quality crash reports behind us. But since most readers of this blog today will not yet have debuginfod enabled, let’s learn how to install debuginfo manually.

As an example, I decided to force gnome-chess to crash using the command killall -SEGV gnome-chess, then I ran coredumpctl gdb to open the resulting core dump in gdb. After a bunch of spam, I saw this:

Missing separate debuginfos, use: dnf debuginfo-install gnome-chess-40.1-1.fc34.x86_64
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/usr/bin/gnome-chess --gapplication-service'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fa23d8b55bf in __GI___poll (fds=0x5636deb06930, nfds=2, timeout=2830)
    at ../sysdeps/unix/sysv/linux/poll.c:29
29  return SYSCALL_CANCEL (poll, fds, nfds, timeout);
[Current thread is 1 (Thread 0x7fa23ca0cd00 (LWP 140177))]
(gdb)

If you are using Fedora, RHEL, or related operating systems, the line “missing separate debuginfos” is a good hint that debuginfo is missing. It even tells you exactly which dnf debuginfo-install command to run to remedy this problem! But this is a Fedora ecosystem feature, and you won’t see this on most other operating systems. Usually, you’ll need to manually locate the right debuginfo packages to install. Debian and Ubuntu users can do this by searching for and installing -dbg or -dbgsym packages until each frame in the backtrace looks good. You’ll just have to manually guess the names of which debuginfo packages you need to install based on the names of the libraries in the backtrace. Look here for instructions for popular operating systems.

How do you know when the backtrace looks good? When each frame has file names, line numbers, function parameters, and local variables! Here is an example of a bad backtrace, if I continue the gnome-chess example above without properly installing the required debuginfo:

(gdb) bt full
#0 0x00007fa23d8b55bf in __GI___poll (fds=0x5636deb06930, nfds=2, timeout=2830)
    at ../sysdeps/unix/sysv/linux/poll.c:29
        sc_ret = -516
        sc_cancel_oldtype = 0
#1 0x00007fa23eee648c in g_main_context_iterate.constprop () at /lib64/libglib-2.0.so.0
#2 0x00007fa23ee8fc03 in g_main_context_iteration () at /lib64/libglib-2.0.so.0
#3 0x00007fa23e4b599d in g_application_run () at /lib64/libgio-2.0.so.0
#4 0x00005636dd7b79a2 in chess_application_main ()
#5 0x00007fa23d7e7b75 in __libc_start_main (main=0x5636dd7aaa50 <main>, argc=2, argv=0x7fff827b6438, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff827b6428)
    at ../csu/libc-start.c:332
        self = <optimized out>
        result = <optimized out>
        unwind_buf = 
              {cancel_jmp_buf = {{jmp_buf = {94793644186304, 829313697107602221, 94793644026480, 0, 0, 0, -829413713854928083, -808912263273321683}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x2, 0x7fff827b6438}, data = {prev = 0x0, cleanup = 0x0, canceltype = 2}}}
        not_first_call = <optimized out>
#6 0x00005636dd7aaa9e in _start ()

This backtrace has six frames, which shows where the code was during program execution when the crash occurred. You can see line numbers for frame #0 (poll.c:29) and #5 (libc-start.c:332), and these frames also show the values of function parameters and variables on the stack, which are often useful for figuring out what went wrong. These frames have good debuginfo because I already had debuginfo installed for glibc. But frames #1 through #4 do not look so useful, showing only function names and the library and nothing else. This is because I’m using Fedora 34 rather than Fedora 35, so I don’t have debuginfod yet, and I did not install proper debuginfo for libgio, libglib, and gnome-chess. (The function names are actually only there because programs in Fedora include some limited debuginfo by default. In many operating systems, you will see ??? instead of function names.) A developer looking at this backtrace is not going to know what went wrong.

Now, let’s run the recommended debuginfo-install command:

# dnf debuginfo-install gnome-chess-40.1-1.fc34.x86_64

When the command finishes, we’ll start gdb again, using coredumpctl gdb just like before. This time, we see this:

Missing separate debuginfos, use: dnf debuginfo-install avahi-libs-0.8-14.fc34.x86_64 colord-libs-1.4.5-2.fc34.x86_64 cups-libs-2.3.3op2-7.fc34.x86_64 fontconfig-2.13.94-2.fc34.x86_64 glib2-2.68.4-1.fc34.x86_64 graphene-1.10.6-2.fc34.x86_64 gstreamer1-1.19.1-2.1.18.4.fc34.x86_64 gstreamer1-plugins-bad-free-1.19.1-3.1.18.4.fc34.x86_64 gstreamer1-plugins-base-1.19.1-2.1.18.4.fc34.x86_64 gtk4-4.2.1-1.fc34.x86_64 json-glib-1.6.6-1.fc34.x86_64 krb5-libs-1.19.2-2.fc34.x86_64 libX11-1.7.2-3.fc34.x86_64 libX11-xcb-1.7.2-3.fc34.x86_64 libXfixes-6.0.0-1.fc34.x86_64 libdrm-2.4.107-1.fc34.x86_64 libedit-3.1-38.20210714cvs.fc34.x86_64 libepoxy-1.5.9-1.fc34.x86_64 libgcc-11.2.1-1.fc34.x86_64 libidn2-2.3.2-1.fc34.x86_64 librsvg2-2.50.7-1.fc34.x86_64 libstdc++-11.2.1-1.fc34.x86_64 libxcrypt-4.4.25-1.fc34.x86_64 llvm-libs-12.0.1-1.fc34.x86_64 mesa-dri-drivers-21.1.8-1.fc34.x86_64 mesa-libEGL-21.1.8-1.fc34.x86_64 mesa-libgbm-21.1.8-1.fc34.x86_64 mesa-libglapi-21.1.8-1.fc34.x86_64 nettle-3.7.3-1.fc34.x86_64 openldap-2.4.57-5.fc34.x86_64 openssl-libs-1.1.1l-1.fc34.x86_64 pango-1.48.9-2.fc34.x86_64

Yup, Fedora ecosystem users will need to run dnf debuginfo-install twice to install everything required, because gdb doesn’t list all required packages until the second time. Next, we’ll run coredumpctl gdb one last time. There will usually be a few more debuginfo packages that are still missing because they’re not available in the Fedora repositories, but now you’ll probably have enough to get a quality backtrace:

(gdb) bt full
#0  0x00007fa23d8b55bf in __GI___poll (fds=0x5636deb06930, nfds=2, timeout=2830)
    at ../sysdeps/unix/sysv/linux/poll.c:29
        sc_ret = -516
        sc_cancel_oldtype = 0
#1  0x00007fa23eee648c in g_main_context_poll
    (priority=, n_fds=2, fds=0x5636deb06930, timeout=, context=0x5636de7b24a0)
    at ../glib/gmain.c:4434
        ret = 
        errsv = 
        poll_func = 0x7fa23ee97c90 
        max_priority = 2147483647
        timeout = 2830
        some_ready = 
        nfds = 2
        allocated_nfds = 2
        fds = 0x5636deb06930
        begin_time_nsec = 30619110638882
#2  g_main_context_iterate.constprop.0
    (context=context@entry=0x5636de7b24a0, block=block@entry=1, dispatch=dispatch@entry=1, self=)
    at ../glib/gmain.c:4126
        max_priority = 2147483647
        timeout = 2830
        some_ready = 
        nfds = 2
        allocated_nfds = 2
        fds = 0x5636deb06930
        begin_time_nsec = 30619110638882
#3  0x00007fa23ee8fc03 in g_main_context_iteration
    (context=context@entry=0x5636de7b24a0, may_block=may_block@entry=1) at ../glib/gmain.c:4196
        retval = 
#4  0x00007fa23e4b599d in g_application_run
    (application=0x5636de7ae260 [ChessApplication], argc=-2105843004, argv=)
    at ../gio/gapplication.c:2560
        arguments = 0x5636de7b2400
        status = 0
        context = 0x5636de7b24a0
        acquired_context = 
        __func__ = "g_application_run"
#5  0x00005636dd7b79a2 in chess_application_main (args=0x7fff827b6438, args_length1=2)
    at src/gnome-chess.p/gnome-chess.c:5623
        _tmp0_ = 0x5636de7ae260 [ChessApplication]
        _tmp1_ = 0x5636de7ae260 [ChessApplication]
        _tmp2_ = 
        result = 0
...

I removed the last two frames because they are triggering a strange WordPress bug, but that’s enough to get the point. It looks much better! Now the developer can see exactly where the program crashed, including filenames, line numbers, and the values of function parameters and variables on the stack. This is as good as a crash report is normally going to get. In this case, it crashed when running poll() because gnome-chess was not actually doing anything at the time of the crash, since we crashed it by manually sending a SIGSEGV signal. Normally the backtrace will look more interesting.

Special Note for Arch Linux Users

Arch does not ship debuginfo packages, btw. To prepare a quality backtrace, you need to rebuild each package yourself with debuginfo enabled. This is a chore, to say the least. You will probably need to rebuild several system libraries in addition to the application itself if you want to get a quality backtrace. This is hardly impossible, but it’s a lot of work, too much to expect users to do for a typical bug report. You might want to consider attempting to reproduce your crash on another operating system in order to make it easier to get the backtrace. What a shame!

debuginfod for Fedora 34

Again, all of that manual debuginfo installation is no longer required as of Fedora 35, where debuginfod is enabled by default. It’s actually all ready to go in Fedora 34, just not enabled by default yet. You can try it early using the DEBUGINFOD_URLS environment variable:

$ DEBUGINFOD_URLS=https://debuginfod.fedoraproject.org/ coredumpctl gdb

Then you can watch gdb download the required debuginfo for you! Again, this environment variable will no longer be necessary in Fedora 35. (Technically, it will still be needed, but it will be configured by default.)

debuginfod for Debian Users

Debian users can use debuginfod, but it has to be enabled manually:

$ DEBUGINFOD_URLS=https://debuginfod.debian.net/ gdb

See here for more information. This requires Debian 11 “bullseye” or newer. If you’re using Ubuntu or other operating systems derived from Debian, you’ll need to wait until a debuginfod server for your operating system is available.

Flatpak

If your application uses Flatpak, you can use the flatpak-coredumpctl script to open core dumps in gdb. For most runtimes, including those distributed by GNOME or Flathub, you will need to manually install (a) the debug extension for your app, (b) the SDK runtime corresponding to the platform runtime that you are using, and (c) the debug extension for the SDK runtime. For example, to install everything required to debug Epiphany 40 from Flathub, you would run:

$ flatpak install org.gnome.Epiphany.Debug//stable
$ flatpak install org.gnome.Sdk//40
$ flatpak install org.gnome.Sdk.Debug//40

(flatpak-coredumpctl will fail to start if you don’t have the correct SDK runtime installed, but it will not fail if you’re missing the debug extensions. You’ll just wind up with a bad backtrace.)

The debug extensions need to exactly match the versions of the app and runtime that crashed, so backtrace generation may be unreliable after you install them for the very first time, because you would have installed the latest versions of the extensions, but your core dump might correspond to an older app or runtime version. If the crash is reproducible, it’s a good idea to run flatpak update after installing to ensure you have the latest version of everything, then reproduce the crash again.

Once your debuginfo is installed, you can open the backtrace in gdb using flatpak-coredumpctl. You just have to tell flatpak-coredumpctl the app ID to use:

$ flatpak-coredumpctl org.gnome.Epiphany

You can pass matches to coredumpctl using -m. For example, to open the core dump corresponding to a crashed process with pid 1234:

$ flatpak-coredumpctl -m 1234 org.gnome.Epiphany

Thibault Saunier wrote flatpak-coredumpctl because I complained about how hard it used to be to debug crashed Flatpak applications. Clearly it is no longer hard. Thanks Thibault!

Update: On newer versions of Debian and Ubuntu, flatpak-coredumpctl is included in the libflatpak-dev subpackage rather than the base flatpak package, so you’ll have to install libflatpak-dev first. But on older OS versions, including Debian 10 “buster” and Ubuntu 20.04, it is unfortunately installed as /usr/share/doc/flatpak/examples/flatpak-coredumpctl rather than /usr/bin/flatpak-coredumpctl due to a regrettable packaging choice that has been corrected in newer package versions. As a workaround, you can simply copy it to /usr/local/bin. Don’t forget to delete your copy after upgrading to a newer OS version, or it will shadow the packaged version.

Fedora Flatpaks

Flatpaks distributed by Fedora are different than those distributed by GNOME or by Flathub because they do not have debug extensions. Historically, this has meant that debugging crashes was impractical. The best solution was to give up.

Good news! Fedora’s Flatpaks are compatible with debuginfod, which means debug extensions will no longer be missed. You do still need to manually install the org.fedoraproject.Sdk runtime corresponding to the version of the org.fedoraproject.Platform runtime that the application uses, because this is required for flatpak-coredumpctl to work, but nothing else is required. For example, to get a backtrace for Fedora’s Epiphany Flatpak using a Fedora 35 host system, I ran:

$ flatpak install org.fedoraproject.Sdk//f34
$ flatpak-coredumpctl org.gnome.Epiphany
(gdb) bt full

(The f34 is not a typo. Epiphany currently uses the Fedora 34 runtime regardless of what host system you are using.)

That’s it!

Miscellany

At this point, you should know enough to obtain a high-quality backtrace on most Linux systems. That will usually be all you really need, but it never hurts to know a little more, right?

Alternative Types of Backtraces

At the top of this blog post, I suggested using bt full to take the backtrace because this type of backtrace is the most useful to most developers. But there are other types of backtraces you might occasionally want to collect:

  • bt on its own without full prints a much shorter backtrace without stack variables or function parameters. This form of the backtrace is more useful for getting a quick feel for where the bug is occurring, because it is much shorter and easier to read than a full backtrace. But because there are no stack variables or function parameters, it might not contain enough information to solve the crash. I sometimes like to paste the first few lines of a bt backtrace directly into an issue report, then submit the bt full version of the backtrace as an attachment, since an entire bt full backtrace can be long and inconvenient if pasted directly into an issue report.
  • thread apply all bt prints a backtrace for every thread. Normally these backtraces are very long and noisy, so I don’t collect them very often, but when a threadsafety issue is suspected, this form of backtrace will sometimes be required.
  • thread apply all bt full prints a full backtrace for every thread. This is what automated bug report tools generally collect, because it provides the most information. But these backtraces are usually huge, and this level of detail is rarely needed, so I normally recommend starting with a normal bt full.

If in doubt, just use bt full like I showed at the top of this blog post. Developers will let you know if they want you to provide the backtrace in a different form.

gdb Logging

You can make gdb print your session to a file. For longer backtraces, this may be easier than copying the backtrace from a terminal:

(gdb) set logging on

Memory Corruption

While a backtrace taken with gdb is usually enough information for developers to debug crashes, memory corruption is an exception. Memory corruption is the absolute worst. When memory corruption occurs, the code will crash in a location that may be far removed from where the actual bug occurred, rendering gdb backtraces useless for tracking down the bug. As a general rule, if you see a crash inside a memory allocation routine like malloc() or g_slice_alloc(), you probably have memory corruption. If you see magazine_chain_pop_head(), that’s called by g_slice_alloc() and is a sure sign of memory corruption. Similarly, crashes in GTK’s CSS machinery are almost always caused by memory corruption somewhere else.

Memory corruption is generally impossible to debug unless you are able to reproduce the issue under valgrind. valgrind is extremely slow, so it’s impractical to use it on a regular basis, but it will get to the root of the problem where gdb cannot. As a general rule, you want to run valgrind with --track-origins=yes so that it shows you exactly what went wrong:

$ valgrind --track-origins=yes my_app

When valgrinding a GLib application, including all GNOME applications, always use the G_SLICE=always-malloc environment variable to disable GLib’s slice allocator, to ensure the highest-quality diagnostics. Correction: the slice allocator can now detect valgrind and disable itself automatically.

When valgrinding a WebKit application, there are some WebKit-specific environment variables to use. Malloc=1 will disable the bmalloc allocator, GIGACAGE_ENABLED=0 will disable JavaScriptCore’s Gigacage feature (Update: turns out this is actually implied by Malloc=1), and WEBKIT_FORCE_SANDBOX=0 will disable the web process sandbox used by WebKitGTK or WPE WebKit:

$ Malloc=1 WEBKIT_FORCE_SANDBOX=0 valgrind --track-origins=yes epiphany

If you cannot reproduce the issue under valgrind, you’re usually totally out of luck. Memory corruption that only occurs rarely or under unknown conditions will lurk in your code indefinitely and cause occasional crashes that are effectively impossible to fix.

Another good tool for debugging memory corruption is address sanitizer (asan), but this is more complicated to use. Experienced users who are comfortable with rebuilding applications using special compiler flags may find asan very useful. However, because it can be very difficult to use,  I recommend sticking with valgrind if you’re just trying to report a bug.

Apport and ABRT

There are two popular downstream bug reporting tools: Ubuntu has Apport, and Fedora has ABRT. These tools are relatively easy to use — no command line knowledge required — and produce quality crash reports. Unfortunately, while the tools are promising, the crash reports go to downstream packagers who are generally either not watching bug reports, or else not interested or capable of fixing upstream software problems. Since downstream reports are very often ignored, it’s better to report crashes directly to upstream if you want your issue to be seen by the right developers and actually fixed. Of course, only report issues upstream if you’re using a recent software version. Fedora and Arch users can pretty much always safely report directly to upstream, as can Ubuntu users who are using the very latest version of Ubuntu. If you are an Ubuntu LTS user, you should stick with reporting issues to downstream only, or at least take the time to verify that the issue still occurs with a more recent software version.

There are a couple more problems with these tools. As previously mentioned, Ubuntu’s apport is incompatible with systemd-coredump. If you’ve read this far, you know you really want systemd-coredump enabled, so I recommend disabling apport until it learns to play ball with systemd-coredump.

The technical design of Fedora’s ABRT is currently better because it actually retrieves your core dumps from systemd-coredump, so you don’t have to choose between one or the other. Unfortunately, ABRT has many serious user experience bugs and warts. I can’t recommend it for this reason, but it if it works well enough for you, it does create some great downstream crash reports. Whether a downstream package maintainer will look at those reports is hit or miss, though.

What is a crash, really?

Most developers consider crashes on Unix systems to be program termination via a Unix signal that triggers creation of a core dump. The most common of these are SIGSEGV (segmentation fault, “invalid memory reference”) or SIBABRT (usually an intentional crash due to an assertion failure). Less-common signals are SIGBUS (“bad memory access”) or SIGILL (“illegal instruction”). Sandboxed applications might occasionally see SIGSYS (“bad system call”). See the manpage signal(7) for a full list. These are cases where you can get a backtrace to help with tracking down the issues.

What is not a crash? If your application is hanging or just not behaving properly, that is not a crash. If your application is killed using SIGTERM or SIGKILL — this can happen when systemd-oomd determines you are low on memory,  or when a service is taking too long to stop — this is also not a crash in the usual sense of the word, because you’re not going to be able to get a backtrace for it. If a website is slow or unavailable, the news might say that it “crashed,” but it’s obviously not the same thing as what we’re talking about here. The techniques in this blog post are no use for these sorts of “crashes.”

Conclusion

If you have systemd-coredump enabled and debuginfod installed and working, most crash reports will be simple.  Memory corruption is a frustrating exception. Encourage your operating system to enable systemd-coredump and debuginfod if it doesn’t already.  Happy crash reporting!

Understanding systemd-resolved, Split DNS, and VPN Configuration

So, systemd-resolved is enabled by default in Fedora 33. Most users won’t notice the difference, but if you use VPNs — or depend on DNSSEC, more on that at the bottom of this post — then systemd-resolved might be big deal for you. When testing Fedora 33, we found one bug report where a user discovered that systemd-resolved broke his VPN configuration. After this bug was fixed, and nobody reported any further issues, I was pretty confident that migration to systemd-resolved would go smoothly. Then Fedora 33 was released, and I noticed a significant number of users on Ask Fedora and Reddit asking for help with broken VPNs, problems that Fedora 33 beta testers had failed to detect. This was especially surprising to me because Ubuntu has enabled systemd-resolved by default since Ubuntu 16.10, so we were four full years behind Ubuntu here, which should have been plenty of time for any problems to be ironed out. So what went wrong?

First, let’s talk about how things worked before systemd-resolved, so we can see what was wrong and why we needed change. We’ll see how split DNS with systemd-resolved is different than traditional DNS. Finally, we’ll learn how custom VPN software must configure systemd-resolved to avoid problems that result in broken DNS.

I want to note that, although I wrote the Fedora change proposal and have done some evangelism on behalf of systemd-resolved, I’m not a systemd developer and haven’t contributed any code to systemd-resolved.

Traditional DNS with nss-dns

Let’s first see how things worked before systemd-resolved. There are two important configuration files to discuss. The first is /etc/nsswitch.conf, which controls which NSS modules are invoked by glibc when performing name resolution. Note these are glibc Name Service Switch modules, which are totally unrelated to Firefox’s NSS, Network Security Services, which unfortunately uses the same acronym. Also note that, in Fedora (and also Red Hat Enterprise Linux), /etc/nsswitch.conf is managed by authselect and must not be edited directly. If you want to change it, you need to edit /etc/authselect/user-nsswitch.conf instead, then run sudo authselect apply-changes.

Anyway, in Fedora 32, the hosts line in /etc/nsswitch.conf looked like this:

hosts: files mdns4_minimal [NOTFOUND=return] dns myhostname

That means: first invoke nss-files, which looks at /etc/hosts to see if the hostname is hardcoded there. If it’s not, then invoke nss-mdns4_minimal, which uses avahi to implement mDNS resolution. [NOTFOUND=return] means it’s OK for avahi to not be installed; in that case, it just gets ignored. (Edit: this was wrong. Mantas mentioned in the comment below that this is to allow returning early for queries to .local domains, which should never go to the remaining nss modules.) Then most DNS resolution is performed by nss-dns. And finally, we have nss-myhostname, which is just there to guarantee that your own local hostname is always resolvable. Anyway, nss-dns is the key part here. nss-dns is what reads /etc/resolv.conf.

Next, let’s look at /etc/resolv.conf. This file contains a list of up to three DNS servers to use. The servers are attempted in order. If the first server in the list is broken, then the second server will be used. If the second server is broken, the third server will be used. If the third server is also broken, then everything fails, because no matter how many servers you list here, all except the first three are ignored. In Fedora 32, /etc/resolv.conf was, by default, a plain file managed by NetworkManager. It might look like this:

# Generated by NetworkManager
nameserver 192.168.122.1

That’s a pretty common example. It means that all DNS requests should be sent to my router. My router must have configured this via DHCP, causing NetworkManager to dutifully add it to /etc/resolv.conf.

Traditional DNS Problems

Traditional DNS is all well and good for a simple case like we had above, but turns out it’s really broken once you start adding VPNs to the mix. Let’s consider two types of VPNs: a privacy VPN that is always enabled and which is the default route for all web traffic, and a corporate VPN that only receives traffic for internal company resources. (To switch between these two different types of VPN configuration, use the checkbox “Use this connection only for resources on its network” at the bottom of the IPv4 and IPv6 tabs of your VPN’s configuration in System Settings.)

Now, what happens if we connect to both VPNs? The VPN that you connect to first gets listed first in /etc/resolv.conf, followed by the VPN that you connect to second, followed by your local DNS server. Assuming the DNS servers are all working properly, that means:

  • If you connect to your privacy VPN first and your corporate VPN second, all DNS requests will be sent to your privacy VPN, and you won’t be able to visit internal corporate websites. (This scenario is exactly why I become interested in systemd-resolved. After joining Red Hat, I discovered that I couldn’t access various redhat.com websites if I connected to my VPNs in the wrong order.)
  • If you connect to your corporate VPN first and your privacy VPN second, then all your DNS goes to your corporate VPN, and none to your privacy VPN. As that defeats the point of using the privacy VPN, we can be confident it’s not what users expect to happen.
  • If you ever connect the VPNs in the opposite order — say, if your connection to one temporarily drops, and you need to reconnect — then you’ll get the opposite behavior. If you don’t notice this pattern behind the failures, it can make problems difficult to reproduce.

You don’t need two VPNs for this to be a problem, of course. Let’s say you have no privacy VPN, only a corporate VPN.  Well, your employer may fire you if it notices DNS requests it doesn’t like. If you’re making 30 requests per hour to facebook.com, youtube.com, or more salacious websites, that sure looks like you’re not doing very much work. It’s really never in the employee’s best interests to send more DNS than necessary to an employer.

If you use only a privacy VPN, the failure case is arguably even more severe. Let’s say your privacy VPN’s DNS server temporarily goes offline. Then, because /etc/resolv.conf is a list, glibc will fall back to using your normal DNS, probably either your ISP’s DNS server, or your router that forwards everything to your ISP. And now your DNS query has gone to your ISP. If you’re making the wrong sort of DNS requests in the wrong sort of countries — say, if you’re visiting websites opposed to your government — this could get you imprisoned or executed.

Finally, either type of VPN will break resolution of local domains, e.g. fritz.box, because only your router can resolve that properly, but you’re sending your DNS query to your VPN’s DNS server. So local resources will be broken for as long as you’re connected to a VPN.

All things considered, the status quo prior to systemd-resolved was pretty terrible. The need for something better should be clear. Now let’s look at how systemd-resolved fixes this.

Modern DNS with nss-resolve

First, let’s look at /etc/nsswitch.conf, which looks a bit different in Fedora 33:

hosts: files mdns4_minimal [NOTFOUND=return] resolve [!UNAVAIL=return] myhostname dns

nss-myhostname and nss-dns have switched places, but that’s just a minor change that ensures your local hostname is always local even if your DNS server thinks otherwise. (March 2021 Update: nss-myhostname has been moved before nss-mdns4_minimal for Fedora 34, so our new configuration is files myhostname mdns4_minimal [NOTFOUND=return] resolve [!UNAVAIL=return] dns.)

The important change here is the addition of resolve [!UNAVAIL=return]. nss-resolve uses systemd-resolved to resolve hostnames, via either its varlink API (with systemd 247) or its D-Bus API (with older versions of systemd). If systemd-resolved is running, glibc will stop there, and refuse to continue on to nss-myhostname or nss-dns even if nss-resolve doesn’t return a result, since both nss-myhostname and nss-dns are obsoleted by nss-resolve. But if systemd-resolved is not running, then it continues on (and, if resolving something other than the local hostname, will up using nss-dns and reading /etc/resolv.conf, as before).

Importantly, when nss-resolve is used, glibc does not read /etc/resolv.conf when performing name resolution, so any configuration that you put there is totally ignored. That means any script or program that writes to /etc/resolv.conf is probably broken. /etc/resolv.conf still exists, though: it’s managed by systemd-resolved to maintain compatibility with programs that manually read /etc/resolv.confand do their own name resolution, bypassing glibc. Although systemd-resolved supports several different modes for managing /etc/resolv.conf, the default mode, and the mode used in both Fedora and Ubuntu, is for /etc/resolv.conf to be a symlink to /run/systemd/resolve/stub-resolv.conf, which now looks like this:

# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53
options edns0 trust-ad
search redhat.com lan

The redhat.com search domain is coming from my corporate VPN, but the rest of this /etc/resolv.conf should look like yours. Notably, 127.0.0.53 is systemd-resolved’s local stub responder. This allows programs that manually read /etc/resolv.conf to continue to work without changes: they will just wind up talking to systemd-resolved on 127.0.0.53 rather than directly connecting to your real DNS server, as before.

A Word about Ubuntu

Although Ubuntu has used systemd-resolved for four years now, it has not switched from nss-dns to nss-resolve, contrary to upstream recommendations. This means that on Ubuntu, glibc still reads /etc/resolv.conf, finds 127.0.0.53 listed there, and then makes an IP connection to systemd-resolved rather than talking to it via varlink or D-Bus, as occurs on Fedora. The practical effect is that, on Ubuntu, you can still manually edit /etc/resolv.conf and applications will respond to those changes, unlike Fedora. Of course, that would be a disaster, since it would cause all of your DNS configuration in systemd-resolved to be completely ignored. But it’s still possible on Ubuntu. On Fedora, that won’t work at all.

If you’re using custom VPN software that doesn’t work with systemd-resolved, chances are it probably tries to write to /etc/resolv.conf.

Split DNS with systemd-resolved

OK, so now we’ve looked at how /etc/nsswitch.conf and /etc/resolve.conf have changed, but we haven’t actually explained how split DNS is configured. Instead of sending all your DNS requests to the first server listed in /etc/resolv.conf, systemd-resolved is able to split your DNS on the basis of DNS routing domains.

IP Routing Domains, DNS Routing Domains, and DNS Search Domains: Oh My!

systemd-resolved works with DNS routing domains and DNS search domains. A DNS routing domain determines only which DNS server your DNS query goes to.  It doesn’t determine where IP traffic goes to: that would be an IP routing domain. Normally, when people talk about “routing domains,” they probably mean IP routing domains, not DNS routing domains, so be careful not to confuse these two concepts. For the rest of this article, I will use “routing domain” or “DNS domain” to mean DNS routing domain.

A DNS search domain is also different. When you query a name that is only a single label — a domain without any dots — a search domain gets appended to your query. For example, because I’m currently connected to my Red Hat VPN, I have a search domain configured for redhat.com. This means that if I make a query to a domain that is only a single label, redhat.com will be appended to the query. For example, I can query bugzilla and this will be treated as a query for bugzilla.redhat.com. This probably won’t work in your web browser, because web browsers like to convert single-label domains into web searches, but it does work at the DNS level.

In systemd-resolved, each DNS routing domain may or may not be used as a search domain. By default, systemd-resolved will add search domains for every configured routing domain that is not prefixed by a tilde. For example, ~example.com is a routing domain only, while example.com is both a routing domain and a search domain. There is also a global routing domain,  ~.

Example Split DNS Configurations

Let’s look at a complex example with three network interfaces:

$ resolvectl
Global
Protocols: LLMNR=resolve -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub

Link 2 (enp4s0)
Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6 
Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 192.168.1.1 
DNS Servers: 192.168.1.1 
DNS Domain: lan

Link 5 (tun0)
Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6 
Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 10.8.0.1 
DNS Servers: 10.8.0.1 
DNS Domain: ~.

Link 9 (tun1)
Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6 
Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 10.9.0.1 
DNS Servers: 10.9.0.1 10.9.0.2
DNS Domain: example.com

To simplify this example, I’ve removed several uninteresting network interfaces from the output above: my unused second Ethernet interface, my unused Wi-Fi interface wlp5s0, and two virtual network interfaces that I presume are used by libvirt. This means we only have three interfaces to consider: normal Ethernet enp4s0, the privacy VPN tun0, and the corporate VPN tun1. I’m currently running NetworkManager 1.26.4, so I have also fudged the output a bit to make it look like it would if I were using NetworkManager 1.26.6 — I’ll discuss the difference below — so that this example will be good for the future. Let’s look at a few points of note:

  • enp4s0 is configured with +DefaultRoute and no routing domains.
  • tun0 is configured with +DefaultRoute and a global routing domain, ~.
  • tun1 is configured with -DefaultRoute and a routing domain for example.com. (It also has a search domain for example.com, because it doesn’t start with a tilde.)

systemd-resolved first decides which network interface is most appropriate for your DNS query based on the domain name you are querying, then sends your query to the DNS server associated with that interface. In this case, queries for example.com, foo.example.com, etc. will be sent to 10.9.0.1, since that is the DNS server configured for tun1, which is associated with the domain example.com. All other requests go to 10.8.0.1, since tun0 has the global domain ~. Nothing ever goes to 192.168.1.1, because a privacy VPN is enabled, and that would be a privacy disaster. Very simple, right?

If you do not use a privacy VPN, you will not have any ~. domain configured. In this case, your query will go to all interfaces that have +DefaultRoute. For example, if tun0 were removed from the above configuration, then queries not for example.com would be sent to 192.168.1.1, my router, which is good because tun1 is my corporate VPN and should only receive DNS queries corresponding to its own DNS domains.

Enter NetworkManager

How does systemd-resolved come up with the above configuration? It doesn’t. Everything I wrote in the previous section assumes that you are using NetworkManager, because systemd-resolved doesn’t actually make any decisions about where to send your DNS. That is all the responsibility of higher-level network management software, typically NetworkManager. If you use custom VPN software — anything that’s not a NetworkManager VPN plugin — then that software is also responsible for configuring systemd-resolved and playing nice with NetworkManager.

NetworkManager normally does a very good job of configuring systemd-resolved to work as you would expect, so most users should not need to make any changes. But if your DNS isn’t working as you expect, and you run resolvectl and find that systemd-resolved’s configuration is not what you want, do not report a bug against systemd-resolved! Report a bug against NetworkManager instead (if you’re confident there is a real bug).

If you don’t use NetworkManager, you can still make systemd-resolved do what you want, but you’re on your own. It will not configure itself for you.

NetworkManager 1.26.6

If you’re reading this in December 2020, you’re probably using NetworkManager 1.26.4 or earlier. Things are slightly different here, because NetworkManager recently landed a major behavior change. Previously, NetworkManager would always configure a ~. domain for exactly one network interface. This means that the value of systemd-resolved’s DefaultRoute settings was always ignored, since ~. takes precedence. Accordingly, NetworkManager did not bother to configure DefaultRoute at all. I told you that I fudged the output of the example above a little. In actuality, NetworkManager 1.26.4 has configured +DefaultRoute on my tun1 corporate VPN. That doesn’t make sense, because it should only receive DNS for example.com, but it previously did not matter, because there was previously always a ~. domain on some interface. If you’re not using any VPNs, then your Ethernet or Wi-Fi interface would receive the ~. domain. But since 1.26.6, NetworkManager now only ever configures a ~. domain when you are using a privacy VPN, so the DefaultRoute setting now matters.

Prior to NetworkManager 1.26.6, you could rely on resolvectl domain alone to see where your DNS goes, because there was always a ~. domain. Since NetworkManager 1.26.6 no longer always creates a ~. domain, that no longer works. You’ll need to use look at the full output of resolvectl instead, since that will show you the DefaultRoute settings, which are now important.

My Corporate VPN is Missing a Routing Domain, What Should I Do?

Say your corporate VPN is example.com. You want all requests for example.com to be resolved by the VPN, and they are, because NetworkManager creates an appropriate routing domain for it. But you also want requests for some other domain, say example.org, to be resolved by the VPN as well. What do you do?

Most VPN protocols allow the VPN to tell NetworkManager which domains should be resolved by the VPN. Others allow specifying this in the connection profile that you import into NetworkManager. Sadly, not all VPNs actually do this properly, since it doesn’t matter for traditional non-split DNS. Worse, there is no graphical configuration in GNOME System Settings to fix this. There really should be. But for now, you’ll have to use nmcli to set the ipv4.dns-search and ipv6.dns-search properties of your VPN connection profile. Confusingly, even though that setting says “search,” it also creates a routing domain. Hopefully you never have to mess with this. If you do this, consider contacting your IT department to ask them to fix your VPN configuration to properly declare its DNS routing domains, so you don’t have to fix it manually. (This actually sometimes works!) You might have to do this more than once, if you discover additional domains that need to be resolved by the corporate VPN.

Custom VPN Software

By “custom VPN software,” I mean any VPN that is not a NetworkManager plugin. That includes proprietary VPN applications offered by VPN services, and also packaged software like openvpn or wg-quick, when invoked by something other than NetworkManager.

If your custom VPN software is broken, you could report a bug against your VPN software to ask for support for systemd-resolved, but it’s really best to ditch your custom software and configure your VPN using NetworkManager instead, if possible. There are really only two good reasons to use custom VPN software: if NetworkManager doesn’t have a plugin appropriate for your corporate VPN, or if you need to use Wireguard and your desktop doesn’t support Wireguard yet. (NetworkManager itself supports Wireguard, but GNOME does not yet, because Wireguard is special and not treated the same as other VPNs. Help welcome.)

If you use NetworkManager to configure your VPN, as desktop developers intend for you to do, then NetworkManager will take care of configuring systemd-resolved appropriately. Fedora ships with several NetworkManager VPN plugins installed by default, so the vast majority of VPN users should be able to configure your VPN directly in System Settings. This also allows you to control your VPN using your desktop environment’s VPN integration, rather than using the command line or a custom proprietary application.

OpenVPN users will want to look into using the unofficial update-systemd-resolved script. However, NetworkManager has good support for OpenVPN, and this is totally unnecessary if you configure your VPN with NetworkManager. So it’s probably better to use NetworkManager instead.

Now, what if you maintain custom VPN software and want it to work properly with systemd-resolved, or what if you can’t use NetworkManager for whatever reason? First, stop trying to write to /etc/resolv.conf, at least if it’s managed by systemd-resolved. You’ll instead want to use the systemd-resolved D-Bus API to configure an appropriate routing domain for your VPN interface. Read this documentation. You could also shell out to resolvectl, but it’s probably better to use the D-Bus API unless your VPN is managed by a shell script. Privacy VPNs (or corporate VPNs that wish to eschew split DNS and hijack all the user’s DNS) can also use the resolvconf compatibility script, but note this will only work properly with NetworkManager 1.26.6 and newer, because the best you can do with it is add a global routing domain to a network interface, but that’s not going to work as expected if another network interface already has a global routing domain. Did I mention that you might want to use the D-Bus API instead? With the D-Bus API, you can remove the global routing domain from any other network interfaces, to ensure only your VPN’s interface gets a global routing domain.

Split DNS Without systemd-resolved

Quick tangent: systemd-resolved is not the only software available that implements split DNS. Previously, the most popular solution for this was to use dnsmasq. This has always been available in Fedora, but you had to go out of your way to install and configure it, so almost nobody did. Other custom solutions were possible too — I know one developer who runs Unbound locally — but systemd-resolved and dnsmasq are the only options supported by NetworkManager.

One significant difference between systemd-resolved and dnsmasq is that systemd-resolved, as a system daemon, allows for multiple sources of configuration. In contrast, NetworkManager runs dnsmasq as a subprocess, so only NetworkManager itself is allowed to configure dnsmasq. For most users, this distinction will not matter, but it’s important for custom VPN software.

Servers and DNSSEC

You might have noticed that the rest of this blog post focused pretty much exclusively on desktop use cases. Your server is probably not using a VPN. It’s probably not using mDNS. It’s probably not expected to be able to resolve local hostnames. Conclusion: most servers don’t need split DNS! Servers do benefit from systemd-resolved’s systemwide DNS cache, so running systemd-resolved on servers is still a good idea. But it’s not nearly as important for servers as it is for desktops.

There are some disadvantages for servers as well. First, systemd-resolved is not intended to be used on DNS servers. If you’re running a DNS server, you’ll need to disable systemd-resolved before setting up BIND or Unbound instead. That is one extra step to get your DNS server working relative to before, so enabling systemd-resolved by default is an inconvenience here, but that’s hardly difficult to do, so not a big deal.

However, systemd-resolved currently has several bugs in how it handles DNSSEC, and this is potentially a big deal if you depend on that. If you’re a desktop user, you’ll probably never notice, because DNSSEC on desktops is a total failure. Due to widespread and unfixable compatibility issues, it’s very unlikely that we would be able to enable DNSSEC validation by default in the next 10-15 years. If you have a desktop computer that never leaves your home and a good ISP, or a server sitting in a data center, then you can probably safely turn it on manually in /etc/systemd/resolved.conf, but this is highly inadvisable for laptops. So DNSSEC is currently useful for securing DNS between DNS servers, but not for securing DNS between you devices and your DNS server.  (For that, we plan to use DNS over TLS instead.) And we’ve already established that DNS servers should not use systemd-resolved. So what’s the problem?

Well, it turns out DNS servers are not the only server software that expects DNSSEC to work properly. In particular, broken DNSSEC can result in broken mail servers. Other stuff might break too. If you’re running a server that needs functional DNSSEC, you’re going to need to disable systemd-resolved for now. These problems with DNSSEC resulted in some extremely vocal opposition to the Fedora 33 systemd-resolved change proposal, which unfortunately we didn’t properly appreciate until too late in the Fedora 33 development cycle. The good news is that these problems are being treated as bugs to be fixed. In particular, I am keeping an eye on this bug and this bug. Development is currently very active, so I’m hopeful that systemd-resolved’s DNSSEC support will look much better in time for Fedora 34.

Tell Me More!

Wow, you made it to the end of a long blog post, and you still want to know more? Next step is to read my colleague Zbigniew’s Fedora Magazine article, which describes some of the concepts I’ve already mentioned in greater detail. (However, when reading that article, be aware of the NetworkManager 1.26.6 changes I mentioned above. The article predates NetworkManager 1.26.6, so you will see in the examples that a ~. global routing domain is assigned to non-VPN interfaces. That will no longer happen.)

Conclusion

Split DNS is designed to just work, like the rest of the modern Linux desktop, and it should for everyone not using custom VPN software. If you do run into trouble with custom VPN software, the bottom line is to try using a NetworkManager VPN plugin instead, if possible. In the short term, you will also need to disable systemd-resolved if you depend on DNSSEC, but hopefully that won’t be necessary for much longer. Everyone else should hopefully never notice that systemd-resolved is there.

Happy resolving!

Patching Vendored Rust Dependencies

Recently I had a difficult time trying to patch a CVE in librsvg. The issue itself was simple to patch because Federico kindly backported the series of commits required to fix it to the branch we are using downstream. Problem was, one of the vendored deps in the old librsvg tarball did not build with our modern rustc, because the code contained a borrow error that was not caught by older versions of rustc. After finding the appropriate upstream fix, I tried naively patching the vendored dep, but that failed because cargo tries very hard to prevent you from patching its dependencies, and complains if the dependency does not match its checksum in Cargo.lock. I tried modifying the checksum in Cargo.lock, but then it complains that you modified the Cargo.lock. It seems cargo is designed to make patching dependencies as difficult as possible, and that not much thought was put into how cargo would be used from rpmbuild with no network access.

Anyway, it seems the kosher way to patch Rust dependencies is to add a [patch] section to librsvg’s Cargo.toml, but I could not figure out how to make that work. Eventually, I got some help: you can edit the .cargo-checksum.json of the vendored dependency and change “files” to an empty array, like so:

diff --git a/vendor/cssparser/.cargo-checksum.json b/vendor/cssparser/.cargo-checksum.json
index 246bb70..713372d 100644
--- a/vendor/cssparser/.cargo-checksum.json
+++ b/vendor/cssparser/.cargo-checksum.json
@@ -1 +1 @@
-{"files":{".cargo-ok":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",".travis.yml":"f1fb4b65964c81bc1240544267ea334f554ca38ae7a74d57066f4d47d2b5d568","Cargo.toml":"7807f16d417eb1a6ede56cd4ba2da6c5c63e4530289b3f0848f4b154e18eba02","LICENSE":"fab3dd6bdab226f1c08630b1dd917e11fcb4ec5e1e020e2c16f83a0a13863e85","README.md":"c5781e673335f37ed3d7acb119f8ed33efdf6eb75a7094b7da2abe0c3230adb8","build.rs":"b29fc57747f79914d1c2fb541e2bb15a003028bb62751dcb901081ccc174b119","build/match_byte.rs":"2c84b8ca5884347d2007f49aecbd85b4c7582085526e2704399817249996e19b","docs/.nojekyll":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","docs/404.html":"025861f76f8d1f6d67c20ab624c6e418f4f824385e2dd8ad8732c4ea563c6a2e","docs/index.html":"025861f76f8d1f6d67c20ab624c6e418f4f824385e2dd8ad8732c4ea563c6a2e","src/color.rs":"c60f1b0ab7a2a6213e434604ee33f78e7ef74347f325d86d0b9192d8225ae1cc","src/cow_rc_str.rs":"541216f8ef74ee3cc5cbbc1347e5f32ed66588c401851c9a7d68b867aede1de0","src/from_bytes.rs":"331fe63af2123ae3675b61928a69461b5ac77799fff3ce9978c55cf2c558f4ff","src/lib.rs":"46c377e0c9a75780d5cb0bcf4dfb960f0fb2a996a13e7349bb111b9082252233","src/macros.rs":"adb9773c157890381556ea83d7942dcc676f99eea71abbb6afeffee1e3f28960","src/nth.rs":"5c70fb542d1376cddab69922eeb4c05e4fcf8f413f27563a2af50f72a47c8f8c","src/parser.rs":"9ed4aec998221eb2d2ba99db2f9f82a02399fb0c3b8500627f68f5aab872adde","src/rules_and_declarations.rs":"be2c4f3f3bb673d866575b6cb6084f1879dff07356d583ca9a3595f63b7f916f","src/serializer.rs":"4ccfc9b4fe994aab3803662bbf31cc25052a6a39531073a867b14b224afe42dd","src/size_of_tests.rs":"e5f63c8c18721cc3ff7a5407e84f9889ffa10e66da96e8510a696c3e00ad72d5","src/tests.rs":"80b02c80ab0fd580dad9206615c918e0db7dff63dfed0feeedb66f317d24b24b","src/tokenizer.rs":"429b2cba419cf8b923fbcc32d3bd34c0b39284ebfcb9fc29b8eb8643d8d5f312","src/unicode_range.rs":"c1c4ed2493e09d248c526ce1ef8575a5f8258da3962b64ffc814ef3bdf9780d0"},"package":"8a807ac3ab7a217829c2a3b65732b926b2befe6a35f33b4bf8b503692430f223"}
\ No newline at end of file
+{"files":{},"package":"8a807ac3ab7a217829c2a3b65732b926b2befe6a35f33b4bf8b503692430f223"}

Then cargo will stop complaining and you can patch the dependency. Success!

Enable Git Commit Message Syntax Highlighting in Vim on Fedora

Were you looking forward to reading an exciting blog post about substantive technical issues affecting GNOME or the Linux desktop community? Sorry, not today.

When setting up new machines, I’m often frustrated by lack of syntax highlighting for git commit messages in vim. On my main workstation, vim uses comforting yellow letters for the first line of my commit message to let me know I’m good on line length, or red background to let me know my first line is too long, and after the first line it automatically inserts a new line break whenever I’ve typed past 72 characters. It’s pretty nice. I can never remember how I get it working in the end, and I spent too long today trying to figure it out yet again. Eventually I realized there was another difference besides the missing syntax highlighting: I couldn’t see the current line or column number, and I couldn’t see the mode indicator either. Now you might be able to guess my mistake: git was not using /usr/bin/vim at all! Because Fedora doesn’t have a default $EDITOR, git defaults to using /usr/bin/vi, which is basically sad trap vim. Solution:

$ git config --global core.editor vim

You also have to install the vim-enhanced package to get /usr/bin/vim, but that’s a lot harder to forget to do.

You’re welcome, Internet!

Let’s Learn Spelling!

Were you looking forward to reading an exciting blog post about substantive technical issues affecting GNOME or the Linux desktop community? Sorry, not today.

GNOME

It used to be an acronym, so it’s all uppercase. Write “GNOME,” never “Gnome.” Please stop writing “Gnome.”

Would it help if you imagine an adorable little garden gnome dying each time you get it wrong?

If you’re lazy and hate capital letters, or for technical contexts like package or project names, then all-lowercase “gnome” might be appropriate, but “Gnome” certainly never is.

Red Hat

This one’s not that hard. Why are some people writing “RedHat” without any space? It doesn’t make sense. Red Hat. Easy!

SUSE and openSUSE

S.u.S.E. and SuSE are both older spellings for the company currently called SUSE. Apparently at some point in the past they realized that the lowercase u was stupid and causes readers’ eyes to bleed.  Can we please let it die?

Similarly, openSUSE is spelled “openSUSE,” not “OpenSUSE.” Do not capitalize the o, even if it’s the first word in a sentence. Do not write “openSuSE” or “OpenSuSE” (which people somehow manage to do even when they’re not trolling) or anything at all other than “openSUSE.” I know this is probably too much to ask, but once you get the hang of it, it’s not so hard.

elementary OS

I don’t often see this one messed up. If you can write elementary OS, you can probably write openSUSE properly too! They’re basically the same structure, right? All lowercase, then all caps. I have faith in you, dear reader! Don’t let me down!

GTK and WebKitGTK

We removed the + from the end of both of these, because it was awful. You’re welcome!

Again, all lowercase is probably OK in technical contexts. “gtk-webkit” is not. WebKitGTK.

Mesa Update Breaks WebKitGTK+ in Fedora 29

If you’re using Fedora and discovered that WebKitGTK+ is displaying blank pages, the cause is a bad mesa update, mesa-18.2.3-1.fc29. This in turn was caused by a GCC bug that resulted in miscompilation of mesa.

To avoid this bug, downgrade to mesa-18.2.2-1.fc29:

$ sudo dnf downgrade mesa*

You can also update to mesa-18.2.4-2.fc29, but this build has not yet reached updates-testing, let alone stable, so downgrading is easier for now. Another workaround is to run your application with accelerated compositing mode disabled, to avoid OpenGL usage:

$ WEBKIT_DISABLE_COMPOSITING_MODE=1 epiphany

On the bright side of things, from all the bug reports I’ve received over the past two days I’ve discovered that lots of people use Epiphany and notice when it’s broken. That’s nice!

Huge thanks to Dave Airlie for quickly preparing the fixed mesa update, and to Jakub Jelenik for handling the same for GCC.

On Python Shebangs

So, how do you write a shebang for a Python program? Let’s first set aside the python2/python3 issue and focus on whether to use env. Which of the following is correct?

#!/usr/bin/env python
#!/usr/bin/python

The first option seems to work in all environments, but it is banned in popular distros like Fedora (and I believe also Debian, but I can’t find a reference for this). Using env in shebangs is dangerous because it can result in system packages using non-system versions of python. python is used in so many places throughout modern systems, it’s not hard to see how using #!/usr/bin/env in an important package could badly bork users’ operating systems if they install a custom version of python in /usr/local. Don’t do this.

The second option is broken too, because it doesn’t work in BSD environments. E.g. in FreeBSD, python is installed in /usr/local/bin. So FreeBSD contributors have been upstreaming patches to convert #!/usr/bin/python shebangs to #!/usr/bin/env python. Meanwhile, Fedora has begun automatically rewriting #!/usr/bin/env python to #!/usr/bin/python, but with a warning that this is temporary and that use of #!/usr/bin/env python will eventually become a fatal error causing package builds to fail.

So obviously there’s no way to write a shebang that will work for both major Linux distros and major BSDs. #!/usr/bin/env python seems to work today, but it’s subtly very dangerous. Lovely. I don’t even know what to recommend to upstream projects.

Next problem: python2 versus python3. By now, we should all be well-aware of PEP 394. PEP 394 says you should never write a shebang like this:

#!/usr/bin/env python
#!/usr/bin/python

unless your python script is compatible with both python2 and python3, because you don’t know what version you’re getting. Your python script is almost certainly not compatible with both python2 and python3 (and if you think it is, it’s probably somehow broken, because I doubt you regularly test it with both). Instead, you should write the shebang like this:

#!/usr/bin/env python2
#!/usr/bin/python2
#!/usr/bin/env python3
#!/usr/bin/python3

This works as long as you only care about Linux and BSDs. It doesn’t work on macOS, which provides /usr/bin/python and /usr/bin/python2.7, but still no /usr/bin/python2 symlink, even though it’s now been six years since PEP 394. It’s hard to understate how frustrating this is.

So let’s say you are WebKit, and need to write a python script that will be truly cross-platform. How do you do it? WebKit’s scripts are only needed (a) during the build process or (b) by developers, so we get a pass on the first problem: using /usr/bin/env should be OK, because the scripts should never be installed as part of the OS. Using #!/usr/bin/env python — which is actually what we currently do — is unacceptable, because our scripts are python2 and that’s broken on Arch, and some of our developers use that. Using #!/usr/bin/env python2 would be dead on arrival, because that doesn’t work on macOS. Seems like the option that works for everyone is #!/usr/bin/env python2.7. Then we just have to hope that the Python community sticks to its promise to never release a python2.8 (which seems likely).

…yay?

Endgame for WebKit Woes

In my original blog post On WebKit Security Updates, I identified three separate problems affecting WebKit users on Linux:

  • Distributions were not providing updates for WebKitGTK+. This was the main focus of that post.
  • Distributions were shipping a insecure compatibility package for old, unmaintained WebKitGTK+ 2.4 (“WebKit1”).
  • Distributions were shipping QtWebKit, which was also unmaintained and insecure.

Let’s review these problems one at a time.

Distributions Are Updating WebKitGTK+

Nowadays, most major community distributions are providing regular WebKitGTK+ updates, so this is no longer a problem for the vast majority of Linux users. If you’re using a supported version of Ubuntu (except Ubuntu 14.04), Fedora, or most other mainstream distributions, then you are good to go.

My main concern here is still Debian, but there are reasons to be optimistic. It’s too soon to say what Debian’s policy will be going forward, but I am encouraged that it broke freeze just before the Stretch release to update from WebKitGTK+ 2.14 to 2.16.3. Debian is slow and conservative and so has not yet updated to 2.16.6, which is sad because 2.16.3 is affected by a bug that causes crashes on a huge number of websites, but my understanding is it is likely to be updated in the near future. I’m not sure if Debian will update to 2.18 or not. We’ll have to wait and see.

openSUSE is another holdout. The latest stable version of openSUSE Leap, 42.3, is currently shipping WebKitGTK+ 2.12.5. That is disappointing.

Most other major distributions seem to be current.

Distributions Are Removing WebKitGTK+ 2.4

WebKitGTK+ 2.4 (often informally referred to as “WebKit1”) was the next problem. Tons of desktop applications depended on this old, insecure version of WebKitGTK+, and due to large API changes, upgrading applications was not going to be easy. But this transition is going much smoother and much faster than I expected. Several distributions, including Debian, Fedora, and Arch, have recently removed their compatibility packages. There will be no WebKitGTK+ 2.4 in Debian 10 (Buster) or Fedora 27 (scheduled for release this October). Most noteworthy applications have either ported to modern WebKitGTK+, or have configure flags to disable use of WebKitGTK+. In some cases, such as GnuCash in Fedora, WebKitGTK+ 2.4 is being bundled as part of the application build process. But more often, applications that have not yet ported simply no longer work or have been removed from these distributions.

Soon, users will no longer need to worry that a huge amount of WebKitGTK+ applications are not receiving security updates. That leaves one more problem….

QtWebKit is Back

Upstream QtWebKit has not been receiving security updates for the past four years or thereabouts, since it was abandoned by the Qt project. That is still the status quo for most distributions, but Arch and Fedora have recently switched to Konstantin Tokarev’s fork of QtWebKit, which is based on WebKitGTK+ 2.12. (Thank you Konstantin!) If you are using any supported version of Fedora, you should already have been switched to this fork. I am hopeful that the fork will be rebased on WebKitGTK+ 2.16 or 2.18 in the near future, to bring it current on security updates, but in the meantime, being a year and a half behind is an awful lot better than being four years behind. Now that Arch and Fedora have led the way, other distributions should find little trouble in making the switch to Konstantin’s QtWebKit. It would be a disservice to users to continue shipping the upstream version.

So That’s Cool

Things are better. Some distributions, notably Arch and Fedora, have resolved all of the above problems (or will in the very near future). Yay!

More On Private Internet Access

A few quick follow-up thoughts from my original review. First, problems I haven’t solved yet:

  • I forgot an important problem in my first blog: email. Evolution is borderline unusable with PIA. My personal GMail account usually works reliably, but my Google Apps school GMail account (which you’d think would function the same) and my Igalia email both time out with the error “Source doesn’t support prompt for credentials”. That’s Evolution’s generic error that it throws up whenever the mail server is taking too long to respond. So what’s going on here? I can check my email via webmail as a workaround in the meantime, but this is really terrible.
  • Still no solution for the first attempt to connect always failing. That’s really annoying! I was expecting some insight (or at least guesses) as to what might be going wrong here, but nobody has suggested anything about this yet. Update: The problem is that I had selected “Make available to other users” but “Store the password only for this user”, which results in the first attempt to connect always failing, because it’s performed by the gdm user. The fix is to store the password for all users.

Some solutions and answers to problems from my original post:

  • Jonh Wendell suggested using TCP instead of UDP to connect to PIA. I’ve been trying this and so far have not noticed a single instance of connection loss. So I think my biggest problem has been solved. Yay!
  • Dan LaManna posted a link to vpnfailsafe. I’m probably not going to use this since it’s a long shell script that I don’t understand, and since my connection drop problems seem to be solved now that I’ve switched to TCP, but it looks like it’d probably be a good solution to its problem. Real shame this is not built in to NetworkManager already.
  • Christel Dahlskjaer has confirmed that freenode requires NickServ/SASL authentication to use via PIA. This isn’t acceptable for me, since Empathy can’t handle it well, so I’m probably just going to stop using freenode for the most part. The only room I was ever really active in was #webkitgtk+, but in practice our use of that room is basically redundant with #epiphany on GIMPNet (where you’ll still find me, and which would be a better location for a WebKitGTK+ channel anyway), so I don’t think I’ll miss it. I’ve been looking to reduce the number of IRC rooms I join for a long time anyway. The only thing I really need freenode for is Fedora Workstation meetings, which I can attend via a web gateway. (Update: I realized that I am going to miss #webkit as well. Hmm, this could be a problem….)

So my biggest issue now is that I can’t use my email. That’s pretty surprising, as I wouldn’t think using a VPN would make any difference for that. I don’t actually care about my Google Apps account, but I need to be able to read my Igalia mail in Evolution. (Note: My actual IP seems to leak in my email headers, but I don’t care. My name is on my emails anyway. I just care that it works.)

On Private Internet Access

I’m soon going to be moving to Charter Communications territory, but I don’t trust Charter and don’t want it to keep records of all the websites that I visit.  The natural solution is to use a VPN, and the natural first choice is Private Internet Access, since it’s a huge financial supporter of GNOME, and I haven’t heard anybody complain about problems with using it. This will be a short review of my experience.

The service is not free. That’s actually good: it means I’m the customer, not the product. Cost is $40 per year if you pay a year in advance, but you should probably start with the $7/month plan until you’re sure you’re happy with the service and will be keeping it long-term. Anyway, this is a pretty reasonable price that I’m happy to pay.

The website is fairly good. It makes it easy to buy or discontinue service, so there are no pricing surprises, and there’s a pretty good library of support documentation. Unfortunately some of the claims on the website seem to be — arguably — borderline deceptive. A VPN service provides excellent anonymity against your ISP, but relying on a VPN would be a pretty bad idea if your adversary is the government (it can perform a traffic correlation attack) or advertising companies (they know your screen resolution, the performance characteristics of your graphics card, and until recently the rate your battery drains…). But my adversary is going to be Charter Communications, so a VPN is the perfect solution for me. If you need real anonymity, you absolutely must use the Tor Browser Bundle, but that’s going to make your life harder, and I don’t want my life to be harder, so I’ll stick with a VPN.

Private Internet Access provides an Ubuntu app, but I’m going to ignore that because (a) I use Fedora, not Ubuntu, and (b) why on Earth would you want a separate desktop app for your VPN when OpenVPN integration is already built-in on Ubuntu and all modern Linux desktops? Unfortunately the documentation provided by Private Internet Access is not really sufficient — they have a script to set it up automatically, but it’s really designed for Ubuntu and doesn’t work on Fedora — so configuration was slightly challenging.  I wound up following instructions on some third-party website, which I have long since forgotten. There are many third-party resources for how to configure PIA on Linux, which you might think is good but actually indicates a problem with the official documentation in my opinion. So there is some room for improvement here. PIA should ditch the pointless desktop app and improve its documentation for configuring OpenVPN via NetworkManager. (Update: After publishing this post, I discovered this article. Seems the installation script now supports for Fedora/RHEL and Arch Linux. So my claim that it only works on Ubuntu is outdated.) But anyway, once you get it configured properly with NetworkManager, it works: no need to install anything (besides the OpenVPN certificate, of course).

Well, it mostly works. Now, I have two main requirements to ensure that Charter can’t keep records of the websites I’m visiting:

  • NetworkManager must autoconnect to the VPN, so I don’t have to do it manually.
  • NetworkManager must reconnect to the VPN service if connection drops, and must never send any data if the VPN is off.

The first requirement was hard to solve, and I still don’t have it working perfectly. There is no GUI configuration option for this in gnome-control-center, but I eventually found it in nm-connection-editor: you have to edit your normal non-VPN connection, which has a preference to select a VPN to connect to automatically. So we should improve that in gnome-control-center. Unfortunately, it doesn’t work at all the first time your computer connects to the internet after it’s booted. Each time I boot my computer, I’m greeted with a Connection Failed notification on the login screen. This is probably a NetworkManager bug. Anyway, after logging in, I just have to manually connect once, then it works.

As for the next requirement, I’ve given up. My PIA connection is routinely lost about once every 30-45 minutes, usually when watching YouTube or otherwise using a lot of data. This is most likely a problem with PIA’s service, but I don’t know that: it could just as well be my current ISP cutting the connection, or maybe even some client-side NetworkManager bug. Anyway, I could live with brief connection interruptions, but when this happens, I lose connection entirely for about a minute — too long — and then the VPN times out and NetworkManager switches back to sending all the data outside the VPN. That’s totally unacceptable. To be clear, sending data outside the VPN is surely a NetworkManager problem, not a PIA problem, but it needs to be fixed for me to be comfortable using PIA. I see some discussion about that on this third-party GitHub issue, but the “solution” there is to stop using NetworkManager, which I’m not going to do. This is probably one of the reasons why PIA provides a desktop app — I think the PIA app doesn’t suffer from this issue? — but like I said, I’m not going to use a third-party OpenVPN app instead of the undoubtedly-nicer support that’s built in to GNOME.

Another problem is that I can’t connect to Freenode when I’m using the VPN. GIMPNet works fine, so it’s not a problem with IRC in general: Freenode is specifically blocking Private Internet Access users. This seems very strange, since Freenode has a bunch of prominent advertising for PIA all over its website. I could understand blocking PIA if there are too many users abusing it, but not if you’re going to simultaneously advertise it.

I also cannot access Igalia’s SIP service when using PIA. I need that too, but that’s probably something we have to fix on our end.

So I’m not sure what to do now. We have two NetworkManager bugs and a problem with Freenode. Eventually I’ll drop Empathy in favor of Matrix or some other IRC client where registering with NickServ is not a terrible mistake (presumably they’re only blocking unregistered users?), so the Freenode issue seems less-important. I think I’d be willing to just stop visiting Freenode if required to use PIA, anyway. But those NetworkManager issues are blockers to me. With those unfixed, I’m not sure if I’m going to renew my PIA subscription or not. I would definitely renew if someone were to fix those two issues. The ideal solution would be for PIA to adopt NetworkManager’s OpenVPN plugin and ensure it gets cared for, but if not, maybe someone else will fix it?

Update: See part two for how to solve some of these problems.