September 2021 – Michael Catanzaro's Blog

Hello Linux users! Help developers help you: include a quality backtrace taken with gdb each and every time you create an issue report for a crash. If you don’t, most developers will request that you provide a backtrace, then ignore your issue until you manage to figure out how to do so. Save us the trouble and just provide the backtrace with your initial report, so everything goes smoother. (Backtraces are often called “stack traces.” They are the same thing.)

Don’t just copy the lower-quality backtrace you see in your system journal into your issue report. That’s a lot better than nothing, but if you really want the crash to be fixed, you should provide the developers with a higher-quality backtrace from gdb. Don’t know how to get a quality backtrace with gdb? Read on.

(Note: this blog post is occasionally updated to maintain relevance and remove historical information. Last update: October 2023)

Modern Crash Reporting

Here are instructions for getting a quality backtrace for a crashing process on Fedora 35, or any other Linux-based OS that enables coredumpctl and debuginfod:

$ coredumpctl gdb
(gdb) bt full

Enter ‘c’ (continue) when required. Enter ‘y’ when prompted to enable debuginfod. When it’s done printing, press ‘q’ to quit. That’s it! That’s all you need to know. You’re done. Two points of note:

When a process crashes, a core dump is caught by systemd-coredump and stored for future use. The coredumpctl gdb command opens the most recent core dump in gdb. systemd-coredump has been enabled by default in Fedora since Fedora 26.
After opening the core dump, gdb uses debuginfod to automatically download all required debuginfo packages, ensuring the generated backtrace is useful. debuginfod has been enabled by default in Fedora since Fedora 35.

Quality Linux operating systems ought to configure both debuginfod and systemd-coredump for you, so that they are running out-of-the-box. If you’re missing debuginfod or systemd-coredump, then read on to learn how to take a backtrace without these tools. It will be more complicated, of course.

The steps above will not work if the crashing application uses Flatpak. If you’re trying to take a backtrace for an application that uses Flatpak and already have systemd-coredump working, then go ahead and skip ahead to the section below on Flatpak. If you don’t have systemd-coredump working yet, read on.

systemd-coredump

If your operating system enables systemd-coredump by default, then congratulations! This makes reporting crashes much easier because you can easily retrieve a core dump for any recent crash using the coredumpctl command. For example, coredumpctl alone will list all available core dumps. coredumpctl gdb will open the core dump of the most recent crash in gdb. coredumpctl gdb 1234 will open the core dump corresponding to the most recent crash of a process with pid 1234. It doesn’t get easier than this.

Core dumps are stored under /var/lib/systemd/coredump. systemd-coredump will automatically delete core dumps that exceed configurable size limits (2 GB by default). It also deletes core dumps if your free disk space falls below a configurable threshold (15% free by default). Additionally, systemd-tmpfiles will delete core dumps automatically after some time has passed (three days by default). This ensures your disk doesn’t fill up with old core dumps. Although most of these settings seem good to me, the default 2 GB size limit is way too low in my opinion, as it causes systemd to immediately discard crashes of any application that uses WebKit. I recommend raising this limit to 20 GB by creating an /etc/systemd/coredump.conf.d/50-coredump.conf drop-in containing the following:

[Coredump]
ProcessSizeMax=20G
ExternalSizeMax=20G

The other settings are likely sufficient to prevent your disk from filling up with core dumps.

Sadly, although systemd-coredump has been around for a good while now and many Linux operating systems have it enabled by default, many still do not. Most notably, the Debian and Ubuntu ecosystems are still not yet on board. To check if systemd-coredump is enabled on your system:

$ cat /proc/sys/kernel/core_pattern

If you see systemd-coredump, then you’re good.

To enable it in Debian or Ubuntu, just install it:

# apt install systemd-coredump

Ubuntu users, note this will cause apport to be uninstalled, since it is currently incompatible. Also note that I switched from $ (which indicates a normal prompt) to # (which indicates a root prompt).

In other operating systems, you may have to manually enable it:

# echo "kernel.core_pattern=|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" > /etc/sysctl.d/50-coredump.conf
# /usr/lib/systemd/systemd-sysctl --prefix kernel.core_pattern

Note the exact core pattern to use changes occasionally in newer versions of systemd, so these instructions may not work everywhere.

Detour: Manual Core Dump Handling

If you don’t want to enable systemd-coredump, life is harder and you should probably reconsider, but it’s still possible to debug most crashes. First, enable core dump creation by removing the default 0-byte size limit on core files:

$ ulimit -c unlimited

This change is temporary and only affects the current instance of your shell. For example, if you open a new tab in your terminal, you will need to set the ulimit again in the new tab.

Next, run your program in the terminal and try to make it crash. A core file will be generated in the current directory. Open it by starting the program that crashed in gdb and passing the filename of the core file that was created. For example:

$ gdb gnome-chess ./core

This is downright primitive, though:

You’re going to have a hard time getting backtraces for services that are crashing, for starters. If starting the service normally, how do you set the ulimit? I’m sure there’s a way to do it, but I don’t know how! It’s probably easier to start the service manually, but then what command line flags are needed to properly do so? It will be different for each service, and you have to figure this all out for yourself.
Special situations become very difficult. For example, if a service is crashing only when run early during boot, or only during an initial setup session, you are going to have an especially hard time.
If you don’t know how to reproduce a crash that occurs only rarely, it’s inevitably going to crash when you’re not prepared to manually catch the core dump. Sadly, not all crashes will occur on demand when you happen to be running the software from a terminal with the right ulimit configured.
Lastly, you have to remember to delete that core file when you’re done, because otherwise it will take up space on your disk space until you do. You’ll probably notice if you leave core files scattered in your home directory, but you might not notice if you’re working someplace else.

Seriously, just enable systemd-coredump. It solves all of these problems and guarantees you will always have easy access to a core dump when something crashes, even for crashes that occur only rarely.

Debuginfo Installation

Now that we know how to open a core dump in gdb, let’s talk about debuginfo. When you don’t have the right debuginfo packages installed, the backtrace generated by gdb will be low-quality. Almost all Linux software developers deal with low-quality backtraces on a regular basis, because most users are not very good at installing debuginfo. Again, if you’re using Fedora 35 or newer, you don’t have to worry about this anymore because debuginfod will take care of everything for you. I would be thrilled if other Linux operating systems would quickly adopt debuginfod so we can put the era of low-quality crash reports behind us. But if you’re using an operating system that does not provide a debuginfod server, you’ll need to learn how to install debuginfo manually.

As an example, I decided to force gnome-chess to crash using the command killall -SEGV gnome-chess, then I ran coredumpctl gdb to open the resulting core dump in gdb. After a bunch of spam, I saw this:

Missing separate debuginfos, use: dnf debuginfo-install gnome-chess-40.1-1.fc34.x86_64
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/usr/bin/gnome-chess --gapplication-service'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fa23d8b55bf in __GI___poll (fds=0x5636deb06930, nfds=2, timeout=2830)
    at ../sysdeps/unix/sysv/linux/poll.c:29
29  return SYSCALL_CANCEL (poll, fds, nfds, timeout);
[Current thread is 1 (Thread 0x7fa23ca0cd00 (LWP 140177))]
(gdb)

If you are using Fedora, RHEL, or related operating systems, the line “missing separate debuginfos” is a good hint that debuginfo is missing. It even tells you exactly which dnf debuginfo-install command to run to remedy this problem! But this is a Fedora ecosystem feature, and you won’t see this on most other operating systems. Usually, you’ll need to manually locate the right debuginfo packages to install. Debian and Ubuntu users can do this by searching for and installing -dbg or -dbgsym packages until each frame in the backtrace looks good. You’ll just have to manually guess the names of which debuginfo packages you need to install based on the names of the libraries in the backtrace. Look here for instructions for popular operating systems.

How do you know when the backtrace looks good? When each frame has file names, line numbers, function parameters, and local variables! Here is an example of a bad backtrace, if I continue the gnome-chess example above without properly installing the required debuginfo:

(gdb) bt full
#0 0x00007fa23d8b55bf in __GI___poll (fds=0x5636deb06930, nfds=2, timeout=2830)
    at ../sysdeps/unix/sysv/linux/poll.c:29
        sc_ret = -516
        sc_cancel_oldtype = 0
#1 0x00007fa23eee648c in g_main_context_iterate.constprop () at /lib64/libglib-2.0.so.0
#2 0x00007fa23ee8fc03 in g_main_context_iteration () at /lib64/libglib-2.0.so.0
#3 0x00007fa23e4b599d in g_application_run () at /lib64/libgio-2.0.so.0
#4 0x00005636dd7b79a2 in chess_application_main ()
#5 0x00007fa23d7e7b75 in __libc_start_main (main=0x5636dd7aaa50 <main>, argc=2, argv=0x7fff827b6438, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff827b6428)
    at ../csu/libc-start.c:332
        self = <optimized out>
        result = <optimized out>
        unwind_buf = 
              {cancel_jmp_buf = {{jmp_buf = {94793644186304, 829313697107602221, 94793644026480, 0, 0, 0, -829413713854928083, -808912263273321683}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x2, 0x7fff827b6438}, data = {prev = 0x0, cleanup = 0x0, canceltype = 2}}}
        not_first_call = <optimized out>
#6 0x00005636dd7aaa9e in _start ()

This backtrace has six frames, which shows where the code was during program execution when the crash occurred. You can see line numbers for frame #0 (poll.c:29) and #5 (libc-start.c:332), and these frames also show the values of function parameters and variables on the stack, which are often useful for figuring out what went wrong. These frames have good debuginfo because I already had debuginfo installed for glibc. But frames #1 through #4 do not look so useful, showing only function names and the library and nothing else. This is because I’m using Fedora 34 rather than Fedora 35, so I don’t have debuginfod yet, and I did not install proper debuginfo for libgio, libglib, and gnome-chess. (The function names are actually only there because programs in Fedora include some limited debuginfo by default. In many operating systems, you will see ??? instead of function names.) A developer looking at this backtrace is not going to know what went wrong.

Now, let’s run the recommended debuginfo-install command:

# dnf debuginfo-install gnome-chess-40.1-1.fc34.x86_64

When the command finishes, we’ll start gdb again, using coredumpctl gdb just like before. This time, we see this:

Missing separate debuginfos, use: dnf debuginfo-install avahi-libs-0.8-14.fc34.x86_64 colord-libs-1.4.5-2.fc34.x86_64 cups-libs-2.3.3op2-7.fc34.x86_64 fontconfig-2.13.94-2.fc34.x86_64 glib2-2.68.4-1.fc34.x86_64 graphene-1.10.6-2.fc34.x86_64 gstreamer1-1.19.1-2.1.18.4.fc34.x86_64 gstreamer1-plugins-bad-free-1.19.1-3.1.18.4.fc34.x86_64 gstreamer1-plugins-base-1.19.1-2.1.18.4.fc34.x86_64 gtk4-4.2.1-1.fc34.x86_64 json-glib-1.6.6-1.fc34.x86_64 krb5-libs-1.19.2-2.fc34.x86_64 libX11-1.7.2-3.fc34.x86_64 libX11-xcb-1.7.2-3.fc34.x86_64 libXfixes-6.0.0-1.fc34.x86_64 libdrm-2.4.107-1.fc34.x86_64 libedit-3.1-38.20210714cvs.fc34.x86_64 libepoxy-1.5.9-1.fc34.x86_64 libgcc-11.2.1-1.fc34.x86_64 libidn2-2.3.2-1.fc34.x86_64 librsvg2-2.50.7-1.fc34.x86_64 libstdc++-11.2.1-1.fc34.x86_64 libxcrypt-4.4.25-1.fc34.x86_64 llvm-libs-12.0.1-1.fc34.x86_64 mesa-dri-drivers-21.1.8-1.fc34.x86_64 mesa-libEGL-21.1.8-1.fc34.x86_64 mesa-libgbm-21.1.8-1.fc34.x86_64 mesa-libglapi-21.1.8-1.fc34.x86_64 nettle-3.7.3-1.fc34.x86_64 openldap-2.4.57-5.fc34.x86_64 openssl-libs-1.1.1l-1.fc34.x86_64 pango-1.48.9-2.fc34.x86_64

Yup, Fedora ecosystem users will need to run dnf debuginfo-install twice to install everything required, because gdb doesn’t list all required packages until the second time. Next, we’ll run coredumpctl gdb one last time. There will usually be a few more debuginfo packages that are still missing because they’re not available in the Fedora repositories, but now you’ll probably have enough to get a quality backtrace:

(gdb) bt full
#0  0x00007fa23d8b55bf in __GI___poll (fds=0x5636deb06930, nfds=2, timeout=2830)
    at ../sysdeps/unix/sysv/linux/poll.c:29
        sc_ret = -516
        sc_cancel_oldtype = 0
#1  0x00007fa23eee648c in g_main_context_poll
    (priority=, n_fds=2, fds=0x5636deb06930, timeout=, context=0x5636de7b24a0)
    at ../glib/gmain.c:4434
        ret = 
        errsv = 
        poll_func = 0x7fa23ee97c90 
        max_priority = 2147483647
        timeout = 2830
        some_ready = 
        nfds = 2
        allocated_nfds = 2
        fds = 0x5636deb06930
        begin_time_nsec = 30619110638882
#2  g_main_context_iterate.constprop.0
    (context=context@entry=0x5636de7b24a0, block=block@entry=1, dispatch=dispatch@entry=1, self=)
    at ../glib/gmain.c:4126
        max_priority = 2147483647
        timeout = 2830
        some_ready = 
        nfds = 2
        allocated_nfds = 2
        fds = 0x5636deb06930
        begin_time_nsec = 30619110638882
#3  0x00007fa23ee8fc03 in g_main_context_iteration
    (context=context@entry=0x5636de7b24a0, may_block=may_block@entry=1) at ../glib/gmain.c:4196
        retval = 
#4  0x00007fa23e4b599d in g_application_run
    (application=0x5636de7ae260 [ChessApplication], argc=-2105843004, argv=)
    at ../gio/gapplication.c:2560
        arguments = 0x5636de7b2400
        status = 0
        context = 0x5636de7b24a0
        acquired_context = 
        __func__ = "g_application_run"
#5  0x00005636dd7b79a2 in chess_application_main (args=0x7fff827b6438, args_length1=2)
    at src/gnome-chess.p/gnome-chess.c:5623
        _tmp0_ = 0x5636de7ae260 [ChessApplication]
        _tmp1_ = 0x5636de7ae260 [ChessApplication]
        _tmp2_ = 
        result = 0
...

I removed the last two frames because they are triggering a strange WordPress bug, but that’s enough to get the point. It looks much better! Now the developer can see exactly where the program crashed, including filenames, line numbers, and the values of function parameters and variables on the stack. This is as good as a crash report is normally going to get. In this case, it crashed when running poll() because gnome-chess was not actually doing anything at the time of the crash, since we crashed it by manually sending a SIGSEGV signal. Normally the backtrace will look more interesting.

debuginfod for Debian Users

Debian users can use debuginfod, but it has to be enabled manually:

$ DEBUGINFOD_URLS=https://debuginfod.debian.net/ gdb

See here for more information. This requires Debian 11 “bullseye” or newer. If you’re using Ubuntu or other operating systems derived from Debian, you’ll need to wait until a debuginfod server for your operating system is available.

Flatpak

If your application uses Flatpak, you can use the flatpak-coredumpctl script to open core dumps in gdb. For most runtimes, including those distributed by GNOME or Flathub, you will need to manually install (a) the debug extension for your app, (b) the SDK runtime corresponding to the platform runtime that you are using, and (c) the debug extension for the SDK runtime. For example, to install everything required to debug Epiphany 40 from Flathub, you would run:

$ flatpak install org.gnome.Epiphany.Debug//stable
$ flatpak install org.gnome.Sdk//40
$ flatpak install org.gnome.Sdk.Debug//40

(flatpak-coredumpctl will fail to start if you don’t have the correct SDK runtime installed, but it will not fail if you’re missing the debug extensions. You’ll just wind up with a bad backtrace.)

The debug extensions need to exactly match the versions of the app and runtime that crashed, so backtrace generation may be unreliable after you install them for the very first time, because you would have installed the latest versions of the extensions, but your core dump might correspond to an older app or runtime version. If the crash is reproducible, it’s a good idea to run flatpak update after installing to ensure you have the latest version of everything, then reproduce the crash again.

Once your debuginfo is installed, you can open the backtrace in gdb using flatpak-coredumpctl. You just have to tell flatpak-coredumpctl the app ID to use:

$ flatpak-coredumpctl org.gnome.Epiphany

You can pass matches to coredumpctl using -m. For example, to open the core dump corresponding to a crashed process with pid 1234:

$ flatpak-coredumpctl -m 1234 org.gnome.Epiphany

Thibault Saunier wrote flatpak-coredumpctl because I complained about how hard it used to be to debug crashed Flatpak applications. Clearly it is no longer hard. Thanks Thibault!

On newer versions of Debian and Ubuntu, flatpak-coredumpctl is included in the libflatpak-dev subpackage rather than the base flatpak package, so you’ll have to install libflatpak-dev first.

Fedora Flatpaks

Flatpaks distributed by Fedora are different than those distributed by GNOME or by Flathub because they do not have debug extensions. Historically, this has meant that debugging crashes was impractical. The best solution was to give up.

Good news! Fedora’s Flatpaks are compatible with debuginfod, which means debug extensions will no longer be missed. You do still need to manually install the org.fedoraproject.Sdk runtime corresponding to the version of the org.fedoraproject.Platform runtime that the application uses, because this is required for flatpak-coredumpctl to work, but nothing else is required. For example, to get a backtrace for Fedora’s Epiphany Flatpak using a Fedora 35 host system, I ran:

$ flatpak install org.fedoraproject.Sdk//f34
$ flatpak-coredumpctl org.gnome.Epiphany
(gdb) bt full

(The f34 is not a typo. Epiphany currently uses the Fedora 34 runtime regardless of what host system you are using.)

That’s it!

Miscellany

At this point, you should know enough to obtain a high-quality backtrace on most Linux systems. That will usually be all you really need, but it never hurts to know a little more, right?

Alternative Types of Backtraces

At the top of this blog post, I suggested using bt full to take the backtrace because this type of backtrace is the most useful to most developers. But there are other types of backtraces you might occasionally want to collect:

bt on its own without full prints a much shorter backtrace without stack variables or function parameters. This form of the backtrace is more useful for getting a quick feel for where the bug is occurring, because it is much shorter and easier to read than a full backtrace. But because there are no stack variables or function parameters, it might not contain enough information to solve the crash. I sometimes like to paste the first few lines of a bt backtrace directly into an issue report, then submit the bt full version of the backtrace as an attachment, since an entire bt full backtrace can be long and inconvenient if pasted directly into an issue report.
thread apply all bt prints a backtrace for every thread. Normally these backtraces are very long and noisy, so I don’t collect them very often, but when a threadsafety issue is suspected, this form of backtrace will sometimes be required.
thread apply all bt full prints a full backtrace for every thread. This is what automated bug report tools generally collect, because it provides the most information. But these backtraces are usually huge, and this level of detail is rarely needed, so I normally recommend starting with a normal bt full.

If in doubt, just use bt full like I showed at the top of this blog post. Developers will let you know if they want you to provide the backtrace in a different form.

gdb Logging

You can make gdb print your session to a file. For longer backtraces, this may be easier than copying the backtrace from a terminal:

(gdb) set logging enabled

Memory Corruption

While a backtrace taken with gdb is usually enough information for developers to debug crashes, memory corruption is an exception. Memory corruption is the absolute worst. When memory corruption occurs, the code will crash in a location that may be far removed from where the actual bug occurred, rendering gdb backtraces useless for tracking down the bug. As a general rule, if you see a crash inside a memory allocation routine like malloc() or g_slice_alloc(), you probably have memory corruption. If you see magazine_chain_pop_head(), that’s called by g_slice_alloc() and is a sure sign of memory corruption. Similarly, crashes in GTK’s CSS machinery are almost always caused by memory corruption somewhere else.

Memory corruption is generally impossible to debug unless you are able to reproduce the issue under valgrind. valgrind is extremely slow, so it’s impractical to use it on a regular basis, but it will get to the root of the problem where gdb cannot. As a general rule, you want to run valgrind with --track-origins=yes so that it shows you exactly what went wrong:

$ valgrind --track-origins=yes my_app

If you cannot reproduce the issue under valgrind, you’re usually totally out of luck. Memory corruption that only occurs rarely or under unknown conditions will lurk in your code indefinitely and cause occasional crashes that are effectively impossible to fix.

Another good tool for debugging memory corruption is address sanitizer (asan), but this is more complicated to use. Experienced users who are comfortable with rebuilding applications using special compiler flags may find asan very useful. However, because it can be very difficult to use, I recommend sticking with valgrind if you’re just trying to report a bug.

Apport and ABRT

There are two popular downstream bug reporting tools: Ubuntu has Apport, and Fedora has ABRT. These tools are relatively easy to use — no command line knowledge required — and produce quality crash reports. Unfortunately, while the tools are promising, the crash reports go to downstream packagers who are generally either not watching bug reports, or else not interested or capable of fixing upstream software problems. Since downstream reports are very often ignored, it’s better to report crashes directly to upstream if you want your issue to be seen by the right developers and actually fixed. Of course, only report issues upstream if you’re using a recent software version. Fedora and Arch users can pretty much always safely report directly to upstream, as can Ubuntu users who are using the very latest version of Ubuntu. If you are an Ubuntu LTS user, you should stick with reporting issues to downstream only, or at least take the time to verify that the issue still occurs with a more recent software version.

There are a couple more problems with these tools. As previously mentioned, Ubuntu’s apport is incompatible with systemd-coredump. If you’ve read this far, you know you really want systemd-coredump enabled, so I recommend disabling apport until it learns to play ball with systemd-coredump.

The technical design of Fedora’s ABRT is currently better because it actually retrieves your core dumps from systemd-coredump, so you don’t have to choose between one or the other. Unfortunately, ABRT has many serious user experience bugs and warts. I can’t recommend it for this reason, but it if it works well enough for you, it does create some great downstream crash reports. Whether a downstream package maintainer will look at those reports is hit or miss, though.

What is a crash, really?

Most developers consider crashes on Unix systems to be program termination via a Unix signal that triggers creation of a core dump. The most common of these are SIGSEGV (segmentation fault, “invalid memory reference”) or SIBABRT (usually an intentional crash due to an assertion failure). Less-common signals are SIGBUS (“bad memory access”) or SIGILL (“illegal instruction”). Sandboxed applications might occasionally see SIGSYS (“bad system call”). See the manpage signal(7) for a full list. These are cases where you can get a backtrace to help with tracking down the issues.

What is not a crash? If your application is hanging or just not behaving properly, that is not a crash. If your application is killed using SIGTERM or SIGKILL — this can happen when systemd-oomd determines you are low on memory, or when a service is taking too long to stop — this is also not a crash in the usual sense of the word, because you’re not going to be able to get a backtrace for it. If a website is slow or unavailable, the news might say that it “crashed,” but it’s obviously not the same thing as what we’re talking about here. The techniques in this blog post are no use for these sorts of “crashes.”

Conclusion

If you have systemd-coredump enabled and debuginfod installed and working, most crash reports will be simple. Memory corruption is a frustrating exception. Encourage your operating system to enable systemd-coredump and debuginfod if it doesn’t already. Happy crash reporting!

Month: September 2021

Creating Quality Backtraces for Crash Reports