The art of decoding backtraces without debug info

I’ve been debugging some Nautilus crashers today. It involved decoding
backtraces, and since this is a useful thing to be able to do I
decided to do a writeup about it:

Many bugs that get reported contain backtraces, mostly thanks to
bug-buddy. However, many of these reports where made on a system where
the programs and libraries involved didn’t have debug
information. Having the reporter retry with a build that has debug
info (manually built or with debuginfo packages) help tremendously
with debugging the problem.

However, its often hard to get this, as bug-buddy reports are rarely
followed by the reporter, and the bug might be hard to
reproduce. Thus, its important to learn to read backtraces without
debug info. Such backtraces have several issues:

  • They contain no line number information, so you don’t know where in
    a function something happened

  • You cannot see the values of arguments and local variables
  • You cannot trust the function names given in the backtrace, since
    the debugger doesn’t know about static functions.

The first two issues you just have to accept, as there is no way to
extract such information. However the third issue can often be worked
around. This means you can get a mostly accurate trace of what
happened before the crash, which can help you figure out the
problem.

To decode such a backtrace you have to know how the debugger generates
the backtrace. The debugger locates the active stack frame on the top
of the stack by looking at a register. Each such frame contains a
pointer to the invoking frame, plus the address where execution should
continue when that frame returns. Using these addresses, plus the current
instruction pointer, the debugger can figure out which function was
executing. There are two problems though:

  • If the last thing function foo() does is call function bar() and
    return its return value (or bar() returns void) the compiler can do an
    optimization so that the return from bar() immediately returns to the
    function that called foo(). This means such functions will not be
    visible in the backtrace.

  • The way gdb figures out what function is executing is by looking at
    the program/library symbol tables, combined with knowledge about where
    in memory the code was loaded. The last function symbol before the
    executing address is selected. However, in our case the static
    functions are not in the symbol table, so the result is the nearest
    non-static function before the actual function.

Armed with this knowledge and the code for the application you can often
figure out what functions were actually called. Its important that the
code you look at is about the same version as the reporter, since
changes to the code affect the result you get.

As an example, let me take bug
302096
, a nautilus crasher bug that was recently reported. There
are multiple duplicates, all without debug info, and with very vague
reports of how this actually happened.

Here is the backtrace from the bug:

#0  0xffffe410 in __kernel_vsyscall ()
#1  0xb7c0148b in __waitpid_nocancel () from /lib/tls/i686/cmov/libpthread.so.0
#2  0xb7dddd97 in libgnomeui_module_info_get () from /usr/lib/libgnomeui-2.so.0
#3  
#4  0xb77a2a8f in g_type_check_instance_is_a () from /usr/lib/libgobject-2.0.so.0
#5  0x08088666 in nautilus_window_open_location_full ()
#6  0xb7fb9785 in nautilus_window_info_open_location () from /usr/lib/libnautilus-private.so.2
#7  0x08092423 in fm_directory_view_confirm_multiple_windows ()
#8  0x0809db34 in fm_directory_view_notify_selection_changed ()
#9  0xb7f6fcbb in nautilus_directory_add_file_monitors () from /usr/lib/libnautilus-private.so.2
#10 0xb7f70c27 in nautilus_async_destroying_file () from /usr/lib/libnautilus-private.so.2
#11 0xb7f7347f in nautilus_directory_async_state_changed () from /usr/lib/libnautilus-private.so.2
#12 0xb7f7243a in nautilus_directory_force_reload_internal () from /usr/lib/libnautilus-private.so.2
#13 0xb774fb67 in _gnome_vfs_job_complete () from /usr/lib/libgnomevfs-2.so.0
#14 0xb77500a2 in _gnome_vfs_job_complete () from /usr/lib/libgnomevfs-2.so.0
#15 0xb75d1a03 in g_child_watch_add () from /usr/lib/libglib-2.0.so.0
#16 0xb75ced0f in g_main_depth () from /usr/lib/libglib-2.0.so.0
#17 0xb75cfcb5 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
#18 0xb75cffd7 in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
#19 0xb75d051e in g_main_loop_run () from /usr/lib/libglib-2.0.so.0
#20 0xb79bd10f in gtk_main () from /usr/lib/libgtk-x11-2.0.so.0
#21 0x0807878a in main ()

Frame 0-3 is just the crash and bug-buddy handling it, so we ignore
those. Frame 4 tells us the crash was likely a NULL pointer or an
invalid pointer passed to some gobject type check. The interesting
parts start at Frame 5.

Looking at the code we see that both
nautilus_window_open_location_full() and
nautilus_window_info_open_location() are followed immediately by
non-static functions. Also, nautilus_window_info_open_location() calls
nautilus_window_open_location_full(), so these are probably right.
However, fm_directory_view_confirm_multiple_windows() is followed by
multiple static functions, and it doesn’t call
nautilus_window_info_open_location(). We then search for a
nautilus_window_info_open_location() call below but before the next
non-static function. Fortunately we only get one hit, open_location().
Doing the same with #8, fm_directory_view_notify_selection_changed() shows
that this must be activate_callback().

#9 is a bit trickier since activate_callback() is a callback
function and won’t be called immediately. However, its only used in
one place where its passed as callback to
nautilus_file_call_when_ready(). So, we start from
nautilus_directory_add_file_monitors() and look for a callback that
would result from such a call. There are not many functions to choose
from, and obviously the call must be from ready_callback_call().
#10 is found out to be call_ready_callbacks() by a simple search. #11
has no non-statics, so it must be right.

#12 is harder, it could be
right, since nautilus_directory_force_reload_internal() does call
nautilus_directory_async_state_changed(), but there are no less than
11 other such calls before the next non-static function. Here we have
to use our knowledge of the code, and the other information that the
bug reporters gave about what they were doing at the time of the
crash. One way forward is to just guess which call was right and work
from that. If you then get a backtrace that makes no sense you know
you picked wrong.

In the bug you can see that I initially guessed that #12 was
nautilus_directory_force_reload_internal() (although I now believe
this to be wrong). #13 is in gnome-vfs, which doesn’t call
nautilus_directory_force_reload_internal(), but there could be one or
more hidden stack frames here, so I greped the code for calls and
found nautilus-vfs-directory.c::vfs_force_reload() as the only
caller. This function ends with a call to the other function and
returns void, so its a likely candidate for the return optimization
meaning it makes sense that its not visible in the backtrace. I
continued a bit after that, but wasn’t able to follow the trace very
long, since there was too many possibilities.

When you’ve finally decoded the backtrace, or at least parts of it you
need to figure out how this set of calls could have resulted in a
crash. For that, you’re on your own. But at least now you have a bit
more information that can help you.