How to waste an evening debugging the internals of glib’s qsort implementation

Working on a personal itch-scratching project, I found myself wanting a sortable treeview. So I stuffed my treemodel into a TreeModelSort and set up my sort functions, and everything was happy. Except, when running my application, I kept getting strange segfaults within glib’s qsort implementation. I had set the following function as my sort_func (yes, it’s C++, but it should be fairly self-explanatory):

static int
compare_foo (const Gtk::TreeModel::iterator& a, const Gtk::TreeModel::iterator& b)
{
    std::tr1::shared_ptr<Foo> l1, l2;
    l1 = a->get_value (COLUMN_FOO);
    l2 = b->get_value (COLUMN_FOO);

    if (!l1)
        return -1;

    if (!l2)
        return 1;

    return l1->get_id () - l2->get_id ();
}

Can you spot the error? Running it under valgrind indicated that I was reading data from *before* the start of the array that was being sorted. In fact, the following lines of code in g_qsort_with_data() were causing tmp_ptr to be decreased back past the start of the array:

        while ((*compare_func) ((void *) run_ptr, (void *) tmp_ptr, user_data) < 0)
          tmp_ptr -= size;

Hmm. Why is there no guard to prevent tmp_ptr from being decreased past the start? Then I noticed the following comment up a few lines:

    /* Find smallest element in first threshold and place it at the
       array's beginning.  This is the smallest array element,
       and the operation speeds up insertion sort's inner loop. */

So at this point, the code is assuming that the very first element of the array is the smallest, so compare_func should always return >= 0 when it reaches the start of the array. Aha! So the problem is my compare_func. It turns out I forgot to handle the single case of both values being NULL.

    l1 = a->get_value (COLUMN_FOO);
    l2 = b->get_value (COLUMN_FOO);
+
+   if (!l1 && !l2)
+       return 0;

    if (!l1)
        return -1;

So there’s not really a big lesson to be learned here, but if you ever hit strange segfaults deep inside glib’s qsort while sorting a treemodel, do yourself a favor and double-check that your sort function is sane first. Also, do yourself a favor and run it under valgrind as soon as you get a strange segfault. Knowing the exact point of the invalid read is infinitely more helpful than waiting until that invalid read causes a segfault.

7 Comments

  1. Anders Brander
    Posted December 8, 2009 at 6:38 am | Permalink | Reply

    Thank you! This has haunted me for 3 years. Damned. That was a quick fix.

  2. Posted December 8, 2009 at 8:42 am | Permalink | Reply

    Wouldn’t it be even nicer to just do:

    if (l1 == l2) return 0;

    You’ll save yourself resolving some pointers and calling some functions both in the identity case (which is a superset of two NULLs).

  3. Posted December 8, 2009 at 8:44 am | Permalink | Reply

    Stupid touchpad, I meant “both resolving and …” ;)

  4. Posted December 8, 2009 at 4:07 pm | Permalink | Reply

    Sure, that might be a nicer solution. But that wasn’t really the point of the post ;)

  5. Posted December 9, 2009 at 2:08 am | Permalink | Reply

    When you hit this issue once, you never forget again. In general you MUST ensure the following properties on your cmp function:

    1) cmp(a,b)==-cmp(b,a)
    2) cmp(a,a)==0
    3) cmp(a,b)==0 && cmp(b,c)==0 => cmp(a,c)==0
    4) cmp(a,b)<0 && cmp(b,c) cmp(a,c)<0

    behdad

  6. Posted December 9, 2009 at 3:16 am | Permalink | Reply

    Thanks for the helpful list Behdad. I think something may have gotten eaten on #4 above due to the angle brackets ?

  7. Posted December 9, 2009 at 4:53 am | Permalink | Reply

    You’re right. Lets try again:

    4) cmp(a,b)<0 && cmp(b,c)<0 => cmp(a,c)<0