The gospel of git: Interactive Rebase

Today I saw this in my scrollback buffer:

<believer1> so, I changed the patch to not break dbus proto, tested it ... problem is,
            that it is followed by the sftp bit ... what is the best practice? 
            cherry-pick to the different branch apply and amend? or reset apply amend?
<believer1> any special tricks?
<believer2> checkout the commit you wanna edit into a different branch, change it, then
            rebase the original branch on top of it
<believer2> i think that's what i always do
<believer1> sounds reasonable

And it was good. These disciples are trying to follow the holy commandments of git, including the important:

Thou shalt keep a clean version history

However, while they are good of heart, doing the best they can, they are not properly enlightened on the book of rebase in the gospel of git. Fortunately such enlightenment is easily come by with some reading and training.

The first rule when it comes to rebases is:

Don’t fear the rebase

Rebasing is history rewriting operation, and most fear of the rebase come from the fear of losing ones history. This is only natural, after all those who lose their history are doomed to rewrite it again. However, once a change is comitted git never loses it (unless you manually gc which I recommend never doing). If you rewrite the history of a branch and the branch now points to this new history, the old commits and history are still stored in your .git directory. All you need to know is the sha1 id of the last commit before your rewrite and you can always access it. For instance, if you’re on a branch that you completely messed up you can get everything back using:

git reset --hard <old-sha1>

This will reset the current branch so that it points to the old history.

If you’re uncertain of how rewriting works and fear something will go wrong, just store away the previous commit which you can easily get using e.g. gitk (or using “git rev-parse HEAD” if you want to be hardcore). And, if you forgot to do this, things are not lost, you can use “git reflog” to find the old history.

So, having overcome our fear of the rebase, how does one actually use interactive rebasing in this case? The background is that you have a bunch of local commits on you branch, each implementing a small independent change (so called microcommits), but before this is applied or merged into a public repo you discover a bug in one of the changes in the middle of the series. It would be nice if such a bug was never visible in the final version history. What to do?

Typically what I do is commit the fix like usual, with a short commit message like “Fixed up the foobar change”. Then you start the interactive rebase using “git rebase -i origin“. This will bring up your chosen editor with a bunch of lines in them, looking something like this:

pick fa1afe1 Implement the foobar function
pick 124efd3 Use the foobar function
pick cafe123 Fixed up the foobar change

Each line corresponds to one commit on your branch that will be applied during the rebase. You can change the order of the lines to change the history order, or you can delete lines to drop certain commits. What we want to do here is move the fixup to just after the right commit and then change “pick” to “squash” (or just “s” for short) which will merge the two changes into one. So change it to this, save and exit:

pick fa1afe1 Implement the foobar function
s cafe123 Fixed up the foobar change
pick 124efd3 Use the foobar function

This will bring up an editor with the commit messages for the first two commits which you edit (often just remove the second one), save and exit. Then the rest of the rebase is done and we end up with a clean history.

The true disciple of git always uses a local branch for each feature he works on, and then when it is done, it is merged into master and then pushed. I must confess that I sometimes sin here, doing minor feature development directly on master, rebasing to clean up and then just pushing that. Sometimes when I do this it turns out that the feature was more complicated than I expected and I didn’t finish it before I had to work on something else. If you end up in this situation its very nice to know that you can then create a branch afterwards by just doing:

git branch <branchname>
git reset --hard origin/master

This (if run on master branch with “origin” being the upstream repo) will create a new branch with the given name that contains your local changes, then it will change the master branch so that it points to what upstream points to, allowing you do do other work and return to the branch later.

gdb over irc

Python scripting in gdb 7.0 kicks total ass. Today I played around with one cool usecase for it: Remote debugging.

I don’t know how many times I have tried to help someone debug over irc, with the person cutting and pasteing gdb commands and results into xchat. Well, no more! Today I hacked up a gdb python script and an xchat python script that lets you export a gdb session over irc:

From gdb:

(gdb) source 
gdb irc server, Waiting for connection on /tmp/gdb-socket-500

In xchat:

> /load ~/
Loaded xchat-gdb
> /join #test
> /gdb connect
connecting to gdb...

Loggin in as another user on irc:

> /join #test
> alex: gdb print "yey"
<alex> $6 = "yey"
> alex: gdb bt 2
<alex> #0  0x00000035010d50d3 in *__GI___poll (fds=<value optimized out>, nfds=<value optimized out>, timeout=-1) at ..
<alex> #1  0x00007ffff52f96cc in g_main_context_poll (n_fds=<value optimized out>, fds=<value optimized out>,
+priority=<value optimized out>, timeout=<value optimized out>, context=<value optimized out>) at gmain.c:2904

Obviously this is kinda unsafe as you can do all sorts of things via gdb. It needs to be able to limit who can control your gdb instance, and you should only give such permissions to people you trust.

Git repo here, have fun with it.

archer gdb macros for glib

The Archer project is working on  modernizing gdb. One aspec of this is support for scripting gdb using python. Today I landed some python macros for gdb that makes debugger integration with glib/gobject much nicer.

This is best shown by a “screencast” showing the new features. If you’ve ever debugged a glib program, read the below carefully. If not, it won’t make much sense to you, sorry.

(gdb) # Welcome to the glib macros demo
(gdb) # We support pretty printing of lists:
(gdb) l toplevel_list
296    static GtkKeyHash *gtk_window_get_key_hash        (GtkWindow   *window);
297    static void        gtk_window_free_key_hash       (GtkWindow   *window);
298    static void       gtk_window_on_composited_changed (GdkScreen *screen,
299                                 GtkWindow *window);
301    static GSList      *toplevel_list = NULL;
302    static guint        window_signals[LAST_SIGNAL] = { 0 };
303    static GList       *default_icon_list = NULL;
304    static gchar       *default_icon_name = NULL;
305    static guint        default_icon_serial = 0;
(gdb) p toplevel_list
$1 = 0x84d7218 = {0x842f468, 0x842f3b8, 0x842f308, 0x842f258, 0x842f1a8, 0x842f0f8,
 0x842f048, 0x8370710, 0x8370660, 0x83705b0, 0x8370500, 0x8370450, 0x83703a0, 
0x83702f0, 0x8370240, 0x8370190, 0x83700e0, 0x8370030, 0x8349088}
(gdb) # And hashtables:
(gdb) l g_quark_ht
88    G_LOCK_DEFINE_STATIC (g_dataset_global);
89    static GHashTable   *g_dataset_location_ht = NULL;
90    static GDataset     *g_dataset_cached = NULL; /* should this be
91                             threadspecific? */
92    G_LOCK_DEFINE_STATIC (g_quark_global);
93    static GHashTable   *g_quark_ht = NULL;
94    static gchar       **g_quarks = NULL;
95    static GQuark        g_quark_seq_id = 0;
97    /* --- functions --- */
(gdb) p g_quark_ht
$2 = 0x8220a30 = {
 [0x824db08 "draw-border"] = 0x97,
 [0x83bdc38 "<Actions>/IconViewActions/Sort by Size"] = 0x962,
 [0x3970905 "custom-success"] = 0x414,
 [0x83c0878 "<Actions>/DirViewActions/OpenFolderWindow"] = 0x8df,
 [0x2a9d6bb "tooltip-text"] = 0x107,
 [0x82c0648 "audio_%d"] = 0x50c,
 [0x83af728 "zoom_level_changed"] = 0x7e6,
 [0x83c1210 "Empty Trash"] = 0x8e6,
 [0x83584a0 "GdkPixmap"] = 0x76c,
 [0x83298d0 "soup"] = 0x61f,
 [0x82e4a58 "gio"] = 0x568,
 [0x83c2600 "<Actions>/DirViewActions/Create Link"] = 0x8f9,
 [0x835bca0 "MimeTypes"] = 0x6ed,
 [0x81c221d "NautilusIconInfo"] = 0x7bb,
 [0x2b2ff13 "gtk-file-chooser-backend"] = 0x84,
 [0x82edee8 "audio/x-shorten"] = 0x5a6,
 [0x82b9e08 "GstAllocTraceFlags"] = 0x498,
 [0x8307798 "videotestsrc"] = 0x5ee,
 [0x82cc6c0 "tcp"] = 0x529,
 [0x81a3dda "NautilusBookmarkList"] = 0x733,
 [0xc76e26 "dirname"] = 0x350,
 [0x2b1f004 "gtk-menu-bar-accel"] = 0x66,
 [0x8267da8 "GstObjectFlags"] = 0x461,
 [0x82bea88 ""] = 0x500,
 [0x835bbf8 "FileSystems"] = 0x6ec,
 [0x82b8bd0 "GstDebugColorFlags"] = 0x481,
 [0x81c564d "EggSMClient"] = 0x2b3,
 [0x83b6528 "band-select-ended"] = 0x84f,
 [0xc685fd "GFile"] = 0x318,
 [0x83bcb80 "row-reordered"] = 0x8ac,
 [0x8320628 "liveadder"] = 0x60f,
 [0x825c870 "drag_failed"] = 0x171,
 [0x825d8e0 "gnome_disable_sound_events"] = 0xd6,
 [0x2b4f420 "grab-focus"] = 0x129,
 [0x83c4c20 "<Actions>/DirViewActions/Self Format Volume"] = 0x91d,
 [0x2b3889d "x"] = 0x817,
 [0x2a92cc0 "y"] = 0x818,
 [0x2b2a535 "color-hash"] = 0x83,
 [0x2b3c0e6 "mask"] = 0x76f,
 [0x8343360 "<Actions>/ShellActions/Zoom Normal"] = 0x711,
 [0x83cd8b0 "<Actions>/IconViewActions/Tighter Layout"] = 0x956,
 [0x2b4f28a "gtk-pango-context"] = 0xea,
 [0x82ec370 "application/x-gzip"] = 0x58f,
---Type <return> to continue, or q <return> to quit---q
(gdb) # We also have a cool gforeach command:
(gdb) gforeach i in toplevel_list: print ((GtkWindow *)$i)
$3 = 0x842f468 [GtkWindow]
$4 = 0x842f3b8 [GtkWindow]
$5 = 0x842f308 [GtkWindow]
$6 = 0x842f258 [GtkWindow]
$7 = 0x842f1a8 [GtkWindow]
$8 = 0x842f0f8 [GtkWindow]
$9 = 0x842f048 [GtkWindow]
$10 = 0x8370710 [GtkWindow]
$11 = 0x8370660 [GtkWindow]
$12 = 0x83705b0 [GtkWindow]
$13 = 0x8370500 [GtkWindow]
$14 = 0x8370450 [GtkWindow]
$15 = 0x83703a0 [GtkWindow]
$16 = 0x83702f0 [GtkWindow]
$17 = 0x8370240 [GtkWindow]
$18 = 0x8370190 [GtkWindow]
$19 = 0x83700e0 [GtkWindow]
$20 = 0x8370030 [GtkWindow]
$21 = 0x8349088 [NautilusSpatialWindow]
(gdb) gforeach i in toplevel_list: print ((GtkWindow *)$i)->title
$22 = (gchar *) 0x0
$23 = (gchar *) 0x0
$24 = (gchar *) 0x0
$25 = (gchar *) 0x0
$26 = (gchar *) 0x0
$27 = (gchar *) 0x0
$28 = (gchar *) 0x0
$29 = (gchar *) 0x0
$30 = (gchar *) 0x0
$31 = (gchar *) 0x0
$32 = (gchar *) 0x0
$33 = (gchar *) 0x0
$34 = (gchar *) 0x0
$35 = (gchar *) 0x0
$36 = (gchar *) 0x0
$37 = (gchar *) 0x0
$38 = (gchar *) 0x0
$39 = (gchar *) 0x0
$40 = (gchar *) 0x83d0200 "alex"
(gdb) # There is also some nice GObject integration features:
(gdb) break gtk_widget_size_allocate
Breakpoint 1 at 0x29ff265: file gtkwidget.c, line 3821.
(gdb) c
[Thread 0xb71ffb70 (LWP 2599) exited]
Breakpoint 1, IA__gtk_widget_size_allocate (widget=0x8349088 [NautilusSpatialWindow], allocation=0xbfffe914) at gtkwidget.c:3821
3821    {
(gdb) # Notice the runtime detected type of the instance
(gdb) c
Breakpoint 1, IA__gtk_widget_size_allocate (widget=0x8313d38 [GtkTable], allocation=0xbfffe460) at gtkwidget.c:3821
3821    {
(gdb) c
Breakpoint 1, IA__gtk_widget_size_allocate (widget=0x8365c70 [GtkVBox], allocation=0xbfffe048) at gtkwidget.c:3821
3821    {
(gdb) c
Breakpoint 1, IA__gtk_widget_size_allocate (widget=0x8365d20 [GtkVBox], allocation=0xbfffdc08) at gtkwidget.c:3821
3821    {
(gdb) # This is gonna be a long hairy backtrace, right? NO!
(gdb) new-backtrace
#0  gtk_widget_size_allocate (widget=0x8365d20 [GtkVBox], allocation=0xbfffdc08) at gtkwidget.c:3821
#1  0x027dcb19 in gtk_box_size_allocate (widget=<value optimized out>, allocation=<value optimized out>) at gtkbox.c:500
#2  0x00a180ac in g_cclosure_marshal_VOID__BOXED (closure=0x8249ad0, return_value=0x0, n_param_values=2, param_values=
 0x8535e78, invocation_hint=0xbfffddc0, marshal_data=0x27dc850) at gmarshal.c:566
#3  <...>
#4  <...>
#5  <...>
#6  <...>
#7  <emit signal size-allocate on instance 0x8365c70 [GtkVBox]>
#8  0x029ff3da in gtk_widget_size_allocate (widget=<value optimized out>, allocation=<value optimized out>) at gtkwidget.c:3887
#9  0x0295b625 in gtk_table_size_allocate_pass2 (table=<value optimized out>) at gtktable.c:1610
#10 0x0295b625 in gtk_table_size_allocate (table=<value optimized out>) at gtktable.c:849
#11 0x00a180ac in g_cclosure_marshal_VOID__BOXED (closure=0x8249ad0, return_value=0x0, n_param_values=2, param_values=
 0x85956f0, invocation_hint=0xbfffe200, marshal_data=0x295abd0) at gmarshal.c:566
#12 <...>
#13 <...>
#14 <...>
#15 <...>
#16 <emit signal size-allocate on instance 0x8313d38 [GtkTable]>
#17 0x029ff3da in gtk_widget_size_allocate (widget=<value optimized out>, allocation=<value optimized out>) at gtkwidget.c:3887
#18 0x02a14b39 in gtk_window_size_allocate (widget=<value optimized out>, allocation=<value optimized out>) at gtkwindow.c:4941
#19 0x00a180ac in g_cclosure_marshal_VOID__BOXED (closure=0x8249ad0, return_value=0x0, n_param_values=2, param_values=
 0x859bc28, invocation_hint=0xbfffe610, marshal_data=0x2a149f0) at gmarshal.c:566
#20 <...>
#21 <...>
#22 <...>
#23 <...>
#24 <emit signal size-allocate on instance 0x8349088 [NautilusSpatialWindow]>
#25 0x029ff3da in gtk_widget_size_allocate (widget=<value optimized out>, allocation=<value optimized out>) at gtkwidget.c:3887
#26 0x02a15044 in gtk_window_move_resize (window=<value optimized out>) at gtkwindow.c:6186
#27 0x02a15044 in gtk_window_check_resize (window=<value optimized out>) at gtkwindow.c:5358
#28 0x00a17994 in g_cclosure_marshal_VOID__VOID (closure=0x8262b68, return_value=0x0, n_param_values=1, param_values=
 0x8535700, invocation_hint=0xbfffeae0, marshal_data=0x2a14b50) at gmarshal.c:77
#29 <...>
#30 <...>
#31 <...>
#32 <...>
#33 <emit signal check-resize on instance 0x8349088 [NautilusSpatialWindow]>
#34 0x028176ba in gtk_container_check_resize (container=<value optimized out>) at gtkcontainer.c:1424
#35 0x028179f2 in gtk_container_idle_sizer (data=<value optimized out>) at gtkcontainer.c:1350
#36 0x002c3588 in gdk_threads_dispatch (data=<value optimized out>) at gdk.c:506
#37 0x00929382 in g_idle_dispatch (source=0x8557f20, callback=0x8365d20, user_data=0x83548c0) at gmain.c:4065
#38 0x0092b198 in g_main_dispatch (context=<value optimized out>) at gmain.c:1960
#39 0x0092b198 in g_main_context_dispatch (context=<value optimized out>) at gmain.c:2513
#40 0x0092eac8 in g_main_context_iterate (context=0x8247f90, block=<value optimized out>, dispatch=1, self=
 0x821f018) at gmain.c:2591
#41 0x0092ef3f in g_main_loop_run (loop=0x8291758) at gmain.c:2799
#42 0x028b7129 in gtk_main () at gtkmain.c:1205
#43 0x0807e923 in g_themed_icon_append_name () at gthemedicon.c:378
#44 0x0071ab36 in __libc_start_main (main=0x807e2c0 <g_themed_icon_append_name+86084>, argc=1, ubp_av=0xbffff154, init=
 0x81a3180, fini=0x81a3170, rtld_fini=0x6efd00 <_dl_fini>, stack_end=0xbffff14c) at libc-start.c:220
#45 0x080692b1 in g_themed_icon_append_name () at gthemedicon.c:378

The above was recorded using a stock rawhide (to be Fedora 12) gdb and glib (including debuginfo packages), with just the python macros added. It also works with the gdb shipping in Fedora 11, but there is some issue there that makes the gforeach macro crash gdb. Additionally, the VTA work that landed in the GCC in Fedora 12 makes what gdb reports much more reliable wrt optimizations, which is very nice for e.g. the backtrace filtering.

GObject performance work

I spent some time last week and this week on fixing some performance issues in gobject. It started out with the patches in bug 557100, which seemed very useful. I cleaned up those patches a bit, wrote a serious performance test and did some additional optimizations.

These changes focus on speeding up creation of “simple” gobject, i.e. things that have no properties or implement any interfaces, etc. They are still important because being able to use gobject gives us lots of advantages like threadsafe refcounting, runtime type introspection, user-data, etc. Sometimes people avoid using gobjects for small things just because they are a bit more expensive than some homebrew struct, which is very sad. With these fixes we can get rid of some of that.

Another thing about gobject that has bothered me for some time is the handling of interfaces. GIO and other modern APIs are starting to use interfaces more and more, so its important that they work well. However, interfaces in gobject have a feature that most people are unaware of, namely that you can add interfaces to a class after the class/type has been initialized. This means that the list of interfaces a class implements must be protected by a lock, and this lock must be taken each time we e.g. check if an object implement an interface or cast to the interface to do a method call on it.

Additionally the interface lookup algorithm used in gobject uses a binary search on the sorted list of interfaces a class implements. Better approaches are possible, like the one used in gcj (described here) which allows constant time (O(1)) interface lookup.

In bug 594525 and 594650 I described these issues and posted patches that fix them.

I added all these patches to the gobject-performance branch in glib git, including the performance test I wrote. The performance improvements are pretty good:

  • Construction speed for simple objects more than doubled, while the construction speed for complex object is not much affected (within one percent).
  • Interface typechecks go from 52 to 95 million per second in the non-threaded case and from 12 to 95 if g_threads_init() has been called.
  • Additionally the contention for typechecks in multiple threads goes to zero as you can see in the tests does by benjamin in bug 594525.

Data about Data

Warning: Long, technical post

One of the few remaining icky areas of the Nautilus codebase is the metadata store. Its got some weird inefficient XML file format, the code is pretty nasty and its the data is not accessible to other apps. Its been on my list of things to replace for quite some time, and yesterday I finally got rid of it.

The new system is actually pretty cool, both in the API to access is how it works internally. So, I’m gonna spend a few bits on explaining how it works.

Lets start with the requirements and then we can see how to fulfil these. We want:

  • A generic per-file key-value store with string and string list values. (String lists are required by Nautilus for e.g. emblems)
  • All apps should be able to access the store for both writing and reading.
  • Access, in particular read access, needs to be very efficient, even when used in typical I/O fashion (lots of small calls intermixed with other file I/O). Getting the metadata for a file should not be significantly more expensive than a stat syscall.
  • Removable media should be handled in a “sane” way, even if multiple volumes may be mounted in the same place.
  • We don’t require transactional semantics for the database (i.e. no need to guarantee that a returned metadata set is written to stable storage). What we want is something I call “desktop transaction semantics”.
    By this I means that in case of a crash, its fine to lose what you changed in the recent history. However, things that were written a long time ago (even if recently overwritten) should not get lost. You either get the “old” value or the “new” value, but you never ever get neither or a broken database.
  • Homedirs on NFS should work, without risking database corruption if two logins with the same homedir write concurrently. It is fine if doing so may lose some of these writes, as long as the database is not corrupted. (NFS is still used in a lot of places like universities and enterprise corporations.)

Seems like a pretty tall order. How would you do something like that?


For performance reason its not a good idea to require IPC for reading data, as doing so can block things for a long time (especially when data are contended, compare i.e. with how gconf reads are a performance issue on login). To avoid this we steal an idea from dconf: all reads go through mmaped files.

These are opened once and the file format in them is designed to allow very fast lookups using a minimal amount of page faults. This means that once things are in a steady state lookup is done without any syscalls at all, and is very fast.


Metadata writes are a handled by a single process that ensures that concurrent writes are serialized when writing to disk.

Clients talk to the metadata daemon via dbus. The daemon is started automatically by dbus when first used, and it may exit when idle.

Desktop Transaction semantics

In order to give any consistancy guarantees for file writes fsync() is normally used. However this is overkill and in some cases a serious system performance problem (see the recent ext3/4 fsync discussion). Even without the ext3 problem a fsync requires a disk spinup and rotation to guarantee some data on disk before we could return a metadata write call, which is quite costly (on the order of several milliseconds at least).

In order to solve this I’ve made the file format for a single database be in two files. One file is the “tree” which contains a static, read only, metadata tree. This file is replaced using the standard atomic replace model (write to temp, fsync, rename over).

However, we rarely change this file, instead all writes go to another file, the “journal”. As the name implies this is a journal oriented format where each new operation gets written at the end of the journal. Each entry has a checksum so that we can validate the journal on read (in case of crash) and the journal is never fsynced.

After a timeout (or when full) the journal is “rotated”, i.e. we create a new “tree” file containing all the info from the journal and a new empty journal. Once something is rotated into the “tree” it is generally safe for long term storage, but this slow operation happens rarely and not when a client is blocking for the result.

NFS homedirs

It turns out that this setup is mostly OK for the NFS homedir case too. All we have to do is put the log file on a non-NFS location like /tmp so that multiple clients won’t scribble over each other. Once a client rotates the journal it will be safely visible by every client in a safe fashion (although some clients may lose recent writes in case of concurrent updates).

There is one detail with atomic replace on NFS that is problematic. Due to the stateless nature of NFS an open file may be removed on the server by another client (the server don’t know you have the file open), which would later cause an error when we read from the file. Fortunately we can workaround this by opening the database file in a specific way[1].

Removable media

The current Nautilus metadata database uses a single tree based on pathnames to store metadata. This becomes quite weird for removable media where the same path may be reused for multiple disks and where one disk can be mounted in different places. Looking at the database it seems like all these files are merged into a single directory, causing various problems.

The new system uses multiple databases. libudev is used to efficiently look up the filesystem UUID and label for as mount and if that is availible use that as the database id, storing paths relative to that mount. We also have a standard database for your homedir (not based on UUID etc, as the homedir often migrates between systems, etc) and a fall-back “root” database for everything not matching the previous databases.

This means that we should seamlessly handle removable media as long as there are useful UUIDs or labels and have a somewhat ok fall-back otherwise.

Integration with platform

All this is pretty much invisible to applications. Thanks to the gio/GVfs split and the extensible gio APIs things are automatically availible to all applications without using any new APIs once a new GVfs is installed. Metadata can be gotten with the normal g_file_query_info() calls by requesting things from the “metadata” namespace. Similar standard calls can be used to set metadata.

Also, the standard gio copy, move and remove operations automatically affect the metadata databases. For instance, if you move a file its metadata will automatically move with it.

Here is an example:

$ touch /tmp/testfile
$ gvfs-info -a "metadata::*" /tmp/testfile
$ gvfs-set-attribute /tmp/testfile metadata::some-key "A metadata value"
$ gvfs-info -a "metadata::*" /tmp/testfile
  metadata::some-key: A metadata value
$ gvfs-copy /tmp/testfile /tmp/testfile2
$ gvfs-info -a "metadata::*" /tmp/testfile2
  metadata::some-key: A metadata value

Relation to Tracker

I think I have to mention this since the Tracker team want other developers to use Tracker as a data store for their applications, and I’m instead creating my own database. I’ll try to explain my reasons and how I think these should cooperate.

First of all there are technical reasons why Tracker is not a good fit. It uses sqlite which is not safe on NFS. It uses a database, so each read operation is an IPC call that gets resolved to a database query, causing performance issues. It is not impossible to make database use efficient, but it requires a different approach than how file I/O normally looks. You need to do larger queries that does as much as possible in one operation, whereas we instead inject many small operations between the ordinary i/o calls (after each stat when reading a directory of files, after each file copy, move or remove, etc).

Secondly, I don’t feel good about storing the kind of metadata Nautilus uses in the Tracker database. There are various vague problems here that all interact. I don’t like the mixing of user specified data like custom icons with auto-extracted or generated data. The tracker database is a huge (gigabytes) complex database with information from lots of sources, mostly autogenerated. This risks the data not being backed up. Also, people having problems with tracker are prone to remove the databases and reindexing just to see if that “fixes it”, or due to database format changes on upgrades. Also, the generic database model seems like overkill for the simple stuff we want to store, like icon positions and spatial window geometry.

Additionally, Tracker is a large dependency, and using it for metadata storage would make it a hard dependency for Nautilus to work at all (to e.g. remember the position of the icons on the desktop). Not everyone wants to use tracker at this point. Some people may want to use another indexer, and some may not want to run Tracker for other reasons. For instance, many people report that system performance when using Tracker suffer. I’m sure this is fixable, but at this point its imho not yet mature enought to force upon every Gnome user.

I don’t want to be viewed like any kind of opponent of Tracker though. I think it is an excellent project, and I’m interested in using it, fixing issues it has and helping them work on it for integration with Nautilus and the new metadata store.

Tracker already indexes all kinds of information about files (filename, filesize, mtime, etc) so that you can do queries for these things. Similarly it should extract metadata from the metadata store (the size of this pales in comparison to the text indexes anyways, so no worries). To facilitate this I want to work with the Tracker people to ensure tracker can efficiently index the metadata and get updates when metadata changes for a file.

Where to go from here

While some initial code has landed in git everything is not finished. There are some lose ends in the metadata system itself, plus we need to add code to import the old nautilus metadata store into the new one.

We can also start using metadata in other places now. For instance, the file selector could show emblems and custom icons, etc.


[1] Remove-safe opening a file on NFS:
Link the file to a temporary filename, open the temp file, unlink the tempfile. Now the NFS client on you OS will “magically” rename the tempfile to .nfsXXXXX something and will track this fd to ensure this gets remove when the fd is closed. Other clients removing the original file will not cause the .nfsXXXX link on the server to be removed.

The return of Client side windows

For a long time now I’ve been working on the client side windows branch of Gtk+. By now it is mostly feature complete when it comes to normal use. However, one of the drivers of client side windows and the initial reason I started working on it is the ability to do offscreen window rendering. The last two weeks I’ve been spending on getting that to work and integrated into the platform.

I think a video says more than a million words here:

(Original ogg availible here)

This is using the current client-side-windows branch of Gtk+, plus my own gtk-in-clutter code availible in the client-side-window branch of

Next up is getting the non-X backends working and getting this merged into master.

ext4 vs fsync, my take

There has been a lot of discussion about the ext4 data loss issue, and I see a lot of misconceptions, both about why rename() is used and what guarantees POSIX gives. I’ll try to give the background, and then my opinion on the situation.

There are two basic ways to update a file. You can either truncate the old file and write the new contents, or you can write the new contents to a temporary file and rename it over the old file when finished. The rename method have several advantages, partly based on the fact that rename is atomic. The exact wording from POSIX (IEEE Std 1003.1TM, 2003 Edition) is:

In this case, a link named new shall remain visible to other processes throughout the renaming operation and refer either to the file referred to by new or old before the operation began.

This gives the rename method some useful properties:

  • If the application crashes while writing the new file, the original file is left in place
  • If an application reads the file the same time as someone is updating it the reading application gets either the old or the new file in its entirety. I.e. we will never read a partially finished file, a mixup of two files, or a missing file.
  • If two applications update the file at the same time we will at worst lose the changes from one of the writers, but never cause a corrupted file.

Note that nothing above talks about what happens in the case of a system crash. This I because system crashes are not specified at all by POSIX. In fact, the behaviour specified predates journaled filesystems where you have any reasonable expectation that recently written data is availible at all after a system crash. For instance, a traditional unix filesystem like UFS or ext2 may well lose the entire filesystem on a system crash if you’re unlucky, but it is still POSIX compliant.

In addition to the above POSIX specifies the “fsync” call, which can be used in the rename method. It flushes all in-memory buffers corresponding to the file onto hardware (this is vaguely specified and the exact behaviour is hw and sw dependent), not returning until its fully saved. If called on the new file before renaming it over the old file it gives a number of advantages:

  • If there is a hardware I/O error during the write to the disk we can detect and report this.
  • In case of a system crash shortly after the write, its more likely that we get the new file than the old file (for maximum chance of this you additionally need to fsync the directory the file is in)
  • Some filesystems may order the metadata writes such that the rename is written to disk, but the contents of the new file are not yet on disk. If we crash at this point this is detected on mount and the file is truncated to 0 bytes. Calling fsync() guarantees that this does not happen. [ext4]

However, it also has a number of disadvantages:

  • It forces a write immediately, spinning up the disk and causing more power use and more wear on flash filesystems.
  • It causes a longer wait for the user, waiting for data to be on disk.
  • It causes lower throughput if updating multiple files in a row.
  • Some filesystems guarantee ordering constraint such that fsync more or less implies a full sync of all outstanding buffers, which may cause system-wide performance issues. [ext3]

It should be noted that POSIX, and even ext4 gives no guarantees that the file will survive a system crash even if using fsync. For instance, the data could be outstanding in hardware buffers when the crash happens, or the filesystem in use may not be journaled or otherwise be robust wrt crashes. However, in case of a filesystem crash it gives a much better chance of getting the new data rather than the old, and on reordering filesystems like an unpatched ext4 it avoids truncated files from the rename method.

Both the fsync and the non-fsync version has their places. For very important data the guarantees given by fsync are important enough to outweight the disadvantages. But in many cases the disadvantages makes it too heavy to use, and the possible data loss is not as big of an issue (after all, system crashes are pretty uncommon).

So much for the background, now over to my personal opinions on filesystem behaviour. I think that in the default configuration all general purpose filesystem that claim to be robust (be it via journalling or whatever) should do their best to preserve the runtime guarantees of the atomic rename save operation so that they extend to the system crash case too. In other words, given a write to a new file followed by a rename over an old file, we shall find either the old data or the new data. This is a less of a requirement than fsync-on-close, but a requirement nevertheless that does result in a performance loss. However, just the fact that you’re running a journaled filesystem is a performance cost already, and something the user has explicitly chosen in order to have less risk of losing data.

It would be nice if the community could work out a way to express intent of the save operation to the filesystem in such a way that we avoid the unnecessary expensive fsync() call. For instance, we could add a fcntl like F_SETDATAORDERED that tells the kernel to ensure the data is written to the disk before writing the metadata for the file to the disk. With this in place applications could choose either if they want the new file on disk *now*, or just if it wants either the old or the new file, without risk for total data loss. (And fall back on fsync if the fcntl is not supported.)

This is the current status of the rename method on the commonly used Linux filesystems to my best knowledge:
(In this context “safe” means we get either the old or the new version of the file after a crash.)

ext2: No robustness guarantees on system crash at all.

ext3: In the default data=ordered mode it is safe, because data is written before metadata. If you crash before the data is written (5 seconds by default) you get the old data. With data=writeback mode it is unsafe.

ext4: Currently unsafe, with a quite long window where you risk data loss. With the patches queued for 2.6.30 it is safe.

btrfs: Currently unsafe, the maintainer claims that patches are queued for 2.6.30 to make it safe

XFS: Currently unsafe (as far as i can tell), however the truncate and overwrite method is safe.

Eternal Vigilance!

I’ve spent a lot of time during the years fixing nautilus memory use. I noticed the other day that it seemed to be using a lot of memory again, doing nothing but displaying the desktop:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND          
14315 alex      20   0  487m  46m  15m S  0.3  1.2   0:00.86 nautilus

So, its time for another round of de-bloating. I fired up massif to see what used so much memory, and it turns out that there is a cache in GnomeBG that caches the original desktop background image. We don’t really need that since we keep around the final pixmap for the background.

It turns out that my desktop image is 2560×1600, which means the unscaled pixbuf uses 12 megs of memory. Just fixing this makes things a bit better:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND          
16129 alex      20   0  538m  33m  15m S  4.9  0.8   0:00.87 nautilus

However, looking at the actual allocations in massif its obvious that we’re not actually using this much memory. For a short time when creating the desktop background pixmap we do several large temporary allocations, but these are quickly freed. So, it seems we’re suffering from the heap growing and then not being returned to the OS due to fragmentation.

It is ‘well known’ that glibc uses mmap for large (> 128k by default) allocations and that such allocations should be returned to the OS directly when freed. However, this doesn’t seem to happen for some reason. Lots of research follows…

It turns out that this isn’t true anymore, since about 2006. Glibc now uses a dynamic threshold for when to start using mmap for allocations. It uses the size of freed mmaped memory chunks to update the threshold, and this is causing problems for nautilus which has a behaviour where almost all allocations are small or medium sized, but there are a few large allocations when handling the desktop background. This is leading to several large temporary allocations going to the heap, never to be returned to the OS.

Enter mallopt(), with lets us set a static mmap limit. So, we set this back to the old value of 128k:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND          
 4971 alex      20   0  479m  26m  15m S  0.0  0.7   0:00.90 nautilus

Not bad. Dropped 20 meg of resident size with a few hours of work. Now nautilus isn’t leading the pack anymore but comes after apps like gnome-panel, gnome-screensaver and gnome-settings-daemon.

How to remove flicker from Gtk+

In between spending time taking care of a sick kid, a sick wife and being sick myself I have slowly been working on the remaining issues in the client-side-windows branch of Gtk+. The initial and main interest in having client side windows is that it lets us emulate all thats needed for widgets to work without any server side windows, which lets us do things like put Gtk+ widgets inside clutter, etc. However another interesting, and not entierly obvious advantage of client side windows is that it allows us to remove flicker. This post will describe how this works and show the effects.

Gtk+ already does quite a lot of things to avoid flicker. For instance, all drawing in expose events is automatically double buffered so that you never see partially drawn results. The remaining flickering is related to the effect of moving or resizing server side subwindows. Although even these are minimized by Gtk+, since many widgets don’t use such windows or only use input-only windows which don’t cause any visual effects. However, there are still some areas where subwindows are used, mostly in cases where scrolling is involved.

Lets start with an example on how scrolling works:


This is a regular Evince window showing a pdf, and we want to scroll down. This happens in several steps. First we copy the bottom area of the window to the top of the window:

Evince 2

Then we mark the newly scrolled in area at the bottom as invalid:

Evince scrolling 3

As a result of this Gtk+ will call the application to redraw the invalid region as soon as it has finished handling the incomming events:


Voila! We have scrolled. (In reality more happened above, the scrollbar area was marked invalid and repainted also, but lets ignore that for now.)

This example also makes it easy to see where flicker comes from. The drawing of the newly exposed area is double buffered, so the newly drawn area is replaced atomically, however the initial copy is not done with the Gtk+ drawing system, instead its done with a XCopyArea directly on the window (not a subwindow move, but with similar effect). So, the xserver will display that immediately, while there might be some delay before the expose of the scrolled in area is drawn causing visual tearing.

Another common problem is widget resizing/move that can be seen in my previous blog entry.  In this case what happens is that a widget with a subwindow is moved and/or resized and it ends up over another widget. The window move operation is done immediately in the server and results in a copy similar to the above, and then there is some delay before the widgets are redrawn to match that.

Now, client side windows don’t by itself fix this, but the copies above and all rendering is now under control of the client (i.e. the app) so we have the tools to do something about it. The solution is to delay the copying until we’re ready to draw everything that will be drawn, so we never show any partial results. Whenever some region of a window is copied we just record the area to be copied and by how much. When we’re handling the expose events for the invalid area we handle the expose up to the point of drawing everything in the double buffer. At this point we replay all the copies we recorded, except we don’t care about copying anything that will draw into the area which will be drawn by the expose. Then we blit out the final result of the expose event.

Furthermore, in practice it often happens that we do several moves/scrolls of the same region before it its drawn. This works with the above approach, but is a bit wasteful as some regions are copied twice. So, instead of just simply keeping track of all copies being made we try to combine such double copies into a single copy, thus minimizing the actual copies we make in the end.

So, how does this look in the end? Its kind of hard to capture this kind of flicker with a screengrabber, so here is a video I took with my phone:

Can you tell which one uses the standard Gtk+?

Flicker free Gtk+ continued

Remember my preview of subwindowless Gtk+? It got rid of some flicker, but it was still pretty raw.

I’ve been working on a new version of the subwindowless patch, and today I implemented a cool trick that gives fully flicker-free subwindow move/resize:

This video is done with the same kind of increased X latency as the previous ones, but no flickering is detectable.