When things go really bad…and nobody notices it for years

The following analysis is wrong. The callback organization is done in a not very obvious way and not very well documented but does NOT POSE ANY RISK OF LOSING DATA! DON’T PICK UP THIS STORY!

This is really a horrific story about design glitches which should remind you of double-checking that imperative programming forces you to take extremely care of what you’re doing.

GnomeVFS has an async file operation API, which ensures that your application does not block until the operation is finished. It offers you two callback hooks, one of them is called “synchronous” and the other one “asynchronous”. since 2001, the code works like:
For some transfer phases, both the sync and the async callbacks are called, and for others, whether only the sync callback is called or both depends on whether the async call was already called within a particular period of time. The callbacks may flag what further action should be taken in the Xfer process. Its interpretation depends on the overall state and progress of the xfer machinery.

It turns out that the original idea probably was that the sync callback is used for the really important stuff, being called on every single change of the xfer state machine, while the async callback was mainly meant to be used for expensive user interface updates. Unfortunately, the code is used in a way (cf. call_progress_often_internal and call_progress_uri) which does incorporate both the async and the sync callback’s return value, but the async callback is called after the sync callback, and so takes precedence over the sync retval. Calling the async callback after the sync one may make sense in some situations, like informing the user that something just changed in the sync callback, but the fact that the sync callback’s return value is overwritten by the async callback’s retval can really be a problem, considering that – as proven by call_progress_often_internal – it can’t always be predicted inside the sync callback whether the async callback will be called right afterwards or not.

In short, to work around this design glitch, you’ll be forced to either use only a sync or an async callback, or use both and let the async callback be aware of the sync callback’s last retval through some internal variable, so that it can overwrite the callback’s retval if desired. Notice that each time the async callback is called, the sync callback was called right before, so either the sync retval can be returned again or some intentionally override value.

Shockingly, the current Nautilus Xfer code does all its error handling in the async callback, which is not aware of the sync callback’s response, resulting in random behavior depending on the time since the last async invocation and the actual XFer state.

With some good luck, the async code is called each time the transfer does something important, but the GnomeVFS XFer copy_file, copy_directory code and some more rather important ones use the callback invocations which only have the async callback invocated if it wasn’t done already within a particular period, which is probably due to performance considerations.

Under some circumstances the async code is not invoked, and the Nautilus sync code blindly returns 0 in some situations where it is absolutely not desirable.
No error handling is done in Nautilus’ sync callback, it’s response does not depend on the state of the state machine, and whether its return value is used or not depends on the period since the last invocation.

Conclusions
a) GnomeVFS has design flaws
b) Nautilus has design flaws
c) fixing a partially broken or unintuitive API concept is very hard if not impossible, even if the API itself is powerful
d) Data loss is no good

Consequences
a) improve GnomeVFS docs, maybe change async/sync callback handling
b) fix Nautilus by making the async callback sync-aware and moving important stuff into the sync callback (done with some luck, needs testing)
c) write proper GnomeVFSXfer API documentation (TODO)

Update

I’m not so sure whether I got the whole sync/async process right anymore. According to the GnomeVFSProgressCallbackState, the async callback is called “periodically every few hundred miliseconds and whenever user interaction is needed”. I don’t like that architecture at all, and am inclined to modify it, so that the user always has to specify a sync callback, and the async callback would be optional with its retval being ignored.

Maybe Christian Kellner also was right some months ago when he concluded that a new GnomeVFS async file operation API is needed.

7 thoughts on “When things go really bad…and nobody notices it for years”

Stu says:

January 11, 2006 at 9:32 am

Hmm when I was using linux exclusively I did have a “feeling” that sometimes nautilus would loose data, esp when copying large amounts of files around but only sometimes… although I did use it a lot, it still seemed like it could sometimes go wrong, maybe this is part of it…
Tiago Bugarin says:

January 11, 2006 at 9:55 am

Christian or any one, please, explain it to me. Will next Gnome release have this stuff fixed or this is thing that will demand more time to fix?
Should I trust Gnome as it is right now or should I go to KDE while waiting for this fix? (I am using Ubuntu Breezy at the time in my pc)
I am not a developer and I can not measure how deep this problem is or how much it will touch my day-to-day work so please don’t understand this as a flame or such. I am just trying to understand.
Thank you all.
Klaus Kinski says:

January 11, 2006 at 7:08 pm

This sounds like a serious issue that can not be tracked and fixed easily. There are so many gnome-vfs depending apps outside and some of them even work around things, others switched to things like curl or neko to do file handling.

Fixing gnome-vfs would mean that it needs totally rewrite in many areas.
Andreas Mohr says:

January 12, 2006 at 12:14 am

I’m not a GNOME user, but almost the *first time* (and certainly the last time, who’d have thunk that) I actually touched nautilus (to do some file operation with my personal SMB folder), I immediately lost the whole folder content and had to ask the (Windows!) IT guys for the last backup – how embarrassing! That was with a nautilus version from early RHEL3, sorry, I don’t remember too many specifics any more.
This told me instantly to stay the “§$% away from such “software”.
BTW, this had been almost the *only* data loss (minus losing a /usr partition due to very experimental kernel once, fortunately backupped) within a decade of almost exclusively using Linux…

OK, so this comment is more flame-style than helpful, but still it tells you something about software reliability.

Anyway, thanks for highlighting this critical issue in such an important low-level component!
Alexander Larsson says:

January 12, 2006 at 1:41 am

This blog entry is wrong. The async callback is always called on errors (although not always on normal progress).
garbeam says:

January 12, 2006 at 3:32 am

In my opinion GnomeVFS is totally over-engineered and totally crap. Instead I recommend 9P:
http://plan9.escet.urjc.es/magic/man2html/5/0intro

Instead of reinventing the wheel, this fs-IO protocol has been used for years adequately in distributed, network-transparent ways. Recently also a Linux kernel module appeared (since 2.6.14) which supports to mount 9P-based file servers to the native VFS.
Erich Schubert says:

January 12, 2006 at 6:17 am

Any chance that KDE and GNOME end up with a common solution?
I guess the current KDE stuff is C++ again, so probably not as well-suited for adoption by GNOME, but maybe there could still be a solution that can be adopted by both worlds sometime.
“Freedesktop.org” pops into my mind somehow… and, I just noticed, there is already some VFS thingy there: http://freedesktop.org/wiki/Software_2fdvfs

Comments are closed.