ext4 vs fsync, my take

There has been a lot of discussion about the ext4 data loss issue, and I see a lot of misconceptions, both about why rename() is used and what guarantees POSIX gives. I’ll try to give the background, and then my opinion on the situation.

There are two basic ways to update a file. You can either truncate the old file and write the new contents, or you can write the new contents to a temporary file and rename it over the old file when finished. The rename method have several advantages, partly based on the fact that rename is atomic. The exact wording from POSIX (IEEE Std 1003.1TM, 2003 Edition) is:

In this case, a link named new shall remain visible to other processes throughout the renaming operation and refer either to the file referred to by new or old before the operation began.

This gives the rename method some useful properties:

  • If the application crashes while writing the new file, the original file is left in place
  • If an application reads the file the same time as someone is updating it the reading application gets either the old or the new file in its entirety. I.e. we will never read a partially finished file, a mixup of two files, or a missing file.
  • If two applications update the file at the same time we will at worst lose the changes from one of the writers, but never cause a corrupted file.

Note that nothing above talks about what happens in the case of a system crash. This I because system crashes are not specified at all by POSIX. In fact, the behaviour specified predates journaled filesystems where you have any reasonable expectation that recently written data is availible at all after a system crash. For instance, a traditional unix filesystem like UFS or ext2 may well lose the entire filesystem on a system crash if you’re unlucky, but it is still POSIX compliant.

In addition to the above POSIX specifies the “fsync” call, which can be used in the rename method. It flushes all in-memory buffers corresponding to the file onto hardware (this is vaguely specified and the exact behaviour is hw and sw dependent), not returning until its fully saved. If called on the new file before renaming it over the old file it gives a number of advantages:

  • If there is a hardware I/O error during the write to the disk we can detect and report this.
  • In case of a system crash shortly after the write, its more likely that we get the new file than the old file (for maximum chance of this you additionally need to fsync the directory the file is in)
  • Some filesystems may order the metadata writes such that the rename is written to disk, but the contents of the new file are not yet on disk. If we crash at this point this is detected on mount and the file is truncated to 0 bytes. Calling fsync() guarantees that this does not happen. [ext4]

However, it also has a number of disadvantages:

  • It forces a write immediately, spinning up the disk and causing more power use and more wear on flash filesystems.
  • It causes a longer wait for the user, waiting for data to be on disk.
  • It causes lower throughput if updating multiple files in a row.
  • Some filesystems guarantee ordering constraint such that fsync more or less implies a full sync of all outstanding buffers, which may cause system-wide performance issues. [ext3]

It should be noted that POSIX, and even ext4 gives no guarantees that the file will survive a system crash even if using fsync. For instance, the data could be outstanding in hardware buffers when the crash happens, or the filesystem in use may not be journaled or otherwise be robust wrt crashes. However, in case of a filesystem crash it gives a much better chance of getting the new data rather than the old, and on reordering filesystems like an unpatched ext4 it avoids truncated files from the rename method.

Both the fsync and the non-fsync version has their places. For very important data the guarantees given by fsync are important enough to outweight the disadvantages. But in many cases the disadvantages makes it too heavy to use, and the possible data loss is not as big of an issue (after all, system crashes are pretty uncommon).

So much for the background, now over to my personal opinions on filesystem behaviour. I think that in the default configuration all general purpose filesystem that claim to be robust (be it via journalling or whatever) should do their best to preserve the runtime guarantees of the atomic rename save operation so that they extend to the system crash case too. In other words, given a write to a new file followed by a rename over an old file, we shall find either the old data or the new data. This is a less of a requirement than fsync-on-close, but a requirement nevertheless that does result in a performance loss. However, just the fact that you’re running a journaled filesystem is a performance cost already, and something the user has explicitly chosen in order to have less risk of losing data.

It would be nice if the community could work out a way to express intent of the save operation to the filesystem in such a way that we avoid the unnecessary expensive fsync() call. For instance, we could add a fcntl like F_SETDATAORDERED that tells the kernel to ensure the data is written to the disk before writing the metadata for the file to the disk. With this in place applications could choose either if they want the new file on disk *now*, or just if it wants either the old or the new file, without risk for total data loss. (And fall back on fsync if the fcntl is not supported.)

This is the current status of the rename method on the commonly used Linux filesystems to my best knowledge:
(In this context “safe” means we get either the old or the new version of the file after a crash.)

ext2: No robustness guarantees on system crash at all.

ext3: In the default data=ordered mode it is safe, because data is written before metadata. If you crash before the data is written (5 seconds by default) you get the old data. With data=writeback mode it is unsafe.

ext4: Currently unsafe, with a quite long window where you risk data loss. With the patches queued for 2.6.30 it is safe.

btrfs: Currently unsafe, the maintainer claims that patches are queued for 2.6.30 to make it safe

XFS: Currently unsafe (as far as i can tell), however the truncate and overwrite method is safe.

22 Responses to “ext4 vs fsync, my take”

  1. Pavel says:

    do you know whether the patched ext4 does your proposed DATAORDERED solution or just implicitly calls fsync?
    Besides that I also think that POSIX is at fault here by not providing a clean atomic file update.

  2. sam says:

    Thank you for this clarifying summary on the ext4 data loss issue! Your proposed solution sounds reasonable too.

  3. alexl says:

    Pavel: the ext4 patch make sure data is written before the rename is written to disk, so it doesn’t implicitly fsync() on close. I.e. it has the same behaviour as ext3.

  4. ulrik says:

    Ext4 discussion is all over..

    I prefer a solution without direct flush or fsync; It has been reiterated all over, but what we really REALLY want is atomicity.. either the new file or the old file, even after the crash. What the ext4 devs tell us and others toss us the POSIX book in the head and say.. we have to flush to disk to get that.. that’s just not an elegant solution. I want my laptop to defer writes and use all of linux awesomeness, working in RAM for a while etc etc.

    Save my battery!
    Don’t truncate my files!
    Make it atomic and recover to either state after crash — pre or post write.

    So the pragmatic solutions are good, but not perfect. Perfect would be preserving ordering to make sure renames can’t erase the whole file.

  5. Ray says:

    For those not following along, two other solutions were proposed on lwn.net:

    http://lwn.net/Articles/323248/

    – Adding a new call “flink()” that does the rename based on file descriptor instead of based on the name.

    – Adding a new flag to open, O_REWRITE, that eliminates some of the hoops app developers have to jump through and makes the ordering entirely the kernels problem.

    All three possible solutions have a significant advantage over adding fsync(). In theory they don’t force I/O to happen before the close. The entire operation can be post-poned in one chunk to be done when the filesystem is ready.

  6. amano says:

    Alex: On the mailing list you proposed two patches for GIO in the upcoming Gnome 2.26. Which one made it? The first one with all calls replaced by fsync(), the second one for just some cases or even neither one?

  7. alexl says:

    amano:
    None is commited yet. However, I’m leaning towards only fsyncing in some cases.

  8. Yevgen Muntyan says:

    Alex, if I read your mail correctly, in the end you were going to call fsync() on saving existing file. That is the case of config files, similar to mozilla’s infamous sqlite stuff, which doesn’t sound good. Or did I misunderstand you? Thanks!

  9. Hey Alex,

    Great post – probably the most level-headed one I’ve read about this issue so far! I work on the Mozilla project, and I know and deal with the issues involving fsync all the time, so I understand the desire to stay away from it whenever possible.

  10. alexl says:

    Yevgen:
    Its not really the same as the firefox case. When writing out a file, if a older version exists already we fsync the new one before renaming it over the old.

    The firefox usecase is different, it keeps a database file open all the time and does in-place writes to it and fsyncs it.

    Of course, the same problem could happen if you use the file replace code to constantly replace a file, but if you’re doing that then you have other efficiency problem anyway that are far more problematic that the fsync.

  11. Luke says:

    This is a great post!

    Is it possible to add an O_TRUNC_ATOMIC extension to fopen() that writes the data for a file to a new (non-directory-linked) inodes, but guarantees processes opening the file during the write don’t see the new contents until the file is completely written and the file is closed? This could fall back to O_TRUNC when the filesystem doesn’t implement it.

    Secondly, for the rename case, is it possible to simply track data dependencies in ext4’s reordering code, so that the metadata cannot be updated on disk until the data is fully written? Seems like adding a dependency graph would be a fix.

  12. [...] bookmarks tagged alexander ext4 vs fsync, my take « Alexander Larsson saved by 14 others     tonystockert bookmarked on 03/17/09 | [...]

  13. alexl says:

    Luke:

    Such O_TRUNC_ATOMIC behaviour is very unlike normal posix/unix behaviour. So its probably both hard and not a good idea.

    About ext4 dependency tracking, thats exactly what the patches in 2.6.30 does.

  14. Nikesh says:

    I am using ext4 on opensuse 11.1 for sometime and it’s looking stable maybe I am only using it on my desktop, not sure how it will work when put under the server load.

    If anyone like to run ext4 on opensuse – http://linuxpoison.blogspot.com/2009/01/ext4-support-on-opensuse-111.html

  15. [...] blogs.gnome.org/alexl/2009/03/16/ext4-vs-fsync-my-take/ [...]

  16. [...] Alexander Larsson: ext4 vs fsync, my take [...]

  17. Bojan says:

    Here are a few facts that I think everyone taking part in this discussion should be aware of:

    1. The rename(2) call operates on directory pathnames only. This is explained by Ted nicely here: http://lwn.net/Articles/323745/. The fact that kernel can present a consistent picture to processes (in terms of both the directories and files) is related to the fact that, of course, kernel keeps track (internally) of where directory entries point to, which has nothing whatsoever to do with persisting directories or files to disk.

    2. It is clear from the documentation of fsync() that separate calls to commit directories and files _can_ (and should) be made. This clearly spells out that these are two different entities, which users can commit to disk at will.

    3. In the absence of fsync(), the order of commits of both files and directories is generally not specified, which means that kernel is free to do as it pleases.

    Now, any talk of barriers, transactions and what not is just wishful thinking in the current API. There is simply no guarantees to it and programming to such guarantees is completely non portable.

    Claiming that a particular file system not implementing such guarantees is broken is false. We all got bitten by this, unfortunately, and the problem is in the application code.

    The problem of configuration files being zeroed on a crash (which should be a rare event) can be solved by using backup files, even in the absence of constant fsync(). An example (probably not a very good one) is here: http://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/#comment-2239.

  18. Bojan says:

    > current API

    I mean rename(2) here, of course.

  19. Robert Devi says:

    While the proposed solution will fix the problem nonportably, it is a cop-out.

    POSIX doesn’t make guarantees on a lot of things. Neither does ISO or ANSI. But all trust implementations “to do the right thing” given the environments of use. For instance, in C, when you call malloc() it’s undefined where the memory comes from or even if all requested memory is available right now. But that doesn’t mean that you can do a malloc() can allocate memory in video memory (messing up the screen) or read-only memory. While such behaviour is technically legal according to ISO/ANSI, it’s not reasonable behaviour.

    If ext4 wants to be taken seriously as a desktop file system, it must have reasonable behaviour in cases common to desktops, and that includes gracefully surviving failures such as crashes or someone tripping over the power plug.

    Thankfully, ext4 seems to be fixed.

  20. [...] problem” and “Don’t fear the fsync!” and also Alessander Larsson’s one “ext4 vs fsync, my take” as well as comment in all of [...]

  21. [...] alexl added an interesting post on ext4 vs fsync, my takeHere’s a small excerptext3: In the default data=ordered mode it is safe, because data is written before metadata. If you crash before the data is written (5 seconds by default) you get the old data. With data=writeback mode it is unsafe. … [...]