Archive for March, 2009

ext4 vs fsync, my take

Monday, March 16th, 2009

There has been a lot of discussion about the ext4 data loss issue, and I see a lot of misconceptions, both about why rename() is used and what guarantees POSIX gives. I’ll try to give the background, and then my opinion on the situation.

There are two basic ways to update a file. You can either truncate the old file and write the new contents, or you can write the new contents to a temporary file and rename it over the old file when finished. The rename method have several advantages, partly based on the fact that rename is atomic. The exact wording from POSIX (IEEE Std 1003.1TM, 2003 Edition) is:

In this case, a link named new shall remain visible to other processes throughout the renaming operation and refer either to the file referred to by new or old before the operation began.

This gives the rename method some useful properties:

  • If the application crashes while writing the new file, the original file is left in place
  • If an application reads the file the same time as someone is updating it the reading application gets either the old or the new file in its entirety. I.e. we will never read a partially finished file, a mixup of two files, or a missing file.
  • If two applications update the file at the same time we will at worst lose the changes from one of the writers, but never cause a corrupted file.

Note that nothing above talks about what happens in the case of a system crash. This I because system crashes are not specified at all by POSIX. In fact, the behaviour specified predates journaled filesystems where you have any reasonable expectation that recently written data is availible at all after a system crash. For instance, a traditional unix filesystem like UFS or ext2 may well lose the entire filesystem on a system crash if you’re unlucky, but it is still POSIX compliant.

In addition to the above POSIX specifies the “fsync” call, which can be used in the rename method. It flushes all in-memory buffers corresponding to the file onto hardware (this is vaguely specified and the exact behaviour is hw and sw dependent), not returning until its fully saved. If called on the new file before renaming it over the old file it gives a number of advantages:

  • If there is a hardware I/O error during the write to the disk we can detect and report this.
  • In case of a system crash shortly after the write, its more likely that we get the new file than the old file (for maximum chance of this you additionally need to fsync the directory the file is in)
  • Some filesystems may order the metadata writes such that the rename is written to disk, but the contents of the new file are not yet on disk. If we crash at this point this is detected on mount and the file is truncated to 0 bytes. Calling fsync() guarantees that this does not happen. [ext4]

However, it also has a number of disadvantages:

  • It forces a write immediately, spinning up the disk and causing more power use and more wear on flash filesystems.
  • It causes a longer wait for the user, waiting for data to be on disk.
  • It causes lower throughput if updating multiple files in a row.
  • Some filesystems guarantee ordering constraint such that fsync more or less implies a full sync of all outstanding buffers, which may cause system-wide performance issues. [ext3]

It should be noted that POSIX, and even ext4 gives no guarantees that the file will survive a system crash even if using fsync. For instance, the data could be outstanding in hardware buffers when the crash happens, or the filesystem in use may not be journaled or otherwise be robust wrt crashes. However, in case of a filesystem crash it gives a much better chance of getting the new data rather than the old, and on reordering filesystems like an unpatched ext4 it avoids truncated files from the rename method.

Both the fsync and the non-fsync version has their places. For very important data the guarantees given by fsync are important enough to outweight the disadvantages. But in many cases the disadvantages makes it too heavy to use, and the possible data loss is not as big of an issue (after all, system crashes are pretty uncommon).

So much for the background, now over to my personal opinions on filesystem behaviour. I think that in the default configuration all general purpose filesystem that claim to be robust (be it via journalling or whatever) should do their best to preserve the runtime guarantees of the atomic rename save operation so that they extend to the system crash case too. In other words, given a write to a new file followed by a rename over an old file, we shall find either the old data or the new data. This is a less of a requirement than fsync-on-close, but a requirement nevertheless that does result in a performance loss. However, just the fact that you’re running a journaled filesystem is a performance cost already, and something the user has explicitly chosen in order to have less risk of losing data.

It would be nice if the community could work out a way to express intent of the save operation to the filesystem in such a way that we avoid the unnecessary expensive fsync() call. For instance, we could add a fcntl like F_SETDATAORDERED that tells the kernel to ensure the data is written to the disk before writing the metadata for the file to the disk. With this in place applications could choose either if they want the new file on disk *now*, or just if it wants either the old or the new file, without risk for total data loss. (And fall back on fsync if the fcntl is not supported.)

This is the current status of the rename method on the commonly used Linux filesystems to my best knowledge:
(In this context “safe” means we get either the old or the new version of the file after a crash.)

ext2: No robustness guarantees on system crash at all.

ext3: In the default data=ordered mode it is safe, because data is written before metadata. If you crash before the data is written (5 seconds by default) you get the old data. With data=writeback mode it is unsafe.

ext4: Currently unsafe, with a quite long window where you risk data loss. With the patches queued for 2.6.30 it is safe.

btrfs: Currently unsafe, the maintainer claims that patches are queued for 2.6.30 to make it safe

XFS: Currently unsafe (as far as i can tell), however the truncate and overwrite method is safe.