Maybe somebody with deep kernel knowledge could help me to sort out the following problem I mentioned some time ago (bug report):
When ejecting USB mass storage, the volume monitor notifies the file manager that an unmount happened, although the USB stick is not yet ejected, but just unmounted. The difference in this particular case is that the buffers on the USB stick are still not yet flushed, i.e. not all data has definitly been written to the stick.
Because of the said notification, the USB stick icon is removed from the screen and the user is suggested that it is OK to remove the stick, although the buffer is not yet flushed and the “eject” command wasn’t even run.
We have multiple ways of resolving this:
a) block unmount signal, i.e. delay it until the whole ejection process is over. Problems: The ejection runs in a different thread, which does not know anything about the volume, but just about its URI. We’d have to introduce mutex-protected hash tables and some extra glue code. Not particularly attractive.
b) tell the underlying operating (linux in my case) that it should flush the storage buffer before the unmount takes place, thus ensuring a clean unmount experience.
Unfortunately, I was unable to figure out the right syscall, although I fiddled around with the kernel for a very notable amount of time.
candidate I: fsync(filedes)
“The fsync() function shall request that all data for the open file
descriptor named by fildes is to be transferred to the storage device
associated with the file described by fildes.”
Oh well, it is blocking and returns, but /dev/foo isn’t flushed, no matter what parameter combination I try out. I wonder whether the interpretation of “transferred to the storage device” should be interpreted as of “data is consistent” or as of “data was piped through some connection”.
candidate II: sync()
Works, flushes buffers for all block devices. Problematic, though: It will sync even unrelated block devices, which may be a huge problem with many devices, maybe unmounted concurrently (at least to the user).
candidate III: ioctl(filedes, BLKFLSBUF, 0)
This one sounds promising. When investigating the issue I hoped this would work since e2fsprogs also uses this. However, I was wrong. While according to sys/mount.h it is part of “(t)he read-only stuff” the actual kernel code checks for the user having admin caps, and sets errno to EACCESS even if the user has rw permissions to the block device file.
For non-generic block devices, patches were submitted by others as mentioned in a comment in the said bug report, which reduce the amount of permissions required to flush a buffer. I wonder whether there is a traditional way of dealing with unflushed buffers in UNIX, because this has a lot to do with permission models which is where things get compilicated.
It would really be nice to have a simple portable way of achieving what I want: flushed buffers for a distinct block device.
5 thoughts on “ioctl, fsync – how to flush block device buffers?”
Your problem with fsync is the interpretation of “all data for the open file descriptor”.
For every inode in use on the system, Linux creates a separate “page cache”. Fsync just writes the modified parts of this cache to the disk.
Different inodes have different page caches. So, an fsync on /dev/foo will not have any effect on the cache of /mnt/file, even though /dev/foo is mounted on /mnt .
I don’t think there’s a portable way to sync a specific filesystem (I don’t even know if there’s a good solution on Linux). AFAIR, sync(2) is not required to wait until all the writes have completed.
What will happen if you try to remount the device? I would not be surprised if the remount was delayed until everything is properly flushed. If so, the proper unmounting procedure could be: umount , mount read-only , umount
AFAIK “umount /foo/bar” syncs things before returning. I think it’d be weird to keep data in the kernel (be it caches or dirty data or anything) for a filesystem that is _not_ mounted.
But you probably should send this question to email@example.com (no subscription needed)
> AFAIK “umount /foo/bar” syncs things before returning
no, it doesn’t. eject does, though, it uses extra ioctl calls.
> I think it’d be weird to keep data in the kernel (be it caches or dirty data or anything) for a filesystem that is _not_ mounted.
The file system is still mounted when I invoke one of the syscalls, it is done before the actual unmount.
Please read my proposed solution at:
Comments are closed.