You may remember my posting regarding thumbnail caching.
The problem is that all applications using GnomeThumbnail will read the entries ~/.thumbnails/normal or ~/.thumbnails/large [and ~/.thumbnails/failed] synchronously, as the first thumbnail request is made, which means a notable time [more than a few seconds] for a few hundred MB of thumbnails. There is a long-standing bug report about that issue.
Why the cache, in the first place?
The idea was to keep around an in-memory list (a cache) of all available thumbnails, and additionally to store the previously-requested thumbnails in memory, as it is likely that previously-requested thumbnails are used again.
The problematic aspect of the cache is that ~/.thumbnails may contain hundrets of megabytes, and thus be very big. This implies that the cache refresh is very expensive.
First idea: Get rid of cache
My initial idea was to solve this by removing the in-memory cache altogether. While users reported that this solved their problems and didn’t cause any performance impact, this may not be generalized: Currently, the user-visible thumbnail loading time is dominated by the time to synchronously read the thumbs, not by the cache-lookup time. This will not be true anymore as we have asynchronous thumbnail loading in Nautilus [which is not written yet, but should heavily improve performance].
Second idea: Multi-threaded cache voodoo
So the next idea was to use a multi-threaded solution, which would refresh the cache in a worker thread, and circumvent it entirely during refresh when requesting/generating a thumbnail. While it sounds good, it gets really nasty as you realize that POSIX doesn’t specify what happens as a directory changes during readdir(). Assuming you refresh the cache from disk [worker thread], and at the same generate a new thumbnail without using the cache [main thread], you’ll end up with a modified file system and you cannot be sure about the validity of any entries you read – you’d have to reopen the directory and reread it entirely. This means no cache hits during thumbnail generation, i.e. if you open two directories simultanously with Nautilus and the thumbs for one are generated, and the thumbs for the others aren’t, you’ll get no cache hits at all. Maybe we could use file change notification, but it is platform-dependant and we don’t want to explicitly write code against Linux or some other UNIX and have gazillions of #ifdefs in the code for dnotify, inotify etc.
Third idea: (to-be-written?) on-demand in-memory file systems
I think the smartest solution [that happens to mean no work for us, i.e. just drop our cache, cf. first idea] involves finding an on-demand in-memory file system, which forms an ideal cache for many applications. Dear lazyweb, do you know of any FS that does the following:
- Store the entire FS contents in a file if it’s not mounted
- Load the entire FS directory structure into memory as it is mounted
- Load the file contents into memory as it is requested
- Store the file contents into memory and back into the image as it is written
- Live in user-space, possibly using shared memory, as a bonus you’d tell it anytime to prefetch the FS structure at any given point [in out case as the session starts]
You’d just mount it to ~/.thumbnails and let the voodoo happen. Yes, this is also platform-dependant, but it can be implemented for all platforms and doesn’t make us depend on the platforms file system capabilities, falling back to a performance-reduced but not memory-intense behavior.
Thanks to the guys at the freenode ##c channel for the fruitful discussion, especially wobster for the POSIX and RAMFS hints.
Comments?
Can’t the thumbnails be attached to the files themselves via the filesystem feature “extended attributes” (EA)? That way there wouldn’t be any visible and annoying Thumbs.db or similar files anywhere….
http://en.wikipedia.org/wiki/Extended_attributes
//fatal
Hmm, i don’t really see the problem with your threads based idea.
Everytime you process a new file you should:
open() it.
fstat() it.
process it.
fstat() it again.
close() it.
Only if the access times in the stat struct did not change while the file was processed you have a valid cache entry. If it changed you should drop what you just processed and move it to the end of your work queue.
The readdir() calls should be nothing more than a feeder for the work queue. If files disappear or are modified should not matter to the work queue.
This proposal seems extremely… hacky, for something which is arguably such a simple problem. Why not just attempt to read the thumbnail file async as it’s needed? This seems extremely obvious, and works quite well for other situations: evolutions email cache, etc.
Let the file system handle keeping the file in memory if it needs be. That may mean it gets left in the block cache.
I don’t really understand why this problem is hard, actually. Maybe I’m missing something.
As for cleaning the cache, seems reasonable to just scan for old files in the background in some fashion and blow them away. Do this in another thread. Easily isolated. Thread does one job and does not need to notify on completion.
Well, the OS already does what you want (except for loading the directory structure into memory upon mount, which could be done with a userspace daemon). Have you considered just letting the OS worry about doing what it was designed to do?
I think a very important fact these days is that you have much crap in your thumbs directory.
Many programs don’t update theire thumbs if they move documents around or delete them. So a big win would be to create a cron job that scans all thumbs and remove them if the asociated document is gone.
For my own use i wrote a little ruby script for this task. With the first run it deleted several thousend thumbs.
The fastest work is the work you don’t have to do.
Cheers
detlef
4: Cache in the dir the image lives in. Bonus: rmdir gets rid of the cache too.
Yeah. I’d vote for not using .thumbnails, for tons of reasons. It’s just a needless disconnect. I’d just put it in the directory of the actual files, like every other OS on the planet does. If there’s no permission to do so, just don’t. Recalculate them each time. No big deal.
It also clogs up my NFS home dirs. Makes things more difficult than neccessary.
Concerning the second idea: what if you generate thumbnails in a separate directory, and you “merge” them in synchronously to the actual thumbnail cache dir?
How about using a generic metadata daemon, like Tracker ?
For me it seems like duplicating new Tracker functionalities. Tracker provides on-the-fly thumbnail generation. Would it not be easier to just requests thumbnails to Tracker through dbus?
Cheers