I get a lot of messages asking me to compare and contrast Storage,
WinFS, and sometimes Dashboard and Medusa. More recently, I’ve gotten a lot of
questions about Spotlight and Beagle. I’ve generally avoided commenting
(which usually means not answering the e-mail…) on these things both because
its impossible for me to do an unbiased comparison, and because the
goals seem to be quite different.

  • Medusa, Beagle & Spotlight are similar, though of course Spotlight is
    much more mature. I would call them metadata index systems.
  • Storage & WinFS are similar, though of course WinFS is much more mature.
    I would call them document stores.

Caveat: If indexing and search were the
primary goals, a document store would be a ridiculously overengineered
approach
. The medusa/beagle/spotlight model is much more sane if
this is your only or primary goal. I’m not saying this to suggest
document stores are better or worse than metadata indexing systems,
only to point out that there’s an element of apple-orange
comparison at work here.

Metadata Index Systems

Medusa:

Medusa was originally written by Eazel integrated tightly with Nautilus
1.0 and was slated for inclusion with the GNOME 1.4 release. It was
primarily written by Rebecca Schulman, but also had major contributions
from Maciej Stachowiak & some by myself. Medusa ran as root, which
worried some people (but of course, so does updatedb for slocate…),
but unfortunately had a major bug that caused it to be pulled from GNOME
1.4 at the last minute. Rebecca fixed the bug after the release, and
re-architected Medusa to run as a normal user. But unfortunately Eazel
collapsed before GNOME 2.0 and nobody promoted its inclusion. Curtis
Hovey & I ported it to GNOME 2.x platform later, and Curtis is currently
maintaining it and adding lots of new features / fixes. In particular he
seems to be working on a UI for it. Medusa allowed very fast searches
over large indexes. Indexes were built by scanning the disk every night
(like slocate, unlike Spotlight which does things better). It also
provided a search: URI scheme that allowed creation of dynamic “search
folders”. So you could have a “Spreadsheets” folder for example that
always contained any spreadsheets on your system. The biggest hurdle for
Medusa today is that the set of indexers is not very extensible, and so
it doesn’t know how to index very many different file types.

Spotlight:

Of course I haven’t looked at Spotlight’s code or used it, so what I
know about it is from what Apple has published and discussions with
friends at Apple. Spotlight appears to be a sophisticated well
implemented approach to building a metadata layer an top of an existing
file system. Changes to files appear to be noticed at the kernel layer,
and indexers are quickly run to update the metadata cache (with
information about filename, album name, size, file contents, keywords,
etc). I don’t know whether it is guaranteed that indexers will be run
before the data can be accessed, but it is supposed to happen very
quickly in any case so it appears instant to the user. Spotlight is the
work of (among others, there are probably more people I just don’t know)
Pavel Cisler (BeOS tracker & Eazel Nautilus) & Dominic Giampaolo (BeOS
BFS, which had a similar sophisticated metadata system). Spotlight also
has a lot of work gone into the UI, for doing grouping, measuring
relevance, etc. Its easy to underestimate how much work this is, in some
ways the “indexing” is the easy part. Spotlight appears to index a lot
more than just the filesystem, including things like calendar and mail,
but I don’t know the full extent of what it can do.

Beagle:

My knowledge of Beagle is based on playing with it and reading through
a fair bit of the code, but I could definitely be missing large aspects
because I haven’t talked with Jon. Beagle’s code appears to be fairly immature at the moment, but I would
expect it to grow. It uses a port of Apache Jarkarta’s Lucene. Lucene
primarily provides a way to *store* indexed metadata and do fast
*searches* over lots of metadata (including full text, of course), but
it doesn’t provide the indexers for specific file types. In some sense,
Lucene as a specialized “database” for storing the results of indexers.
Currently Beagle has indexers for HTML, JPEG, MP3, OpenOffice.org (very
cool) and Text. Unlike Medusa (I have no idea about Spotlight for this)
Beagle is designed to index “byte streams” rather than files, so it can
index, e.g. “The current page you are looking at in Epiphany”. This
makes it very compatible w/ Dashboard, since Dashboard wants to index
any and all contextual data, not just things on the hard disk. At the
moment Beagle appears to contain only very simple UI, so its primarily a
document indexing system.

On the filesystem side, Beagle currently works
like Medusa and requires a “crawler” to update its metadata cache (say
nightly), vs. spotlight which updates instantly. Beagle also has
crawlers for Mail and IM logs. Beagle also includes a renderer system
for displaying the relevant metadata etc for different file type
results. AFAIK, Jon Trowbridge at Novell is the person mainly hacking on
Beagle atm, but I think the code was refactored out of Dashboard, and a
number of other contributors are listed.

Document Stores

Both WinFS & Storage are aimed at doing a lot more than document
indexing… in many ways document indexing is only a nice side effect of
their larger aims. Storage and (AFAICT) to a lesser extent WinFS both
intend to store the actual documents themselves inside the store. That
means that more than just metadata is inside the store. Both WinFS &
Storage provide a query system, though WinFS’ has developed a nice
object oriented language (which I think they compile to SQL) whereas
Storage currently uses straight SQL which is harder for other developers
to use.

Storage:

I know most about this so I’ll talk about it most of course 😉 Storage
is fairly immature, and the architecture has shifted a lot in the past
few months.

“storage-store” provides a DBus service that allows fetching objects
over the FreeDesktop DBus
getting their attributes, relating them to eachother, running queries
etc. “storage-store” uses postgresql to store the structured objects and
perform queries. Because objects are accessed “live” rather than as
“buffers”, changes are instantly propagated across the bus, so multiple
applications or users can work on the same document and instantly see
changes other people make.

I’m currently working on architecture to
storage-store into standard IM presence information so you will be able
to see buddy icons of other people and what part of the document they
are working on inside storage applications. I have a lot of user
experience goals for Storage (or more accurately, for applications and
desktop that use storage). You can find information about most of them
on my blog and at
the storage homepage. Though these goals are more
important to me than document indexing
, and have a lot more
impact on Storage’s architecture as a result, I will focus on document
indexing in order to compare and contrast with the other systems.

libstorage-translators provides a framework for translators that can
take structured object data in the store (metadata and the actual data
itself) and translate it to and from byte streams (such as files). The
goal is not indexing files, but for providing a way to move files in and
out of the store. So for example, if your friend sent you a PDF file by
e-mail, you could drag that file into your local store and the
libstorage-translators will automatically decompose the information for
placing in the store (and of course extract lots of metadata like album
name, description, image width, etc etc in the process). Currently I
have only worked on the “importer” side of translators, not the
“exporter”, so they are effectively like indexers. There are currently
importers for: DocBook, HTML, any image format supported by gdk-pixbuf
(JPEG, PNG, BMP, GIF, and several more obscure formats), PDF, text, and
any format supported by gstreamer (MP3, OGG, AVI, MPEG2, etc). Importers
can also create thumbnails for the data for convenient display later.
Storage also includes a renderer system for displaying the relevant
metadata etc for different sorts of results to a query. A major drawback
is that I don’t have translators for common document formats like
Gnumeric or OO.o at the moment.

Queries can either be performed using an SQL-like format (slightly
higher level than SQL but not much, it gets translated to SQL) or using
natural language queries. A large chunk of storage code is currently in
its NL system which uses very sophisticated HPSG grammars and other
techniques to translate human language phrases into the SQL query
format.

A storage:/// VFS URI is provided which automatically invokes
translators when files are dragged into the store. That means you can,
e.g. open a nautilus window to storage:/// and drag files in to add them
to the store. It also provides query folders like Medusa. So for example
you can have a folder “spreadsheets” or “songs by John Lennon that don’t
have the word ‘love’ in them” that is live updated to contain objects
matching those criteria.

WinFS:

I know the least about WinFS of any of the systems
discussed here. I need to read up on it more… but the last time I looked
at it heavily was more than a year ago when MS was still very ellusive.
It looks like a lot of info is up on the web now, so what I’m saying
could be out of date. WinFS is backed by both NTFS & Microsoft’s SQL server.
It provides a very nice API for querying and working with objects.
Currently the set of object types it can used is fixed and predefined by
MS (but the list is long). In the future they will probably open this up
and allow anyone to define new object types. AFAICT, WinFS is currently
targeting primarily the storage of metadata, though it is tightly
coupled to the files themselves stored as byte streams in NTFS. It does
look like in the future they intend to more completely store things in
WinFS. WinFS provides a very cool set of hooks for performing actions in
response to changes in the store. WinFS uses this to provide indexing
services, but users can also define their own actions (e.g. you could
say, “whenever an e-mail from George is created, copy it into my “to
burn” directory”).