semantic desktop

I named this post “Tracker” first as I started writing from that perspective, but the problems I’m about to talk are more related to what is called “semantic desktop” and not specific to Tracker, which is just the GNOME implementation to that idea.
This post is a collection of my thoughts on this whole topic. What I originally wanted to do was improve Epiphany’s history handling. Epiphany still deletes your history after 10 days for performance reasons. When people suggesting Tracker I started investigating it, both for this purpose and in general.

How did this all start?

It gained traction when people realized that a lot of places on the desktop refer to the same things, but they all do it incompatibly. (me, me, me, me, me and I might be in your IRC, mail program and feed reader, too.) So they set out to change it. Unfortunately, such a change would have required changes to almost all applications. And that is hard. An easier approach is to just index all files and collect the information from them without touchig the applications. And thus, Tracker and Beagle were born and competed on doing just that.
However, indexing has lots of problems. Not only do you need to support all those ever-changing file formats, you also need to do the indexing. And that takes lots of CPU and IO and is duplicated work and storage. So the idea was born to instead write plugins that grab the information from the applications while they are running.
But still, people weren’t convinced, as the only things they got from this is search tools, even if they automatically update. And their data is still duplicated.

What’s a sematic desktop anyway?

Well, it’s actually quite smple. It’s just a bunch of statements in the form <subject> <predicate> <object> (called triples), like “Evolution sends emails”. Those statements come complete with a huge spec and lots of buzzwords, but it’s always just about <subject> <predicate> <object>.
Unfortunately, statements don’t help you a whole lot, there’s a huge difference between “Evolution sends emails” and “I send emails”. You need a dictionary (called ontology). The one used by Tracker is the Nepomuk ontology.
And when you have stored lots of triples stored according to your ontologies, then you can query them (using SPARQL). See Philip’s posts (1, 2) for an intro.

So why is that awesome?

If all your data is stored this way, you can easily access information from other applications without having to parse their formats. And you can easily talk to the other applications about that data. So you can have a button in evolution for IM or one in empathy to send emails. Implementing something like Wave should be kinda trivial.
And of course, you get an awesome search.

No downsides?

Of course there are downsides. For a start, one has to agree on the ontologies: How should all the data be represented? (Do we use a model like OOXML, ODF or HTML for storing documents?) Then there also is the question about security. (Should all apps be able to get all passwords”? Or should everyone be able to delete all data?) It’s definitely not easy to get right.
How does Tracker help?

Tracker tries to solve 2 problems: It tries to supply a storage and query daemon for all the data (called a triple-store) and it tries to solve the infrastructure to indexing files. The storage backend makes sense. Its architecture is sound, it’s fast and you can send useful queries its way. It has a crew of coders developing it that know their stuff. So yes, it’s the thing you want to use. Unless you don’t buy in to the semantic desktop hype.

What about the indexing?

Well, the whole idea of indexing the files on my computer is problematic. The biggest problem I have is thatthe Tracker people lack the expertise to know what data to index and how. It doesn’t help a whole lot if Tracker parses all JPEG files in the Pitures/ folder when the real data is stored in F-Spot. It doesn’t help a whole lot when you have Empathy and Evolution plugins that both have a contact named Kelly Hildebrand, but you don’t know if they’re talking about the same person. You just end up with a bunch of unrelated data.

There was something about hype?

Yeah, the semantic desktop has been an ongoing hype for a while without showing great results. Google Desktop is the best example, Beagle was supposed to be awesome but hasn’t had a release for a while, let alone Dashboard, and it hasn’t caught on in GNOME, either, even though we talk about it for more than 3 years.
But then, Nokia still builds on Tracker for Harmattan, the Zeitgeist team tries to collect data from multiple applications and make use of it in innovative ways. People are definitely still trying. But it’s not clear to me that anyone has figured out a way to make use of it yet.

Now, do I want to use it in my application or not?

Tracker is not up to the quality standards people are used from GNOME software. It’s an exciting and rapidly changing code base with lots of downright idiotic behaviors – like crashing when it accidentally takes more than 80MB memory while parsing a large file – and unsolved problems. Some parts don’t compile, the API is practically not documented and the dependancy list is not small (at least if you wanna hack on it). It also ships a tool to delete your database. Which is nice for debugging, but somewhat like shipping a tool that does rm -rf ~. In summary, I feel remembered of the GStreamer 0.7 or Gtk 1.3 days. Products with solid foundations, a potentially bright future ahead but not there yet. So it’s at best beta quality. I’d call it alpha.
There is an active development team, but that team is focused on the next Maemo release and not on desktop integration. This is quite important, because it likely means that the development focus will probably only be on new applications for the Maemo platform and not on porting old applications (in particular GNOME ones) to use Tracker. And that in turn means there will not be deeper desktop integration. Unless someone comes up and works on it.

So I don’t want to use Tracker?

The idea of the semantic desktop has great potential. if every application makes its data available for every other application in a common data store that everybody agrees on, you can get very nice integration of that data. But that requires that it’s not treated as an add-on that crawls the desktop itself, but that applications start using Tracker as their exclusive primary data store. Until EDS is just a compatibility frontend for Tracker, it’s not there yet.
So if you use Tracker, you will not have to port yor application to use it in the future, when GNOME requires it. You also get rid of the Save button in your application and gain automatic backup, crash recovery and full text search. But if you don’t use Tracker, you don’t save your data on unfinished software, and you don’t have to rip it out when Nokia (or whoever) figures out that Tracker is not the future.

Conclusion

I have no idea if Tracker or the semantic desktop is the right idea. I don’t even know if it is the right idea for a new Epiphany history backend. It’s probably roughly the same amount of work I have to do in both cases. But I’m worried about its (self)perception as an add-on instead of as an integral part of every application.

21 comments ↓

#1 Emmanuele Bassi on 07.23.09 at 21:22

I don’t think applications should use tracker as their storage; first and foremost because *it’s not ready yet*. second of all because you have to wonder what will happen if my 16 GB of music gets indexed along with my ~2GB of emails and all my documents. and without throwing things like my .75TB of videos. do I need to search my emails dealing with my local fruit&veg delivery? sure, let’s wait half an hour while I render the ontologies across a sqllite database the size of my home on my Atom netbook. I *don’t* think so.

what applications should do, instead, is opening their databases (which can, and mostly are) optimized for speed and data retrieval, to tracker. applications should talk and push their data, and tracker should be *the* API needed to talk to all applications on the platform.

for this reason, tracker should have a first-class library – what I’ve been asking for years, incidentally, and never got replies past the “use D-Bus to submit a query and then demarshal random a{sv} or a{sa{ussubba{sv}} yourself” kind of reply.

#2 Michael Schurter on 07.23.09 at 21:33

EDS was a great example to mention. Despite using gmail for most of my mail, I find myself constantly going back to Evolution because it does an amazing job of aggregating all my various mail accounts, calendars, contact lists, etc.

None of the use cases I’ve read for indexers like Tracker and Beagle have ever excited me. I rarely find myself needing to search for documents. The one feature of Zeitgeist that has excited me is the potential of a timeline view, so I can see what I was doing at a certain time whether it was coding or web browsing.

Thanks for the excellent post! As a Gnome user I’ve seen a number of these technologies mentioned, but never summarized so well!

#3 Greg on 07.23.09 at 21:37

Perhaps the Wizbit guys have some ideas about backends…

http://www.wizbit.org/drupal/node/2

#4 Olivier Le Thanh on 07.23.09 at 21:48

What I don’t get is if Tracker is supposed be used to store whole documents (like couchDB which seems promising) or just their metadata so applications can make relation between them.

It worry me a bit to have the two mixed so freely without separation. Because I don’t really care if it lose what the indexator collected since it’s only a cache for data but if you lose a document I spent time writing I’m going to be really angry.

#5 Marco Barisione on 07.23.09 at 21:56

I started working on a project to fetch IM contacts with Telepathy and store them in tracker, using the nco and pimo ontologies as base. I’m also planning to write an EDS backend that just uses tracker as storage to represents EContacts (but I know it will be painful).

To be honest I’m not sure this will be the right solution or it will be terrible because of performance and other problems.

#6 Chani on 07.24.09 at 07:19

evolution storing things in tracker?
ok, I’m even *more* confused… tracker, strigi, akonadi, couchdb, nepomuk – how do all these things relate to each other? where do they help each other? where do they overlap? what’s each one actually responsible for?

some of this has been explained to me before, but I keep forgetting it and getting confused all over again. I never really grokked it.

#7 otte on 07.24.09 at 07:45

I’m pretty much convinced by now that the crucial thing for Tracker’s success is if applications will use it as their way to store data. And that means it needs to provide (almost) all the features of a file system. If Tracker doesn’t replace the file system, it will always be perceived as a kludge.
The example here is Email: Putting the metadata in Tracker but saving attachments in files is not superior, but worse than just mmapping a big mbox file.

And the reason is the same as what I mentioned in my blog post: If it’s not the primary data storage, people will ot really care about getting it right: So what if the Tracker database got deleted, just rebuild it from the real data. So what if this isn’t supported in Tracker, use the real data if you need it.

But one thing is right: It’s not there yet. Certainly not the code, but I think the developers don’t realize it themselves yet.
In today’s blog at http://jamiemcc.livejournal.com/13021.html Jamie for example only talks about the features you get when you already have a working database, but not how you get there or what applications need.

#8 Richard on 07.24.09 at 08:00

I just want to eat other people’s data.

Thanks for this explanation of Tracker, by the way. It’s one of the clearer and more concise ones I’ve read.

#9 Marco Barisione on 07.24.09 at 08:45

@Chani:
tracker-store: a daemon that stores metadata in a DB
tracker-indexer: a file indexer that stores the metadata on indexed stuff in tracker-store
strigi: similar to tracker for KDE
akonadi: storing system for PIM and emails for KDE, should store stuff in strigi (not sure at all tbh)
nepomuk: ontologies, i.e. the ways of representing the metadata and their relationship

#10 Will Stephenson on 07.24.09 at 09:41

@Marco:

Your description of the KDE semantic and storage system is inaccurate.

nepomuk: ontologies and (in KDE) implementation of metadata storage system
strigi: file metadata extraction utilities; metadata is stored by nepomuk
akonadi: type independent caching service around mostly PIM backends (eg maildir, IMAP, RSS, …), not a permanent store but a single place to go to for your data. Designed for cross-desktop use!

#11 Rax on 07.24.09 at 10:19

that’s an interesting article.

Perhaps other applications should not be tied directly to tracker like you suggest.
How about the desktop environment having a generic interface for feeding this semantic information from any desktop application. One could then plug in any kind of back end, like Tracker, to listen for this information as it is written.

The desktop can then also be configured to enable or disable this semantic information.

Well, I haven’t thought about this too much, so it might not make sense ;)

#12 Marco Barisione on 07.24.09 at 11:43

@Will Stephenson:
Sorry, I’m not really an expert on the KDE stuff :)

@Rax:
That’s already almost like that, application basically will just talk SPARQL to tracker.

#13 Seif Lotfy on 07.24.09 at 12:24

I truly believe Tracker is going the right direction by providing a central storage for all the data using an RDF ontology. But still a semantic desktop is only the data representaion. I think what is missing is something that makes sense of it. Having a good technology wont help out much unless you define some use cases for it. I sometimes fear that Tracker i targeting the mobile platform and not the desktop. Thus focusing most of its use cases on the maemo.

Personally at the Zeitgeist team we are looking forward to use the Tracker Storage as well as triggering the indexer.

#14 Jamie McCracken on 07.24.09 at 12:31

I’m pretty much convinced by now that the crucial thing for Tracker’s success is if applications will use it as their way to store data. And that means it needs to provide (almost) all the features of a file system. If Tracker doesn’t replace the file system, it will always be perceived as a kludge.

Not quite. Tracker is designed to fill in the missing features from a file system rather than replace it. It certainly was not designed for storing files although that is possible (possibly with version control and behave like a document store). I dont believe its necessary or critical to its success though

#15 Jamie McCracken on 07.24.09 at 12:36

Tracker-storage should be fast regardless of its size or whether its used for indexing

This is because the ontology is decomposed into individual tables and each table’s I/O performance should be independent of each others (this is unlike any other triple store which tends to store everything in one big table or two and therefore suffers scalability problems and performance issues when DB gets too large)

Having a GB of emails is *not* going to slow down search of music files or metadata storage of application data

#16 pvanhoof on 07.24.09 at 13:48

Guys, more coding less abstract debating. We still have a lot of work to do on Tracker (really).

Yes .. we as a team have a focus on mobile, but we are most definitely not going to stop you from helping us make Tracker rock on your desktop. We are very interested in your experiments, patches and work to improve the desktop experience of Tracker.

We already put some effort in integrating with the desktop. I for example coded a EPlugin for Evolution to get the metadata out of Evolution and into our RDF store. I’m also proposing patches to Evolution so that it would fetch bodystructure from IMAP, which will help us getting more metadata out of E-mails than what Evolution itself right now has. We can only do so much as our time allows us to do. And Tracker has several very high priority TODO items itself too.

For example named-graph support, a quadruple (done), backup (done) & restore (depends on a better Turtle parser), a better Turtle parser than raptor’s, etc

I know everybody wants this yesterday. Then join us on the development of it. We can put you to work instantly, just ask us (Jürg, me, Martyn, Carlos, Ivan, Jamie, Ottela and Urho).

Cheers, and don’t go too abstract about all this. Clone the repository and code a bit instead.

#17 Links 24/07/2009: Germany GNU/Linux Adoption High, FSF Speaks on TPB | Boycott Novell on 07.24.09 at 20:15

[...] semantic desktop I named this post “Tracker” first as I started writing from that perspective, but the problems I’m about to talk are more related to what is called “semantic desktop” and not specific to Tracker, which is just the GNOME implementation to that idea. This post is a collection of my thoughts on this whole topic. What I originally wanted to do was improve Epiphany’s history handling. Epiphany still deletes your history after 10 days for performance reasons. When people suggesting Tracker I started investigating it, both for this purpose and in general. [...]

#18 Burke on 07.26.09 at 22:45

Soi, whats the reason you do not consider to work together with the Nepomuk/Strigi people? They’Ve done alot work until now and you could benefit from their experience. I guess you could even share some code if not even more

#19 pvanhoof on 07.27.09 at 08:11

@Burke: Tracker does integrate with streamanalyzer, which is Strigi’s extraction library. We do cooperate with Jos Vandenoever and Evgeny Egorochkin who are the authors of Strigi+Streamanalyzer.

#20 Leo Sauermann on 08.03.09 at 22:25

just for the sake of completion and namedropping :-)

the semantic desktop is a bit more than you said above. Its a vision started by Stefan Decker and me in 2003 to help people communicate better and keep their personal information in a radically new way. That means, there is a lot more in it than just triples… but good to pick the name as title for the post :-)

http://en.wikipedia.org/wiki/Semantic_desktop

#21 On Hierarchical File Systems and Storage Location « Thorwil’s on 08.23.09 at 09:46

[...] topic is old enough to be discussed in the FLOSS [...]