» Sparql Experiments in GNOMEland

Optimizing SPARQL queries for Tracker, part 2

20/01/2011

In the last post, we learnt a way to avoid OPTIONAL blocks in queries. We however raised a problem that happens when you want to fetch an optional resource, and the predicate chain that links that resource to the subject includes some multi-valued predicate.

To illustrate this example, let’s imagine we want to retrieve all music resources, along with their tags, if they have some. The straightforward way to write this query would be:
SELECT ?m ?tagLabel WHERE { ?m a nmm:MusicPiece OPTIONAL { ?m nao:hasTag ?tag ; nao:prefLabel ?tagLabel } } [1]

If you read the previous article, and forgot about multi-valued predicates, you might try something like:
SELECT ?m nao:tagLabel(nao:hasTag(?m)) WHERE { ?m a nmm:MusicPiece } [2]

… but that query is not valid. Why? Because nao:hasTag is not single valued, a resource can have several tags. If you get several results using a predicate function, Tracker will concatenate them using a separator character, by default “,”. So the query
SELECT ?m nao:hasTag(?m) WHERE { ?m a nmm:MusicPiece } [3]
could return a line like:
urnOfMusicPiece urnOfTag1,urnOfTag2,urnOfTag3.
So what you get in the second column is actually not the identifier of a resource, but a string with URNs encoded inside. And no way the nao:prefLabel “function” can work on that.

There exists an alternative solution though, and it is to use the so called scalar selects. A scalar select is a SELECT block returning one line, and one column, which is assimilable to a scalar. And that type of select can be added to our query’s projections:
SELECT ?m (SELECT GROUP_CONCAT(nao:prefLabel(?tag), ":") WHERE { ?m nao:hasTag ?tag }) WHERE { ?m a nmm:MusicPiece } [4]

Yes, this does look a bit like black magic. But we’ll break it down into pieces. First, if you remove the scalar select, you get back to our most basic query, selecting all music resources. Now let’s analyse the scalar select itself, first without the GROUP_CONCAT:
SELECT nao:prefLabel(?tag) WHERE { ?m nao:hasTag ?tag } [5]

The query 5 has nothing really special to it, the only detail being that ?m is not defined in the scalar select, but its definition comes from the “main” one. Scalar selects in projections are evaluated after the WHERE pattern, which means you can use values from the “main” select in a scalar select in the projections, but not the other way around.

Now on to GROUP_CONCAT: if our resource ?m happens to have several tags, our scalar select will return more than one line, and additional results will be discarded (Tracker implicitely adds “LIMIT 1” to scalar selects). Not good. The GROUP_CONCAT takes all results, and concatenates them together using a defined separator. In our case, we get a list of tag labels separated by :. So, a result line from the query 4 might look like:
urnOfMusicPiece tagLabel1:tagLabel2:tagLabel3
And if there were no tags, the second column will simply be empty. Of course, this approach requires a bit of string splitting on the application side, but this is usually much cheaper than the OPTIONAL block. And if you’re really going to use this kind of query, the choice of a separator better than “:” might be a good idea, ASCII has some special characters like 0x1E (field separator) that are less likely to be used in tag labels. You can use the syntax \u001E in SPARQL.

PS. To answer the question “How do I know if a predicate is single or multi valued, you can read the ontology reference documentation, and look for the “cardinality” property of the predicates you’re using.

Update: the ASCII control character I wanted to mention was not 0x2E but 0x1E

Posted by Adrien Bustany
Filed in Uncategorized
Tags: Sparql, Tracker

4 Comments »

Optimizing SPARQL queries for Tracker, part 1

15/01/2011

My current job implies working with Tracker, for the first time not as a developer but as a user. This is quite a cool change, as I can now be on the side of those bitching when things don’t work as they should 🙂

As you probably know now, Tracker is a RDF database (and a set of programs to exploit it). However, it is a bit special for various reasons:
1. Tracker’s ontologies are fixed (changes are supported in a limited way), which means you should stick to the installed data schemas (ontologies), as opposed to being allowed to store any triple in the database.
2. Tracker uses SQLite. On the good side it means Tracker is rather lite on the system resources (it usually idles at around 4MB RSS, and can go maybe up to 25-30MB when running a big query). On the bad side, it means that not every operation is fast, since SQLite is an on-disk database. As my job implies using Tracker on devices with not so much memory or CPU power, it is very important to know what is fast/expensive and what is not. And it is precisely what this post is about.

OPTIONAL blocks and predicate functions

OPTIONAL blocks are one of the very costly operations in Tracker. Let’s say you want to query all music resources, and their title. You could run this query:
SELECT ?urn ?title WHERE { ?urn a nmm:MusicPiece; nie:title ?title } [1]

However, this query will only return resources that do have a title. Resources without title will not match the query. To also get resources without a title, we can write:
SELECT ?urn ?title WHERE { ?urn a nmm:MusicPiece OPTIONAL { ?urn nie:title ?title } } [2]

Now, we have an OPTIONAL block, which makes our query slower. On this very precise example, the speed difference might be negligible, but I’ve already seen 10x speedups on some queries optimized to use as few OPTIONAL blocks as possible.

The faster solution is to use “predicate functions”, a non-standard SPARQL feature that allows us to use predicate as functions on the query variables. The query [2] rewritten to use predicate functions would be:
SELECT ?urn nie:title(?urn) WHERE { ?urn a nmm:MusicPiece } [3]

In that case, the second columns in our results would be an empty string when there is no title. If this is faster, you might wonder why Tracker does not convert internally OPTIONAL blocks to predicate functions. The answer is, OPTIONAL blocks allow you to do more things, that are not always possible with predicate functions. When using an optional block, you define a sparql variable (?title in our example), which you can reuse in other patterns. This is not the case when using predicate functions.

You can chain predicate functions. If you also want to get the album title of each music resource, you can write:
SELECT ?urn nie:title(?urn) nie:title(nie:musicAlbum(?urn)) WHERE { ?urn a nmm:MusicPiece } [4]

If you use a predicate function on a predicate that can have more than on value, values will be joined with a user defined separator (by default, “,”):
SELECT ?urn nie:keyword(?urn) WHERE { ?urn a nmm:MusicPiece } [5]

However, predicate functions don’t work on lists. That means if you have a chain of predicate functions p₁(p₂(…p_n(?variable))), the query will only be valid if p₂, p₃…p_n are single valued.

If one of the predicates is not single valued, you will either have to use an OPTIONAL block… Or wait until the next blog post, where I’ll present an alternative solution 🙂

Posted by Adrien Bustany
Filed in Uncategorized
Tags: Sparql, Tracker

1 Comment »

SPARQL learning tips for the curious developer

13/01/2011

There has been recently an increasing number of people dropping on IRC (#tracker on GimpNet), with nice ideas for projects using Tracker. Some of them are looking for using Tracker on a server, or accessing it using languages other than C or Vala, which are usecases we don’t really support right now (although our DBus interface is of course language agnostic, it is not really the preferred IPC), and some others are just curious about the idea of having a global metadata database.

The common factor in all those users, is that at some point they start playing with SPARQL (Tracker being an RDF database, SPARQL is the query language to access the data). And, inevitably, they ask us where they can find documentation… The problem is, there is of course documentation on the W3C website about SPARQL, but many users find it hard to follow. Personally, the only section I use in the W3C doc is the SPARQL grammar reference. However, we also have various SPARQL examples in the Tracker documentation on Gnome Live, and a page explaining the non-standard SPARQL features supported by Tracker. Those two pages are usually enough to get people started, and allow them to write their first queries.

I initially intended this article to be about how to write fast SPARQL queries, but I will split that part in another post, to keep the size reasonable.

And remember, Tracker will be part of Gnome 3.0, so it’s now the best moment to learn about it! The project is evolving at a tremendous pace, every weekly release being loaded with fixes and performance improvements. If you still have memories about Tracker 0.6, be sure to erase them carefully, and take a fresh look at Tracker 0.9!

Posted by Adrien Bustany
Filed in Uncategorized
Tags: Sparql, Tracker

4 Comments »