Goodbye Tracker, hello TinySPARQL and LocalSearch

It is our pleasure (Sam’s and mine) to announce the following project renames:

– The Tracker SPARQL library will now be known as TinySPARQL
– The Tracker Miners indexer will now be known as LocalSearch, or GNOME LocalSearch if you prefer.

This on one hand will meet the increasing popular demand to have the terms “Tracker” and “Miner” avoided (In case the subtleties of language evade you, these have gained some bad connotations over the years), and on the other hand fixes the blunder done when Tracker 1.x conceded its name to the SPARQL library when library and indexer were split for 2.x. This has been a source of confusion wrt the extent and intent of each component, even among GNOME peers.

I am personally hopeful for the rename of TinySPARQL, as there has been a great amount of effort spent in making it a most credible implementation of the SPARQL/RDF/RDFS sets of standards, which feels underutilized with the library being mostly typecast as a gateway to the indexer data. Perhaps the new name will help reach to wider audiences that will be appreciative of a lightweight alternative to the available enterprise-level SPARQL engines.

But I shouldn’t diminish the importance of the rename of LocalSearch, especially at a time when other operating systems are approaching search in orwellian ways in the name of convenience, it becomes important to take a stand. We think the new name is a simple enough guide of our goals and non-goals.

This is not just a superficial rename, however we the maintainers have grown wary of approaches that would require us to bump version to 4.x, and updating all the projects depending on TinySPARQL/LocalSearch in a synchronized manner. So we have gone for a soft approach that will be backwards compatible for the most part, and allow a transitional path. Here’s what you can do to “update”:

In C projects

You are encouraged to use the tinysparql-3.0.pc file to detect the TinySPARQL library. The tracker-sparql-3.0.pc file is still provided for backwards compatibility. In code, use #include <tinysparql.h> to include the SPARQL library headers.

In the future we might consider an API migration scheme that deprecates all API under the old namespace and replaces with with objects in the new namespace. Meanwhile, all API objects and symbols keep the “Tracker” namespace.

From gobject-introspection bindings

You can get a “sneak peek” by using the new namespaces right away, e.g. from python code:

#!/usr/bin/python3

import gi, sys
gi.require_version('Tsparql', '3.0')
from gi.repository import GLib, Gio, Tsparql

try:
    connection = Tsparql.SparqlConnection.new(
        Tsparql.SparqlConnectionFlags.NONE,
        None, # Database location, None creates it in-memory
        Tsparql.sparql_get_ontology_nepomuk(), # Ontology location
        None)

    # Create a resource containing RDF data
    resource = Tsparql.Resource.new(None)
    resource.set_uri('rdf:type', 'nmm:MusicPiece')

    # Create a batch, and add the resource to it
    batch = connection.create_batch()
    batch.add_resource(None, resource)

    # Execute the batch to insert the data
    batch.execute()

    connection.close()

except Exception as e:
    print('Error: {0}'.format(e))
    sys.exit(-1)

Remember to make your project depend on tinysparql-3.0.pc to ensure an up-to-date enough version to have these new binding definitions. Of course backwards compatibility is preserved, and there will be GIR/typelib files for the old namespaces.

In Vala code

There is also a new VAPI file, to describe the TinySPARQL API under the Tsparql namespace:

// valac --pkg tsparql-3.0 tsparql.vala
using Tsparql;

int main (string[] argv) {
  try {
    var conn = Tsparql.Connection.new (Tsparql.ConnectionFlags.NONE, null, Tsparql.get_ontology_nepomuk(), null);

    // Create a resource containing RDF data
    var resource = new Tsparql.Resource(null);
    resource.set_uri ("rdf:type", "nmm:MusicPiece");

    // Create a batch, and add the resource to it
    var batch = conn.create_batch ();
    batch.add_resource (null, resource);

    // Execute the batch to insert the data
    batch.execute ();

    conn.close ();
  } catch (GLib.Error e) {
    stderr.printf ("Error: %s\n", e.message);
  }

  return 0;
}

And same as before, there will be also a VAPI file with the old namespace.

Accessing LocalSearch data

The LocalSearch indexer data will now alternatively be offered via the org.freedesktop.LocalSearch3 D-Bus name, as well as through the org.freedesktop.Tracker3.Miner.Files D-Bus name for backwards compatibility. This can be used as usual with the TinySPARQL D-Bus based connection, e.g. from Python:

#!/usr/bin/python3

import gi, sys
gi.require_version('Tsparql', '3.0')
from gi.repository import Tsparql

conn = Tsparql.SparqlConnection.bus_new(
    'org.freedesktop.LocalSearch3', None)
cursor = conn.query('select ("Hello World" as ?str) {}', None)
cursor.next()
print(cursor.get_string(0))

Command line utilities

The command line utilities for both projects have parted ways. Instead of a single command line utility with extensible subcommands, the will be distinct tinysparql and localsearch command line tools, each providing their own sensible set of subcommands. We trust it will be reasonably low effort to adapt to these changes, and reasonably intuitive.

Test utilities

In order to test search-related code, some projects were using the tracker-testutils-3.0.pc file to locate the tracker-sandbox helper script to help them implement a test harness for search features. This has now become more of a first class citizen as a localsearch test-sandbox subcommand. There is no longer a pkgconf file, we recommend to detect the presence of the localsearch command in build systems.

A note to distributors

The repositories for these projects are https://gitlab.gnome.org/GNOME/tinysparql and https://gitlab.gnome.org/GNOME/localsearch, and the tarballs distributed will have tinysparql-x.y.z and localsearch-x.y.z filenames (starting with tinysparql-3.8.alpha and localsearch-3.8.alpha which are already out of the door). We are understanding that a project rename will involve a more changes on your end(s) than ours, but we will greatly appreciate if you help us establish the new identities and follow up on the rename.

On CVE-2023-43641

As you might have read already about, there was a vulnerability in libcue that took a side gig demonstrating a sandbox escape in tracker-miners.

The good news first so you can skip the rest, this is fixed in the tracker-miners 3.6.1/3.5.3/3.4.5/3.3.2 versions released on Sept 28th 2023, about a couple weeks ago. The relevant changes are also in the tracker-miners-3.2 and tracker-miners-3.1 branches, but I didn’t wind up to doing releases for those. If you didn’t update yet, please do so already.

Background

The seccomp jail in the metadata extractor is far from new, it was introduced during development for tracker-1.12 in 2016 (yup, tracker, before the tracker-miners indexers spun off the monolithic package). Before that, the file metadata extraction task was already split as a separate process from filesystem structure indexing for stability and resource consumption reasons.

Seccomp comes in different sizes, it allows from creating very permissive sandboxes where system calls are allowed unless blocked by some rule, to creating paranoid sandboxes where every system call is disallowed by default (with a degree of harshness of choice) unless allowed by some rule. Tracker took the most restrictive approach, making every system call fail with SIGSYS by default, and carving the holes necessary for correct operation of the many dependencies within the operational parameters established by yours truly (no outside network access, no filesystem write access, no execution of subprocesses). The list of rules is quite readable in the source code, every syscall there has been individually and painstakingly collected, evaluated, and pondered for inclusion.

But the tracker-extract process as a whole still had to do a number of things that could not work under those parameters:

  1. It wanted to read some settings, and GSettings/Dconf will anyways require r/w access to /run/user/.../dconf/user, or complain loudly.
  2. There is infrastructure to write detailed error reports (visible through the tracker3 status commandline tool) for files where metadata extraction failed. These error reports have helped improve tracker-miners quality and are a valuable resource.
  3. The tracker-extract process depends on GStreamer, and is a likely candidate for being the first to fork and indirectly rewrite the GST registry cache file on gst_init().

While in an ideal world we could work our way through highly specific seccomp rules, in practice the filters based on system call arguments are pretty rudimentary, e.g. not allowing for (sub)string checks. So we cannot carve out allowances for these specific things, not without opening the sandbox significantly.

Rules for thee, but not for me. Given the existing tracker-extract design with threads dedicated to extraction, the pragmatic approach taken back then was to make the main thread (in charge already of dispatching tasks to those dedicated threads) unrestricted and in charge of those other duties, and applying the seccomp ruleset to every thread involved in dealing with the data. While not perfectly airtight, the sandbox in this case is already a defense in depth measure, only applicable after other vulnerabilities (most likely, in 3rd party libraries) were already successfully exploited, all that can be done at that point is to deter and limit the extent of the damage, so the sandbox role is to make it not immediately obvious to make something harmful without needle precision. While the “1-click exploit in GNOME” headline is surely juicy, it had not happened for these 7 years the sandbox existed, despite our uncontrollable hose of dependencies (hi, GStreamer), or our other dependencies with a history of CVEs (hi Poppler, libXML et al), so the sandbox has at least seemed to fulfill the deterrence role.

But there is that, I knew about the potential of a case like this and didn’t act in time. I personally am not sure this is a much better defense than “I didn’t know better!”.

The serenpidity

The discovery of CVE-2023-43641 was doubly serendipitous, on one hand Kevin Backhouse from GitHub Security Lab quickly struck gold, unwittingly managing to corrupt data in a way that it was the unconstrained thread finding the trap being layered. To me, fortune struck by dealing with security issues “the good way”. For something that normally goes as pleasant as a tooth extraction, I must give Kevin five stars in management and diligence.

And as said, this didn’t catch me entirely by surprise. I even talked about the potential case with co-maintainer Sam Thursfield as recently as Fosdem this year, and funnily it was high up in my to-do list, only dropped from 3.6 schedule due to time restrictions. Even though the requirements (to reiterate: no indiscriminate network access, no write access, no execution) and the outliers didn’t budge, there were ideas floating on how to handle them. While the fixes were not as straightforward as just “init seccomp on main(), duh”, there was no moments of hesitation in addressing them, you can read the merge request commits for the details on how those were handled.

The end result is a tracker-extract main() function that pretty literally looks like:

main ()
{
  lower_process_priorities();
  init_seccomp();
  do_main();
}

I.e. the seccomp jail will affect the main thread and every other thread spawned from it, while barely extending the seccomp ruleset (mostly, to make some syscalls fail softly with error codes, instead of hard through SIGSYS), and not giving up in any of the impositions set.

The aftermath

Of course, one does not go willy-nilly adding paranoid sandbox restrictions to a behemoth like GStreamer and call it an early evening, also C library implementations will call different sets of system calls in different architectures and compile-time flags, transitioning to a fully sandboxed process still means adding new restrictions to an non-trivial amount of code, and will definitely stir the SIGSYS bug report pot for some time to come due to legit code hitting the newly set restrictions. This is already being the case, a thank you to the early reporters. The newly confined code is very trotted, so optimistically the grounds will settle back to relatively boring after another round of releases or two.

Reflections

While we could start splitting hairs about the tracker-extract process being allowed to do this or that while it shouldn’t, I think the current sandbox will keep tracker-miners itself out of the way of CVEs for a long long time after this *knocks on wood*. I’m sure this will result in more pairs of eyes looking, and patches flowing and hahaha who am I kidding. I’ll likely think/work alone through ways to decrease further the exposed surface on future cycles out of the spotlight, after the waves of armchair comments were long waded. For now, any vulnerability exploit do already have a much harder time at succeeding in evil.

I’d like to eventually see some tighter control on the chain of dependencies that tracker-miners has, maintaining a paranoid seccomp sandbox has a very high friction with running arbitrarily large amounts of code out of your control, and I’m sure that didn’t just get better. There are reasons we typically commit to only supporting the latest and greatest upstream, we cannot decently assure forward compatibility of the seccomp sandbox.

And regardless, now more than ever I am happy with relying on third party libraries to deal with parsers and formats. It still is the only way to maintain a generic purpose indexer and keep a piece of your sanity. To the people willing to step up and rewrite prime Rust fodder, there’s gold if you pull from the thread of tracker-extract module dependencies.

Conclusion

I’d personally want to apologize to libcue maintainers, as this likely blew out of proportion to them due to my inaction. Anything less dangerous looking than “1 click exploit” would have likely reduced the severity of the issue and the media coverage significantly.

And thanks again to Kevin for top notch management of the issue. My personal silver lining will be that this was pretty much a triumph of “enough eyes make bugs shallow”.

Getting the best of tablet pads

In case they needed introduction, pads are these collections of buttons and tactile sensors (ring or strip shaped) most typically found along the side of drawing tablets. These devices will be the topic of today.

Picture of someone using a pad
The Wacom ExpressKey Remote is a rare case of disembodied pad. Picture from Wacom.

A bit of context

The concept behind pads is simple, a collection of triggers with no specific action associated, so users can configure the actions that best suit their tastes and setups. Other quality differences like the number of pressure levels aside, entry-level consumer tablets often have few buttons, more advanced tablets typically get more buttons, and “modes” that multiply the amount of available mappable actions. It you are a pro, it is likely that you want quick access to more features, and so on.

When this kind of devices came around back in many years ago, and support for them was added in X11, they sat on a rough spot. As a X11 input driver, you either tell the X server that you are adding a device with buttons and valuators that drives the pointer sprite (e.g. a mouse), or you are adding a device with keycodes and levels (e.g. keyboards) subject to keymap translations. A pad device is notably neither of those things, the way to shove the square peg on the round hole was to make them pointers, thus they would send button and scroll events directed towards the pointer position, they would just not move the pointer.

We are here also talking about a time that applications had (many still do) a hard time recognizing input from different source devices, to them a button press is just a button press, so those pads as-is would just click and scroll on things, quite far from the customizable actions promised land. The pragmatic approach taken in GNOME (Shell now, settings-daemon back in the day) was grabbing those devices at the session level, not letting them be seen by applications and converting them into something immediately useful. Given the limit of choices, these pad button and scroll events got converted to keycombos.

This at least kept two promises, it’s a) customizable and b) universal. Although it felt like falling short for long, since keycombos have some implicit problems associated:

  1. A small set of keycombos is close to universal, but many are not. So the typical choice is either universal but seldom used settings, or highly app-specific keycombos that are latently bollocks in any other application.
  2. Keycombos are also subject to developer/translator changes, etc. You might find your keycombos no longer hold if you change locale, or environment, or update your app.

But along came Wayland! Pads would not need to be subject to the limitations of X server driver API and could become their own entity with their own semantics, and indeed they became a first class citizen of the Wayland tablet protocol available in wayland-protocols. This allowed us to let pad events be sent to clients, have it be either correctly interpreted in clients/toolkits as input from a pad device or safely ignored, instead of misinterpreted as another kind of input. The Wayland protocol opened even further the possibilities like being able to provide readable strings for compositor feedback. I blogged about this… in 2016.

What since?

A drawback from the previous state of things and the massive change of approach is that no client was prepared to be in charge of pad events, so any integral support for pads had to start from the very bottom. But also, pads are most useful in very specific setups and applications, that at the time were nowhere near having Wayland support. In the GTK world, the requisites for it to work were:

  1. Using GTK >= 3.22
  2. Using GAction for application actions
  3. Using GtkPadController to hook actions

While that looks simple enough, the flagship applications were barely taking bits at the first step at the time, and after that, the second was not a small undertaking either with applications with hundreds of options.

Some weeks ago, I was fortunate to attend the Wilber Week at Amsterdam, and was very glad to meet my GIMP friends again (and Inkscape/Blender folks!) despite my week-long stint with a commuter’s life after so long.

Picture of a baked Wilber.
A moment I missed, immortalized at a splash screen.

 

Over there, I learned that not long ago, GIMP did finally port away from GtkAction into GAction, finally lifting the last barrier to get pad actions working as envisioned, in one app at least :).

Since this is something I’ve been touting for so long to GIMP maintainers, of course I had to volunteer for the task, now up in a merge request.

Of course, most of the gist there is the configuration UI, and the configuration serialization/deserialization, the conversion to actions is still done using GtkPadController created from the configuration.

Screenshot of GIMP pad configuration
While the UI could get more polish by using libwacom (e.g. nicer names, or pre-populating buttons/rings/strips in the available modes), it also needs to work with tablets not recognized by libwacom, or (maybe someday) in other platforms. The important part is that the application may define the action associated with a pad feature, and give it a friendly name. These do also show in the GNOME Shell pad OSD:

Screenshot of the pad OSD

Which in this case works as a cheat sheet of sorts, since there are pads with extreme combinations of buttons/modes (e.g. the Express Key Remote has 17 buttons by 3 modes), and per-app mapping does not make things easier to remember on itself.

Conclusion

An improved form of pad support has been stalled for too long, and it is amazing that it can already start to roll out. It looks like Inkscape is also ripe for such improvements, and during Wilber Week I was glad to hear a very positive attitude towards Wayland from Blender developers. Maybe the tides are turning to make this often neglected device shine?

A call to GNOME app developers

While these flagship applications are key for a major leap in support for pad devices, perhaps your application can do a little bit to help, if you see your app as possible part of a designer/artist/etc workflow. Using GtkPadController is rather easy with a fixed set of actions, so exposing an small set of useful actions could extend the usefulness of pad devices even further. This is for example what Nautilus does. You can also draw inspiration from the “Paint” GTK demo.

Tracker 3.x, a retrospect.

Time files, for better or for the worst. The last time I bored you with ramblings on this blog was more than 2 years ago already, prepping up for Tracker 3.0. Since I’m sure you don’t need a general catch up about these last 2 years, let’s stay on that same subject.

Nowadays, we are very close to GNOME 43, and an accompanying 3.4.0 release of Tracker SPARQL library and data miners, that is 4 minor releases ahead! What happened since then? Most immediately after that previous blog post, the 3.0 release did roll in, the uncertainty behind all major structural changes vanished, and the transition could largely be called a success: The overhauled internals for complete SPARQL support stood ground without large regressions; the promises of portals and data isolation delivered and to this day remains unchallenged (except for spare requests to let more pieces of metadata through); the increased genericity and versatility kept fostering further improvements.

Overall, there’s been no major regrets, and we are now sitting comfortably in 3.x. And that was all good, since we could use all that time not fixing fallout in keeping up with the improvements. Let’s revisit what happened since.

Tracker (SPARQL library)

Testing

Ever since 3.0, test coverage has been growing fairly steadily. A very good thing about SPARQL and Tracker API is that it is very “circular” altogether, every data format used in information exchange must be both parsed and produced by the SPARQL implementation, every external request made is also a external request it should be able to serve, and so on.

This makes it fairly easy to reach all corners in testing coverage, despite the involved complexity. It has been also quite a rule for some time that SPARQL language compliance fixes come with tests. The accumulated result over the years is a fairly large collection of tests to the internal machinery, there’s over 330 subtests already for the SPARQL language alone.

To the day of writing, Tracker stands at 76.4% coverage (Up-to-date for the posterity: ), we are getting very good at catching deviations from how the SPARQL library should behave, and decently good at catching how it should not misbehave. Simply following W3C standards and recommendations pays off here too, since it settles the direction and resolves most matters about what the right behavior is.

Developer Experience

3.0 marked the point where being able to create private Tracker databases with custom data models transitioned from an easter egg to a first-class feature. This also means that developers can now write an ontology (or data model, or schema, pick a name) that suits their data like a glove, instead of using the default Nepomuk one, which is well-trodden and literally written by academics, but will be overkill for the needs of individual applications.

And of course there is room for failure in writing those ontologies. Last year, GSoC student Abanoub Ghadban worked hard on “breaking it”, polishing the experience and trying to produce helpful warnings, so developer mistakes are easily visible and solvable. The CLI tools provided facilitate these checks, e.g. creating a temporary endpoint that loads the ontology being edited, and running queries against it.

The documentation front got also steady improvements, the API itself is 100% documented and there is now a fully fleshed out SPARQL tutorial. Also, drawing inspiration from SQLite, there are now miscellaneous docs on some implementation details like limits, a discussion on the security considerations of the implemented specs, or extensions and interpretations of the SPARQL spec. The examples have been also modernized, and written in Python and Javascript in addition.

A great blunder of how Tracker tended to be used in applications was having SPARQL mixed in code, or worse, built through string manipulation. The latter got better API-wise in the past with compiled statements, but the mix of code and database logic was still prevalent. Since 3.3, there is support for loading and creating compiled statements from query files located in GResources. This neatly addresses the separation of queries and code, while keeping them indissoluble to the produced binary, while preserving the benefits of compiled statements (compile once, run many times).

Performance

One of the benefits of the API provided by Tracker is that it gives a great amount of leeway in internal refactors without altering the surface. We are largely just backwards-compatibility constrained by the underlying database format. This has allowed for further optimizations to happen under the hood since 3.0, database updates both in terms of database changes and queries. Databases also take now less space, specially in the presence of many blank nodes.

Although the greatest performance boost for data producers can be obtained through the TrackerBatch API (Since 3.1). Prior to it, normally a TrackerResource would be used build RDF data, then used to produce a SPARQL update, and the SPARQL update parsed to generate and apply the RDF changes. This new API can efficiently traverse a series of TrackerResources (describing RDF already) and turn them to database modifications skipping the SPARQL middle man altogether.

In the git tree, there now is a small utility to benchmark certain uses of Tracker API, let’s see how the output looks on a modern Intel i7 for an in-memory database:

Batch size: 5000, Individual test duration: 30 sec
Opening in-memory database…
                           Test		Elements	Elems/sec	Min         	Max         	Avg
   Resource batch update (sync)		6169883.801	205662.793	4.292 usec	5.615 usec	4.862 usec
     SPARQL batch update (sync)		2430664.747	81022.158	11.889 usec	14.255 usec	12.342 usec
   Resource modification (sync)		4440988.603	148032.953	6.588 usec	8.438 usec	6.755 usec
  Resource insert+delete (sync)		3033137.552	101104.585	9.689 usec	12.669 usec	9.891 usec
Prepared statement query (sync)		8566182.714	285539.424	3.000 usec	745.000 usec	3.502 usec
            SPARQL query (sync)		1329076.956	44302.565	21.000 usec	189.000 usec	22.572 usec

After the usual disclaimer that this benchmark utility greatly relies on CPU and disk characteristics, and your mileage may vary, there’s a few things to highlight here:

  • Using modern APIs always pays off. Inserting data directly from TrackerResource (Resource batch update) is 2.5x faster than inserting data through SPARQL updates (SPARQL batch update), and querying through prepared statements (Prepared statement query) is ~6.5x faster than repeating SPARQL queries (SPARQL query).
  • Even though the tests are run synchronously on the main loop, queries can be greatly parallelized, so the actual throughput on a modern machine will be much higher in reality. Updates are single-threaded though.
  • Used the right way, Tracker code is never in the hot paths, merits there go to SQLite and production of data itself. You are expected to get results that are in the same ballpark than using SQLite directly, given similar volumes and layout of data.

But this snapshot does not fully highlight the improvements done. For a reference baseline, a backported version of this benchmark on the same computer over Tracker 3.0.x gives:

Batch size: 5000, Individual test duration: 30 sec
Opening in-memory database…
                           Test		Elements	Elems/sec	Min         	Max         	Avg
     SPARQL batch update (sync)		1387346.192	46244.873	17.035 usec	24.290 usec	21.624 usec
   Resource modification (sync)		259923.682	8664.123	49.863 usec	122.236 usec	115.418 usec
  Resource insert+delete (sync)		707638.539	23587.951	41.593 usec	73.702 usec	42.395 usec
Prepared statement query (sync)		7729898.742	257663.291	3.000 usec	527.000 usec	3.881 usec
            SPARQL query (sync)		888896.319	29629.877	31.000 usec	180.000 usec	33.750 usec

Looking past the lack of TrackerBatch there, it’s still easy to see there has been massive improvements pretty much all over the board. As we already encourage, using the latest Tracker gives you the best Tracker.

Data serialization

SPARQL was very much thought out with the task of storing RDF data, querying into RDF data, and moving RDF data around. From the tiniest resource/value existence check, to full content dumps, everything is one query away.

What we were lacking is a consistent way to convert all that data into something that could easily be piped through, saved, processed, etc. To make this task easy, there is now API to serialize and deserialize data between a Tracker database and the popular RDF file formats. This is performed efficiently, with flat RAM usage on both ends, it is even possible to pipe these APIs with the RDF data never existing anywhere at once during the process.

And since this serialization to RDF formats is a builtin feature of HTTP endpoints, it has allowed us to level up our support for these. A Tracker HTTP endpoint is now entirely compliant and indistinguishable from the larger players.

The most immediately useful usage of this serialization support is in the CLI tools, the import and export commands now use this API and can deal with these formats. But what is this for? Is this driven by a level of completionism that borders the sickness? Well, yes, but there are of course plans around these features, more on that later.

Tracker Miners

Performance

You might think the SPARQL library improvements above would be the larger improvement the filesystem miner could get, and you would be wrong.

Part of the raison d’être of a filesystem indexer is to stay up-to-date with filesystem changes. In the GNOME world, this catching up is usually done through a GFileMonitor, which provides a GLib-friendly way to do the dirty job of setting up an tracking an inotify handle to track changes on an individual directory for you. What is wrong with that? Nothing, unless you do that at a large scale like indexers do. Each of those GFileMonitors is backed by a pollable FD, and a GSource wrapping it, and iterating a GMainContext that has thousands of GSources attached to it massively, thoroughly sucks.

Is this a case of Tracker abusing a perfectly fine GLib API? Or on the contrary is this a case of bad GLib API design? I will let you debate on that, as I am unclear myself.

The first solution to alleviate that (since 3.1) was delegating file monitoring to a separate thread, so the GMainContext that is expensive to iterate only affects file monitors, as opposed to everything that goes on. Later on, FANotify finally gained the missing features that made it suitable for indexers (not requiring CAP_SYS_ADMIN was one of them) and Tracker Miners got an implementation for it (since 3.3). Most notably here, with this kernel API it is only necessary to poll a single FD to receive events for all FANotify marks set on the filesystem.

In what it sounds like a case of miscommunication between kernel developers developing independent new features that didn’t mix well, unfortunately it’s not possible nowadays to mix the bleeding edge in file monitors (FANotify) with the bleeding edge in filesystems (btrfs), for these (and other) cases Tracker Miners will still fallback on plain glib/inotify. Hopefully the situation will be resolved at some point.

At 3.1, the filesystem indexer also implemented flow control mechanisms that allowed its RAM usage to stay mostly flat independently of the filesystem size and layout. At the peak of its activity, tracker-miner-fs-3 uses 30-40MB here (per gnome-system-monitor), and idles at 5MB. Needless to say, it is also many times faster than its past 3.0 self.

But all this was about tracker-miner-fs-3, the daemon in charge of monitoring filesystem changes and mirroring file/folder structure into the database. What about tracker-extract-3, the daemon in charge of nitty-gritty file metadata extraction? When this step needs to happen (say, newly indexed or modified files). It’s for all accounts expensive, now by the sheer magic of everything else shrinking, it comparatively only got worse. There is a reason we avoid that from happening frequently at all costs.

But what is slow there? Roughly speaking, it should be just a loop going through files, getting the metadata and inserting it, and that should be fucking fast as per the benchmarks above, right? Right, the problem is in the “getting the metadata” step. This will wildly fluctuate depending on the files scattered in the filesystem, their mimetypes, and the libraries used to extract their metadata. The plain text extractor or the in-tree MP3 extractor are capable of opening, extracting metadata from, and closing multiple thousands of files per second. All the external libraries used for metadata extraction (yes, all of them) are slower, ranging from several times over to up to 4 orders of magnitude, also depending on the input files (I curated some infernal PDFs). The worst offenders are Poppler, GStreamer and libtiff.

As it’s evident here (You don’t need to believe me, add TRACKER_DEBUG=statistics to /etc/environment and reset/restart the miner), most libraries dealing with files and formats optimize for library resources being long lived across an application lifetime, while optimizing the creation and disposal of these library resources is often overlooked. The metadata extraction daemon faces that hard fact file after file, so its slowness is just a reflection of the slowness of these libraries in setting themselves up. If, after all of this, someone thinks the filesystem indexer is slow, that is where the money is.

Extending and maintaining metadata

Although the focus has been mainly on making things work reliably, rather than going crazy with extending the metadata stored (yet), a point worth noting here is last year’s GSoC work from student Nishit Patel, who worked on indexing creation time (in the filesystems that support/enable it), and allowing for its search all across the stack.

We also got support for a number of game file formats (mainly, retro ones), which GNOME Games (now Highscore) readily made use of. LOL, jk.

Handling and following failures

Whenever a file is broken or corrupted, or a 3rd party library crashes or produces a syscall that is caught up by seccomp, the tracker-extract-3 daemon would quit (with varying degrees of gracefulness) and be taught on the next restart to avoid the file that triggered this situation. This is not precisely new behavior, what is new though is that these failures are now recorded and can be easily inspected over the CLI with tracker3 status. Most bugs we receive about broken extraction are reactive (e.g. “why does Music not show this file?”), this would allow for a more proactive approach to fixing metadata extraction bugs, if users happen to look there and cooperate.

There is also a slight possibility that extraction bugs are due to Tracker itself, but these are largely a think of the past.

Coming up next…

I very much cheered when I learnt of the “Local first” initiative. In fact, I so much anticipated it that I literally anticipated it. Development of the serialization APIs started sometime around the last year, with a plan to provide facilities to transparently and neatly synchronize RDF data across instances in multiple machines owned by the same user.

Who wants that? Certainly not the filesystem indexer. However, there’s indeed a desire to avoid reliance on third party services for user sensitive data like their own health information, chat logs, or bookmarked sites. With some truckloads of optimism, I would hope that this becomes a cornerstone of that goal, for applications under the GNOME umbrella that need to deal with a non-trivial amount of data.

How would that work? What do we need to get there? We need a query language that supports it (check, duh), a data model that can handle the different patterns that might emerge in synchronizing data (check), a way to make these machines talk between each other (check), and a way to diff missing data (check). All the pieces are really set, so what is missing is putting those together, of course drawing inspiration from Christian Hergert’s Bonsai to make machines discover each other.

And of course there is still very much a desire to keep the heart of content applications compelling and relevant. There’s still opportunities to further extend and link the metadata stored by the filesystem indexer, perhaps with the help of the actual semantic web that lives out there. We already have a number of universal identifiers available (musicbrainz tags, IMDB IDs, game rom IDs) to interrelate and cross-reference data.

Now that the codebase features are settled and working well, we can start thinking on new fancy features, stay tuned for the next installment of this series in 2024, when I talk about Tracker 3.8.0, or perhaps Tracker 3.5.20. If you made it this far, you have my appreciation, until the next time!

Tracker 2.99.1 and miners released

TL;DR: $TITLE, and a call for distributors to make it easily available in stable distros, more about that at the bottom.

Sometime this week (or last, depending how you count), Tracker 2.99.1 was released. Sam has been doing a fantastic series of blog posts documenting the progress. With my blogging frequency I’m far from stealing his thunder :), I will still add some retrospect here to highlight how important of a milestone this is.

First of all, let’s give an idea of the magnitude of the changes so far:


[carlos@irma tracker]$ git diff origin/tracker-2.3..origin/master --stat -- docs examples src tests utils |tail -n 1
788 files changed, 20475 insertions(+), 66384 deletions(-)

[carlos@irma tracker-miners]$ git diff origin/tracker-miners-2.3..origin/master --stat -- data docs src tests | tail -n 1
354 files changed, 39422 insertions(+), 6027 deletions(-)

What did happen there? A little more than half of the insertions in tracker-miners (and corresponding deletions in tracker) can be attributed to code from libtracker-miner, libtracker-control and corresponding tests moving to tracker-miners. Those libraries are no longer public, but given those are either unused or easily replaceable, that’s not even the most notable change :).

The changes globally could be described as “things falling in place”, Tracker got more cohesive, versatile and tested than it ever was, we put a lot of care and attention to detail, and we hope you like the result. Let’s break down the highlights.

Understanding SPARQL

Sometime a couple years ago, I got fed up after several failed attempts at implementing support for property paths, this wound up into a rewrite of the SPARQL parser. This was part of Tracker 2.2.0 and brought its own benefits, ancient history.

Getting to the point, having the expression tree in the new parser closely modeled after SPARQL 1.1 grammar definition helped getting a perfect snapshot of what we don’t do, what we don’t do correctly and what we do extra. The parser was made to accept all correct SPARQL, and we had an `_unimplemented()` define in place to error out when interpreting the expression tree.

But that also gave me something to grep through and sigh, this turned into many further reads of SPARQL 1.1 specs, and a number of ideas about how to tackle them. Or if we weren’t restricted by compatibility concerns, as for some things we were limited by our own database structure.

Fast forward to today, the define is gone. Tracker covers the SPARQL 1.1 language in its entirety, warts and everything. The spec is from 2013, we just got there 7 years late :). Most notably, there’s:

  • Graphs: In a triple store, the aptly named triples consist of subject/predicate/object, and they belong within graphs. The object may point to elements in other graphs.

    In prior versions, we “supported graphs” in the language, but those were more a property of the triple’s object. This changes the semantics slightly in appearance but in fundamental ways, eg. no two graphs may have the same triple, and the ownership of the triple is backwards if subject and object are in different graphs.

    Now the implementation of graphs perfectly matches the description, and becomes a good isolated unit to let access in the case of sandboxing.

    We also support the ADD/MOVE/CLEAR/LOAD/DROP wholesome operations on graphs, to ease their management.

  • Services: The SERVICE syntax allows to federate portions of your query graph pattern to external services, and operate transparently on that data as if local. This is not exactly new in Tracker 2.99.x, but now supports dbus services in addition to http ones. More notes about why this is key further down.
  • New query forms, DESCRIBE/CONSTRUCT: This syntax sits alongside SELECT. DESCRIBE is a simple form to get RDF triples fully describing a resource, CONSTRUCT is a more powerful data extraction clause that allows serializing arbitrary portions of the triple set, even all of it, and even across RDF schemas.

Of all 11 documents from the SPARQL recommendations, we are essentially missing support for HTTP endpoints to entirely pass for a SPARQL 1.1 store. We obviously don’t mean to compete wrt enterprise-level databases, but we are completionists and will get to implementing the full recommendations someday :).

There is no central store

The tracker-store service got stripped of everything that makes it special. You were already able to create private stores, making those public via DBus is now one API call away. And its simple DBus API to perform/restore backups is now superseded by CONSTRUCT and LOAD syntax.

We have essentially democratized triple stores, in this picture (and a sandboxed world) it does not make sense to have a singleton default one, so the tracker-store process itself is no more. Each miner (Filesystem, RSS) has its own store, made public on its DBus name. TrackerSparqlConnection constructors let you specifically create a local store, or connect to a specific DBus/HTTP service.

No central service? New paradigm!

Did you use to store/modify data in tracker-store? There’s some bad news: It’s no longer for you to do that, scram from our lawn!

You are still much welcome to create your own private store, there you can do as you please, even rolling something else than Nepomuk.

But wait, how can you keep your own store and still consume data indexed by tracker miners? Here comes the SERVICE syntax to play, allowing you to deal with miner data and your own altogether. A simple hypothetical example:

# Query favorite files
SELECT ?u {
  SERVICE <dbus:org.freedesktop.Tracker3.Miner.Files> {
    ?u a nfo:FileDataObject
  }
  ?u mylocaldata:isFavorite true
}

As per the grammar definition, the SERVICE syntax can only be used from Query forms, not Update ones. This is essentially the language conspiring to keep a clear ownership model, where other services are not yours to modify.

If you are only interested in accessing one service, you can use tracker_sparql_connection_bus_new and perform queries directly to the remote service.

A web presence

It’s all about appearance these days, that’s why newscasters don’t switch the half of the suit they wear. A long time ago, we used to have the tracker-project.org domain, the domain expired and eventually got squatted.

That normally sucks on itself, for us it was a bit of a pickle, as RDF (and our own ontologies) stands largely on URIs, that means live software producing links out of our control, and it going to pastes/bugs/forums all over the internet. Luckily for us, tracker-project.org is a terrible choice of name for a porn site.

We couldn’t simply do the change either, in many regards those links were ABI. With 3.x on the way, ABI was no longer a problem, Sam did things properly so we have a site, and a proper repository of ontologies.

Nepomuk is dead, long live Nepomuk

Nepomuk is a dead project. Despite its site being currently alive, it’s been dead for extended periods of time over the last 2 years. That’s 11.5M EUR of your european taxpayer money slowly fading away.

We no longer consider we should consider it “an upstream”, so we have decided to go our own. After some minor sanitization and URI rewriting, the Nepomuk ontology is preserved mostly as-is, under our own control.

But remember, Nepomuk is just our “reference” ontology, a swiss army knife for whatever a might need to be stored in a desktop. You can always roll your own.

Tracker-miner-fs data layout

For sandboxing to be any useful, there must be some actual data separation. The tracker-miner-fs service now stores things in several graphs:

  • tracker:FileSystem
  • tracker:Audio
  • tracker:Video
  • tracker:Documents
  • tracker:Software

And commits further to the separation between “Data Objects” (e.g. files) and “Information Elements” (e.g. what its content represents). Both aspects of a “file” still reference each other, but simply used to be the same previously.

The tracker:FileSystem graph is the backbone of file system data, it contains all file Data Objects, and folders. All other graphs store the related Information Elements (eg. a song in a flac file).

Resources are interconnected between graphs, depending on the graphs you have access to, you will get a partial (yet coherent) view of the data.

CLI improvements

We have been doing some changes around our CLI tools, with tracker shifting its scope to being a good SPARQL triple store, the base set of CLI tools revolves around that, and can be seen as an equivalent to sqlite3 CLI command.

We also have some SPARQL specific sugar, like tracker endpoint that lets you create transient SPARQL services.

All miner-specific subcommands, or those that relied implicitly on their details did move to the tracker-miners repo, the tracker3 command is extensible to allow this.

Documentation

In case this was not clear, we want to be a general purpose data storage solution. We did spend quite some time improving and extending the developer and ontology documentation, adding migration notes… there’s even an incipient SPARQL tutorial!

There is a sneak preview of the API documentation at our site. It’s nice being able to tell that again!

Better tests

Tracker additionally ships a small helper python library to make it easy writing tests against Tracker infrastructure. There’s many new and deeper tests all over the place, e.g. around new syntax support.

Up next…

You’ve seen some talk about sandboxing, but nothing about sandboxing itself. That’s right, support for it is in a branch and will probably be part of 2.99.2. Now the path is paved for it to be transparent.

We currently are starting the race to update users. Sam got some nice progress on nautilus, and I just got started at shaving a yak on a cricket.

The porting is not completely straightforward. With few nice exceptions, a good amount of the tracker code around is stuck in some time frozen “as long as it works”, cargo-culted state. This sounds like a good opportunity to modernize queries, and introduce the usage of compiled statements. We are optimist that we’ll get most major players ported in time, and made 3.x able to install and run in parallel in case we miss the goal.

A call to application developers

We are no longer just “that indexer thingy”. If you need to store data with more depth than a table. If you missed your database design and relational algebra classes, or don’t miss them at all. We’ve got to talk :), come visit us at #tracker.

A call to distributors

We made tracker and tracker-miners 3.x able to install and run in parallel to tracker 2.x, and we expect users to get updated to it over time.

Given that it will get reflected in nightly flatpaks, and Tracker miners are host services, we recommend that tracker3 development releases are made available or easy to install in current stable distribution releases. Early testers and ourselves will thank you.

Gnome-shell Hackfest 2019 – Day 3

As promised, some late notes on the 3rd and last day of the gnome-shell hackfest, so yesterday!

Some highlights from my partial view:

  • We had a mind blowing in depth discussion about the per-crtc frame clocks idea that’s been floating around for a while. What started as “light” before-bedtime conversation the previous night continued the day after straining our neurons in front of a whiteboard. We came out wiser nonetheless, and have a much more concrete idea about how should it work.
  • Georges updated his merge request to replace Cogl structs with graphene ones. This now passes CI and was merged \o/
  • Much patch review happened in place, and some other pretty notable refactors and cleanups were merged.
  • The evening was more rushed than usual, with some people leaving already. The general feeling seemed good!
  • In my personal opinion the outcome was pretty good too. There’s been progress at multiple levels and new ideas sparked, you should look forward to posts from others :). It was also great to put a face to some IRC nicks, and meet again all the familiar ones.

Kudos to the RevSpace members and especially Hans, without them this hackfest couldn’t have happened.

Gnome-shell Hackfest 2019 – Day 2

Well, we are starting the 3rd and last day of this hackfest… I’ll write about yesterday, which probably means tomorrow I’ll blog about today :).

Some highlights of what I was able to participate/witness:

  • Roman Gilg of KDE fame came to the hackfest, it was a nice opportunity to discuss mixed DPI densities for X11/Xwayland clients. We first thought about having one server per pixel density, but later on we realized we might not be that far from actually isolating all X11 clients from each other, so why stop there.
  • The conversation drifted into other topics relevant to desktop interoperation. We did discuss about window activation and focus stealing prevention, this is a topic “fixed” in Gnome but in a private protocol. I had already a protocol draft around which was sent today to wayland-devel ML.
  • A plan was devised for what is left of Xwayland-on-demand, and an implementation is in progress.
  • The designers have been doing some exploration and research on how we interact with windows, the overview and the applications menu, and thinking about alternatives. At the end of the day they’ve demoed to us the direction they think we should take.

    I am very much not a designer and I don’t want to spoil their fine work here, so stay tuned for updates from them :).

  • As the social event, we had a very nice BBQ with some hackerspace members. Again kindly organized by Revspace.

Gnome-shell Hackfest 2019 – Day 1

So today kickstarted the gnome-shell hackfest in Leidschendam, the Netherlands.

There’s a decent number of attendants from multiple parties (Red Hat, Canonical, Endless, Purism, …). We all brought various items and future plans for discussion, and have a number of merge requests in various states to go through. Some exciting keywords are Graphene, YUV, mixed DPI, Xwayland-on-demand, …

But that is not all! Our finest designers also got together here, and I overheard they are discussing usability of the lock screen between other topics.

This event wouldn’t have been possible without the Revspace hackerspace people and specially our host Hans de Goede. They kindly provided the venue and necessary material, I am deeply thankful for that.

As there are various discussions going on simultaneously it’s kind of hard to keep track of everything, but I’ll do my best to report back over this blog. Stay tuned!

What am I doing with Tracker?

“Colored net”by Chris Vees (priorité maison) is licensed under CC BY-NC-ND 2.0

Some years ago I was asked to come up with some support for sandboxed apps wrt indexed data. This drummed up into Tracker 2.0 and domain ontologies, allowing those sandboxed apps to keep their own private data and collection of Tracker services to populate it.

Fast forward to today and… this is still largely unused, Tracker-using flatpak applications still whitelist org.freedesktop.Tracker, and are thus allowed to read and change content there. Despite I’ve been told it’s been mostly lack of time… I cannot blame them, domain ontologies offer the perfect isolation at the cost of the perfect duplication. It may do the job, but is far from optimal.

So I got asked again “we have a credible story for sandboxed tracker?”. One way or another, seems we don’t, back to the drawing board.

Somehow, the web world seems to share some problems with our case, and seems to handle it with some degree of success. Let’s have a look at some excerpts of the Sparql 1.1 recommendation (emphasis mine):

RDF is often used to represent, among other things, personal information, social networks, metadata about digital artifacts, as well as to provide a means of integration over disparate sources of information.

A Graph Store is a mutable container of RDF graphs managed by a single service. […] named graphs can be added to or deleted from a Graph Store. […] a Graph Store can keep local copies of RDF graphs defined elsewhere […] independently of the original graph.

The execution of a SERVICE pattern may fail due to several reasons: the remote service may be down, the service IRI may not be dereferenceable, or the endpoint may return an error to the query. […] Queries may explicitly allow failed SERVICE requests with the use of the SILENT keyword. […] (SERVICE pattern) results are returned to the federated query processor and are combined with results from the rest of the query.

So according to Sparql 1.1, we have multiple “Graph Stores” that manage multiple RDF graphs. They may federate queries to other endpoints with disparate RDF formats and whose availability may vary. This remote data is transparent, and may be used directly or processed for local storage.

Let’s look back at Tracker, we have a single Graph Store, which really is not that good at graphs. Responsibility of keeping that data updated is spread across multiple services, and ownership of that data is equally scattered.

It snapped me, if we transpose those same concepts from the web to the network of local services that your session is, we can use those same mechanisms to cut a number of drawbacks short:

  • Ownership is clear: If a service wants to store data, it would get its own Graph Store instead of modifying “the one”. Unless explicitly supported, Graph Stores cannot be updated from the outside.
  • So is lifetime: There’s been debate about whether data indexed “in Tracker” is permanent data or a cache. Everyone would get to decide their best fit, unaffected by everyone else’s decisions. The data from tracker-miners would totally be a cache BTW :).
  • Increases trustability: If Graph Stores cannot be tampered with externally, you can trust their content to represent the best effort of their only producer, instead of the minimum common denominator of all services updating “the Graph Store”.
  • Gives a mechanism for data isolation: Graph Stores may choose limiting the number of graphs seen on queries federated from other services.
  • Is sandboxing friendly: From inside a sandbox, you may get limited access to the other endpoints you see, or to the graphs offered. Updates are also limited by nature.
  • But works the same without a sandbox. It also has some benefits, like reducing data duplication, and make for smaller databases.

Domain ontologies from Tracker 2.0 also handle some of those differently, but very very roughly. So the first thing to do to get to that RDF nirvana was muscling up that Sparql support in Tracker, and so I did! I already had some “how could it be possible to do…” plans in my head to tackle most of those, but unfortunately they require changes to the internal storage format.

As it seems the time to do one (FTR, storage format has been “unchanged” since 0.15) I couldn’t just do the bare minimum work, it seemed too much of a good opportunity to miss, instead of maybe making future changes for leftover Sparql 1.1 syntax support.

Things ended up escalating into https://gitlab.gnome.org/GNOME/tracker/commits/wip/carlosg/sparql1.1, where It can be said that Tracker supports 100% of the Sparql 1.1 syntax. No buts, maybe bugs.

Some notable additions are:

  • Graphs are fully supported there, along with all graph management syntax.
  • Support for query federation through SERVICE {}
  • Data dumping through DESCRIBE and CONSTRUCT query forms.
  • Data loading through LOAD update form.
  • The pesky negated property path operator.
  • Support for rdf:langString and rdf:List
  • All missing builtin functions

This is working well, and is almost drop-in (One’s got to mind the graph semantics), making it material for Gnome 3.34 starts to sound realistic.

As Sparql 1.1 is a recommendation finished in 2013, and no other newer versions seem to be in the works, I think it can be said Tracker is reaching maturity. Only HTTP Graph Store Protocol (because why not) remains the big-ish item to reasonably tell we implement all 11 documents. Note that Tracker’s bet for RDF and Sparql started at a time when 1.0 was the current document and 1.1 just an early draft.

And sandboxing support? You might guess already the features it’ll draw from. It’s coming along, actually using Tracker as described above will go a bit deeper than the required query language syntax, more on that when I have the relevant pieces in place. I just thought I’d stop a moment to announce this huge milestone :).

A mutter and gnome-shell update

Some personal highlights:

Emoji OSK

The reworked OSK was featured a couple of cycles ago, but a notable thing that was still missing from the design reference was emoji input.

No more, sitting in a branch as of yet:

This UI feeds from the same emoji list than GtkEmojiChooser, and applies the same categorization/grouping, all the additional variants to an emoji are available as a popover. There’s also a (less catchy) keypad UI in place, ultimately hooked to applications through the GtkInputPurpose.

I do expect this to be in place for 3.32 for the Wayland session.

X11 vs Wayland

Ever since the wayland work started on mutter, there’s been ideas and talks about how mutter “core” should become detached of X11 code. It has been a long and slow process, every design decision has been directed towards this goal, we leaped forward on 2017 GSOC, and eg. Georges sums up some of his own recent work in this area.

For me it started with a “Hey, I think we are not that far off” comment in #gnome-shell earlier this cycle. Famous last words. After rewriting several, many, seemingly unrelated subsystems, and shuffling things here and there, and there we are to a point where gnome-shell might run with --no-x11 set. A little push more and we will be able to launch mutter as a pure wayland compositor that just spawns Xwayland on demand.

What’s after that? It’s certainly an important milestone but by no means we are done here. Also, gnome-settings-daemon consists for the most part X11 clients, which spoils the fun by requiring Xwayland very early in a real session, guess what’s next!

At the moment about 80% of the patches have been merged. I cannot assure at this point will all be in place for 3.32, but 3.34 most surely. But here’s a small yet extreme proof of work:

Performance

It’s been nice to see some of the performance improvements I did last cycle being finally merged. Some notable ones, like that one that stopped triggering full surface redraws on every surface invalidation. Also managed to get some blocking operations out of the main loop, which should fix many of the seemingly random stalls some people were seeing.

Those are already in 3.31.x, with many other nice fixes in this area from Georges, Daniel Van Vugt et al.

Fosdem

As a minor note, I will be attending Fosdem and the GTK+ Hackfest happening right after. Feel free to say hi or find Wally, whatever comes first.