Developing an application with TinySPARQL in 2025

Back a couple of months ago, I was given the opportunity to talk at LAS about search in GNOME, and the ideas floating around to improve it. Part of the talk was dedicated to touting the benefits of TinySPARQL as the base for filesystem search, and how in solving the crazy LocalSearch usecases we ended up with a very versatile tool for managing application data, either application-private or shared with other peers.

It was no one else than our (then) future ED in a trench coat (I figure!) who forced my hand in the question round into teasing an application I had been playing with, to showcase how TinySPARQL should be used in modern applications. Now, after finally having spent some more time on it, I feel it’s up to a decent enough level of polish to introduce it more formally.

Behold Rissole

Rissole is a simple RSS feed reader, intended let you read articles in a distraction free way, and to keep them all for posterity. It also sports a extremely responsive full-text search over all those articles, even on huge data sets. It is built as a flatpak, you can ATM download it from CI to try it, meanwhile it reaches flathub and GNOME Circle (?). Your contributions are welcome!

So, let’s break down how it works, and what does TinySPARQL bring to the table.

Structuring the data

The first thing a database needs is a definition about how the data is structured. TinySPARQL is strongly based on RDF principles, and depends on RDF Schema for these data definitions. You have the internet at your fingertips to read more about these, but the gist is that it allows the declaration of data in a object-oriented manner, with classes and inheritance:


mfo:FeedMessage a rdfs:Class ;
    rdfs:subClassOf mfo:FeedElement .

mfo:Enclosure a rdfs:Class ;
    rdfs:subClassOf mfo:FeedElement .

One can declare properties on these classes:


mfo:downloadedTime a rdf:Property ;
    nrl:maxCardinality 1 ;
    rdfs:domain mfo:FeedMessage ;
    rdfs:range xsd:dateTime .

And make some of these properties point to other entities of specific (sub)types, this is the key that makes TinySPARQL a graph database:


mfo:enclosureList a rdf:Property ;
    rdfs:domain mfo:FeedMessage ;
    rdfs:range mfo:Enclosure .

In practical terms, a database needs some guidance on what data access patterns are most expected. Being a RSS reader, sorting things by date will be prominent, and we want full-text search on content. So we declare it on these properties:


nie:plainTextContent a rdf:Property ;
    nrl:maxCardinality 1 ;
    rdfs:domain nie:InformationElement ;
    rdfs:range xsd:string ;
    nrl:fulltextIndexed true .

nie:contentLastModified a rdf:Property ;
    nrl:maxCardinality 1 ;
    nrl:indexed true ;
    rdfs:subPropertyOf nie:informationElementDate ;
    rdfs:domain nie:InformationElement ;
    rdfs:range xsd:dateTime .

The full set of definitions will declare what is permitted for the database to contain, the class hierarchy and their properties, how do resources of a specific class interrelate with other classes… In essence, how the information graph is allowed to grow. This is its ontology (semi-literally, its view of the world, whoooah duude). You can read more in detail how these declarations work at the TinySPARQL documentation.

This information is kept in files separated from code, built in as a GResource in the application binary, and used during initialization to create a database at a location in control of the application:


    let mut store_path = glib::user_data_dir();
    store_path.push("rissole");
    store_path.push("db");

    obj.imp()
        .connection
        .set(tsparql::SparqlConnection::new(
            tsparql::SparqlConnectionFlags::NONE,
            Some(&gio::File::for_path(store_path)),
            Some(&gio::File::for_uri(
                "resource:///com/github/garnacho/Rissole/ontology",
            )),
            gio::Cancellable::NONE,
        )?)
        .unwrap();

So there’s a first advantage right here, compared to other libraries and approaches: The application only has to declare this ontology without much (or any) further code to support supporting code, compare to going through the design/normalization steps for your database design, and. having to CREATE TABLE your way to it with SQLite.

Handling structure updates

If you are developing an application that needs to store a non-trivial amount of data. It often comes as a second thought how to deal with new data being necessary, stored data being no longer necessary, and other post-deployment data/schema migrations. Rarely things come up exactly right at the first try.

With few documented exceptions, TinySPARQL is able to handle these changes to the database structure by itself, applying the necessary changes to convert a pre-existing database into the new format declared by the application. This also happens at initialization time, from the application-provided ontology.

But of course, besides the data structure, there might also be data content that might some kind of conversion or migration, this is where an application might still need some supporting code. Even then, SPARQL offers the necessary syntax to convert data, from small to big, from minor to radical changes. With the CONSTRUCT query form, you can generate any RDF graph from any other RDF graph.

For Rissole, I’ve gone with a subset of the Nepomuk ontology, which does contain much embedded knowledge about the best ways to lay data in a graph database. As such I don’t expect major changes or gotchas in the data, but this remains a possibility for the future, e.g. if we were to move to another emerging ontology, or any other less radical data migrations that might crop up.

So here’s the second advantage, compare to having to ALTER TABLE your way to new database schemas, or handle data migration for each individual table, and ensuring you will not paint yourself into a corner in the future.

Querying data

Now we have a database! We can write the queries that will feed the application UI. Of course, the language to write these in is SPARQL, there are plenty of resources over it on the internet, and TinySPARQL has its own tutorial in the documentation.

One feature that sets TinySPARQL apart from other SPARQL engines in terms of developer experience is the support for parameterized values in SPARQL queries, through a little bit of non-standard syntax and the TrackerSparqlStatement API, you can compile SPARQL queries into reusable statements, which can be executed with different arguments, and will compile to an intermediate representation, resulting in faster execution when reused. Statements are also the way to go in terms of security, in order to avoid query injection situations. This is e.g. Rissole (simplified) search query:


SELECT
    ?urn
    ?title
{
    ?urn a mfo:FeedMessage ;
        nie:title ?title ;
        fts:match ~match .
}

Which allows me to funnel a GtkEntry content right away in the ~match without caring about character escaping or other validation. These queries may also be stored in GResource, and live as separate files in the project tree, and be loaded/compiled early during application startup once, so they are reusable during the rest of the application lifetime:


fn load_statement(&self, query: &str) -> tsparql::SparqlStatement {
    let base_path = "/com/github/garnacho/Rissole/queries/";

    let stmt = self
        .imp()
        .connection
        .get()
        .unwrap()
        .load_statement_from_gresource(&(base_path.to_owned() + query), gio::Cancellable::NONE)
        .unwrap()
        .expect(&format!("Failed to load {}", query));

    stmt
}

...

// Pre-loading an statement
obj.imp()
    .search_entries
    .set(obj.load_statement("search_entries.rq"))
    .unwrap();

...

// Running a search
pub fn search(&self, search_terms: &str) -> tsparql::SparqlCursor {
    let stmt = self.imp().search_entries.get().unwrap();

    stmt.bind_string("match", search_terms);
    stmt.execute(gio::Cancellable::NONE).unwrap()
}

This data is of course all introspectable with the gresource CLI tool, and I can run these queries from a file using the tinysparql query CLI command, either on the application database itself, or on a separate in-memory testing database created through e.g.tinysparql endpoint --ontology-path ./src/ontology --dbus-service=a.b.c.

Here’s the third advantage for application development. Queries are 100% separate from code, introspectable, and able to be run standalone for testing, while the code remains highly semantic.

Inserting and updating data

When inserting data, we have two major pieces of API to help with the task, each with their own strengths:

TrackerSparqlStatement also works for SPARQL update queries.

TrackerResource offers more of a builder API to generate RDF data.

These can be either executed standalone, or combined/accumulated in a TrackerBatch for a transactional behavior. Batches do improve performance by clustering writes to the database, and database stability by making these changes either succeed or fail atomically (TinySPARQL is fully ACID).

This interaction is the most application dependent (concretely, retrieving the data to insert to the database), but here is some links to Rissole code for reference, using TrackerResource to store RSS feed data, and using TrackerSparqlStatement to delete RSS feeds.

And here is the fourth advantage for your application, an async friendly mechanism to efficiently manage large amounts of data, ready for use.

Full-text search

For some reason, there tends to be some magical thinking revolving databases and how these make things fast. And the most damned pattern of all can be typically at the heart of search UIs: substring matching. What feels wonderful during initial development in small datasets soon slows to a crawl in larger ones. See, an index is little more than a tree, you can look up exact items on relatively low big O, lookup by prefix with slightly higher one, and for anything else (substring, suffix) there will be nothing to do but a linear search. Sure, the database engine will comply, however painstakingly.

What makes full-text search fundamentally different? This is a specialized index that performs an effort to pre-tokenize the text, so that each parsed word and term is represented individually, and can be looked up independently (either prefix or exact matches). At the expense of a slightly higher insertion cost (i.e. the usually scarce operation), this provides response times measured in milliseconds when searching for terms (i.e. the usual operation) regardless of their position in the text, even on really large data sets. Of course this is a gross simplification (SQLite has extensive documentation about the details), but I hopefully shined enough light into why full-text search can make things fast in a way a traditional index can not.

I am largely parroting a SQLite feature here, and yes, this might also be available for you if using SQLite, but the fact that I’ve already taught you in this post how to use it in TinySPARQL (declaring nrl:fulltextIndexed on the searchable properties, using fts:match in queries to match on them) does again have quite some contrast with rolling your own database creation code. So here’s another advantage.

Backups and other bulk operations

After you got your data stored, is it enshrined? Are there forward plans to get the data back again out of there? Is the backup strategy cp?

TinySPARQL (and the SPARQL/RDF combo at its core) boldly says no. Data is fully introspectable, and the query language is powerful enough to extract even full data dumps at a single query, if you wished so. This is for example available through the command line with tinysparql export and tinysparql import, the full database content can be serialized into any of the supported RDF formats, and can be either post-processed or snapshot into other SPARQL databases from there.

A “small” detail I have not mentioned so far is the (optional) major network transparency of TinySPARQL, since for the most part it is irrelevant on usecases like Rissole. Coming from web standards, of course network awareness is a big component of SPARQL. In TinySPARQL, creating an endpoint to publicly access a database is an explicit choice of made through API, and so it is possible to access other endpoints either from a dedicated connection or by extending your local queries. Why do I bring this up here? I talked at Guadec 2023 about Emergence, a local-first oriented data synchronization mechanism between devices owned by the same user. Network transparency sits at the heart of this mechanism, which could make Rissole able to synchronize data between devices, or any other application that made use of it.

And this is the last advantage I’ll bring up today, a solid standards-based forward plan to the stored data.

Closing note

If you develop an application that does need to store data, future you might appreciate some forward thinking on how to handle a lifetime’s worth of it. More artisan solutions like SQLite or file-based storage might set you up quickly for other funnier development and thus be a temptation, but will likely decrease rapidly in performance unless you know very well what you are doing, and will certainly increase your project’s technical debt over time.

TinySPARQL wraps all major advantages of SQLite with a versatile data model and query language strongly based on open standards. The degree of separation between the data model and the code makes both neater and more easily testable. And it’s got forward plans in terms of future data changes, backups, and migrations.

As everything is always subject to improvement, there’s some things that could do for a better developer experience:

Query and schema definition files could be linted/validated as a preprocess step when embedding in a GResource, just as we validate GtkBuilder files
TinySPARQL’s builtin web IDE started during last year’s GSoC should move forward, so we have an alternative to the CLI
There could be graphical ways to visualize and edit these schemas
Similar thing, but to visually browse a database content
I would not dislike if some of these were implemented in/tied to GNOME Builder
It would be nice to have a more direct way to funnel the results of a SPARQL query into UI. Sadly, the GListModel interface API mandates random access and does not play nice with cursor-alike APIs as it is common with databases. This at least excludes making TrackerSparqlCursor just implement GListModel.
A more streamlined website to teach and showcase these benefits, currently tinysparql.org points to the developer documentation (extensive otoh, but does not make a great landing page).

Even though the developer experience would be more buttered up, there’s a solid core that is already a leap compared to other more artisan solutions, in a few areas. I would also like to point out that Rissole is not the first instance here, there are also Polari and Health using TinySPARQL databases this way, and mostly up-to-date in these best practices. Rissole is just my shiny new excuse to talk about this in detail, other application developers might appreciate the resource, and I’d wish it became one of many, so Emergence finally has a worthwhile purpose.

Last but not least, I would like to thank Kristi, Anisa, and all organizers at LAS for a great conference.

Embracing sysexts for system development under Silverblue

Due to my circumstances, I might be perhaps interested in dogfooding a larger number of GNOME system/session components on a daily basis than the average.

So far, I have been using jhbuild to help me with this deed, mostly in the form of jhbuild make to selectively build projects out of their git tree. See, there’s a point in life where writing long-winded CLI commands stop making you feel smart and work the opposite way, jhbuild had a few advantages I liked:

I could reset and rebuild build trees without having to remember project-specific meson args.
The build dir did not pollute the source dir, and would be wiped out without any loss.
The main command is pretty fast to type with minimal finger motion for something done so frequently, jh<tab>.

This, combined with my habit to use Fedora Rawhide also meant I did not require to rebuild the world to get up-to-date dependencies, keeping the number of miscellaneous modules built to a minimum.

This was all true even after Silverblue came around, and Florian unlocked the “run GNOME as built from toolbox” achievement. I adopted this methodology, but still using jhbuild to build things inside that toolbox, for the sake of convenience.

Enter sysext-utils

Meanwhile, systemd sysexts came around as a way to install “extensions” to the base install, even over atomic distributions, paving a way for development of system components to happen in these distributions. More recently Martín Abente brought an excellent set of utilities to ease building such sysexts.

This is a great step in the direction towards sysexts as a developer testing method. However, there is a big drawback for users of atomic distributions: to build these sysexts you must have all necessary build dependencies in your host system. Basically, desecrating your small and lean atomic install with tens to hundreds of packages. While for GNOME OS it may be true that it comes “with batteries included”, feels like a very big margin to miss the point with Silverblue, where the base install is minimal and you are supposed to carry development with toolbox, install apps with flatpak, etc etc.

What is necessary

Ideally, in these systems, we’d want:

A toolbox matching the version of the host system.
With all development tools and dependencies installed
The sysexts to be created from inside the toolbox
The sysexts to be installed in the host system
But also, the installed sysexts need to be visible from inside the toolbox, so that we can build things depending on them

The most natural way to achieve both last points is building things so they install in /usr/local, as this will allow us to also mount this location from the host inside the toolbox, in order to build things that depend on our own sysexts.

And last, I want an easy way to manage these projects that does not get in the middle of things, is fast to type, etc.

Introducing gg

So I’ve made a small script to help myself on these tasks. It can be installed at ~/.local/bin along with sysext-utils, and be used in a host shell to generate, install and generally manage a number of sysexts.

Sysexts-utils is almost there for this, I however needed some local hacks to help me get by:

– Since I have these are installed at ~/.local, but they will be run with pkexec to do things as root, the python library lookup paths had to be altered in the executable scripts (sysext-utils#10).
– They are ATM somewhat implicitly prepared to always install things at /usr, I had to alter paths in code to e.g. generate GSettings schemas at the right location (sysext-utils#11).

Hopefully these will be eventually sorted out. But with this I got 1) a pristine atomic setup and 2) My tooling in ~/.local 3) all the development environment in my home dir, 4) a simple and fast way to manage a number of projects. Just most I ever wanted from jhbuild.

This tool is a hack to put things together, done mainly so it’s intuitive and easy to myself. So far been using it for a week with few regrets except the frequent password prompts. If you think it’s useful for you too, you’re welcome.

Goodbye Tracker, hello TinySPARQL and LocalSearch

It is our pleasure (Sam’s and mine) to announce the following project renames:

– The Tracker SPARQL library will now be known as TinySPARQL
– The Tracker Miners indexer will now be known as LocalSearch, or GNOME LocalSearch if you prefer.

This on one hand will meet the increasing popular demand to have the terms “Tracker” and “Miner” avoided (In case the subtleties of language evade you, these have gained some bad connotations over the years), and on the other hand fixes the blunder done when Tracker 1.x conceded its name to the SPARQL library when library and indexer were split for 2.x. This has been a source of confusion wrt the extent and intent of each component, even among GNOME peers.

I am personally hopeful for the rename of TinySPARQL, as there has been a great amount of effort spent in making it a most credible implementation of the SPARQL/RDF/RDFS sets of standards, which feels underutilized with the library being mostly typecast as a gateway to the indexer data. Perhaps the new name will help reach to wider audiences that will be appreciative of a lightweight alternative to the available enterprise-level SPARQL engines.

But I shouldn’t diminish the importance of the rename of LocalSearch, especially at a time when other operating systems are approaching search in orwellian ways in the name of convenience, it becomes important to take a stand. We think the new name is a simple enough guide of our goals and non-goals.

This is not just a superficial rename, however we the maintainers have grown wary of approaches that would require us to bump version to 4.x, and updating all the projects depending on TinySPARQL/LocalSearch in a synchronized manner. So we have gone for a soft approach that will be backwards compatible for the most part, and allow a transitional path. Here’s what you can do to “update”:

In C projects

You are encouraged to use the tinysparql-3.0.pc file to detect the TinySPARQL library. The tracker-sparql-3.0.pc file is still provided for backwards compatibility. In code, use #include <tinysparql.h> to include the SPARQL library headers.

In the future we might consider an API migration scheme that deprecates all API under the old namespace and replaces with with objects in the new namespace. Meanwhile, all API objects and symbols keep the “Tracker” namespace.

From gobject-introspection bindings

You can get a “sneak peek” by using the new namespaces right away, e.g. from python code:


#!/usr/bin/python3

import gi, sys
gi.require_version('Tsparql', '3.0')
from gi.repository import GLib, Gio, Tsparql

try:
    connection = Tsparql.SparqlConnection.new(
        Tsparql.SparqlConnectionFlags.NONE,
        None, # Database location, None creates it in-memory
        Tsparql.sparql_get_ontology_nepomuk(), # Ontology location
        None)

    # Create a resource containing RDF data
    resource = Tsparql.Resource.new(None)
    resource.set_uri('rdf:type', 'nmm:MusicPiece')

    # Create a batch, and add the resource to it
    batch = connection.create_batch()
    batch.add_resource(None, resource)

    # Execute the batch to insert the data
    batch.execute()

    connection.close()

except Exception as e:
    print('Error: {0}'.format(e))
    sys.exit(-1)

Remember to make your project depend on tinysparql-3.0.pc to ensure an up-to-date enough version to have these new binding definitions. Of course backwards compatibility is preserved, and there will be GIR/typelib files for the old namespaces.

In Vala code

There is also a new VAPI file, to describe the TinySPARQL API under the Tsparql namespace:


// valac --pkg tsparql-3.0 tsparql.vala
using Tsparql;

int main (string[] argv) {
  try {
    var conn = Tsparql.Connection.new (Tsparql.ConnectionFlags.NONE, null, Tsparql.get_ontology_nepomuk(), null);

    // Create a resource containing RDF data
    var resource = new Tsparql.Resource(null);
    resource.set_uri ("rdf:type", "nmm:MusicPiece");

    // Create a batch, and add the resource to it
    var batch = conn.create_batch ();
    batch.add_resource (null, resource);

    // Execute the batch to insert the data
    batch.execute ();

    conn.close ();
  } catch (GLib.Error e) {
    stderr.printf ("Error: %s\n", e.message);
  }

  return 0;
}

And same as before, there will be also a VAPI file with the old namespace.

Accessing LocalSearch data

The LocalSearch indexer data will now alternatively be offered via the org.freedesktop.LocalSearch3 D-Bus name, as well as through the org.freedesktop.Tracker3.Miner.Files D-Bus name for backwards compatibility. This can be used as usual with the TinySPARQL D-Bus based connection, e.g. from Python:


#!/usr/bin/python3

import gi, sys
gi.require_version('Tsparql', '3.0')
from gi.repository import Tsparql

conn = Tsparql.SparqlConnection.bus_new(
    'org.freedesktop.LocalSearch3', None)
cursor = conn.query('select ("Hello World" as ?str) {}', None)
cursor.next()
print(cursor.get_string(0))

Command line utilities

The command line utilities for both projects have parted ways. Instead of a single command line utility with extensible subcommands, the will be distinct tinysparql and localsearch command line tools, each providing their own sensible set of subcommands. We trust it will be reasonably low effort to adapt to these changes, and reasonably intuitive.

Test utilities

In order to test search-related code, some projects were using the tracker-testutils-3.0.pc file to locate the tracker-sandbox helper script to help them implement a test harness for search features. This has now become more of a first class citizen as a localsearch test-sandbox subcommand. There is no longer a pkgconf file, we recommend to detect the presence of the localsearch command in build systems.

A note to distributors

The repositories for these projects are https://gitlab.gnome.org/GNOME/tinysparql and https://gitlab.gnome.org/GNOME/localsearch, and the tarballs distributed will have tinysparql-x.y.z and localsearch-x.y.z filenames (starting with tinysparql-3.8.alpha and localsearch-3.8.alpha which are already out of the door). We are understanding that a project rename will involve a more changes on your end(s) than ours, but we will greatly appreciate if you help us establish the new identities and follow up on the rename.

On CVE-2023-43641

As you might have read already about, there was a vulnerability in libcue that took a side gig demonstrating a sandbox escape in tracker-miners.

The good news first so you can skip the rest, this is fixed in the tracker-miners 3.6.1/3.5.3/3.4.5/3.3.2 versions released on Sept 28th 2023, about a couple weeks ago. The relevant changes are also in the tracker-miners-3.2 and tracker-miners-3.1 branches, but I didn’t wind up to doing releases for those. If you didn’t update yet, please do so already.

Background

The seccomp jail in the metadata extractor is far from new, it was introduced during development for tracker-1.12 in 2016 (yup, tracker, before the tracker-miners indexers spun off the monolithic package). Before that, the file metadata extraction task was already split as a separate process from filesystem structure indexing for stability and resource consumption reasons.

Seccomp comes in different sizes, it allows from creating very permissive sandboxes where system calls are allowed unless blocked by some rule, to creating paranoid sandboxes where every system call is disallowed by default (with a degree of harshness of choice) unless allowed by some rule. Tracker took the most restrictive approach, making every system call fail with SIGSYS by default, and carving the holes necessary for correct operation of the many dependencies within the operational parameters established by yours truly (no outside network access, no filesystem write access, no execution of subprocesses). The list of rules is quite readable in the source code, every syscall there has been individually and painstakingly collected, evaluated, and pondered for inclusion.

But the tracker-extract process as a whole still had to do a number of things that could not work under those parameters:

It wanted to read some settings, and GSettings/Dconf will anyways require r/w access to /run/user/.../dconf/user, or complain loudly.
There is infrastructure to write detailed error reports (visible through the tracker3 status commandline tool) for files where metadata extraction failed. These error reports have helped improve tracker-miners quality and are a valuable resource.
The tracker-extract process depends on GStreamer, and is a likely candidate for being the first to fork and indirectly rewrite the GST registry cache file on gst_init().

While in an ideal world we could work our way through highly specific seccomp rules, in practice the filters based on system call arguments are pretty rudimentary, e.g. not allowing for (sub)string checks. So we cannot carve out allowances for these specific things, not without opening the sandbox significantly.

Rules for thee, but not for me. Given the existing tracker-extract design with threads dedicated to extraction, the pragmatic approach taken back then was to make the main thread (in charge already of dispatching tasks to those dedicated threads) unrestricted and in charge of those other duties, and applying the seccomp ruleset to every thread involved in dealing with the data. While not perfectly airtight, the sandbox in this case is already a defense in depth measure, only applicable after other vulnerabilities (most likely, in 3rd party libraries) were already successfully exploited, all that can be done at that point is to deter and limit the extent of the damage, so the sandbox role is to make it not immediately obvious to make something harmful without needle precision. While the “1-click exploit in GNOME” headline is surely juicy, it had not happened for these 7 years the sandbox existed, despite our uncontrollable hose of dependencies (hi, GStreamer), or our other dependencies with a history of CVEs (hi Poppler, libXML et al), so the sandbox has at least seemed to fulfill the deterrence role.

But there is that, I knew about the potential of a case like this and didn’t act in time. I personally am not sure this is a much better defense than “I didn’t know better!”.

The serenpidity

The discovery of CVE-2023-43641 was doubly serendipitous, on one hand Kevin Backhouse from GitHub Security Lab quickly struck gold, unwittingly managing to corrupt data in a way that it was the unconstrained thread finding the trap being layered. To me, fortune struck by dealing with security issues “the good way”. For something that normally goes as pleasant as a tooth extraction, I must give Kevin five stars in management and diligence.

And as said, this didn’t catch me entirely by surprise. I even talked about the potential case with co-maintainer Sam Thursfield as recently as Fosdem this year, and funnily it was high up in my to-do list, only dropped from 3.6 schedule due to time restrictions. Even though the requirements (to reiterate: no indiscriminate network access, no write access, no execution) and the outliers didn’t budge, there were ideas floating on how to handle them. While the fixes were not as straightforward as just “init seccomp on main(), duh”, there was no moments of hesitation in addressing them, you can read the merge request commits for the details on how those were handled.

The end result is a tracker-extract main() function that pretty literally looks like:


main ()
{
  lower_process_priorities();
  init_seccomp();
  do_main();
}

I.e. the seccomp jail will affect the main thread and every other thread spawned from it, while barely extending the seccomp ruleset (mostly, to make some syscalls fail softly with error codes, instead of hard through SIGSYS), and not giving up in any of the impositions set.

The aftermath

Of course, one does not go willy-nilly adding paranoid sandbox restrictions to a behemoth like GStreamer and call it an early evening, also C library implementations will call different sets of system calls in different architectures and compile-time flags, transitioning to a fully sandboxed process still means adding new restrictions to an non-trivial amount of code, and will definitely stir the SIGSYS bug report pot for some time to come due to legit code hitting the newly set restrictions. This is already being the case, a thank you to the early reporters. The newly confined code is very trotted, so optimistically the grounds will settle back to relatively boring after another round of releases or two.

Reflections

While we could start splitting hairs about the tracker-extract process being allowed to do this or that while it shouldn’t, I think the current sandbox will keep tracker-miners itself out of the way of CVEs for a long long time after this *knocks on wood*. I’m sure this will result in more pairs of eyes looking, and patches flowing and hahaha who am I kidding. I’ll likely think/work alone through ways to decrease further the exposed surface on future cycles out of the spotlight, after the waves of armchair comments were long waded. For now, any vulnerability exploit do already have a much harder time at succeeding in evil.

I’d like to eventually see some tighter control on the chain of dependencies that tracker-miners has, maintaining a paranoid seccomp sandbox has a very high friction with running arbitrarily large amounts of code out of your control, and I’m sure that didn’t just get better. There are reasons we typically commit to only supporting the latest and greatest upstream, we cannot decently assure forward compatibility of the seccomp sandbox.

And regardless, now more than ever I am happy with relying on third party libraries to deal with parsers and formats. It still is the only way to maintain a generic purpose indexer and keep a piece of your sanity. To the people willing to step up and rewrite prime Rust fodder, there’s gold if you pull from the thread of tracker-extract module dependencies.

Conclusion

I’d personally want to apologize to libcue maintainers, as this likely blew out of proportion to them due to my inaction. Anything less dangerous looking than “1 click exploit” would have likely reduced the severity of the issue and the media coverage significantly.

And thanks again to Kevin for top notch management of the issue. My personal silver lining will be that this was pretty much a triumph of “enough eyes make bugs shallow”.

Getting the best of tablet pads

In case they needed introduction, pads are these collections of buttons and tactile sensors (ring or strip shaped) most typically found along the side of drawing tablets. These devices will be the topic of today.

Picture of someone using a pad — The Wacom ExpressKey Remote is a rare case of disembodied pad. Picture from Wacom.

A bit of context

The concept behind pads is simple, a collection of triggers with no specific action associated, so users can configure the actions that best suit their tastes and setups. Other quality differences like the number of pressure levels aside, entry-level consumer tablets often have few buttons, more advanced tablets typically get more buttons, and “modes” that multiply the amount of available mappable actions. It you are a pro, it is likely that you want quick access to more features, and so on.

When this kind of devices came around back in many years ago, and support for them was added in X11, they sat on a rough spot. As a X11 input driver, you either tell the X server that you are adding a device with buttons and valuators that drives the pointer sprite (e.g. a mouse), or you are adding a device with keycodes and levels (e.g. keyboards) subject to keymap translations. A pad device is notably neither of those things, the way to shove the square peg on the round hole was to make them pointers, thus they would send button and scroll events directed towards the pointer position, they would just not move the pointer.

We are here also talking about a time that applications had (many still do) a hard time recognizing input from different source devices, to them a button press is just a button press, so those pads as-is would just click and scroll on things, quite far from the customizable actions promised land. The pragmatic approach taken in GNOME (Shell now, settings-daemon back in the day) was grabbing those devices at the session level, not letting them be seen by applications and converting them into something immediately useful. Given the limit of choices, these pad button and scroll events got converted to keycombos.

This at least kept two promises, it’s a) customizable and b) universal. Although it felt like falling short for long, since keycombos have some implicit problems associated:

A small set of keycombos is close to universal, but many are not. So the typical choice is either universal but seldom used settings, or highly app-specific keycombos that are latently bollocks in any other application.
Keycombos are also subject to developer/translator changes, etc. You might find your keycombos no longer hold if you change locale, or environment, or update your app.

But along came Wayland! Pads would not need to be subject to the limitations of X server driver API and could become their own entity with their own semantics, and indeed they became a first class citizen of the Wayland tablet protocol available in wayland-protocols. This allowed us to let pad events be sent to clients, have it be either correctly interpreted in clients/toolkits as input from a pad device or safely ignored, instead of misinterpreted as another kind of input. The Wayland protocol opened even further the possibilities like being able to provide readable strings for compositor feedback. I blogged about this… in 2016.

What since?

A drawback from the previous state of things and the massive change of approach is that no client was prepared to be in charge of pad events, so any integral support for pads had to start from the very bottom. But also, pads are most useful in very specific setups and applications, that at the time were nowhere near having Wayland support. In the GTK world, the requisites for it to work were:

Using GTK >= 3.22
Using GAction for application actions
Using GtkPadController to hook actions

While that looks simple enough, the flagship applications were barely taking bits at the first step at the time, and after that, the second was not a small undertaking either with applications with hundreds of options.

Some weeks ago, I was fortunate to attend the Wilber Week at Amsterdam, and was very glad to meet my GIMP friends again (and Inkscape/Blender folks!) despite my week-long stint with a commuter’s life after so long.

Picture of a baked Wilber. — A moment I missed, immortalized at a splash screen.

Over there, I learned that not long ago, GIMP did finally port away from GtkAction into GAction, finally lifting the last barrier to get pad actions working as envisioned, in one app at least :).

Since this is something I’ve been touting for so long to GIMP maintainers, of course I had to volunteer for the task, now up in a merge request.

Of course, most of the gist there is the configuration UI, and the configuration serialization/deserialization, the conversion to actions is still done using GtkPadController created from the configuration.

While the UI could get more polish by using libwacom (e.g. nicer names, or pre-populating buttons/rings/strips in the available modes), it also needs to work with tablets not recognized by libwacom, or (maybe someday) in other platforms. The important part is that the application may define the action associated with a pad feature, and give it a friendly name. These do also show in the GNOME Shell pad OSD:

Which in this case works as a cheat sheet of sorts, since there are pads with extreme combinations of buttons/modes (e.g. the Express Key Remote has 17 buttons by 3 modes), and per-app mapping does not make things easier to remember on itself.

Conclusion

An improved form of pad support has been stalled for too long, and it is amazing that it can already start to roll out. It looks like Inkscape is also ripe for such improvements, and during Wilber Week I was glad to hear a very positive attitude towards Wayland from Blender developers. Maybe the tides are turning to make this often neglected device shine?

A call to GNOME app developers

While these flagship applications are key for a major leap in support for pad devices, perhaps your application can do a little bit to help, if you see your app as possible part of a designer/artist/etc workflow. Using GtkPadController is rather easy with a fixed set of actions, so exposing an small set of useful actions could extend the usefulness of pad devices even further. This is for example what Nautilus does. You can also draw inspiration from the “Paint” GTK demo.

Tracker 3.x, a retrospect.

Time files, for better or for the worst. The last time I bored you with ramblings on this blog was more than 2 years ago already, prepping up for Tracker 3.0. Since I’m sure you don’t need a general catch up about these last 2 years, let’s stay on that same subject.

Nowadays, we are very close to GNOME 43, and an accompanying 3.4.0 release of Tracker SPARQL library and data miners, that is 4 minor releases ahead! What happened since then? Most immediately after that previous blog post, the 3.0 release did roll in, the uncertainty behind all major structural changes vanished, and the transition could largely be called a success: The overhauled internals for complete SPARQL support stood ground without large regressions; the promises of portals and data isolation delivered and to this day remains unchallenged (except for spare requests to let more pieces of metadata through); the increased genericity and versatility kept fostering further improvements.

Overall, there’s been no major regrets, and we are now sitting comfortably in 3.x. And that was all good, since we could use all that time not fixing fallout in keeping up with the improvements. Let’s revisit what happened since.

Tracker (SPARQL library)

Testing

Ever since 3.0, test coverage has been growing fairly steadily. A very good thing about SPARQL and Tracker API is that it is very “circular” altogether, every data format used in information exchange must be both parsed and produced by the SPARQL implementation, every external request made is also a external request it should be able to serve, and so on.

This makes it fairly easy to reach all corners in testing coverage, despite the involved complexity. It has been also quite a rule for some time that SPARQL language compliance fixes come with tests. The accumulated result over the years is a fairly large collection of tests to the internal machinery, there’s over 330 subtests already for the SPARQL language alone.

To the day of writing, Tracker stands at 76.4% coverage (Up-to-date for the posterity: ), we are getting very good at catching deviations from how the SPARQL library should behave, and decently good at catching how it should not misbehave. Simply following W3C standards and recommendations pays off here too, since it settles the direction and resolves most matters about what the right behavior is.

Developer Experience

3.0 marked the point where being able to create private Tracker databases with custom data models transitioned from an easter egg to a first-class feature. This also means that developers can now write an ontology (or data model, or schema, pick a name) that suits their data like a glove, instead of using the default Nepomuk one, which is well-trodden and literally written by academics, but will be overkill for the needs of individual applications.

And of course there is room for failure in writing those ontologies. Last year, GSoC student Abanoub Ghadban worked hard on “breaking it”, polishing the experience and trying to produce helpful warnings, so developer mistakes are easily visible and solvable. The CLI tools provided facilitate these checks, e.g. creating a temporary endpoint that loads the ontology being edited, and running queries against it.

The documentation front got also steady improvements, the API itself is 100% documented and there is now a fully fleshed out SPARQL tutorial. Also, drawing inspiration from SQLite, there are now miscellaneous docs on some implementation details like limits, a discussion on the security considerations of the implemented specs, or extensions and interpretations of the SPARQL spec. The examples have been also modernized, and written in Python and Javascript in addition.

A great blunder of how Tracker tended to be used in applications was having SPARQL mixed in code, or worse, built through string manipulation. The latter got better API-wise in the past with compiled statements, but the mix of code and database logic was still prevalent. Since 3.3, there is support for loading and creating compiled statements from query files located in GResources. This neatly addresses the separation of queries and code, while keeping them indissoluble to the produced binary, while preserving the benefits of compiled statements (compile once, run many times).

Performance

One of the benefits of the API provided by Tracker is that it gives a great amount of leeway in internal refactors without altering the surface. We are largely just backwards-compatibility constrained by the underlying database format. This has allowed for further optimizations to happen under the hood since 3.0, database updates both in terms of database changes and queries. Databases also take now less space, specially in the presence of many blank nodes.

Although the greatest performance boost for data producers can be obtained through the TrackerBatch API (Since 3.1). Prior to it, normally a TrackerResource would be used build RDF data, then used to produce a SPARQL update, and the SPARQL update parsed to generate and apply the RDF changes. This new API can efficiently traverse a series of TrackerResources (describing RDF already) and turn them to database modifications skipping the SPARQL middle man altogether.

In the git tree, there now is a small utility to benchmark certain uses of Tracker API, let’s see how the output looks on a modern Intel i7 for an in-memory database:

Batch size: 5000, Individual test duration: 30 sec
Opening in-memory database…
                           Test		Elements	Elems/sec	Min         	Max         	Avg
   Resource batch update (sync)		6169883.801	205662.793	4.292 usec	5.615 usec	4.862 usec
     SPARQL batch update (sync)		2430664.747	81022.158	11.889 usec	14.255 usec	12.342 usec
   Resource modification (sync)		4440988.603	148032.953	6.588 usec	8.438 usec	6.755 usec
  Resource insert+delete (sync)		3033137.552	101104.585	9.689 usec	12.669 usec	9.891 usec
Prepared statement query (sync)		8566182.714	285539.424	3.000 usec	745.000 usec	3.502 usec
            SPARQL query (sync)		1329076.956	44302.565	21.000 usec	189.000 usec	22.572 usec

After the usual disclaimer that this benchmark utility greatly relies on CPU and disk characteristics, and your mileage may vary, there’s a few things to highlight here:

Using modern APIs always pays off. Inserting data directly from TrackerResource (Resource batch update) is 2.5x faster than inserting data through SPARQL updates (SPARQL batch update), and querying through prepared statements (Prepared statement query) is ~6.5x faster than repeating SPARQL queries (SPARQL query).
Even though the tests are run synchronously on the main loop, queries can be greatly parallelized, so the actual throughput on a modern machine will be much higher in reality. Updates are single-threaded though.
Used the right way, Tracker code is never in the hot paths, merits there go to SQLite and production of data itself. You are expected to get results that are in the same ballpark than using SQLite directly, given similar volumes and layout of data.

But this snapshot does not fully highlight the improvements done. For a reference baseline, a backported version of this benchmark on the same computer over Tracker 3.0.x gives:

Batch size: 5000, Individual test duration: 30 sec
Opening in-memory database…
                           Test		Elements	Elems/sec	Min         	Max         	Avg
     SPARQL batch update (sync)		1387346.192	46244.873	17.035 usec	24.290 usec	21.624 usec
   Resource modification (sync)		259923.682	8664.123	49.863 usec	122.236 usec	115.418 usec
  Resource insert+delete (sync)		707638.539	23587.951	41.593 usec	73.702 usec	42.395 usec
Prepared statement query (sync)		7729898.742	257663.291	3.000 usec	527.000 usec	3.881 usec
            SPARQL query (sync)		888896.319	29629.877	31.000 usec	180.000 usec	33.750 usec

Looking past the lack of TrackerBatch there, it’s still easy to see there has been massive improvements pretty much all over the board. As we already encourage, using the latest Tracker gives you the best Tracker.

Data serialization

SPARQL was very much thought out with the task of storing RDF data, querying into RDF data, and moving RDF data around. From the tiniest resource/value existence check, to full content dumps, everything is one query away.

What we were lacking is a consistent way to convert all that data into something that could easily be piped through, saved, processed, etc. To make this task easy, there is now API to serialize and deserialize data between a Tracker database and the popular RDF file formats. This is performed efficiently, with flat RAM usage on both ends, it is even possible to pipe these APIs with the RDF data never existing anywhere at once during the process.

And since this serialization to RDF formats is a builtin feature of HTTP endpoints, it has allowed us to level up our support for these. A Tracker HTTP endpoint is now entirely compliant and indistinguishable from the larger players.

The most immediately useful usage of this serialization support is in the CLI tools, the import and export commands now use this API and can deal with these formats. But what is this for? Is this driven by a level of completionism that borders the sickness? Well, yes, but there are of course plans around these features, more on that later.

Tracker Miners

Performance

You might think the SPARQL library improvements above would be the larger improvement the filesystem miner could get, and you would be wrong.

Part of the raison d’être of a filesystem indexer is to stay up-to-date with filesystem changes. In the GNOME world, this catching up is usually done through a GFileMonitor, which provides a GLib-friendly way to do the dirty job of setting up an tracking an inotify handle to track changes on an individual directory for you. What is wrong with that? Nothing, unless you do that at a large scale like indexers do. Each of those GFileMonitors is backed by a pollable FD, and a GSource wrapping it, and iterating a GMainContext that has thousands of GSources attached to it massively, thoroughly sucks.

Is this a case of Tracker abusing a perfectly fine GLib API? Or on the contrary is this a case of bad GLib API design? I will let you debate on that, as I am unclear myself.

The first solution to alleviate that (since 3.1) was delegating file monitoring to a separate thread, so the GMainContext that is expensive to iterate only affects file monitors, as opposed to everything that goes on. Later on, FANotify finally gained the missing features that made it suitable for indexers (not requiring CAP_SYS_ADMIN was one of them) and Tracker Miners got an implementation for it (since 3.3). Most notably here, with this kernel API it is only necessary to poll a single FD to receive events for all FANotify marks set on the filesystem.

In what it sounds like a case of miscommunication between kernel developers developing independent new features that didn’t mix well, unfortunately it’s not possible nowadays to mix the bleeding edge in file monitors (FANotify) with the bleeding edge in filesystems (btrfs), for these (and other) cases Tracker Miners will still fallback on plain glib/inotify. Hopefully the situation will be resolved at some point.

At 3.1, the filesystem indexer also implemented flow control mechanisms that allowed its RAM usage to stay mostly flat independently of the filesystem size and layout. At the peak of its activity, tracker-miner-fs-3 uses 30-40MB here (per gnome-system-monitor), and idles at 5MB. Needless to say, it is also many times faster than its past 3.0 self.

But all this was about tracker-miner-fs-3, the daemon in charge of monitoring filesystem changes and mirroring file/folder structure into the database. What about tracker-extract-3, the daemon in charge of nitty-gritty file metadata extraction? When this step needs to happen (say, newly indexed or modified files). It’s for all accounts expensive, now by the sheer magic of everything else shrinking, it comparatively only got worse. There is a reason we avoid that from happening frequently at all costs.

But what is slow there? Roughly speaking, it should be just a loop going through files, getting the metadata and inserting it, and that should be fucking fast as per the benchmarks above, right? Right, the problem is in the “getting the metadata” step. This will wildly fluctuate depending on the files scattered in the filesystem, their mimetypes, and the libraries used to extract their metadata. The plain text extractor or the in-tree MP3 extractor are capable of opening, extracting metadata from, and closing multiple thousands of files per second. All the external libraries used for metadata extraction (yes, all of them) are slower, ranging from several times over to up to 4 orders of magnitude, also depending on the input files (I curated some infernal PDFs). The worst offenders are Poppler, GStreamer and libtiff.

As it’s evident here (You don’t need to believe me, add TRACKER_DEBUG=statistics to /etc/environment and reset/restart the miner), most libraries dealing with files and formats optimize for library resources being long lived across an application lifetime, while optimizing the creation and disposal of these library resources is often overlooked. The metadata extraction daemon faces that hard fact file after file, so its slowness is just a reflection of the slowness of these libraries in setting themselves up. If, after all of this, someone thinks the filesystem indexer is slow, that is where the money is.

Extending and maintaining metadata

Although the focus has been mainly on making things work reliably, rather than going crazy with extending the metadata stored (yet), a point worth noting here is last year’s GSoC work from student Nishit Patel, who worked on indexing creation time (in the filesystems that support/enable it), and allowing for its search all across the stack.

We also got support for a number of game file formats (mainly, retro ones), which GNOME Games (now Highscore) readily made use of. LOL, jk.

Handling and following failures

Whenever a file is broken or corrupted, or a 3rd party library crashes or produces a syscall that is caught up by seccomp, the tracker-extract-3 daemon would quit (with varying degrees of gracefulness) and be taught on the next restart to avoid the file that triggered this situation. This is not precisely new behavior, what is new though is that these failures are now recorded and can be easily inspected over the CLI with tracker3 status. Most bugs we receive about broken extraction are reactive (e.g. “why does Music not show this file?”), this would allow for a more proactive approach to fixing metadata extraction bugs, if users happen to look there and cooperate.

There is also a slight possibility that extraction bugs are due to Tracker itself, but these are largely a think of the past.

Coming up next…

I very much cheered when I learnt of the “Local first” initiative. In fact, I so much anticipated it that I literally anticipated it. Development of the serialization APIs started sometime around the last year, with a plan to provide facilities to transparently and neatly synchronize RDF data across instances in multiple machines owned by the same user.

Who wants that? Certainly not the filesystem indexer. However, there’s indeed a desire to avoid reliance on third party services for user sensitive data like their own health information, chat logs, or bookmarked sites. With some truckloads of optimism, I would hope that this becomes a cornerstone of that goal, for applications under the GNOME umbrella that need to deal with a non-trivial amount of data.

How would that work? What do we need to get there? We need a query language that supports it (check, duh), a data model that can handle the different patterns that might emerge in synchronizing data (check), a way to make these machines talk between each other (check), and a way to diff missing data (check). All the pieces are really set, so what is missing is putting those together, of course drawing inspiration from Christian Hergert’s Bonsai to make machines discover each other.

And of course there is still very much a desire to keep the heart of content applications compelling and relevant. There’s still opportunities to further extend and link the metadata stored by the filesystem indexer, perhaps with the help of the actual semantic web that lives out there. We already have a number of universal identifiers available (musicbrainz tags, IMDB IDs, game rom IDs) to interrelate and cross-reference data.

Now that the codebase features are settled and working well, we can start thinking on new fancy features, stay tuned for the next installment of this series in 2024, when I talk about Tracker 3.8.0, or perhaps Tracker 3.5.20. If you made it this far, you have my appreciation, until the next time!

Tracker 2.99.1 and miners released

TL;DR: $TITLE, and a call for distributors to make it easily available in stable distros, more about that at the bottom.

Sometime this week (or last, depending how you count), Tracker 2.99.1 was released. Sam has been doing a fantastic series of blog posts documenting the progress. With my blogging frequency I’m far from stealing his thunder :), I will still add some retrospect here to highlight how important of a milestone this is.

First of all, let’s give an idea of the magnitude of the changes so far:

[carlos@irma tracker]$ git diff origin/tracker-2.3..origin/master --stat -- docs examples src tests utils |tail -n 1 788 files changed, 20475 insertions(+), 66384 deletions(-)

[carlos@irma tracker-miners]$ git diff origin/tracker-miners-2.3..origin/master --stat -- data docs src tests | tail -n 1 354 files changed, 39422 insertions(+), 6027 deletions(-)

What did happen there? A little more than half of the insertions in tracker-miners (and corresponding deletions in tracker) can be attributed to code from libtracker-miner, libtracker-control and corresponding tests moving to tracker-miners. Those libraries are no longer public, but given those are either unused or easily replaceable, that’s not even the most notable change :).

The changes globally could be described as “things falling in place”, Tracker got more cohesive, versatile and tested than it ever was, we put a lot of care and attention to detail, and we hope you like the result. Let’s break down the highlights.

Understanding SPARQL

Sometime a couple years ago, I got fed up after several failed attempts at implementing support for property paths, this wound up into a rewrite of the SPARQL parser. This was part of Tracker 2.2.0 and brought its own benefits, ancient history.

Getting to the point, having the expression tree in the new parser closely modeled after SPARQL 1.1 grammar definition helped getting a perfect snapshot of what we don’t do, what we don’t do correctly and what we do extra. The parser was made to accept all correct SPARQL, and we had an `_unimplemented()` define in place to error out when interpreting the expression tree.

But that also gave me something to grep through and sigh, this turned into many further reads of SPARQL 1.1 specs, and a number of ideas about how to tackle them. Or if we weren’t restricted by compatibility concerns, as for some things we were limited by our own database structure.

Fast forward to today, the define is gone. Tracker covers the SPARQL 1.1 language in its entirety, warts and everything. The spec is from 2013, we just got there 7 years late :). Most notably, there’s:

Graphs: In a triple store, the aptly named triples consist of subject/predicate/object, and they belong within graphs. The object may point to elements in other graphs.
In prior versions, we “supported graphs” in the language, but those were more a property of the triple’s object. This changes the semantics slightly in appearance but in fundamental ways, eg. no two graphs may have the same triple, and the ownership of the triple is backwards if subject and object are in different graphs.

Now the implementation of graphs perfectly matches the description, and becomes a good isolated unit to let access in the case of sandboxing.

We also support the ADD/MOVE/CLEAR/LOAD/DROP wholesome operations on graphs, to ease their management.
Services: The SERVICE syntax allows to federate portions of your query graph pattern to external services, and operate transparently on that data as if local. This is not exactly new in Tracker 2.99.x, but now supports dbus services in addition to http ones. More notes about why this is key further down.
New query forms, DESCRIBE/CONSTRUCT: This syntax sits alongside SELECT. DESCRIBE is a simple form to get RDF triples fully describing a resource, CONSTRUCT is a more powerful data extraction clause that allows serializing arbitrary portions of the triple set, even all of it, and even across RDF schemas.

Of all 11 documents from the SPARQL recommendations, we are essentially missing support for HTTP endpoints to entirely pass for a SPARQL 1.1 store. We obviously don’t mean to compete wrt enterprise-level databases, but we are completionists and will get to implementing the full recommendations someday :).

There is no central store

The tracker-store service got stripped of everything that makes it special. You were already able to create private stores, making those public via DBus is now one API call away. And its simple DBus API to perform/restore backups is now superseded by CONSTRUCT and LOAD syntax.

We have essentially democratized triple stores, in this picture (and a sandboxed world) it does not make sense to have a singleton default one, so the tracker-store process itself is no more. Each miner (Filesystem, RSS) has its own store, made public on its DBus name. TrackerSparqlConnection constructors let you specifically create a local store, or connect to a specific DBus/HTTP service.

No central service? New paradigm!

Did you use to store/modify data in tracker-store? There’s some bad news: It’s no longer for you to do that, scram from our lawn!

You are still much welcome to create your own private store, there you can do as you please, even rolling something else than Nepomuk.

But wait, how can you keep your own store and still consume data indexed by tracker miners? Here comes the SERVICE syntax to play, allowing you to deal with miner data and your own altogether. A simple hypothetical example:


# Query favorite files
SELECT ?u {
  SERVICE <dbus:org.freedesktop.Tracker3.Miner.Files> {
    ?u a nfo:FileDataObject
  }
  ?u mylocaldata:isFavorite true
}

As per the grammar definition, the SERVICE syntax can only be used from Query forms, not Update ones. This is essentially the language conspiring to keep a clear ownership model, where other services are not yours to modify.

If you are only interested in accessing one service, you can use tracker_sparql_connection_bus_new and perform queries directly to the remote service.

A web presence

It’s all about appearance these days, that’s why newscasters don’t switch the half of the suit they wear. A long time ago, we used to have the tracker-project.org domain, the domain expired and eventually got squatted.

That normally sucks on itself, for us it was a bit of a pickle, as RDF (and our own ontologies) stands largely on URIs, that means live software producing links out of our control, and it going to pastes/bugs/forums all over the internet. Luckily for us, tracker-project.org is a terrible choice of name for a porn site.

We couldn’t simply do the change either, in many regards those links were ABI. With 3.x on the way, ABI was no longer a problem, Sam did things properly so we have a site, and a proper repository of ontologies.

Nepomuk is dead, long live Nepomuk

Nepomuk is a dead project. Despite its site being currently alive, it’s been dead for extended periods of time over the last 2 years. That’s 11.5M EUR of your european taxpayer money slowly fading away.

We no longer consider we should consider it “an upstream”, so we have decided to go our own. After some minor sanitization and URI rewriting, the Nepomuk ontology is preserved mostly as-is, under our own control.

But remember, Nepomuk is just our “reference” ontology, a swiss army knife for whatever a might need to be stored in a desktop. You can always roll your own.

Tracker-miner-fs data layout

For sandboxing to be any useful, there must be some actual data separation. The tracker-miner-fs service now stores things in several graphs:

tracker:FileSystem
tracker:Audio
tracker:Video
tracker:Documents
tracker:Software

And commits further to the separation between “Data Objects” (e.g. files) and “Information Elements” (e.g. what its content represents). Both aspects of a “file” still reference each other, but simply used to be the same previously.

The tracker:FileSystem graph is the backbone of file system data, it contains all file Data Objects, and folders. All other graphs store the related Information Elements (eg. a song in a flac file).

Resources are interconnected between graphs, depending on the graphs you have access to, you will get a partial (yet coherent) view of the data.

CLI improvements

We have been doing some changes around our CLI tools, with tracker shifting its scope to being a good SPARQL triple store, the base set of CLI tools revolves around that, and can be seen as an equivalent to sqlite3 CLI command.

We also have some SPARQL specific sugar, like tracker endpoint that lets you create transient SPARQL services.

All miner-specific subcommands, or those that relied implicitly on their details did move to the tracker-miners repo, the tracker3 command is extensible to allow this.

Documentation

In case this was not clear, we want to be a general purpose data storage solution. We did spend quite some time improving and extending the developer and ontology documentation, adding migration notes… there’s even an incipient SPARQL tutorial!

There is a sneak preview of the API documentation at our site. It’s nice being able to tell that again!

Better tests

Tracker additionally ships a small helper python library to make it easy writing tests against Tracker infrastructure. There’s many new and deeper tests all over the place, e.g. around new syntax support.

Up next…

You’ve seen some talk about sandboxing, but nothing about sandboxing itself. That’s right, support for it is in a branch and will probably be part of 2.99.2. Now the path is paved for it to be transparent.

We currently are starting the race to update users. Sam got some nice progress on nautilus, and I just got started at shaving a yak on a cricket.

The porting is not completely straightforward. With few nice exceptions, a good amount of the tracker code around is stuck in some time frozen “as long as it works”, cargo-culted state. This sounds like a good opportunity to modernize queries, and introduce the usage of compiled statements. We are optimist that we’ll get most major players ported in time, and made 3.x able to install and run in parallel in case we miss the goal.

A call to application developers

We are no longer just “that indexer thingy”. If you need to store data with more depth than a table. If you missed your database design and relational algebra classes, or don’t miss them at all. We’ve got to talk :), come visit us at #tracker.

A call to distributors

We made tracker and tracker-miners 3.x able to install and run in parallel to tracker 2.x, and we expect users to get updated to it over time.

Given that it will get reflected in nightly flatpaks, and Tracker miners are host services, we recommend that tracker3 development releases are made available or easy to install in current stable distribution releases. Early testers and ourselves will thank you.

Gnome-shell Hackfest 2019 – Day 3

As promised, some late notes on the 3rd and last day of the gnome-shell hackfest, so yesterday!

Some highlights from my partial view:

We had a mind blowing in depth discussion about the per-crtc frame clocks idea that’s been floating around for a while. What started as “light” before-bedtime conversation the previous night continued the day after straining our neurons in front of a whiteboard. We came out wiser nonetheless, and have a much more concrete idea about how should it work.
Georges updated his merge request to replace Cogl structs with graphene ones. This now passes CI and was merged \o/
Much patch review happened in place, and some other pretty notable refactors and cleanups were merged.
The evening was more rushed than usual, with some people leaving already. The general feeling seemed good!
In my personal opinion the outcome was pretty good too. There’s been progress at multiple levels and new ideas sparked, you should look forward to posts from others :). It was also great to put a face to some IRC nicks, and meet again all the familiar ones.

Kudos to the RevSpace members and especially Hans, without them this hackfest couldn’t have happened.

Gnome-shell Hackfest 2019 – Day 2

Well, we are starting the 3rd and last day of this hackfest… I’ll write about yesterday, which probably means tomorrow I’ll blog about today :).

Some highlights of what I was able to participate/witness:

Roman Gilg of KDE fame came to the hackfest, it was a nice opportunity to discuss mixed DPI densities for X11/Xwayland clients. We first thought about having one server per pixel density, but later on we realized we might not be that far from actually isolating all X11 clients from each other, so why stop there.
The conversation drifted into other topics relevant to desktop interoperation. We did discuss about window activation and focus stealing prevention, this is a topic “fixed” in Gnome but in a private protocol. I had already a protocol draft around which was sent today to wayland-devel ML.
A plan was devised for what is left of Xwayland-on-demand, and an implementation is in progress.
The designers have been doing some exploration and research on how we interact with windows, the overview and the applications menu, and thinking about alternatives. At the end of the day they’ve demoed to us the direction they think we should take.
I am very much not a designer and I don’t want to spoil their fine work here, so stay tuned for updates from them :).
As the social event, we had a very nice BBQ with some hackerspace members. Again kindly organized by Revspace.

Gnome-shell Hackfest 2019 – Day 1

So today kickstarted the gnome-shell hackfest in Leidschendam, the Netherlands.

There’s a decent number of attendants from multiple parties (Red Hat, Canonical, Endless, Purism, …). We all brought various items and future plans for discussion, and have a number of merge requests in various states to go through. Some exciting keywords are Graphene, YUV, mixed DPI, Xwayland-on-demand, …

But that is not all! Our finest designers also got together here, and I overheard they are discussing usability of the lock screen between other topics.

This event wouldn’t have been possible without the Revspace hackerspace people and specially our host Hans de Goede. They kindly provided the venue and necessary material, I am deeply thankful for that.

As there are various discussions going on simultaneously it’s kind of hard to keep track of everything, but I’ll do my best to report back over this blog. Stay tuned!