Tracker 3.x, a retrospect.

Time files, for better or for the worst. The last time I bored you with ramblings on this blog was more than 2 years ago already, prepping up for Tracker 3.0. Since I’m sure you don’t need a general catch up about these last 2 years, let’s stay on that same subject.

Nowadays, we are very close to GNOME 43, and an accompanying 3.4.0 release of Tracker SPARQL library and data miners, that is 4 minor releases ahead! What happened since then? Most immediately after that previous blog post, the 3.0 release did roll in, the uncertainty behind all major structural changes vanished, and the transition could largely be called a success: The overhauled internals for complete SPARQL support stood ground without large regressions; the promises of portals and data isolation delivered and to this day remains unchallenged (except for spare requests to let more pieces of metadata through); the increased genericity and versatility kept fostering further improvements.

Overall, there’s been no major regrets, and we are now sitting comfortably in 3.x. And that was all good, since we could use all that time not fixing fallout in keeping up with the improvements. Let’s revisit what happened since.

Tracker (SPARQL library)

Testing

Ever since 3.0, test coverage has been growing fairly steadily. A very good thing about SPARQL and Tracker API is that it is very “circular” altogether, every data format used in information exchange must be both parsed and produced by the SPARQL implementation, every external request made is also a external request it should be able to serve, and so on.

This makes it fairly easy to reach all corners in testing coverage, despite the involved complexity. It has been also quite a rule for some time that SPARQL language compliance fixes come with tests. The accumulated result over the years is a fairly large collection of tests to the internal machinery, there’s over 330 subtests already for the SPARQL language alone.

To the day of writing, Tracker stands at 76.4% coverage (Up-to-date for the posterity: ), we are getting very good at catching deviations from how the SPARQL library should behave, and decently good at catching how it should not misbehave. Simply following W3C standards and recommendations pays off here too, since it settles the direction and resolves most matters about what the right behavior is.

Developer Experience

3.0 marked the point where being able to create private Tracker databases with custom data models transitioned from an easter egg to a first-class feature. This also means that developers can now write an ontology (or data model, or schema, pick a name) that suits their data like a glove, instead of using the default Nepomuk one, which is well-trodden and literally written by academics, but will be overkill for the needs of individual applications.

And of course there is room for failure in writing those ontologies. Last year, GSoC student Abanoub Ghadban worked hard on “breaking it”, polishing the experience and trying to produce helpful warnings, so developer mistakes are easily visible and solvable. The CLI tools provided facilitate these checks, e.g. creating a temporary endpoint that loads the ontology being edited, and running queries against it.

The documentation front got also steady improvements, the API itself is 100% documented and there is now a fully fleshed out SPARQL tutorial. Also, drawing inspiration from SQLite, there are now miscellaneous docs on some implementation details like limits, a discussion on the security considerations of the implemented specs, or extensions and interpretations of the SPARQL spec. The examples have been also modernized, and written in Python and Javascript in addition.

A great blunder of how Tracker tended to be used in applications was having SPARQL mixed in code, or worse, built through string manipulation. The latter got better API-wise in the past with compiled statements, but the mix of code and database logic was still prevalent. Since 3.3, there is support for loading and creating compiled statements from query files located in GResources. This neatly addresses the separation of queries and code, while keeping them indissoluble to the produced binary, while preserving the benefits of compiled statements (compile once, run many times).

Performance

One of the benefits of the API provided by Tracker is that it gives a great amount of leeway in internal refactors without altering the surface. We are largely just backwards-compatibility constrained by the underlying database format. This has allowed for further optimizations to happen under the hood since 3.0, database updates both in terms of database changes and queries. Databases also take now less space, specially in the presence of many blank nodes.

Although the greatest performance boost for data producers can be obtained through the TrackerBatch API (Since 3.1). Prior to it, normally a TrackerResource would be used build RDF data, then used to produce a SPARQL update, and the SPARQL update parsed to generate and apply the RDF changes. This new API can efficiently traverse a series of TrackerResources (describing RDF already) and turn them to database modifications skipping the SPARQL middle man altogether.

In the git tree, there now is a small utility to benchmark certain uses of Tracker API, let’s see how the output looks on a modern Intel i7 for an in-memory database:

Batch size: 5000, Individual test duration: 30 sec
Opening in-memory database…
                           Test		Elements	Elems/sec	Min         	Max         	Avg
   Resource batch update (sync)		6169883.801	205662.793	4.292 usec	5.615 usec	4.862 usec
     SPARQL batch update (sync)		2430664.747	81022.158	11.889 usec	14.255 usec	12.342 usec
   Resource modification (sync)		4440988.603	148032.953	6.588 usec	8.438 usec	6.755 usec
  Resource insert+delete (sync)		3033137.552	101104.585	9.689 usec	12.669 usec	9.891 usec
Prepared statement query (sync)		8566182.714	285539.424	3.000 usec	745.000 usec	3.502 usec
            SPARQL query (sync)		1329076.956	44302.565	21.000 usec	189.000 usec	22.572 usec

After the usual disclaimer that this benchmark utility greatly relies on CPU and disk characteristics, and your mileage may vary, there’s a few things to highlight here:

  • Using modern APIs always pays off. Inserting data directly from TrackerResource (Resource batch update) is 2.5x faster than inserting data through SPARQL updates (SPARQL batch update), and querying through prepared statements (Prepared statement query) is ~6.5x faster than repeating SPARQL queries (SPARQL query).
  • Even though the tests are run synchronously on the main loop, queries can be greatly parallelized, so the actual throughput on a modern machine will be much higher in reality. Updates are single-threaded though.
  • Used the right way, Tracker code is never in the hot paths, merits there go to SQLite and production of data itself. You are expected to get results that are in the same ballpark than using SQLite directly, given similar volumes and layout of data.

But this snapshot does not fully highlight the improvements done. For a reference baseline, a backported version of this benchmark on the same computer over Tracker 3.0.x gives:

Batch size: 5000, Individual test duration: 30 sec
Opening in-memory database…
                           Test		Elements	Elems/sec	Min         	Max         	Avg
     SPARQL batch update (sync)		1387346.192	46244.873	17.035 usec	24.290 usec	21.624 usec
   Resource modification (sync)		259923.682	8664.123	49.863 usec	122.236 usec	115.418 usec
  Resource insert+delete (sync)		707638.539	23587.951	41.593 usec	73.702 usec	42.395 usec
Prepared statement query (sync)		7729898.742	257663.291	3.000 usec	527.000 usec	3.881 usec
            SPARQL query (sync)		888896.319	29629.877	31.000 usec	180.000 usec	33.750 usec

Looking past the lack of TrackerBatch there, it’s still easy to see there has been massive improvements pretty much all over the board. As we already encourage, using the latest Tracker gives you the best Tracker.

Data serialization

SPARQL was very much thought out with the task of storing RDF data, querying into RDF data, and moving RDF data around. From the tiniest resource/value existence check, to full content dumps, everything is one query away.

What we were lacking is a consistent way to convert all that data into something that could easily be piped through, saved, processed, etc. To make this task easy, there is now API to serialize and deserialize data between a Tracker database and the popular RDF file formats. This is performed efficiently, with flat RAM usage on both ends, it is even possible to pipe these APIs with the RDF data never existing anywhere at once during the process.

And since this serialization to RDF formats is a builtin feature of HTTP endpoints, it has allowed us to level up our support for these. A Tracker HTTP endpoint is now entirely compliant and indistinguishable from the larger players.

The most immediately useful usage of this serialization support is in the CLI tools, the import and export commands now use this API and can deal with these formats. But what is this for? Is this driven by a level of completionism that borders the sickness? Well, yes, but there are of course plans around these features, more on that later.

Tracker Miners

Performance

You might think the SPARQL library improvements above would be the larger improvement the filesystem miner could get, and you would be wrong.

Part of the raison d’être of a filesystem indexer is to stay up-to-date with filesystem changes. In the GNOME world, this catching up is usually done through a GFileMonitor, which provides a GLib-friendly way to do the dirty job of setting up an tracking an inotify handle to track changes on an individual directory for you. What is wrong with that? Nothing, unless you do that at a large scale like indexers do. Each of those GFileMonitors is backed by a pollable FD, and a GSource wrapping it, and iterating a GMainContext that has thousands of GSources attached to it massively, thoroughly sucks.

Is this a case of Tracker abusing a perfectly fine GLib API? Or on the contrary is this a case of bad GLib API design? I will let you debate on that, as I am unclear myself.

The first solution to alleviate that (since 3.1) was delegating file monitoring to a separate thread, so the GMainContext that is expensive to iterate only affects file monitors, as opposed to everything that goes on. Later on, FANotify finally gained the missing features that made it suitable for indexers (not requiring CAP_SYS_ADMIN was one of them) and Tracker Miners got an implementation for it (since 3.3). Most notably here, with this kernel API it is only necessary to poll a single FD to receive events for all FANotify marks set on the filesystem.

In what it sounds like a case of miscommunication between kernel developers developing independent new features that didn’t mix well, unfortunately it’s not possible nowadays to mix the bleeding edge in file monitors (FANotify) with the bleeding edge in filesystems (btrfs), for these (and other) cases Tracker Miners will still fallback on plain glib/inotify. Hopefully the situation will be resolved at some point.

At 3.1, the filesystem indexer also implemented flow control mechanisms that allowed its RAM usage to stay mostly flat independently of the filesystem size and layout. At the peak of its activity, tracker-miner-fs-3 uses 30-40MB here (per gnome-system-monitor), and idles at 5MB. Needless to say, it is also many times faster than its past 3.0 self.

But all this was about tracker-miner-fs-3, the daemon in charge of monitoring filesystem changes and mirroring file/folder structure into the database. What about tracker-extract-3, the daemon in charge of nitty-gritty file metadata extraction? When this step needs to happen (say, newly indexed or modified files). It’s for all accounts expensive, now by the sheer magic of everything else shrinking, it comparatively only got worse. There is a reason we avoid that from happening frequently at all costs.

But what is slow there? Roughly speaking, it should be just a loop going through files, getting the metadata and inserting it, and that should be fucking fast as per the benchmarks above, right? Right, the problem is in the “getting the metadata” step. This will wildly fluctuate depending on the files scattered in the filesystem, their mimetypes, and the libraries used to extract their metadata. The plain text extractor or the in-tree MP3 extractor are capable of opening, extracting metadata from, and closing multiple thousands of files per second. All the external libraries used for metadata extraction (yes, all of them) are slower, ranging from several times over to up to 4 orders of magnitude, also depending on the input files (I curated some infernal PDFs). The worst offenders are Poppler, GStreamer and libtiff.

As it’s evident here (You don’t need to believe me, add TRACKER_DEBUG=statistics to /etc/environment and reset/restart the miner), most libraries dealing with files and formats optimize for library resources being long lived across an application lifetime, while optimizing the creation and disposal of these library resources is often overlooked. The metadata extraction daemon faces that hard fact file after file, so its slowness is just a reflection of the slowness of these libraries in setting themselves up. If, after all of this, someone thinks the filesystem indexer is slow, that is where the money is.

Extending and maintaining metadata

Although the focus has been mainly on making things work reliably, rather than going crazy with extending the metadata stored (yet), a point worth noting here is last year’s GSoC work from student Nishit Patel, who worked on indexing creation time (in the filesystems that support/enable it), and allowing for its search all across the stack.

We also got support for a number of game file formats (mainly, retro ones), which GNOME Games (now Highscore) readily made use of. LOL, jk.

Handling and following failures

Whenever a file is broken or corrupted, or a 3rd party library crashes or produces a syscall that is caught up by seccomp, the tracker-extract-3 daemon would quit (with varying degrees of gracefulness) and be taught on the next restart to avoid the file that triggered this situation. This is not precisely new behavior, what is new though is that these failures are now recorded and can be easily inspected over the CLI with tracker3 status. Most bugs we receive about broken extraction are reactive (e.g. “why does Music not show this file?”), this would allow for a more proactive approach to fixing metadata extraction bugs, if users happen to look there and cooperate.

There is also a slight possibility that extraction bugs are due to Tracker itself, but these are largely a think of the past.

Coming up next…

I very much cheered when I learnt of the “Local first” initiative. In fact, I so much anticipated it that I literally anticipated it. Development of the serialization APIs started sometime around the last year, with a plan to provide facilities to transparently and neatly synchronize RDF data across instances in multiple machines owned by the same user.

Who wants that? Certainly not the filesystem indexer. However, there’s indeed a desire to avoid reliance on third party services for user sensitive data like their own health information, chat logs, or bookmarked sites. With some truckloads of optimism, I would hope that this becomes a cornerstone of that goal, for applications under the GNOME umbrella that need to deal with a non-trivial amount of data.

How would that work? What do we need to get there? We need a query language that supports it (check, duh), a data model that can handle the different patterns that might emerge in synchronizing data (check), a way to make these machines talk between each other (check), and a way to diff missing data (check). All the pieces are really set, so what is missing is putting those together, of course drawing inspiration from Christian Hergert’s Bonsai to make machines discover each other.

And of course there is still very much a desire to keep the heart of content applications compelling and relevant. There’s still opportunities to further extend and link the metadata stored by the filesystem indexer, perhaps with the help of the actual semantic web that lives out there. We already have a number of universal identifiers available (musicbrainz tags, IMDB IDs, game rom IDs) to interrelate and cross-reference data.

Now that the codebase features are settled and working well, we can start thinking on new fancy features, stay tuned for the next installment of this series in 2024, when I talk about Tracker 3.8.0, or perhaps Tracker 3.5.20. If you made it this far, you have my appreciation, until the next time!

Tracker 2.99.1 and miners released

TL;DR: $TITLE, and a call for distributors to make it easily available in stable distros, more about that at the bottom.

Sometime this week (or last, depending how you count), Tracker 2.99.1 was released. Sam has been doing a fantastic series of blog posts documenting the progress. With my blogging frequency I’m far from stealing his thunder :), I will still add some retrospect here to highlight how important of a milestone this is.

First of all, let’s give an idea of the magnitude of the changes so far:


[carlos@irma tracker]$ git diff origin/tracker-2.3..origin/master --stat -- docs examples src tests utils |tail -n 1
788 files changed, 20475 insertions(+), 66384 deletions(-)

[carlos@irma tracker-miners]$ git diff origin/tracker-miners-2.3..origin/master --stat -- data docs src tests | tail -n 1
354 files changed, 39422 insertions(+), 6027 deletions(-)

What did happen there? A little more than half of the insertions in tracker-miners (and corresponding deletions in tracker) can be attributed to code from libtracker-miner, libtracker-control and corresponding tests moving to tracker-miners. Those libraries are no longer public, but given those are either unused or easily replaceable, that’s not even the most notable change :).

The changes globally could be described as “things falling in place”, Tracker got more cohesive, versatile and tested than it ever was, we put a lot of care and attention to detail, and we hope you like the result. Let’s break down the highlights.

Understanding SPARQL

Sometime a couple years ago, I got fed up after several failed attempts at implementing support for property paths, this wound up into a rewrite of the SPARQL parser. This was part of Tracker 2.2.0 and brought its own benefits, ancient history.

Getting to the point, having the expression tree in the new parser closely modeled after SPARQL 1.1 grammar definition helped getting a perfect snapshot of what we don’t do, what we don’t do correctly and what we do extra. The parser was made to accept all correct SPARQL, and we had an `_unimplemented()` define in place to error out when interpreting the expression tree.

But that also gave me something to grep through and sigh, this turned into many further reads of SPARQL 1.1 specs, and a number of ideas about how to tackle them. Or if we weren’t restricted by compatibility concerns, as for some things we were limited by our own database structure.

Fast forward to today, the define is gone. Tracker covers the SPARQL 1.1 language in its entirety, warts and everything. The spec is from 2013, we just got there 7 years late :). Most notably, there’s:

  • Graphs: In a triple store, the aptly named triples consist of subject/predicate/object, and they belong within graphs. The object may point to elements in other graphs.

    In prior versions, we “supported graphs” in the language, but those were more a property of the triple’s object. This changes the semantics slightly in appearance but in fundamental ways, eg. no two graphs may have the same triple, and the ownership of the triple is backwards if subject and object are in different graphs.

    Now the implementation of graphs perfectly matches the description, and becomes a good isolated unit to let access in the case of sandboxing.

    We also support the ADD/MOVE/CLEAR/LOAD/DROP wholesome operations on graphs, to ease their management.

  • Services: The SERVICE syntax allows to federate portions of your query graph pattern to external services, and operate transparently on that data as if local. This is not exactly new in Tracker 2.99.x, but now supports dbus services in addition to http ones. More notes about why this is key further down.
  • New query forms, DESCRIBE/CONSTRUCT: This syntax sits alongside SELECT. DESCRIBE is a simple form to get RDF triples fully describing a resource, CONSTRUCT is a more powerful data extraction clause that allows serializing arbitrary portions of the triple set, even all of it, and even across RDF schemas.

Of all 11 documents from the SPARQL recommendations, we are essentially missing support for HTTP endpoints to entirely pass for a SPARQL 1.1 store. We obviously don’t mean to compete wrt enterprise-level databases, but we are completionists and will get to implementing the full recommendations someday :).

There is no central store

The tracker-store service got stripped of everything that makes it special. You were already able to create private stores, making those public via DBus is now one API call away. And its simple DBus API to perform/restore backups is now superseded by CONSTRUCT and LOAD syntax.

We have essentially democratized triple stores, in this picture (and a sandboxed world) it does not make sense to have a singleton default one, so the tracker-store process itself is no more. Each miner (Filesystem, RSS) has its own store, made public on its DBus name. TrackerSparqlConnection constructors let you specifically create a local store, or connect to a specific DBus/HTTP service.

No central service? New paradigm!

Did you use to store/modify data in tracker-store? There’s some bad news: It’s no longer for you to do that, scram from our lawn!

You are still much welcome to create your own private store, there you can do as you please, even rolling something else than Nepomuk.

But wait, how can you keep your own store and still consume data indexed by tracker miners? Here comes the SERVICE syntax to play, allowing you to deal with miner data and your own altogether. A simple hypothetical example:

# Query favorite files
SELECT ?u {
  SERVICE <dbus:org.freedesktop.Tracker3.Miner.Files> {
    ?u a nfo:FileDataObject
  }
  ?u mylocaldata:isFavorite true
}

As per the grammar definition, the SERVICE syntax can only be used from Query forms, not Update ones. This is essentially the language conspiring to keep a clear ownership model, where other services are not yours to modify.

If you are only interested in accessing one service, you can use tracker_sparql_connection_bus_new and perform queries directly to the remote service.

A web presence

It’s all about appearance these days, that’s why newscasters don’t switch the half of the suit they wear. A long time ago, we used to have the tracker-project.org domain, the domain expired and eventually got squatted.

That normally sucks on itself, for us it was a bit of a pickle, as RDF (and our own ontologies) stands largely on URIs, that means live software producing links out of our control, and it going to pastes/bugs/forums all over the internet. Luckily for us, tracker-project.org is a terrible choice of name for a porn site.

We couldn’t simply do the change either, in many regards those links were ABI. With 3.x on the way, ABI was no longer a problem, Sam did things properly so we have a site, and a proper repository of ontologies.

Nepomuk is dead, long live Nepomuk

Nepomuk is a dead project. Despite its site being currently alive, it’s been dead for extended periods of time over the last 2 years. That’s 11.5M EUR of your european taxpayer money slowly fading away.

We no longer consider we should consider it “an upstream”, so we have decided to go our own. After some minor sanitization and URI rewriting, the Nepomuk ontology is preserved mostly as-is, under our own control.

But remember, Nepomuk is just our “reference” ontology, a swiss army knife for whatever a might need to be stored in a desktop. You can always roll your own.

Tracker-miner-fs data layout

For sandboxing to be any useful, there must be some actual data separation. The tracker-miner-fs service now stores things in several graphs:

  • tracker:FileSystem
  • tracker:Audio
  • tracker:Video
  • tracker:Documents
  • tracker:Software

And commits further to the separation between “Data Objects” (e.g. files) and “Information Elements” (e.g. what its content represents). Both aspects of a “file” still reference each other, but simply used to be the same previously.

The tracker:FileSystem graph is the backbone of file system data, it contains all file Data Objects, and folders. All other graphs store the related Information Elements (eg. a song in a flac file).

Resources are interconnected between graphs, depending on the graphs you have access to, you will get a partial (yet coherent) view of the data.

CLI improvements

We have been doing some changes around our CLI tools, with tracker shifting its scope to being a good SPARQL triple store, the base set of CLI tools revolves around that, and can be seen as an equivalent to sqlite3 CLI command.

We also have some SPARQL specific sugar, like tracker endpoint that lets you create transient SPARQL services.

All miner-specific subcommands, or those that relied implicitly on their details did move to the tracker-miners repo, the tracker3 command is extensible to allow this.

Documentation

In case this was not clear, we want to be a general purpose data storage solution. We did spend quite some time improving and extending the developer and ontology documentation, adding migration notes… there’s even an incipient SPARQL tutorial!

There is a sneak preview of the API documentation at our site. It’s nice being able to tell that again!

Better tests

Tracker additionally ships a small helper python library to make it easy writing tests against Tracker infrastructure. There’s many new and deeper tests all over the place, e.g. around new syntax support.

Up next…

You’ve seen some talk about sandboxing, but nothing about sandboxing itself. That’s right, support for it is in a branch and will probably be part of 2.99.2. Now the path is paved for it to be transparent.

We currently are starting the race to update users. Sam got some nice progress on nautilus, and I just got started at shaving a yak on a cricket.

The porting is not completely straightforward. With few nice exceptions, a good amount of the tracker code around is stuck in some time frozen “as long as it works”, cargo-culted state. This sounds like a good opportunity to modernize queries, and introduce the usage of compiled statements. We are optimist that we’ll get most major players ported in time, and made 3.x able to install and run in parallel in case we miss the goal.

A call to application developers

We are no longer just “that indexer thingy”. If you need to store data with more depth than a table. If you missed your database design and relational algebra classes, or don’t miss them at all. We’ve got to talk :), come visit us at #tracker.

A call to distributors

We made tracker and tracker-miners 3.x able to install and run in parallel to tracker 2.x, and we expect users to get updated to it over time.

Given that it will get reflected in nightly flatpaks, and Tracker miners are host services, we recommend that tracker3 development releases are made available or easy to install in current stable distribution releases. Early testers and ourselves will thank you.

Gnome-shell Hackfest 2019 – Day 3

As promised, some late notes on the 3rd and last day of the gnome-shell hackfest, so yesterday!

Some highlights from my partial view:

  • We had a mind blowing in depth discussion about the per-crtc frame clocks idea that’s been floating around for a while. What started as “light” before-bedtime conversation the previous night continued the day after straining our neurons in front of a whiteboard. We came out wiser nonetheless, and have a much more concrete idea about how should it work.
  • Georges updated his merge request to replace Cogl structs with graphene ones. This now passes CI and was merged \o/
  • Much patch review happened in place, and some other pretty notable refactors and cleanups were merged.
  • The evening was more rushed than usual, with some people leaving already. The general feeling seemed good!
  • In my personal opinion the outcome was pretty good too. There’s been progress at multiple levels and new ideas sparked, you should look forward to posts from others :). It was also great to put a face to some IRC nicks, and meet again all the familiar ones.

Kudos to the RevSpace members and especially Hans, without them this hackfest couldn’t have happened.

Gnome-shell Hackfest 2019 – Day 2

Well, we are starting the 3rd and last day of this hackfest… I’ll write about yesterday, which probably means tomorrow I’ll blog about today :).

Some highlights of what I was able to participate/witness:

  • Roman Gilg of KDE fame came to the hackfest, it was a nice opportunity to discuss mixed DPI densities for X11/Xwayland clients. We first thought about having one server per pixel density, but later on we realized we might not be that far from actually isolating all X11 clients from each other, so why stop there.
  • The conversation drifted into other topics relevant to desktop interoperation. We did discuss about window activation and focus stealing prevention, this is a topic “fixed” in Gnome but in a private protocol. I had already a protocol draft around which was sent today to wayland-devel ML.
  • A plan was devised for what is left of Xwayland-on-demand, and an implementation is in progress.
  • The designers have been doing some exploration and research on how we interact with windows, the overview and the applications menu, and thinking about alternatives. At the end of the day they’ve demoed to us the direction they think we should take.

    I am very much not a designer and I don’t want to spoil their fine work here, so stay tuned for updates from them :).

  • As the social event, we had a very nice BBQ with some hackerspace members. Again kindly organized by Revspace.

Gnome-shell Hackfest 2019 – Day 1

So today kickstarted the gnome-shell hackfest in Leidschendam, the Netherlands.

There’s a decent number of attendants from multiple parties (Red Hat, Canonical, Endless, Purism, …). We all brought various items and future plans for discussion, and have a number of merge requests in various states to go through. Some exciting keywords are Graphene, YUV, mixed DPI, Xwayland-on-demand, …

But that is not all! Our finest designers also got together here, and I overheard they are discussing usability of the lock screen between other topics.

This event wouldn’t have been possible without the Revspace hackerspace people and specially our host Hans de Goede. They kindly provided the venue and necessary material, I am deeply thankful for that.

As there are various discussions going on simultaneously it’s kind of hard to keep track of everything, but I’ll do my best to report back over this blog. Stay tuned!

What am I doing with Tracker?

“Colored net”by Chris Vees (priorité maison) is licensed under CC BY-NC-ND 2.0

Some years ago I was asked to come up with some support for sandboxed apps wrt indexed data. This drummed up into Tracker 2.0 and domain ontologies, allowing those sandboxed apps to keep their own private data and collection of Tracker services to populate it.

Fast forward to today and… this is still largely unused, Tracker-using flatpak applications still whitelist org.freedesktop.Tracker, and are thus allowed to read and change content there. Despite I’ve been told it’s been mostly lack of time… I cannot blame them, domain ontologies offer the perfect isolation at the cost of the perfect duplication. It may do the job, but is far from optimal.

So I got asked again “we have a credible story for sandboxed tracker?”. One way or another, seems we don’t, back to the drawing board.

Somehow, the web world seems to share some problems with our case, and seems to handle it with some degree of success. Let’s have a look at some excerpts of the Sparql 1.1 recommendation (emphasis mine):

RDF is often used to represent, among other things, personal information, social networks, metadata about digital artifacts, as well as to provide a means of integration over disparate sources of information.

A Graph Store is a mutable container of RDF graphs managed by a single service. […] named graphs can be added to or deleted from a Graph Store. […] a Graph Store can keep local copies of RDF graphs defined elsewhere […] independently of the original graph.

The execution of a SERVICE pattern may fail due to several reasons: the remote service may be down, the service IRI may not be dereferenceable, or the endpoint may return an error to the query. […] Queries may explicitly allow failed SERVICE requests with the use of the SILENT keyword. […] (SERVICE pattern) results are returned to the federated query processor and are combined with results from the rest of the query.

So according to Sparql 1.1, we have multiple “Graph Stores” that manage multiple RDF graphs. They may federate queries to other endpoints with disparate RDF formats and whose availability may vary. This remote data is transparent, and may be used directly or processed for local storage.

Let’s look back at Tracker, we have a single Graph Store, which really is not that good at graphs. Responsibility of keeping that data updated is spread across multiple services, and ownership of that data is equally scattered.

It snapped me, if we transpose those same concepts from the web to the network of local services that your session is, we can use those same mechanisms to cut a number of drawbacks short:

  • Ownership is clear: If a service wants to store data, it would get its own Graph Store instead of modifying “the one”. Unless explicitly supported, Graph Stores cannot be updated from the outside.
  • So is lifetime: There’s been debate about whether data indexed “in Tracker” is permanent data or a cache. Everyone would get to decide their best fit, unaffected by everyone else’s decisions. The data from tracker-miners would totally be a cache BTW :).
  • Increases trustability: If Graph Stores cannot be tampered with externally, you can trust their content to represent the best effort of their only producer, instead of the minimum common denominator of all services updating “the Graph Store”.
  • Gives a mechanism for data isolation: Graph Stores may choose limiting the number of graphs seen on queries federated from other services.
  • Is sandboxing friendly: From inside a sandbox, you may get limited access to the other endpoints you see, or to the graphs offered. Updates are also limited by nature.
  • But works the same without a sandbox. It also has some benefits, like reducing data duplication, and make for smaller databases.

Domain ontologies from Tracker 2.0 also handle some of those differently, but very very roughly. So the first thing to do to get to that RDF nirvana was muscling up that Sparql support in Tracker, and so I did! I already had some “how could it be possible to do…” plans in my head to tackle most of those, but unfortunately they require changes to the internal storage format.

As it seems the time to do one (FTR, storage format has been “unchanged” since 0.15) I couldn’t just do the bare minimum work, it seemed too much of a good opportunity to miss, instead of maybe making future changes for leftover Sparql 1.1 syntax support.

Things ended up escalating into https://gitlab.gnome.org/GNOME/tracker/commits/wip/carlosg/sparql1.1, where It can be said that Tracker supports 100% of the Sparql 1.1 syntax. No buts, maybe bugs.

Some notable additions are:

  • Graphs are fully supported there, along with all graph management syntax.
  • Support for query federation through SERVICE {}
  • Data dumping through DESCRIBE and CONSTRUCT query forms.
  • Data loading through LOAD update form.
  • The pesky negated property path operator.
  • Support for rdf:langString and rdf:List
  • All missing builtin functions

This is working well, and is almost drop-in (One’s got to mind the graph semantics), making it material for Gnome 3.34 starts to sound realistic.

As Sparql 1.1 is a recommendation finished in 2013, and no other newer versions seem to be in the works, I think it can be said Tracker is reaching maturity. Only HTTP Graph Store Protocol (because why not) remains the big-ish item to reasonably tell we implement all 11 documents. Note that Tracker’s bet for RDF and Sparql started at a time when 1.0 was the current document and 1.1 just an early draft.

And sandboxing support? You might guess already the features it’ll draw from. It’s coming along, actually using Tracker as described above will go a bit deeper than the required query language syntax, more on that when I have the relevant pieces in place. I just thought I’d stop a moment to announce this huge milestone :).

A mutter and gnome-shell update

Some personal highlights:

Emoji OSK

The reworked OSK was featured a couple of cycles ago, but a notable thing that was still missing from the design reference was emoji input.

No more, sitting in a branch as of yet:

This UI feeds from the same emoji list than GtkEmojiChooser, and applies the same categorization/grouping, all the additional variants to an emoji are available as a popover. There’s also a (less catchy) keypad UI in place, ultimately hooked to applications through the GtkInputPurpose.

I do expect this to be in place for 3.32 for the Wayland session.

X11 vs Wayland

Ever since the wayland work started on mutter, there’s been ideas and talks about how mutter “core” should become detached of X11 code. It has been a long and slow process, every design decision has been directed towards this goal, we leaped forward on 2017 GSOC, and eg. Georges sums up some of his own recent work in this area.

For me it started with a “Hey, I think we are not that far off” comment in #gnome-shell earlier this cycle. Famous last words. After rewriting several, many, seemingly unrelated subsystems, and shuffling things here and there, and there we are to a point where gnome-shell might run with --no-x11 set. A little push more and we will be able to launch mutter as a pure wayland compositor that just spawns Xwayland on demand.

What’s after that? It’s certainly an important milestone but by no means we are done here. Also, gnome-settings-daemon consists for the most part X11 clients, which spoils the fun by requiring Xwayland very early in a real session, guess what’s next!

At the moment about 80% of the patches have been merged. I cannot assure at this point will all be in place for 3.32, but 3.34 most surely. But here’s a small yet extreme proof of work:

Performance

It’s been nice to see some of the performance improvements I did last cycle being finally merged. Some notable ones, like that one that stopped triggering full surface redraws on every surface invalidation. Also managed to get some blocking operations out of the main loop, which should fix many of the seemingly random stalls some people were seeing.

Those are already in 3.31.x, with many other nice fixes in this area from Georges, Daniel Van Vugt et al.

Fosdem

As a minor note, I will be attending Fosdem and the GTK+ Hackfest happening right after. Feel free to say hi or find Wally, whatever comes first.

On the track for 3.32

It happens sneakily, but there’s more things going on in the Tracker front than the occasional fallout. Yesterday 2.2.0-alpha1 was released, containing some notable changes.

On and off during the last year, I’ve been working on a massive rework of the SPARQL parser. The current parser was fairly solid, but hard to extend for some of the syntax in the SPARQL 1.1 spec. After multiple attempts and failures at implementing property paths, I convinced myself this was the way forward.

The main difference is that the previous parser was more of a serializer to SQL, just minimal state was preserved across the operation. The new parser does construct an expression tree so that nodes may be shuffled/reevaluated. This allows some sweet things:

  • Property paths are a nice resource to write more idiomatic SPARQL, most property path operators are within reach now. There’s currently support for sequence paths:

    # Get all files in my homedir
    SELECT ?elem {
      ?elem nfo:belongsToContainer/nie:url 'file:///home/carlos'
    }
    


    And inverse paths:

    # Get all files in my homedir by inverting
    # the child to container relation
    SELECT ?elem {
      ?homedir nie:url 'file:///home/carlos' ;
               ^nfo:belongsToContainer ?elem
    }
    

    There’s harder ones like + and * that will require recursive selects, and there’s the negation (!) operator which is not possible to implement yet.

  • We now have prepared statements! A TrackerSparqlStatement object was introduced, capable of holding a query with parameters which can be set/replaced prior to execution.

    conn = tracker_sparql_connection_get (NULL, NULL);
    stmt = tracker_sparql_connection_query_statement (conn,
                                                      "SELECT ?u { ?u fts:match ~term }",
                                                      NULL, NULL);
    
    tracker_sparql_statement_bind_string (stmt, "term", search_term);
    cursor = tracker_sparql_statement_execute (stmt, NULL, NULL);
    

    This is a long sought protection for injections. The object is cacheable and can service multiple cursors asynchronously, so it will also be an improvement for frequent queries.

  • More concise SQL is generated at places, which brings slight improvements on SQLite query planning.

This also got the ideas churning towards future plans, the trend being a generic triple store as much sparql1.1 capable as possible. There’s also some ideas about better data isolation for Flatpak and sandboxes in general (seeing the currently supported approach didn’t catch on). Those will eventually happen in this or following cycles, but I’ll reserve that for other blog post.

An eye was kept on memory usage too (mostly unrealized ideas from the performance hackfest earlier this year), tracker-store has been made to automatically shutdown when unneeded (ideally most of the time, since it just takes care of updates and the unruly apps that use the bus connection), and tracker-miner-fs took over the functionality of tracker-miner-apps. That’s 2 processes less in your default session.

In general, we’re on the way to an exciting release, and there’s more to come!

Performance hackfest

Last evening I came back from the GNOME performance hackfest happening in Cambridge. There was plenty of activity, clear skies, and pub evenings. Here’s some incomplete and unordered items, just the ones I could do/remember/witness/talk/overhear:

  • Xwayland 1.20 seems to be a big battery saver. Christian Kellner noticed that X11 Firefox playing Youtube could take his laptop to >20W consumption, traced to fairly intensive GPU activity. One of the first things we did was trying master, which dropped power draw to 8-9W. We presumed this was due to the implementation of the Present extension.
  • I was looking into dropping the gnome-shell usage of AtspiEventListener for the OSK, It is really taxing on CPU usage (even if the events we want are a minuscule subset, gnome-shell will forever get all that D-Bus traffic, and a11y is massively verbose), plus it slowly but steadily leaks memory.

    For the other remaining path I started looking into at least being able to deinitialize it. The leak deserves investigation, but I thought my time could be better invested on other things than learning yet another codebase.

  • Jonas Ådahl and Christian Hergert worked towards having Mutter dump detailed per-frame information, and Sysprof able to visualize it. This is quite exciting as all classic options just let us know where do we spend time overall, but doesn’t let us know whether we missed the frame mark, nor why precisely would that be. Update: I’ve been pointed out that Eric Anholt also worked on GPU perf events in mesa/vc4, so this info could also be visualized through sysprof
  • Peter Robinson and Marco Trevisan run into some unexpected trouble when booting GNOME in an ARM board with no input devices whatsoever. I helped a bit with debugging and ideas, Marco did some patches to neatly handle this situation.
  • Hans de Goede did some nice progress towards having the GDM session consume as little as possible while switched away from it.
  • Some patch review went on, Jonas/Marco/me spent some time looking at a screen very close and discussing the mipmapping optimizations from Daniel Van Vugt.
  • I worked towards fixing the reported artifact from my patches to aggressively cache paint volumes. These are basically one-off cases where individual ClutterActors break the invariants that would make caching possible.
  • Christian Kellner picked up my idea of performing pointer picking purely on the CPU side when the stage purely consists of 2D actors, instead of using the usual GL approach of “repaint in distinctive colors, read pixel to perform hit detection” which is certainly necessary for 3D, but quite a big roundtrip for 2D.
  • Alberto Ruiz and Richard Hughes talked about how to improve gnome-software memory usage in the background.
  • Alberto and me briefly dabbled with the idea of having specific search provider API that were more tied to Tracker, in order to ease the many context switches triggered by overview search.
  • On the train ride back, I unstashed and continued work on a WIP tracker-miners patch to have tracker-extract able to shutdown on inactivity. One less daemon to have usually running.

Overall, it was a nice and productive event. IMO having people with good knowledge both deep in the stack and wide in GNOME was determining, I hope we can repeat this feat again soon!

OSK update

There’s been a rumor that I was working on improving gnome-shell on-screen keyboard, what’s been up here? Let me show you!

The design has been based on the mockups at https://wiki.gnome.org/Design/OS/ScreenKeyboard, here’s how it looks in English (mind you, hasn’t gone through theming wizards):

The keymaps get generated from CLDR (see here), which helped boost the number of supported scripts (c.f. caribou), some visual examples:

As you can see there’s still a few ugly ones, the layouts aren’t as uniform as one might expect, these issues will be resolved over time.

The additional supported scripts don’t mean much without a way to send those fancy chars/strings to the client. We traditionally were just able to send forged keyboard events, which means we were restricted to keycodes that had a representation in the current keymap. On X11 we are kind of stuck with that, but we can do better on Wayland, this work relies on a simplified version of the text input protocol that I’m doing the last proofreading before proposing as v3 (the branches currently use a private copy). Using an specific protocol allows for sending UTF8 strings independently of the keymap, very convenient too for text completion.

But there are keymaps where CLDR doesn’t dare going, prominent examples are Chinese or Japanese. For those, I’m looking into properly leveraging IBus so pinyin-like input methods work by feeding the results into the suggestions box:

Ni Hao!

The suggestion box even kind of works with the typing booster ibus IM. But you have to explicitly activate it, there is room for improvement here in the future.

And there is of course still bad stuff and todo items. Some languages like Korean neither have a layout, nor input methods that accept latin input, so they are badly handled (read: not at all). It would also be nice to support shape-based input.

Other missing things from the mockups are the special numeric and emoji keymaps, there’s some unpushed work towards supporting those, but I had to draw the line somewhere!

The work has been pushed in mutter, gtk+ and gnome-shell branches, which I hope will get timely polished and merged this cycle 🙂