AppStream and the related AppData are XML formats that have been adopted by thousands of upstream projects and are being used in about a dozen different client programs. The AppStream metadata shipped in Fedora is currently a huge 13Mb XML file, which with gzip compresses down to a more reasonable 3.6Mb. AppStream is awesome; it provides translations of lots of useful data into basically all languages and includes screenshots for almost everything. GNOME Software is built around AppStream, and we even use a slightly extended version of the same XML format to ship firmware update metadata from the LVFS to fwupd.
XML does have two giant weaknesses. The first is that you have to decompress and then parse the files – which might include all the ~300 tiny AppData files as well as the distro-provided AppStream files, if you want to list installed applications not provided by the distro. Seeking lots of small files isn’t so slow on a SSD, and loading+decompressing a small file is actually quicker than loading an uncompressed larger file. Parsing an XML file typically means you set up some callbacks, which then get called for every start tag, text section, then end tag – so for a 13Mb XML document that’s nested very deeply you have to do a lot of callbacks. This means you have to process the description of GIMP in every language before you can even see if Shotwell exists at all.
The typical way parsing XML involves creating a “node tree” when parsing the XML. This allows you treat the XML document as a Document Object Model (DOM) which allows you to navigate the tree and parse the contents in an object oriented way. This means you typically allocate on the heap the nodes themselves, plus copies of all the string data. AsNode
in libappstream-glib has a few tricks to reduce RSS usage after parsing, which includes:
- Interning common element names like
description
, p
, ul
, li
- Freeing all the nodes, but retaining all the node data
- Ignoring node data for languages you don’t understand
- Reference counting the strings from the nodes into the various appstream-glib GObjects
This still has a both drawbacks; we need to store in hot memory all the screenshot URLs of all the apps you’re never going to search for, and we also need to parse all these long translated descriptions data just to find out if gimp.desktop
is actually installable. Deduplicating strings at runtime takes nontrivial amounts of CPU and means we build a huge hash table that uses nearly as much RSS as we save by deduplicating.
On a modern system, parsing ~300 files takes less than a second, and the total RSS is only a few tens of Mb – which is fine, right? Except on resource constrained machines it takes 20+ seconds to start, and 40Mb is nearly 10% of the total memory available on the system. We have exactly the same problem with fwupd, where we get one giant file from the LVFS, all of which gets stored in RSS even though you never have the hardware that it matches against. Slow starting of fwupd and gnome-software is one of the reasons they stay resident, and don’t shutdown on idle and restart when required.
We can do better.
We do need to keep the source format, but that doesn’t mean we can’t create a managed cache to do some clever things. Traditionally I’ve been quite vocal against squashing structured XML data into databases like sqlite and Xapian as it’s like pushing a square peg into a round hole, and forces you to think like a database doing 10 level nested joins to query some simple thing. What we want to use is something like XPath, where you can query data using the XML structure itself.
We also want to be able to navigate the XML document as if it was a DOM, i.e. be able to jump from one node to it’s sibling without parsing all the great, great, great, grandchild nodes to get there. This means storing the offset to the sibling in a binary file.
If we’re creating a cache, we might as well do the string deduplication at creation time once, rather than every time we load the data. This has the added benefit in that we’re converting the string data from variable length strings that you compare using strcmp()
to quarks that you can compare just by checking two integers. This is much faster, as any SAT solver will tell you. If we’re storing a string table, we can also store the NUL byte. This seems wasteful at first, but has one huge advantage – you can mmap()
the string table. In fact, you can mmap the entire cache. If you order the string table in a sensible way then you store all the related data in one block (e.g. the <id> values) so that you don’t jump all over the cache invalidating almost everything just for a common query. mmap’ing the strings means you can avoid strdup()
ing every string just in case; in the case of memory pressure the kernel automatically reclaims the memory, and the next time automatically loads it from disk as required. It’s almost magic.
I’ve spent the last few days prototyping a library, which is called libxmlb until someone comes up with a better name. I’ve got a test branch of fwupd that I’ve ported from libappstream-glib and I’m happy to say that RSS has reduced from 3Mb (peak 3.61Mb) to 1Mb (peak 1.07Mb) and the startup time has gone from 280ms to 250ms. Unless I’ve missed something drastic I’m going to port gnome-software too, and will expect even bigger savings as the amount of XML is two orders of magnitude larger.
So, how do I use this thing. First, lets create a baseline doing things the old way:
$ time appstream-util search gimp.desktop
real 0m0.645s
user 0m0.800s
sys 0m0.184s
To create a binary cache:
$ time xb-tool compile appstream.xmlb /usr/share/app-info/xmls/* /usr/share/appdata/* /usr/share/metainfo/*
real 0m0.497s
user 0m0.453s
sys 0m0.028s
$ time xb-tool compile appstream.xmlb /usr/share/app-info/xmls/* /usr/share/appdata/* /usr/share/metainfo/*
real 0m0.016s
user 0m0.004s
sys 0m0.006s
Notice the second time it compiled nearly instantly, as none of the filename or modification timestamps of the sources changed. This is exactly what programs would do every time they are launched.
$ df -h appstream.xmlb
4.2M appstream.xmlb
$ time xb-tool query appstream.xmlb "components/component[@type='desktop']/id[text()='firefox.desktop']"
RESULT: <id>firefox.desktop</id>
RESULT: <id>firefox.desktop</id>
RESULT: <id>firefox.desktop</id>
real 0m0.008s
user 0m0.007s
sys 0m0.002s
8ms includes the time to load the file, search for all the components that match the query and the time to export the XML. You get three results as there’s one AppData file, one entry in the distro AppStream, and an extra one shipped by Fedora to make Firefox featured in gnome-software. You can see the whole XML component of each result by appending /..
to the query. Unlike appstream-glib, libxmlb doesn’t try to merge components – which makes it much less magic, and a whole lot simpler.
Some questions answered:
- Why not just use a GVariant blob?: I did initially, and the cache was huge. The deeply nested structure was packed inefficiently as you have to assume everything is a hash table of
a{sv}
. It was also slow to load; not much faster than just parsing the XML. It also wasn’t possible to implement the zero-copy XPath queries this way.
- Is this API and ABI stable?: Not yet, as soon as gnome-software is ported.
- You implemented XPath in c‽: No, only a tiny subset. See the README.md
Comments welcome.