Speeding up AppStream: mmap’ing XML using libxmlb

AppStream and the related AppData are XML formats that have been adopted by thousands of upstream projects and are being used in about a dozen different client programs. The AppStream metadata shipped in Fedora is currently a huge 13Mb XML file, which with gzip compresses down to a more reasonable 3.6Mb. AppStream is awesome; it provides translations of lots of useful data into basically all languages and includes screenshots for almost everything. GNOME Software is built around AppStream, and we even use a slightly extended version of the same XML format to ship firmware update metadata from the LVFS to fwupd.

XML does have two giant weaknesses. The first is that you have to decompress and then parse the files – which might include all the ~300 tiny AppData files as well as the distro-provided AppStream files, if you want to list installed applications not provided by the distro. Seeking lots of small files isn’t so slow on a SSD, and loading+decompressing a small file is actually quicker than loading an uncompressed larger file. Parsing an XML file typically means you set up some callbacks, which then get called for every start tag, text section, then end tag – so for a 13Mb XML document that’s nested very deeply you have to do a lot of callbacks. This means you have to process the description of GIMP in every language before you can even see if Shotwell exists at all.

The typical way parsing XML involves creating a “node tree” when parsing the XML. This allows you treat the XML document as a Document Object Model (DOM) which allows you to navigate the tree and parse the contents in an object oriented way. This means you typically allocate on the heap the nodes themselves, plus copies of all the string data. AsNode in libappstream-glib has a few tricks to reduce RSS usage after parsing, which includes:

Interning common element names like description, p, ul, li
Freeing all the nodes, but retaining all the node data
Ignoring node data for languages you don’t understand
Reference counting the strings from the nodes into the various appstream-glib GObjects

This still has a both drawbacks; we need to store in hot memory all the screenshot URLs of all the apps you’re never going to search for, and we also need to parse all these long translated descriptions data just to find out if gimp.desktop is actually installable. Deduplicating strings at runtime takes nontrivial amounts of CPU and means we build a huge hash table that uses nearly as much RSS as we save by deduplicating.

On a modern system, parsing ~300 files takes less than a second, and the total RSS is only a few tens of Mb – which is fine, right? Except on resource constrained machines it takes 20+ seconds to start, and 40Mb is nearly 10% of the total memory available on the system. We have exactly the same problem with fwupd, where we get one giant file from the LVFS, all of which gets stored in RSS even though you never have the hardware that it matches against. Slow starting of fwupd and gnome-software is one of the reasons they stay resident, and don’t shutdown on idle and restart when required.

We can do better.

We do need to keep the source format, but that doesn’t mean we can’t create a managed cache to do some clever things. Traditionally I’ve been quite vocal against squashing structured XML data into databases like sqlite and Xapian as it’s like pushing a square peg into a round hole, and forces you to think like a database doing 10 level nested joins to query some simple thing. What we want to use is something like XPath, where you can query data using the XML structure itself.

We also want to be able to navigate the XML document as if it was a DOM, i.e. be able to jump from one node to it’s sibling without parsing all the great, great, great, grandchild nodes to get there. This means storing the offset to the sibling in a binary file.

If we’re creating a cache, we might as well do the string deduplication at creation time once, rather than every time we load the data. This has the added benefit in that we’re converting the string data from variable length strings that you compare using strcmp() to quarks that you can compare just by checking two integers. This is much faster, as any SAT solver will tell you. If we’re storing a string table, we can also store the NUL byte. This seems wasteful at first, but has one huge advantage – you can mmap() the string table. In fact, you can mmap the entire cache. If you order the string table in a sensible way then you store all the related data in one block (e.g. the <id> values) so that you don’t jump all over the cache invalidating almost everything just for a common query. mmap’ing the strings means you can avoid strdup()ing every string just in case; in the case of memory pressure the kernel automatically reclaims the memory, and the next time automatically loads it from disk as required. It’s almost magic.

I’ve spent the last few days prototyping a library, which is called libxmlb until someone comes up with a better name. I’ve got a test branch of fwupd that I’ve ported from libappstream-glib and I’m happy to say that RSS has reduced from 3Mb (peak 3.61Mb) to 1Mb (peak 1.07Mb) and the startup time has gone from 280ms to 250ms. Unless I’ve missed something drastic I’m going to port gnome-software too, and will expect even bigger savings as the amount of XML is two orders of magnitude larger.

So, how do I use this thing. First, lets create a baseline doing things the old way:

$ time appstream-util search gimp.desktop
real	0m0.645s
user	0m0.800s
sys	0m0.184s

To create a binary cache:

$ time xb-tool compile appstream.xmlb /usr/share/app-info/xmls/* /usr/share/appdata/* /usr/share/metainfo/*
real	0m0.497s
user	0m0.453s
sys	0m0.028s

$ time xb-tool compile appstream.xmlb /usr/share/app-info/xmls/* /usr/share/appdata/* /usr/share/metainfo/*
real	0m0.016s
user	0m0.004s
sys	0m0.006s

Notice the second time it compiled nearly instantly, as none of the filename or modification timestamps of the sources changed. This is exactly what programs would do every time they are launched.

$ df -h appstream.xmlb
4.2M	appstream.xmlb

$ time xb-tool query appstream.xmlb "components/component[@type='desktop']/id[text()='firefox.desktop']"
RESULT: <id>firefox.desktop</id>
RESULT: <id>firefox.desktop</id>
RESULT: <id>firefox.desktop</id>
real	0m0.008s
user	0m0.007s
sys	0m0.002s

8ms includes the time to load the file, search for all the components that match the query and the time to export the XML. You get three results as there’s one AppData file, one entry in the distro AppStream, and an extra one shipped by Fedora to make Firefox featured in gnome-software. You can see the whole XML component of each result by appending /.. to the query. Unlike appstream-glib, libxmlb doesn’t try to merge components – which makes it much less magic, and a whole lot simpler.

Some questions answered:

Why not just use a GVariant blob?: I did initially, and the cache was huge. The deeply nested structure was packed inefficiently as you have to assume everything is a hash table of a{sv}. It was also slow to load; not much faster than just parsing the XML. It also wasn’t possible to implement the zero-copy XPath queries this way.
Is this API and ABI stable?: Not yet, as soon as gnome-software is ported.
You implemented XPath in c‽: No, only a tiny subset. See the README.md

Comments welcome.

10 thoughts on “Speeding up AppStream: mmap’ing XML using libxmlb”

Zander says:

September 20, 2018 at 6:56 pm

Name Suggestion: blobex

Either as “blob-ex” or “blo-bex” from reworking X-M-L as Blob-X so-to-speak

I dunno it’s a thought ¯\_(ツ)_/¯
Clive C. Hunt says:

September 20, 2018 at 9:10 pm

AppStream is a classic example of using XML for the wrong job. Misuses of XML like this are the reason why people hate tag soup, even though it actually makes sense for document markup.

For heavens sake, stop using XML for data serialization. You, in particular, seem to repeat this same mistake in all your open source projects.

[WORDPRESS HASHCASH] The poster sent us ‘0 which is not a hashcash value.
1. hughsie says:
  
  September 20, 2018 at 9:31 pm
  
  And your suggestion is…?
E. Raymond says:

September 20, 2018 at 9:13 pm

Also titled: “How I half-fixed a self-inflicted problem”
Alex says:

September 21, 2018 at 9:26 am

Any reason against using https://developers.google.com/protocol-buffers/ instead of XML? Would it worth to do a prototype to check how it performs?
1. hughsie says:
  
  September 21, 2018 at 10:40 am
  
  Can you generate protobuf for C now?
  1. liberforce says:
    
    September 28, 2018 at 10:01 am
    
    I’m completely incompetent on this area, but after reading the comments of Alex and Clive C. Hunt, I wondered if something similar to protobuf for C was available.
    
    Found this format comparion chart.
    https://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats
    
    It turns out there’s Thrift which seems to do more or less the same as protobuf. It was originally developped by Facebook but is under the Apache Foundation. It can generate C output, and icing on the cake is that it spits glib-style C.
    https://thrift.apache.org/tutorial/c_glib
    
    C support is a bit sub-par:
    https://thrift.apache.org/docs/Languages (doc link there are broken, though)
    However, it supports Binary and Compact protocols:
    https://github.com/apache/thrift/blob/master/doc/specs/thrift-binary-protocol.md
    https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md
    
    And here’s a recent comparison about serialization formats:
    http://labs.criteo.com/2017/05/serialization/
    And the conclusion:
    « Protobuf and Thrift have similar performances, in terms of file sizes and serialization/deserialization time. The slightly better performances of Thrift did not outweigh the easier and less risky integration of Protobuf as it was already in use in our systems, thus the final choice.
    
    Protobuf also has a better documentation, whereas Thrift lacks it. »
    
    Might be worth a look.
Stuart Axon says:

September 21, 2018 at 10:07 am

There seems to be some info around about using SIMD instructions to help parse XML.

This could be a good speedup and possibly kind of fun to implement.
James says:

September 22, 2018 at 5:19 am

Nice to hear that our software is getting faster!

Would writing this new lib be a good candidate for something that could be in rust? Maybe that would make it safer, still performant, and a good direction for GNOME to move in– one tiny piece at a time.

Also, it’s free rust practice for you =D

Thanks!
1. hughsie says:
  
  September 22, 2018 at 7:43 am
  
  I did consider this; unfortunately I need something that can run on RHEL 7 without EPEL and also on all the weird architectures that RHEL has to run on.

Comments are closed.

Published by

hughsie

10 thoughts on “Speeding up AppStream: mmap’ing XML using libxmlb”