XML, GMarkup, and all that jazz

xmlchick.jpgI was asked to talk about how to use GMarkup. This is a brief introduction; there are many people more qualified to talk about it than I am. These are my opinions and not those of the project or my employer. If you want to suggest a change or report a mistake, suggest away.

Firstly, why you shouldn’t use GMarkup.
Don’t use GMarkup if all you want is to store a simple list of settings. Instead, either use gconf, or if what you want is a file on disk, use GKeyFile, which lets you write things like:

[favourites]
icecream=chocolate
film=Better Than Chocolate
poem=Jenny kiss'd me when we met

in the style of .ini files. These are much more user-friendly.

Don’t use GMarkup if you want to parse actual arbitrary XML files. Instead, use libxml, which is beautiful and wonderful and fast and accurate. GMarkup is made to be easy to use.

Do use GMarkup if you want a reasonably complicated way to store files on disk, in a new format you’re making up.

Why GMarkup files are not XML.
XML is big and scary and complicated and spiky. People pretend it is simple. It isn’t. GMarkup files differ in many ways from XML, which makes them easier to use but also less flexible. Here are some ways in which a file can be XML but not GMarkup:

  • There is no character code but Unicode, and UTF-8 is its encoding. GMarkup does not attempt to screw around with UTF-16, ASCII, ISO646, or, heaven help us, EBCDIC. That way madness lies.
  • There are five predefined entities: &amp; for &, &lt; for <, &gt; for >, &quot; for ", and &apos; for '. You cannot define any new ones, but you can use character references (giving the code point explicitly, like &#9731; or &#X2603; for a snowman, ☃).
  • Processing instructions (including doctypes and comments) aren’t specially treated, and there is no validation.

There are also a few subtle ways in which a file can be parsable by GMarkup but not be valid XML. However, these are officially invalid GMarkup even though they work fine, if you can follow that. Many people don’t care, but they should.

Okay, so how do we get going?
There are two ways people deal with XML: either as a tree, or as a series of events. GMarkup always sees them as a series of events. There are five kinds of event which can happen:

  • The start of an element
  • The end of an element
  • Some text (inside an element)
  • Some other stuff (processing instructions, mainly, including comments and doctypes)
  • An error

Let’s imagine we have this file, called simple.xml:

<zoo>
  <animal noise="roar">lion</animal>
  <animal noise="sniffle">bunny</animal>
  <animal noise="lol">cat</animal>
  <keeper/>
</zoo>

This will be seen by the parser as a series of events, as follows:

  • Start of “zoo”.
  • Start of “animal”, with a “noise” attribute of “roar”.
  • The text “lion”.
  • End of “animal”.
  • Start of “animal”, with a “noise” attribute of “sniffle”.
  • The text “bunny”.
  • End of “animal”.
  • Start of “animal”, with a “noise” attribute of “lol”.
  • The text “cat”.
  • End of “animal”.
  • Start of “keeper”.
  • End of “keeper”.
  • End of “zoo”.

(Actually there’ll be some extra text which is just whitespace, but let’s ignore that for now.)

There are two kinds of objects to deal with.
One is a GMarkupParser: it lists what to do in each of the five cases given above. In each case we give a function which knows how to handle opening elements, or closing elements, or whatever. If we don’t care about that case, we can say NULL. The signatures needed for each of these functions are given in the API documentation.

The second kind of object is a GMarkupParseContext. You construct this, feed it text, which it will parse, and then eventually destroy it. It would be nice if there was a function which would just read in a file and deal with it, but there isn’t. Fortunately, we have g_file_get_contents(), which is almost as good, if we can assume there’s memory available to store the whole file at once.

So let’s say we want to print the animals’ noises from the file above.

  1. Decide which kinds of events we need to know about. We need to know when elements open so that we can pick up the animal noise, and when text comes past giving the animal name, so we can print it. It would be possible to free the noise when we need to get the next noise, but it would be easier to free it when we see </animal>, so let’s do it like that. Processing instructions and errors we can ignore for the sake of example.
  2. Write functions to handle each one.
  3. Write a GMarkupParser listing the name of each function.
  4. Write something to load the file into memory and parse it.

Here’s some less-than-beautiful example code to do that.

#include <glib.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

gchar *current_animal_noise = NULL;

/* The handler functions. */

void start_element (GMarkupParseContext *context,
    const gchar         *element_name,
    const gchar        **attribute_names,
    const gchar        **attribute_values,
    gpointer             user_data,
    GError             **error) {

  const gchar **name_cursor = attribute_names;
  const gchar **value_cursor = attribute_values;

  while (*name_cursor) {
    if (strcmp (*name_cursor, "noise") == 0)
      current_animal_noise = g_strdup (*value_cursor);

    name_cursor++;
    value_cursor++;
  }
}

void text(GMarkupParseContext *context,
    const gchar         *text,
    gsize                text_len,
    gpointer             user_data,
    GError             **error)
{
  /* Note that "text" is not a regular C string: it is
   * not null-terminated. This is the reason for the
   * unusual %*s format below.
   */
  if (current_animal_noise)
    printf("I am a %*s and I go %s. Can you do it?\n",
        text_len, text, current_animal_noise);
}

void end_element (GMarkupParseContext *context,
    const gchar         *element_name,
    gpointer             user_data,
    GError             **error)
{
  if (current_animal_noise)
    { 
      g_free (current_animal_noise);
      current_animal_noise = NULL;
    }
}

/* The list of what handler does what. */
static GMarkupParser parser = {
  start_element,
  end_element,
  text,
  NULL,
  NULL
};

/* Code to grab the file into memory and parse it. */
int main() {
  char *text;
  gsize length;
  GMarkupParseContext *context = g_markup_parse_context_new (
      &parser,
      0,
      NULL,
      NULL);

  /* seriously crummy error checking */

  if (g_file_get_contents ("simple.xml", &text, &length, NULL) == FALSE) {
    printf("Couldn't load XML\n");
    exit(255);
  }

  if (g_markup_parse_context_parse (context, text, length, NULL) == FALSE) {
    printf("Parse failed\n");
    exit(255);
  }

  g_free(text);
  g_markup_parse_context_free (context);
}
/* EOF */

Save that as simple.c. If you have the GNOME libraries properly installed, then typing

gcc simple.c $(pkg-config glib-2.0 --cflags --libs) -o simple

will compile the program, and running it with ./simple will give you

I am a lion and I go roar. Can you do it?
I am a bunny and I go sniffle. Can you do it?
I am a cat and I go lol. Can you do it?

I think that was enough to whet your appetite, but there’s a whole lot more to know. You can read more here. If you want to see a real-life example, Metacity uses exactly this sort of arrangement for its theme files. (Later: Julien Puydt shares memories of how schema handling in gconf was written using GMarkup.) Any questions?

Photo: Day-old chick, GFDL, from here, by Fir0002, modified by Dcoetzee, Editor at Large, and tthurman.

Tags: ,

4 Responses to “XML, GMarkup, and all that jazz”

  1. While your parser functions correctly when you feed the entire file, it won’t necessarily give the right results if the input is fed to the parser in chunks.

    If I split the animal name over a chunk boundary, GMarkup may call your text() handler twice, which might lead to output like:

    I am a li and I go roar. Can you do it?
    I am a on and I go roar. Can you do it?

    To handle this case correctly, you’d need to accumulate the text (maybe using GString) when you’re inside an interesting element and then print the message from the end_element() handler.

  2. @James: That’s a very good point.

    It might be nice if there was a flag you could pass into GMarkup to make it do the buffering for itself, the same way it clearly does for tags it hasn’t seen the whole of.

  3. I guess the reason the SAX APIs are structured like this is so that they can pass through character data directly from the input buffer without copying (returning spans delimited by tags or entities).

    In contrast with element start/end tags, the text content can be quite large (unless your XML format does everything through attributes …) so this makes sense. If you can avoid having to make a copy of every byte of the input, you probably should …

  4. […] ᛏᚦ Thomas Thurman « XML, GMarkup, and all that jazz […]