GXml and on-the-fly-post-parsing technique

I think this is new, so I’ll describe a new technique used in GXml to parse a large set of nodes in an XML document.

Parsing Large XML documents

If you have a large XML document, with a root with a number of child nodes, the standard technique is read all of them, including the child’s children ones, to create the XML tree. This process can take a while.

New on-the-fly parser

GXml now has a new custom parser called StreamReader used to read the root element and its children, but without any attribute and without any child’s children; the attributes and the children’s children are stored in a string on-the-fly in order to read the document almost at the same time it is read from the IO stream, for the root and for each children, improving the loading time of large XML documents up to 400% times faster than the previous technique already present in GXml.

On-the-fly-post-parsing

By using this On-the-fly-post-parsing technique, You can’t access the child’s children or the root’s attributes immediately after first read, you have to parse it from a temporally location in the GXml.Element class, using the new GXml.Element.parse_buffer() method, this one use the standard method, already present in GXml, to parse the root’s properties and the children’s children. When GXml.Element.parse_buffer() is called over the root, all children’s children are parsed recursively, but you can choose to parse just one of the root’s child, making a really convenient technique when you need just one root’s child node in a large XML document.

Multi-threading parsing

Currently GXml.Element.parse_buffer_async(), when called on root’s element, uses GLib.ThreadPool to parse each child in a different thread each and uses as many as threads are usable (less one) in your system. The expected behavior is getting a parse boost over the standard technique using in GXml: Xml.TextReader from the veteran libxml2 library running over just one thread. Currently a standard time parsing is provided when GXml.Element.parse_buffer_async() is called on document’s root, this maybe is a limitation on libxml2, because we have lot of Xml.TextReader running at the same time parsing element’s children; or a limitation on GLib.ThreadPool. Maybe the solution is a step away.

Author: despinosa

Linux and GNOME user, full time, since 2001. Actual maintainer of GXml and contributor to other projects mainly on GObject Introspection support. View all posts by despinosa

Parsing Large XML documents

New on-the-fly parser

On-the-fly-post-parsing

Multi-threading parsing

Author: despinosa

Leave a Reply Cancel reply