When I first started doing GNOME documentation, our documents were translated manually, as a whole, and with no change-tracking. That is to say, they were never translated. Then Danilo came along and wrote xml2po, and I integrated it into the build utilities in gnome-doc-utils. Suddenly, all our XML documents could be translated with standard PO files. And we started seeing actual documentation translation.
Fast forward six years. We’re still using xml2po, and it’s serving us well. Various web-based translation editors and trackers have appeared that can handle XML documents. Some of them use xml2po underneath; some use the same concepts in their own implementation. Meanwhile, the W3C created ITS, a common vocabulary for specifying things that translators and translation tools might want to know about an XML format. But we’re not using it, and I don’t know anybody in the greater GNOME ecosystem who is.
Presenting itstool: an ITS-based tool for converting XML files to PO files and back again.
This is an experiment I started over the weekend. The idea is to have a general xml2po-like tool that contains no vocabulary-specific logic. All it knows is XML and ITS, and everything you need to know about a format is specified with an ITS rules files. Look at the Mallard ITS file as an example. The basics of how to handle Mallard are in 13 lines.
This has a lot of potential. For starters, you don’t need to patch a program or write a plug-in to support a new XML vocabulary. All you need to do is provide an ITS file. There’s a chance that the people who developed the vocabulary will already have ITS definitions in place.
What’s more, you can embed ITS attributes and rules directly into your XML document, extending or overriding global rules. This is huge. This is the “mark something as untranslatable” feature that translators want, provided by a W3C recommendation that other tools might actually understand as well.
For an example, take a look at the DocBook Element Reference for Mallard. If you convert this to a PO file using itstool, you’ll see a bunch of messages that look like this:
msgid "<code href=\"http://www.docbook.org/tdg/en/html/abbrev.html\">abbrev</code>"
Yawn. It’s some markup, and a URL, and the name of an XML element. Nobody needs to translate that, but there’s no way a general tool can know that. But the author knows, so the author can use the its:translate attribute to save the translators some work:
<td its:translate="no"><p>
<code href="http://www.docbook.org/tdg/en/html/abbrev.html">abbrev</code>
</p></td>
Problem solved. This one pesky message will no longer appear in the PO file. That’s a big step forward for us. Unfortunately, there are 417 of those on that page. That’s a lot of typing. Fortunately, ITS also provides a standard way to specify rules, and you can put those rules directly inside your document. For this page, I happen to know that the first column of every row is untranslatable. And I know there’s no row spanning that would cause the first td to be in anything other than the first column. So rather than type its:translate="no" 417 times, I can put what I know right in the XML, where itstool can find it.
<its:rules xmlns:its="http://www.w3.org/2005/11/its">
<its:translateRule translate="no" selector="//mal:tr/mal:td[1]"/>
</its:rules>
We can just put this inside a Mallard info element. This is valid because Mallard allows any element from an external namespace inside the info element. And now itstool drops all 417 of those pesky messages from the PO file.
There are some bits that ITS doesn’t provide, and for those, we’ll need extension rules. Annoyingly, ITS doesn’t have a rule to specify space-preserving elements. I think the idea is that your DTD or XSD can specify this using xml:space. But I come from the RELAX NG school of thought, where validation is validation, and processing logic is something else. The two formats I care about most, DocBook and Mallard, are both RNG-based, so I had to create an extension rule for that. That’s done.
Then we have xml2po’s awesome ability to create messages for external files like images. We can’t translate the images using xml2po, of course. But xml2po can let translators know when images have changed, or when new images were added. That’s another extension rule. It’s not in yet, but I have a syntax for it, and I think it will be easy.
Finally, there’s translator credits. This one is harder, but I think it can still be specified in XML as an extension. I have a prototype of how the XML might look in the ITS files for DocBook and Mallard, but no working code yet.
Some things were easier than with xml2po’s approach, and some things were harder. On the whole, I think the ITS-based approach can take us further, and lines up more with standards that exist outside GNOME. I’m going to keep experimenting with this, and perhaps try to get in touch with other ITS folks. We’ll see where it goes.