Earlier this week, the W3C released the Internationalization Tag Set (ITS) 2.0 as a recommendation. This is a big leap from ITS 1.0 in terms of functionality, and I’m proud to have played a small part in the development of it. Today, I released ITS Tool 2.0.0 with full support for ITS 2.0. Notably, this release supports:

  • Parameters in selectors, including the ability for users to override parameters on the command line.
  • Preserve Space, a data category that allows you to specify which elements are space-preserving. This was based in part on a similar extension data category from ITS Tool 1
  • External Resource, a data category that allows you to locate referenced resources like images and videos. This was based in part on a similar extension data category from ITS Tool 1
  • Locale Filter, a data category that allows you to exclude content from localized copies based on locale. This replaces the considerably more limited dropRule extension from ITS Tool 1
  • ID Value, a data category that allows you to specify potentially complex IDs for elements.

This release also includes a number of features beyond ITS 2.0 support, such as an option to preserve entity references, an option to load external DTDs, and built-in rules for DocBook 5. This is the biggest release of ITS Tool since it was first released. I’d appreciate people trying it out and reporting bugs.

I do have plans for more features, including:

  • An option to follow XIncludes, automatically processing any included files. This is distinct from just merging the XIncludes in that the files are handled individually, as if you had specified them each on the command line. This is really useful in setups that have a large pool of common content that gets XIncluded in different deliverables so that a single PO file can easily reflect all the translatable strings for one deliverable.

  • Support for some sort of readiness data category that can specify whether individual segments are ready for translation. This would put information in the PO file about whether each PO message is finalized yet. This is useful for partial or slushy freezes, where you want to notify translators of finished content, but you can’t commit to freezing all your deliverables.

    We discussed a data category for this in ITS 2.0, but it wasn’t ready in time. Other implementers are interested in this, so we will probably use a shared extension that works in other tools.

  • The ability to specify repeatable elements for multi-lingual documents. In ITS Tool 1.2.0, I added the ability to join multiple translations into a single multi-lingual file, such as those that GNOME now uses for AppData files. Unfortunately, you currently have no control over which elements are repeated with distinct xml:lang attributes. Right now, ITS Tool assumes that whatever elements it uses for segmentation are the repeatable elements. But in files like AppData files, you may want to segment on the p elements, but repeat the description elements.

    I tried to bend the ITS 2.0 Target Pointer data category to support this use case, but ultimately decided it was too different and would needless complicate the specification.

  • HTML support. ITS 2.0 officially supports HTML5, and specifies exactly how ITS information is mapped to HTML. It does not, however, require implementations to support HTML5. For now, ITS Tool is just an XML tool, but I’d like to add support for HTML5. The libxml2 HTML parser doesn’t cut it, unfortunately, so I need to find something that does, and preferably something that I can map to libxml2′s data structures so I can use the same behind-the-scenes logic.

If you’re interested in using ITS Tool for your XML document translation, get in touch. Leave a comment or email me at shaunm at gnome dot org. I’m always happy to help people set up better translation processes.

ITS Tool Released

2011-04-26

Last October, I blogged about itstool, a tool I developed to translate XML documents with PO files using ITS rules. Today, I released version 1.0.0 of ITS Tool on the new ITS Tool web site. If you’ve used xml2po before, you’re familiar with the basic idea: PO messages are extracted from an XML files, and translated messages are merged with the source to produce localized XML files. If you’re not already translating your documents using a message-based format, you need to start. Your translators will thank you.

ITS Tool takes the same idea as tools like xml2po, but the implementation is done entirely in terms of rules from the W3C Internationalization Tag Set. You don’t have to patch it to create a mode for a new XML format. You just need to provide a standard ITS file. Better still, if you mix XML vocabularies in a single file, ITS Tool can apply the rules for all matching formats.

Translators will be happy to know that we can now mark things as untranslatable using the standard its:translate attribute, or using custom its:translateRule elements. This is a long-requested feature that will help cut down the amount of unnecessary cruft that translators have to look at.

In addition to the features we get from standard ITS data categories, ITS Tool provides some custom extension rules to support features like translator credits and external file tracking. There are a few more features I’d like to provide as well, such as adding extra Mallard link titles and specifying transliteration-only messages.

I’ll be working on the GNOME build tools to switch GNOME’s documentation over to itstool for 3.2. Most messages in the PO files will be the same as with xml2po, so it won’t introduce much extra work for translators.

But ITS Tool is not just a GNOME project. It’s free software, under the GPL 3. It’s built on Python and libxml2, and can be used by any project for their XML documents. If you use an XML format that isn’t handled by the built-in ITS rules, you can pass your own custom ITS rules. Or if it’s a common format, submit those rules upstream. I encourage everybody working with XML documents to try ITS Tool and let me know how well it works and what can be done to improve it.

Open Help Conference

Mallard+TTML+ITS

2010-11-08

I just blogged about Mallard+TTML. And earlier I blogged about XML translations with ITS.

One of the nice features of ITS Tool is that it can merge ITS definitions from multiple sources. So when you embed TTML into Mallard, you don’t need to have a specific Mallard+TTML mode. Instead, the Mallard ITS definitions get applied, and the TTML ITS definitions get applied on top of them. I just added this ITS file to itstool git:

<its:rules
    xmlns:its="http://www.w3.org/2005/11/its"
    xmlns:tt="http://www.w3.org/ns/ttml"
    its:version="1.0">
  <its:withinTextRule withinText="yes" selector="//tt:p//*"/>
</its:rules>

This ensures that inline markup like tt:span doesn’t get split off with a placeholder into a separate translation unit. This one file applies the ITS definitions whether your TTML is embedded in Mallard, DocBook, XHTML, or any other format, or even if you have a standalone TTML file.

When I first started doing GNOME documentation, our documents were translated manually, as a whole, and with no change-tracking. That is to say, they were never translated. Then Danilo came along and wrote xml2po, and I integrated it into the build utilities in gnome-doc-utils. Suddenly, all our XML documents could be translated with standard PO files. And we started seeing actual documentation translation.

Fast forward six years. We’re still using xml2po, and it’s serving us well. Various web-based translation editors and trackers have appeared that can handle XML documents. Some of them use xml2po underneath; some use the same concepts in their own implementation. Meanwhile, the W3C created ITS, a common vocabulary for specifying things that translators and translation tools might want to know about an XML format. But we’re not using it, and I don’t know anybody in the greater GNOME ecosystem who is.

Presenting itstool: an ITS-based tool for converting XML files to PO files and back again.

This is an experiment I started over the weekend. The idea is to have a general xml2po-like tool that contains no vocabulary-specific logic. All it knows is XML and ITS, and everything you need to know about a format is specified with an ITS rules files. Look at the Mallard ITS file as an example. The basics of how to handle Mallard are in 13 lines.

This has a lot of potential. For starters, you don’t need to patch a program or write a plug-in to support a new XML vocabulary. All you need to do is provide an ITS file. There’s a chance that the people who developed the vocabulary will already have ITS definitions in place.

What’s more, you can embed ITS attributes and rules directly into your XML document, extending or overriding global rules. This is huge. This is the “mark something as untranslatable” feature that translators want, provided by a W3C recommendation that other tools might actually understand as well.

For an example, take a look at the DocBook Element Reference for Mallard. If you convert this to a PO file using itstool, you’ll see a bunch of messages that look like this:

msgid "<code href=\"http://www.docbook.org/tdg/en/html/abbrev.html\">abbrev</code>"

Yawn. It’s some markup, and a URL, and the name of an XML element. Nobody needs to translate that, but there’s no way a general tool can know that. But the author knows, so the author can use the its:translate attribute to save the translators some work:

<td its:translate="no"><p>
<code href="http://www.docbook.org/tdg/en/html/abbrev.html">abbrev</code>
</p></td>

Problem solved. This one pesky message will no longer appear in the PO file. That’s a big step forward for us. Unfortunately, there are 417 of those on that page. That’s a lot of typing. Fortunately, ITS also provides a standard way to specify rules, and you can put those rules directly inside your document. For this page, I happen to know that the first column of every row is untranslatable. And I know there’s no row spanning that would cause the first td to be in anything other than the first column. So rather than type its:translate="no" 417 times, I can put what I know right in the XML, where itstool can find it.

<its:rules xmlns:its="http://www.w3.org/2005/11/its">
<its:translateRule translate="no" selector="//mal:tr/mal:td[1]"/>
</its:rules>

We can just put this inside a Mallard info element. This is valid because Mallard allows any element from an external namespace inside the info element. And now itstool drops all 417 of those pesky messages from the PO file.

There are some bits that ITS doesn’t provide, and for those, we’ll need extension rules. Annoyingly, ITS doesn’t have a rule to specify space-preserving elements. I think the idea is that your DTD or XSD can specify this using xml:space. But I come from the RELAX NG school of thought, where validation is validation, and processing logic is something else. The two formats I care about most, DocBook and Mallard, are both RNG-based, so I had to create an extension rule for that. That’s done.

Then we have xml2po’s awesome ability to create messages for external files like images. We can’t translate the images using xml2po, of course. But xml2po can let translators know when images have changed, or when new images were added. That’s another extension rule. It’s not in yet, but I have a syntax for it, and I think it will be easy.

Finally, there’s translator credits. This one is harder, but I think it can still be specified in XML as an extension. I have a prototype of how the XML might look in the ITS files for DocBook and Mallard, but no working code yet.

Some things were easier than with xml2po’s approach, and some things were harder. On the whole, I think the ITS-based approach can take us further, and lines up more with standards that exist outside GNOME. I’m going to keep experimenting with this, and perhaps try to get in touch with other ITS folks. We’ll see where it goes.