» Improving input method support in GTK+-based apps Simos says…

Improving input method support in GTK+-based apps

January 30, 2008

When a bug report gets long with many comments, it gets more difficult for someone to get the full picture of what is going on. I’ll attempt to summarise here what’s being said in Bug 321896, Synch gdkkeysyms.h / gtkimcontextsimple.c with X.org 6.9/7.0.

GTK+-based applications use by default the GTK+ Input Method in order to let users type in different languages. Some scripts are very complex (such as SE Asian scripts) and in this case SCIM is used, replacing the GTK+ Input Method. One can even disable GTK+ IM altogether and use the basic X Input Method (XIM) which is provided by the Xorg server, by setting GTK_IM_MODULE to xim. However, the majority of the users have GTK+ IM enabled.

Between GTK+ IM and XIM, the keyboard layouts are being managed by the xkeyboard-config project and Sergey Udaltsov. A keyboard layout is simply a mapping of keyboard keys to Unicode characters, but you can also have compose sequences for some characters using what we call dead keys. When you press a dead key nothing appears on screen but when you press a letter immediately afterwards, you can get an á. This functionality is common to add accents, and there is a big table for these compose sequences (1.3MB) and what Unicode characters they produce.

If you change your keyboard layout (System/Preferences/Keyboard/Layout) to something like U.S. English International (with dead keys), then the ‘ key on your keyboard becomes dead_acute, and the compose sequence

<dead_acute> <a>  : "á"   U00E1 # LATIN SMALL LETTER A WITH ACUTE

works when you press ‘ and then a.

There is an issue with compose sequences and input methods; XIM maintains the official upstream version of the compose sequences, and projects such as GTK+ and SCIM carry their own copies of that table.

The issue with GTK+ regarding the compose sequences is that it has a very old version compared to what is available upstream. This is what Bug 321896 is about.

The bug would be have been resolved much much earlier if it wasn’t for the insistence of the GTK+ maintainers to cut the fat and reduce the size of the table (~6000 entries) with clever optimisations.

Tor suggested a clever optimisation; a good number of compose sequences (which looks like <dead_acute> <a> : “á”) resemble the decomposed form (a la Unicode) of those characters. Thus, we can let the user type what she wants, and we can try Unicode normalisation to see if the sequence is composed to a single Unicode character. Lets demonstrate in Python,

$ python

>>> import unicodedata

>>> sequence=[65, 0x301]     # That's 'a' and acute

>>> result = unicodedata.normalize('NFC',"".join(map(unichr, sequence)))

>>> result

u'\xc1'

>>> print len(result)

>>> print result

Á

That long line above takes the array, applies the unichr() function on each member so that they become Unicode characters and then joins them in a single string. Finally, it normalises the (decomposed) string to a single character. The fact that the resulting string has length 1 (single character) is key to this optimisation. Over 1000 compose sequences can be removed from the compose table through this optimisation. This includes a big chunk of the Latin Unicode blocks, about a few dozens of Cyrillic characters, all of modern Greek and Greek polytonic, some Indic languages (are they actually used?) and other misc sequences.

Matthias laid out the requirements for the optimisation of the remaining compose sequences; ① it has to be static const so a single copy is shared all over the place, ② the first column (out of six) is repeated too often, thus use subtables, and ③ each row ends with a varying number of zeroes, so cut on those zeroes as well. This also required the automatic generation of the optimised table using a script.

The work has not finished yet, and requires testing of the patch. The high priority testing is that keyboard layouts do not get any regressions (that is, compose sequences with dead keys must continue to work along with any new sequences).

With an updated compose table in GTK+, one can write things like ⒼⓃⓄⓂⒺ and all variations of accents on characters, in an easier way.

I’ld like to thank Matthias and Tor for their support in this work. And Jeff for adding this blog to Planet GNOME!

Posted by simos
Filed in gnome
Tags: gnome, gtk, input method, normalisation, python, unicode

5 Comments »

5 Responses to “Improving input method support in GTK+-based apps”

Sergey Udaltsov Says:

January 30, 2008 at 10:26 pm
Congratulations on getting into Planet GNOME!
Pádraig Brady Says:

January 31, 2008 at 12:02 am
Nice info thanks!
I’m always surprised at how useful
python is for manipulating unicode.

A few years ago I noticed the differences between the X and GTK input methods.
It’s great to see this being finally resolved.
Данило Says:

January 31, 2008 at 12:34 am
Hey Simos, does this mean we’re finally going to get support for multi-char (decomposed, those which can’t be composed) accenting as well? (i.e. what Xcompose supports since ages ago)

That’s a must for Cyrillic and many other languages since Unicode is not including many of the accented characters, yet they are needed.
simos Says:

January 31, 2008 at 2:44 am
Sergey: Thanks!

Pádraig: Thanks for the link.

Данило: It appears doable to add this support once the patch gets included. If there is a way (using glib-provided functions) to determine if a sequence of Cyrillic characters is in a valid decomposed form, then this can be added with a few lines of code. I believe there is no such thing, so the ideal solution would be to add a new table using data from NormalisationTest.txt (ftp.unicode.org), targeting those languages that are really affected (because the file is sort of huge).
Mi blog lah! » task update (el) Says:

February 19, 2008 at 2:04 pm
[…] patch για την υποστήριξη του πολυτονικού από το GTK+· λόγω του σχεδόν κλειστού παραθύρου για εισαγωγή νέων […]

Simos says…