In Improving input method support in GTK+-based apps, we talked about some work to update the list of compose sequences that GTK+ knows to the latest version that comes from Xorg. From 691 compose sequences, we now support over 5000.

The patch has landed in GTK+ (trunk), and here are instructions for testing.

  1. If you have not used jhbuild before, read the jhbuild instructions and install it.
  2. Add the following to your ~/.jhbuildrc file
    branches['gtk+'] = None    # Makes sure you build from the trunk of GTK+
  3. Install gtk+ using the command (see the comment of James on this post on how to avoid Step 5 below)
    jhbuild build gtk+
  4. About 40 minutes later, and about 700MB of space (~600MB for source, ~100MB for installation of files) consumed, you should get a working copy of GTK+ 2.12.
  5. You can use this compiled version of GTK+ by running
    jhbuild shell

    This should give you a new shell, and whatever you run from here will use our fresh GTK+. Try running “gedit”. You will notice that the theme is different; it uses the default theme due to the special GTK+. This shell has set special environment variables so that program that run will use the fresh GTK+. The rest of the libraries come from our distribution.

  6. If you try to type compose sequences, you will notice no improvement. This is because at the moment jhbuild builds the branch 2.12 of GTK+ and not trunk. We need to download GTK+ from trunk and rebuild.
    cd ~/checkout/gnome2/
    mv gtk+ gtk+-branch-2.12
    svn co svn://svn.gnome.org/svn/gtk+/trunk gtk+
    jhbuild build --no-network gtk+
  7. Perform Step 4 and get gedit running.

How to test?

  • Setup a keyboard layout that supports a good variety of dead keys. My preference is GBr (United Kingdom). Here, AltGr+[];’#/ and AltGr+{}:@~? produce different dead keys. You press one of these combinations and then you press a letter. If such a combination exists, then it gets printed. For example, the old GTK+ produces öõóôòx åōőxxx. The new GTK+ produces öõóôòọ åōőǒŏȯ (12 dead keys).
  • Setup Greek, Polytonic (Ancient Greek). The dead keys are [];’ {}:@ AltGr+[] (10 dead keys). Produce characters such as ᾅᾂᾷῗὕὒᾥᾢῷ.
  • Try compose sequences as described from the upstream file at XOrg. For example,
    ComposeKey+(   1 0 )  produces ⑩. Try the same for 0-20, a-zA-Z.
  • Other miscellaneous, Ṩǟấẫǡ (using GBr layout)

The next step would be to parse the list of compose sequences and produce a documentation file.

Typically, if you want to type characters with accents, such as á, ë, ś, you need to configure a suitable keyboard layout that includes compose sequences for those characters. The produced characters are what we call as precomposed characters; which were included in the early stages of Unicode. Nowdays, the idea is that you do not need to define á as a distinct character because it can be represented as a and ´, where the latter is a combining diacritic.

When put together a character and a combining diacritic, they fuse together, producing a seemingly single character. á is a precomposed (really one character), while á is letter a and the combining diacritic called acute (two characters). You can type the latter á by

  1. Type a
  2. Press Ctrl+Shift+u, then type 301, then press space bar.

Western languages do not really require combining marks, so the existing keyboard layouts do not use them. Other scripts, such as the Congolese keyboard layout (based on Latin) make good use of them.

Gedit, pango and combining diacritics

This is gedit showing off pango and DejaVu fonts (default font in major distributions).

Line 3 is a bit of an extreme, showing a sandwich of combining diacritics.

Line 4 shows the base character a with the combining diacritics from the Unicode range 0x300 to 0x315.

Both lines 3 and 4 were produced easily with a modified keyboard layout, which is show below.

Line 5 is just me being silly. You can have combining diacritics that enclose your base character.

$ cat  /usr/share/X11/xkb/symbols/combining
partial alphanumeric_keys alternate_group
xkb_symbols "combining" {

    name[Group1] = "Combining diacritics";

    key.type[Group1] = "FOUR_LEVEL";

    key <AD11> { [ NoSymbol, NoSymbol, 0x1000300, 0x1000301 ] }; // à   á
    key <AD12> { [ NoSymbol, NoSymbol, 0x1000302, 0x1000303 ] }; // â   ã

    key <AC10> { [ NoSymbol, NoSymbol, 0x1000304, 0x1000305 ] }; // ā   a̅
    key <AC11> { [ NoSymbol, NoSymbol, 0x1000306, 0x1000307 ] }; // ă   ȧ
    key <BKSL> { [ NoSymbol, NoSymbol, 0x1000308, 0x1000309 ] }; // ä    ả

    key <AB08> { [ NoSymbol, NoSymbol, 0x1000310, 0x1000311 ] }; // a̐     ȃ
    key <AB09> { [ NoSymbol, NoSymbol, 0x1000312, 0x1000313 ] }; // a̒     a̓
    key <AB10> { [ NoSymbol, NoSymbol, 0x1000314, 0x1000315 ] }; // a̔     a̕
};
$ diff -u /usr/share/X11/xkb/symbols/us.ORIGINAL /usr/share/X11/xkb/symbols/us
--- /usr/share/X11/xkb/symbols/us.ORIGINAL      2008-02-20 11:11:13.000000000 +0000
+++ /usr/share/X11/xkb/symbols/us       2008-02-20 13:02:07.000000000 +0000
@@ -492,3 +492,12 @@
     name[Group1]= "U.S. English - Macintosh";
 };

+partial alphanumeric_keys modifier_keys
+xkb_symbols "combining_us" {
+
+    include "us"
+    include "combining"
+
+    key.type[Group1] = "FOUR_LEVEL";
+    name[Group1] = "U.S. English - Combining";
+};
$ diff -u /usr/share/X11/xkb/rules/xorg.xml.ORIGINAL /usr/share/X11/xkb/rules/xorg.xml
--- /usr/share/X11/xkb/rules/xorg.xml.ORIGINAL  2008-02-20 11:27:00.000000000 +0000
+++ /usr/share/X11/xkb/rules/xorg.xml   2008-02-20 11:27:48.000000000 +0000
@@ -3643,6 +3643,12 @@
             <description xml:lang="zh_TW">Macintosh</description>
           </configItem>
         </variant>
+        <variant>
+          <configItem>
+            <name>combining_us</name>
+            <description>Combining</description>
+          </configItem>
+        </variant>
       </variantList>
     </layout>
     <layout>
$ _

Then, you select this keyboard layout (U.S. English) and variant (Combining) in the Keyboard Indicator applet.

Unlike dead keys, with combining diacritics you first type the base character (such as a) and then any combining diacritics.
Our sample layout variant puts the diacritics in the physical keys for [];’#,./. For example,

  • a + AltGr+[ : à
  • a + AltGr+Shift+[ : á
  • a + AltGr+[ + AltGr+’ : ằ

If your language has needs that can be solved with combining diacritics, this is how they are solved.

It is quite important to create keyboard layouts for all languages, and actually make good use of them.

Garrett asks how to type squiggles and dots in GNOME; that is, how to type characters such as á à ä ã â ą ȩ ę ő ǰ ǩ ǒ ġ ṅ ȯ ṁ ė.

There are several ways, and one can choose depending on how frequently they need to type them or how much time they need to invest learning.

① One option is to start the Character Map (Applications/Accessories/Character Map), pick the character, copy and paste it. This is good for rare characters and weird situations such as

┏━━━━━━━━━━━━━━━━━━━━━━━┓

⟁⟁⟁⟁♥♀★★▶◀☆♀░░░▒▒▒▓▓▓▙▚▛▙▙▙▞

The Unicode standard, apart from defining characters for languages, it also defines symbols, dingbats and all sort of things. If your distribution is based on the DejaVu fonts (such as Ubuntu), then you are probably covered for many of these symbols. If you do not have a suitable font, or you use Windows, you will be wondering what the hell I am talking about.

② Another option is to use the Character Palette applet which shows an applet on the panel with a configurable small repertoire of characters such as áàéíñó½©ث€. You select one of the characters with the mouse, and wherever you middle-click, this character is typed. This is an improvement over ①, and good when you want to type often rare characters. It is not convenient to type characters found normally on a keyboard layout.

③ To type characters normally found in a specific language(s), it is good to setup a suitable keyboard layout. For this, it is good to add the Keyboard Indicator applet; right click on the panel, click Add to panel… and choose the Keyboard Indicator from the Utilities section. The US English keyboard layout (Default variant) does not provide any interesting characters apart from those shown printed on the keys of a US Keyboard.
Keyboard layout US Intl with dead keys
The US English International (with dead keys) variant might be a better option,

Keyboard layout GB

Or the United Kingdom layout.

You can get a similar image for your layout when you right-click on the Keyboard Indicator applet, then click Show Current Layout.

Each key in the images contain up to four letters. Starting from bottom-left and going clock-wise, these are the keys produced when

ⓐ you press the key

ⓑ you press the key with Shift (or Caps Lock)

ⓒ you press the key with AltGr and Shift (or Caps Lock)

ⓓ you press the key with AltGr

For example, with the UK keyboard layout, the key G produces g, G, Ŋ, ŋ.

If AltGr + Shift + letter does not work for you, see the FDO Bug #2871 Different results for shift-altgr and altgr-shift.

Using the appropriate keyboard layout is the way to go when writing text that require squiggles. You can either choose a layout with dead keys (meaning that some keys lose their normal functionality), or you can pick a layout that still allows you to have dead keys but are available when you press AltGr + key. For example, in the UK Keyboard layout – Default variant, AltGr + ; + a produces á, or AltGr+Shift+]+e produces ē.

OLPC showing the keyboard

Photo by titanas.

The OLPC uses those four level for the keyboard layout. You can see the all the variations printed on the keyboard. Click on the image, choose Large size for the details.

④ Another option to produce more characters on the keyboard is to enable the compose key, and use compose sequences. A compose sequence looks similar to what we described above (i.e. AltGr+Shift+]+e to ē) but the idea is that we use it for characters we want to be available across different keyboard layouts that you may have enabled.
Configuring the compose key
The compose key is very powerful functionality, thus it is not enabled by default, and lays hidden in the Layout Options tab. I prefer to set it to Menu, but every person has their own preference.

For example,

  • Compose key + – + a produces ã,
  • Compose key + < + c produces č
  • Compose key + 1 + s produces ¹ (Superscript on 1. Try to replace 1 with 2.)
  • Compose key + + + – procudes ±

Currently, GTK+ provides 640 such compose sequences involving the Compose key, and hopefully soon it will increase to over 3000.

The Compose key is known as Multi_key in the source code (Xorg, GTK+, etc).

The Compose key compose sequences offer the ability to define smart mnemonics on how to produce characters. It is much easier to type ComposeKey + 1 + s rather than remembering the codepoint value of ¹ (1 superscript). As with many things open-source, there are too many options, and with the Compose key there is the issue of which shall we pick as a sensible default, and how to make it prominent for those who might want to use it.

It appears to me that there should be more effort to promote the functionality that is provided with the standard keyboard layouts (choose a better keyboard layout, produce characters provided in the third and fourth levels, etc). In this respect, Compose key compose sequences should complement after the main discussion on keyboard layouts take place.

⑤ There is a last issue on switching keyboard layouts to cover in a separate post.

When a bug report gets long with many comments, it gets more difficult for someone to get the full picture of what is going on. I’ll attempt to summarise here what’s being said in Bug 321896, Synch gdkkeysyms.h / gtkimcontextsimple.c with X.org 6.9/7.0.

GTK+-based applications use by default the GTK+ Input Method in order to let users type in different languages. Some scripts are very complex (such as SE Asian scripts) and in this case SCIM is used, replacing the GTK+ Input Method. One can even disable GTK+ IM altogether and use the basic X Input Method (XIM) which is provided by the Xorg server, by setting GTK_IM_MODULE to xim. However, the majority of the users have GTK+ IM enabled.

Between GTK+ IM and XIM, the keyboard layouts are being managed by the xkeyboard-config project and Sergey Udaltsov. A keyboard layout is simply a mapping of keyboard keys to Unicode characters, but you can also have compose sequences for some characters using what we call dead keys. When you press a dead key nothing appears on screen but when you press a letter immediately afterwards, you can get an á. This functionality is common to add accents, and there is a big table for these compose sequences (1.3MB) and what Unicode characters they produce.

If you change your keyboard layout (System/Preferences/Keyboard/Layout) to something like U.S. English International (with dead keys), then the ‘ key on your keyboard becomes dead_acute, and the compose sequence

<dead_acute> <a>  : "á"   U00E1 # LATIN SMALL LETTER A WITH ACUTE

works when you press and then a.

There is an issue with compose sequences and input methods; XIM maintains the official upstream version of the compose sequences, and projects such as GTK+ and SCIM carry their own copies of that table.

The issue with GTK+ regarding the compose sequences is that it has a very old version compared to what is available upstream. This is what Bug 321896 is about.

The bug would be have been resolved much much earlier if it wasn’t for the insistence of the GTK+ maintainers to cut the fat and reduce the size of the table (~6000 entries) with clever optimisations.

Tor suggested a clever optimisation; a good number of compose sequences (which looks like <dead_acute> <a> : “á”) resemble the decomposed form (a la Unicode) of those characters. Thus, we can let the user type what she wants, and we can try Unicode normalisation to see if the sequence is composed to a single Unicode character. Lets demonstrate in Python,

$ python
>>> import unicodedata
>>> sequence=[65, 0x301]     # That's 'a' and acute
>>> result = unicodedata.normalize('NFC',"".join(map(unichr, sequence)))
>>> result
u'\xc1'
>>> print len(result)
1
>>> print result
Á

That long line above takes the array, applies the unichr() function on each member so that they become Unicode characters and then joins them in a single string. Finally, it normalises the (decomposed) string to a single character. The fact that the resulting string has length 1 (single character) is key to this optimisation. Over 1000 compose sequences can be removed from the compose table through this optimisation. This includes a big chunk of the Latin Unicode blocks, about a few dozens of Cyrillic characters, all of modern Greek and Greek polytonic, some Indic languages (are they actually used?) and other misc sequences.

Matthias laid out the requirements for the optimisation of the remaining compose sequences; ① it has to be static const so a single copy is shared all over the place, ② the first column (out of six) is repeated too often, thus use subtables, and ③ each row ends with a varying number of zeroes, so cut on those zeroes as well. This also required the automatic generation of the optimised table using a script.

The work has not finished yet, and requires testing of the patch. The high priority testing is that keyboard layouts do not get any regressions (that is, compose sequences with dead keys must continue to work along with any new sequences).

With an updated compose table in GTK+, one can write things like ⒼⓃⓄⓂⒺ and all variations of accents on characters, in an easier way.

I’ld like to thank Matthias and Tor for their support in this work. And Jeff for adding this blog to Planet GNOME!