Need help with non-latin alphabet

Felix Kaser is working for Collabora, writing EmpathyLiveSearch widget. It is a widget similar to N900’s HildonLiveSearch. The goal is to make easy to search in your contact list in a smart way. So it match only if words are starting with your search string. For example if you have a contact “Xavier Claessens”, typing “Xav” or “Cla” will show it, but not if you enter “ier” nor “ssens”. The match is of course case-insensitive, so typing “xav” will match as well.

Where things gets more complicated, is that our code also try to strip accentuation marks. For example if you have a contact “Gaëtan”, typing “gae” will match it. This is done using g_unicode_canonical_decomposition() and keep only the first unicode.

I’m writing unit tests for that matching algorithm to make sure it is working as wanted. Being French speaker, I can easily test that letters éèçàï, etc are stripped correctly to keep only the base letter without the accentuation marks. But I would like to include tests in other non-latin alphabets, like Arabic/Chinese/Corean/etc. I don’t know if such “accentuation marks” that can be stripped makes sense in any other alphabet, but if you know, please give me some example strings.

Strings must be encoded in UTF-8 of course!

Thanks.

Update: Empathy in git master now has the live search merged. Please give it a try and see if it maches your needs. It has the matching algorith described above, surely not perfect in all languages, but already better than nothing.

Of course, I’m interested in feedback, does it fail horribly with your language, or is it acceptable? All I can tell is it is perfect in French :D

30 Responses to “Need help with non-latin alphabet”

  1. As for decomposition: wouldn’t it make more sense to transliterate both the input and the output to ascii and compare the result instead? Basically find a C equivalent of Unidecode (http://pypi.python.org/pypi/Unidecode/). Doing things the way you describe will likely break for all Asian alphabets.

    BTW: Any chance of matching beginning of any word (\b) instead? I know a lot of people, including me, who name their contacts ‘First “nickname” Last’. Only being able to search for the first name is not very helpful.

  2. xclaesse says:

    @Patryk: How could it transliterate Asian letters to ASCII ?!? Looked in Corean unicode table and tryed with ‘각’ (no idea what it means). Transliterate it to ASCII gives: ‘?’. But my decomp idea (actually the idea comes from E-D-S) gives ‘각’ -> ‘ᄀ’ . I’m wondering if that makes any sense in Corean…

    Of course it splits the contact names into “words”, using g_unichar_isalnum(), See example I’ve written in the poste, searching “Cla” inside “Xavier Claessens” matches.

  3. xclaesse says:

    Hum, one point for transliteration is it does ‘œuf’ -> ‘oeuf’ which is the right thing for French. But I don’t think concatenated letters like that is used in names anyway.

    Also German people probably prefer typing “Joergen” to match “Jörgen”, where french people prefer typing “Jorgen”. And that’s done correctly with transliteration…

    I would like to hear about Asian/Arabic people here for advice :-)

  4. Tor Lillqvist says:

    Actually, it doesn’t make sense to strip “accents” like ¨ in all languages that use the Latin script either.

    For instance, in Swedish and FInnish the letters Ä and Ö are separate letters and not considered “accented variants” of A and O. (Except in a purely mechanical writing sense, of couse. But then, in that sense for instance Q is just a variant of O, too;)

    The issue is more complex than you think, even for just Latin.

  5. Given a complete transliteration table (not sure if one exists) you have perfect support for all of the languages. As for using transliteration vs decomposition – see the Bei Jing example at the Unidecode’s homepage :)

  6. Jeroen Hoek says:

    I’m afraid Chinese, Korean, and Japanese are more complicated than that. For names, all of them tend to use Chinese characters, which may, or may not, correspond to a single sound you could write out in the Roman script for sorting and searching purposes as you describe them. None of these languages use diacritic marks as such.

    Japanese contacts in an address book always have the name in there in two forms: the real name in Chinese characters, and the reading of those characters, mostly in the Japanese hiragana script. If you want to search Japanese names, you would use both, the latter of which fortunately is fairly easy to map to Roman letters.

    Sample Japanese name (in Japanese order; family, given name):

    Name: 田中奈々美
    Reading: たなか ななみ
    Romanized: Tanaka Nanami (or Nanami Tanaka)

    Of course, in Roman script using countries people often will use a version of their name already romanized, mostly without any diacritics at all, so the point me be moot. Even Evolution doesn’t seem to allow for reading extensions in their contacts (X-PHONETIC-FIRST-NAME in vCard and such). People in Japan, China, Taiwan and South Korea often use software and telephones tailored to their languages and culture.

  7. xclaesse says:

    @Tor: Note that the goal is absolutely not to be correct about the language, but being fast to type on keyboard. The goal is to easily find your contacts and you probably don’t want to type composed letters on your keyboard for that (at least not in French).

  8. xclaesse says:

    @Jeroen Hoek:
    Name: 田中奈々美
    Reading: たなか ななみ
    Romanized: Tanaka Nanami (or Nanami Tanaka)

    So is it possible to transliterate “田中奈々美” to “Tanaka Nanami” ? I’m testing using that function:

    g_convert (basename, -1, “ASCII//translit”, “UTF-8″, NULL, NULL, NULL);

    But here it only gives “?” string… Maybe if I had my system installed with Chinese locale it would work?

    If your Empathy contact list has all names with Chinese characters, like “田中奈々美”, what would you type on keyboard to search one of them ideally? The romanized name?

  9. xclaesse says:

    Oops, sorry, s/chinese/Japanese/ in my previous message.

  10. Tor Lillqvist says:

    No, it’s the “reading” (たなか ななみ) that could be translitterated to Latin script.

  11. Janne says:

    On Swedish keyboards at least, “å”, “ä”, and “ö” are separate keys; I strongly suspect the same goes for Norwegian, Danish, Icelandic and Finnish keyboards as well. Stripping the accent marks for these languages is simply wrong – it’d be like finding “Xavier” or “Claessens” when typing a “k”.

    As for Japanese, there are no accent marks or anything of the sort to look for. Jeroen above has it right; all you can do is match prefix characters.

    And note that there is no regular mapping between the name in kanji and the pronunciation. Many people have completely irregular readings or characters for their names; it’s common enough that nobody bats an eye when people ask them how they write or pronounce their name. It goes both ways – a common name like “ritsuko” may have several dozen ways of writing it in kanji, some of which don’t use any existing sound for their characters. And likewise a common name kanji compound may have a completely offbeat pronunciation.

    It’s kind of like if it was completely normal for people to write their name “John”, but pronounce it as “Benjamin”.

  12. Janne says:

    “If your Empathy contact list has all names with Chinese characters, like “田中奈々美”, what would you type on keyboard to search one of them ideally?”

    You would write “田中奈々美” of course. Or, if you knew only the pronounciation, you’d write “たなかななみ”. People will of course search in their own language and character set, not in some other.

  13. Xavier:

    g_convert assumes all strings are in the current locale. You might want to look at libicu’s transliteration data.

  14. Jeroen Hoek says:

    Well, that’s the tricky bit for Asian names. As a human I would guess that 田中奈々美 is read as Tanaka Nanami, but there are many names that can’t be figured out that easily because their readings vary. Japanese name readings are notoriously difficult to figure out, even for the Japanese.

    This is why Japanese calling cards always include the reading of the name in the phonetic hiragana script; たなか ななみ in the above example. Japanese cell phones have address books that allow you to enter these readings as well as the name in Chinese characters. I believe vCard can be extended to do so too. If you search using either Roman characters, or one of the two phonetic Japanese scripts (hiragana and katakana), you would expect the reading of the name to be searched against as well.

    In short, you cannot transliterate the Chinese characters for Japanese names without the readings that specific person uses. You need additional fields in the contact data that tell you those readings. Chinese and Korean may (or may not) be easier; as Patrick suggested; my field of expertise is Japanese.

    In the fictional case of 田中奈々美, I would search for “tan” or “nanami” or something similar.

  15. foo says:

    Why is this specific to Empathy? I want this for tracker search too!!

  16. Jussi Kukkonen says:

    Xavier, Tors point was that matching “äö” to “ao” would be very confusing and seen as clearly wrong by anyone speaking Nordic languages.

  17. Jussi Kukkonen says:

    sorry, hit submit a bit early:

    The other thing is, scandinavian keyboards include “ä” and “ö”. So they really aren’t composed letters for us — although I see the problem everyone else may have…

  18. Jeroen Hoek says:

    (about multiple readings)

    Example: this is a (female) given name that can be read as either Sachiko or Yukiko:

    幸子

    Additionally, ENAMDICT also lists the rarer Kōko (or Kouko or Kohko depending on your romanization preference), Kazuko, Keiko, Sakiko, Takako, Toshiko, and Tomoko.

  19. xclaesse says:

    @foo: My plan is to experiment it in Empathy, then propose it to be included in GTK+ itself.

  20. Alban says:

    Don’t forget to test the Polish “Ł”:
    https://bugs.maemo.org/show_bug.cgi?id=9948

  21. Adam Etienne says:

    As for Korean, ‘ᄀ’ is just the sound ‘k’.. for ‘각’ you can start to match with only this sound but before matching the next character you’ll need to check if the whole character was entered, with the 2 others parts, ㅏ and again ᄀ.

    Hope it’s clear and it helps, otherwise you can contact me.

  22. Benja says:

    In Spanish, all the accentuated vowels (áéíóúü) are equivalent. However, ‘ñ’ that is not equivalent to ‘n’, it is a different character with a different key in the keyboard.

  23. behrooz says:

    here’s some Persian characters: گنوم خداس

    i think if everything be OK with this string there won’t be any problem with Arabic.

  24. For Arabic, it may be a good idea to strip diacritics even if people rarely use them, let alone in names (I think however that they are considered as 2 characters and not as a composed one). For hamza (e.g. matching ا to أ or إ), I beleive the canonical decomposition should work.

  25. antono says:

    Esperanto symbols:

    Eĥoŝanĝo ĉiuĵaŭde
    =
    Ehosango ciujaude

  26. Aleksander says:

    @foo:
    Tracker Search Tool (0.8/0.9) already does this. Full Text Search engine actually does several things with the input string before using it for search:
    * Casefolding
    * NFKD normalization
    * Removal of combining diacritical marks (currently using libunac, but there’s a branch to drop libunac not merged yet). As you pointed out, the easiest way of removing the combining diacritical marks is to perform a compatibility decomposition and then iterate through all the characters in the grapheme removing those which happen to be combining diacritical marks.
    * Stemming (currently disabled, as doesn’t work very well).

    @xclaesse:
    I would say it’s quite fair to remove the combining diacritical marks while doing text search, so that’s nothing to worry about. The best example would be when you want to look for the spanish word “España” and you don’t have an spanish keyboard. In this case, it quite makes sense to be able to use “espana” as word in the search, and get the proper “España” result. The problem anyway comes with other combining marks (not accents) in other languages. It seems other search engines like google perform some kind of ‘unaccenting’ for other non-latin unicode scripts, but this doesn’t seem to be as clear as latin-script unaccenting. Not sure if in some other scripts like arabic ones you can actuallly compose a grapheme using only combining marks, without any base character, so that could make things more difficult. We’ve got a bugreport about this actually in tracker:
    https://bugzilla.gnome.org/show_bug.cgi?id=588753

    Oh, and we’ve got some unit tests about just exactly this issue. Look for the unaccenting-related tests in:
    http://git.gnome.org/browse/tracker/tree/tests/libtracker-fts/tracker-parser-test.c?h=drop-unac

  27. Tor Lillqvist says:

    For languages where certain “accented” (Latin) letters are in fact not perceived as being accented, it would be quite surprising and unnatural if searches for names beginning with A turns up a list of names starting with A, Ä and Å intermixed. Etc. So I do think that the “unaccenting” behaviour should be dependent on language (of the user, or perhaps even of the data…).

  28. Tor:

    Part of the problem is being able to find people whose names contain letters not reachable with the active input method.

  29. Christian says:

    In German you have ä, ö, ü and ß. On a German keyboard layout you can type these directly. However when I type on a different layout, which is the case on one of my machines, I *need* mapping of these to a, o, u and ss because I simply can’t type the exact letters.

  30. Janne says:

    If the problem is the inability to enter accents, then you should solve it by making entering them possible, not by breaking search.

    For instance, an optional button row with ¨, ` and so on right under or next to the search input; a click, and the last entered letter would get the accent added.