Need help with non-latin alphabet

Felix Kaser is working for Collabora, writing EmpathyLiveSearch widget. It is a widget similar to N900’s HildonLiveSearch. The goal is to make easy to search in your contact list in a smart way. So it match only if words are starting with your search string. For example if you have a contact “Xavier Claessens”, typing “Xav” or “Cla” will show it, but not if you enter “ier” nor “ssens”. The match is of course case-insensitive, so typing “xav” will match as well.

Where things gets more complicated, is that our code also try to strip accentuation marks. For example if you have a contact “Gaëtan”, typing “gae” will match it. This is done using g_unicode_canonical_decomposition() and keep only the first unicode.

I’m writing unit tests for that matching algorithm to make sure it is working as wanted. Being French speaker, I can easily test that letters éèçàï, etc are stripped correctly to keep only the base letter without the accentuation marks. But I would like to include tests in other non-latin alphabets, like Arabic/Chinese/Corean/etc. I don’t know if such “accentuation marks” that can be stripped makes sense in any other alphabet, but if you know, please give me some example strings.

Strings must be encoded in UTF-8 of course!


Update: Empathy in git master now has the live search merged. Please give it a try and see if it maches your needs. It has the matching algorith described above, surely not perfect in all languages, but already better than nothing.

Of course, I’m interested in feedback, does it fail horribly with your language, or is it acceptable? All I can tell is it is perfect in French :D