Should UI strings in source code have non-ASCII characters?

May 14, 2008

There is a discussion going on at desktop-devel about whether the UI strings in the source code should also have non-ASCII characters. For example, should typical strings with double-quotes have those fancy Unicode double quotes?

printf(_("Could not find file “%s”\n"));

instead of

printf(_("Could not find file \"%s\"\n"));

The general view from the replies is to go ahead and add those nice Unicode characters.

Actually, there are UI messages already with non-ASCII characters (the ellipsis character, …) in GNOME 2.22:

glade3
epiphany

In GNOME 2.24, there are even more (with ellipsis):

gucharmap
epiphany
gnome-terminal
gedit
glade3

Regarding the fancy Unicode double quotes, there are UI strings in GNOME 2.22 (same list for 2.24) in the following packages:

evince
cheese
epiphany
eog
gnome-doc-utils

What are the arguments against having non-ASCII characters in UI strings?

There might be systems that still use 8-bit legacy encodings. In this case, the UTF-8 encoded may not be displayed properly. However, when I tried to demonstrate this on my system (Ubuntu 8.04), I failed miserably. I downloaded a small GTK2 text editor (called tea), I changed a source UI string to include “” and ellipsis, compiled and installed. I then opened a shell, set LANG to POSIX (or C), and ran the text editor. The UI message was proper Unicode and I could even type non-ASCII in the text editor. I resorted to changing a system locale (I picked en_IN) to ISO-8859-1, then logged out. In the login screen it did not show the 8-bit encoding. If someone has a proper legacy 8-bit encoding system with GNOME (OpenBSD, FreeBSD, etc), could you please try it out?
As Alan Cox mentioned in the thread, the canonical way to deal with UI strings in the source code should be to keep as ASCII, and put any fancy Unicode characters in the translation files (even for en_US, get an en_US translation file).

Is GNOME (or components) used in a legacy 7-bit/8-bit environment?

If there is any reason to keep UI strings in the source code as plain ASCII, speak now, or the Unicode flood gates are about to open.

Update 16 May 2008:There is a document at the ISO/IEC 9899 website (C programming language), that mentions the issue of character sets in C. It is http://www.open-std.org/jtc1/sc22/wg14/www/docs/C99RationaleV5.10.pdf.

On page 26, section 5.2.1, it says

The C89 Committee ultimately came to remarkable unanimity on the subject of character set requirements. There was strong sentiment that C should not be tied to ASCII, despite its heritage and despite the precedent of Ada being defined in terms of ASCII. Rather, an implementation is required to provide a unique character code for each of the printable graphics used by C, and for each of the control codes representable by an escape sequence. (No particular graphic representation for any character is prescribed; thus the common Japanese practice of using the glyph “¥” for the C character “\” is perfectly legitimate.) Translation and execution environments may have different character sets, but each must meet this requirement in its own way. The goal is to ensure that a conforming implementation can translate a C translator written in C.

For this reason, and for economy of description, source code is described as if it undergoes the same translation as text that is input by the standard library I/O routines: each line is terminated by some newline character regardless of its external representation.

With the concept of multibyte characters, “native” characters could be used in string literals and character constants, but this use was very dependent on the implementation and did not usually work in heterogenous environments. Also, this did not encompass identifiers.

It then goes on with an addition to C99:

A new feature of C99: C99 adds the concept of universal character name (UCN) (see §6.4.3) in order to allow the use of any character in a C source, not just English characters. The primary goal of the Committee was to enable the use of any “native” character in identifiers, string literals and character constants, while retaining the portability objective of C.

Both the C and C++ committees studied this situation, and the adopted solution was to introduce a new notation for UCNs. Its general forms are \unnnn and \Unnnnnnnn, to designate a given character according to its short name as described by ISO/IEC 10646. Thus, \unnnn can be used to designate a Unicode character. This way, programs that must be fully portable may use virtually any character from any script used in the world and still be portable, provided of course that if it prints the character, the execution character set has representation for it.

Of course the notation \unnnn, like trigraphs, is not very easy to use in everyday programming; so there is a mapping that links UCN and multibyte characters to enable source programs to stay readable by users while maintaining portability. Given the current state of multibyte encodings,
10 this mapping is specified to be implementation-defined; but an implementation can provide the users with utility programs that do the conversion from UCNs to “native” multibytes or vice versa, thus providing a way to exchange source files between implementations using the UCN notation.

Update 7 Aug 2008: According to PEP 8, Style Guide for Python Code, under Encodings, says

    For Python 3.0 and beyond, the following policy is prescribed for
    the standard library (see PEP 3131): All identifiers in the Python
    standard library MUST use ASCII-only identifiers, and SHOULD use
    English words wherever feasible (in many cases, abbreviations and
    technical terms are used which aren't English). In addition,
    string literals and comments must also be in ASCII. The only
    exceptions are (a) test cases testing the non-ASCII features, and
    (b) names of authors. Authors whose names are not based on the
    latin alphabet MUST provide a latin transliteration of their
    names.

    Open source projects with a global audience are encouraged to
    adopt a similar policy.

(Emphasis mine)

Posted by simos
Filed in gnome

16 Comments »

16 Responses to “Should UI strings in source code have non-ASCII characters?”

Davyd Says:

May 14, 2008 at 1:49 am
Personally I like to escape my Unicode as hex bytes. That way I can be 100% confident that no one will ever manage to totally stuff up my strings with some funny non-8-bit-clean program.

Of course, it does make your strings that little bit harder to read.
James Henstridge Says:

May 14, 2008 at 2:03 am
Strings displayed by GTK or printed to the terminal by g_print() or g_log and friends are expected to be UTF-8 (it will reencode output destined for the terminal if needed).

So for strings used in these contexts, using UTF-8 should be fine.
James Henstridge Says:

May 14, 2008 at 5:22 am
Another note: it may be worth escaping the non-ASCII characters in your source code.

e.g. “\xe2\x80\x9c%s\xe2\x80\x9d” instead of ““%s””
Havoc Says:

May 14, 2008 at 8:28 am
Why don’t we just declare that the strings in the program code are “en” and if people want to do a “C” translation they can go nuts 😉
Johannes Says:

May 14, 2008 at 9:03 am
Do all compilers allow non-ASCII characters in strings? I would expect that some C compilers might break.
simos Says:

May 14, 2008 at 11:55 am
@Johannes: Even GNOME 2.20 had non-ASCII characters in the strings (for example, in evince), and there was complaint as far as I know. Of course, evince is a high-level application.
It would be a better diagnostic if we put non-ASCII characters in strings in, let’s say, glib, then wait and see ;-).

@James, Davyd: If those strings are for the UI, it would be somewhat more difficult for the translators to figure out what the message says.

The current summary for this looks like: It appears to be ok to have UI strings with non-ASCII characters in GNOME applications, though it’s not clear yet if it is ok for libraries (such as glib, gtk+). This is because these libraries may be used for embedded systems, etc where compilers may not like non-ascii source strings.
Phil Says:

May 14, 2008 at 12:05 pm
It isn’t always a valid assumption that apps are written with american english strings, so why not just go the whole hog and use semi-symbolic strings by default?

printf(_(“file not found: %s\n”));

might not be very friendly, but it’s direct and equally easy to translate, regardless of what quotes your region uses. There exist english translations of a lot of english software already (en_UK etc) so adding en_US isn’t creating a major new translation job.

Obviously I wouldn’t bother changing old strings, as long as compilers aren’t erroring anyway, but a gradual shift doesn’t seem to be a lot of work.
Sebastian Benitez Says:

May 14, 2008 at 3:55 pm
Like Phil says, languages like spanish, german and french use different quotes than english. For spanish it would be «these quotes».
Yevgen Muntyan Says:

May 14, 2008 at 7:58 pm
First fix all text editors, so they don’t screw up your unicode (on the way to there remove iso8559-15 markers from all source files). Next, fix the C and C++ standards so all compilers understand UTF-8 source by default. Then use UTF-8 in C code 😉
Note that gcc is not the only C compiler for desktops, MS makes some too. UTF-8 in source code is GNU-C-ism which makes code less portable.
Alexander Jones Says:

May 15, 2008 at 10:42 am
UTF-8 translates to ISO-8859 fine, insofar that it remains valid, even if it is garbage. I don’t see why a compiler would screw up on parsing UTF-8 characters, as they just appear like a series of ISO-8859-x characters.

Maybe I’m missing something?
simos Says:

May 15, 2008 at 11:16 am
@Alexander: Some compilers complain when the source code has non-ASCII characters.

That is, bytes with the 8th bit set. Both iso-8859-x and utf-8 can have bytes that the value is >127.

Or, bytes with value <32 (control characters). That could be the case with UTF-8 when a character has codepoint value >127.
behdad Says:

May 16, 2008 at 3:07 am
I think the reason some think C source code should be 7-bit is that your *compiler* can screw up if run under a non-UTF-8 locale. And that may actually be required by the C standard. Not motivated enough to test it.
nona Says:

May 16, 2008 at 10:10 pm
What about non-unicode multibyte character sets that might still be popular in some countries? What happens when there’s UTF-8 and, let’s say, SHIFT-JIS in the same PO file?
simos Says:

May 17, 2008 at 12:02 am
@nona: For the narrow scope of GNOME, it appears that all POT/PO files follow the UTF-8 encoding. Indeed, if some translation teams were to use another encoding such as SHIFT-JIS, it would make a bit of a mess.

SHIFT-JIS is almost backward-compatible with ASCII (two characters differ).

@behdad: C99 defines a super-portable way to encode non-ASCII strings (using UCNs, as described in the added section in the post above). This is what gcc says about UCNs:

$ cat t.c
int main(void)
{
char* str = “\u0399”;

return 0;
}
$ gcc t.c -o t
t.c:3:14: warning: universal character names are only valid in C++ and C99
$ _

This means that UCNs work in gcc, but they produce a warning by default.

Using hand-encoded UTF-8 strings (such as “\xCE\x80”) makes the code less portable to different encodinds.
Yevgen Muntyan Says:

May 17, 2008 at 12:20 am
UTF-8 strings like “\xCE\x80” *are* portable. They are not “portable to different encodings”, but nobody needs that (whatever that means). We need UTF-8 in C strings, and that’s the way to have them. If you want nice in po files, make xgettext convert C escape sequences to nice UTF-8 symbols.

As to universal character names, it is implementation-defined what actually will be contained in the character array. I.e. if you have char *s = “\u…” then you have no idea how to display text pointed to by that variable in a gtk label. Also, once you have that line of code, it won’t by magic change “to different encodings”, it will be whatever byte sequences the compiler will put in there and that’s it. It’s pretty much the same as “abc” – it won’t by magic be valid UTF-16, no matter how you compile the file.

And by the way, MS does not implement C99 (surprised?).
Alexander Jones Says:

May 19, 2008 at 11:43 pm
@simos:

UTF-8 is designed so that subsequences are unambiguous. You won’t get a byte less than 0x80 in any part of a multi-byte sequence. bytes 0x00-0x7F map directly to 7-bit ASCII.

Some people are worried about string functions breaking. I really don’t see how this is the case, seeing as we’re doing g_some_function (_(“Some ASCII string”)) which is replaced with a UTF-8 string at runtime anyway.

Does anyone have any actual proof of UTF-8 in our translatable strings breaking C?