6. Making your programs Unicode aware
6.1 C/C++
The C `char
' type is 8-bit and will stay 8-bit because it denotes
the smallest addressable data unit. Various facilities are available:
For normal text handling
The ISO/ANSI C standard contains, in an amendment which was added in 1995,
a "wide character" type `wchar_t
', a set of functions like those
found in <string.h>
and <ctype.h>
(declared in
<wchar.h>
and <wctype.h>
, respectively), and
a set of conversion functions between `char *
' and
`wchar_t *
' (declared in <stdlib.h>
).
Good references for this API are
- the GNU libc-2.1 manual, chapters 4 "Character Handling" and 6 "Character Set Handling",
- the manual pages man-mbswcs.tar.gz, now contained in ftp://ftp.win.tue.nl/pub/linux-local/manpages/man-pages-1.29.tar.gz,
- the OpenGroup's introduction http://www.unix-systems.org/version2/whatsnew/login_mse.html,
- the OpenGroup's Single Unix specification http://www.UNIX-systems.org/online.html,
- the ISO/IEC 9899:1999 (ISO C 99) standard. The latest draft before it was adopted is called n2794. You find it at ftp://ftp.csn.net/DMK/sc22wg14/review/ or http://java-tutor.com/docs/c/.
- Clive Feather's introduction http://www.lysator.liu.se/c/na1.html,
- the Dinkumware C library reference http://www.dinkumware.com/htm_cl/.
Advantages of using this API:
- It's a vendor independent standard.
- The functions do the right thing, depending on the user's locale.
All a program needs to call is
setlocale(LC_ALL,"");
.
Drawbacks of this API:
- Some of the functions are not multithread-safe, because they keep a hidden internal state between function calls.
- There is no first-class locale datatype. Therefore this API cannot reasonably be used for anything that needs more than one locale or character set at the same time.
- The OS support for this API is not good on most OSes.
Portability notes
A `wchar_t
' may or may not be encoded in Unicode; this is
platform and sometimes also locale dependent. A multibyte sequence
`char *
' may or may not be encoded in UTF-8; this is platform
and sometimes also locale dependent.
In detail, here is what the
Single Unix specification
says about the `wchar_t
' type:
All wide-character codes in a given process consist of an equal number
of bits. This is in contrast to characters, which can consist of a
variable number of bytes. The byte or byte sequence that represents a
character can also be represented as a wide-character code.
Wide-character codes thus provide a uniform size for manipulating text
data. A wide-character code having all bits zero is the null
wide-character code, and terminates wide-character strings. The
wide-character value for each member of the Portable Character Set (i.e. ASCII) will equal its value when used as the lone character in an integer
character constant. Wide-character codes for other characters are
locale- and implementation-dependent. State shift bytes do not have a
wide-character code representation.
One particular consequence is that in portable programs you shouldn't use
non-ASCII characters in string literals. That means, even though you
know the Unicode double quotation marks have the codes U+201C and U+201D,
you shouldn't write a string literal L"\u201cHello\u201d, he said"
or "\xe2\x80\x9cHello\xe2\x80\x9d, he said"
in C programs. Instead,
use GNU gettext, write it as gettext("'Hello', he said")
, and create
a message database en.po which translates "'Hello', he said" to
"\u201cHello\u201d, he said".
Here is a survey of the portability of the ISO/ANSI C facilities on various Unix flavours.
- GNU glibc-2.2.x
- <wchar.h> and <wctype.h> exist.
- Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.
- Has five UTF-8 locales.
- mbrtowc works.
- GNU glibc-2.0.x, glibc-2.1.x
- <wchar.h> and <wctype.h> exist.
- Has wcs/mbs functions, but no fgetwc/fputwc/wprintf.
- No UTF-8 locale.
- mbrtowc returns EILSEQ for bytes >= 0x80.
- AIX 4.3
- <wchar.h> and <wctype.h> exist.
- Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.
- Has many UTF-8 locales, one for every country.
- Needs -D_XOPEN_SOURCE=500 in order to define mbstate_t.
- mbrtowc works.
- Solaris 2.7
- <wchar.h> and <wctype.h> exist.
- Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.
- Has the following UTF-8 locales: en_US.UTF-8, de.UTF-8, es.UTF-8, fr.UTF-8, it.UTF-8, sv.UTF-8.
- mbrtowc returns -1/EILSEQ (instead of -2) for bytes >= 0x80.
- OSF/1 4.0d
- <wchar.h> and <wctype.h> exist.
- Has wcs/mbs functions, fgetwc/fputwc/wprintf, everything.
- Has an add-on universal.utf8@ucs4 locale, see "man 5 unicode".
- mbrtowc does not know about UTF-8.
- Irix 6.5
- <wchar.h> and <wctype.h> exist.
- Has wcs/mbs functions and fgetwc/fputwc, but not wprintf.
- Has no multibyte locales.
- Has only a dummy definition for mbstate_t.
- Doesn't have mbrtowc.
- HP-UX 11.00
- <wchar.h> exists, <wctype.h> does not.
- Has wcs/mbs functions and fgetwc/fputwc, but not wprintf.
- Has a C.utf8 locale.
- Doesn't have mbstate_t.
- Doesn't have mbrtowc.
As a consequence, I recommend to use the restartable and multithread-safe wcsr/mbsr functions, forget about those systems which don't have them (Irix, HP-UX, AIX), and use the UTF-8 locale plug-in libutf8_plug.so (see below) on those systems which permit you to compile programs which use these wcsr/mbsr functions (Linux, Solaris, OSF/1).
A similar advice, given by Sun in http://www.sun.com/software/white-papers/wp-unicode/, section "Internationalized Applications with Unicode", is:
To properly internationalize an application, use the following guidelines:
- Avoid direct access with Unicode. This is a task of the platform's internationalization framework.
- Use the POSIX model for multibyte and wide-character interfaces.
- Only call the APIs that the internationalization framework provides for language and cultural-specific operations.
- Remain code-set independent.
If, for some reason, in some piece of code, you really have to assume that
`wchar_t' is Unicode (for example, if you want to do special treatment of
some Unicode characters), you should make that piece of code conditional
upon the result of is_locale_utf8()
. Otherwise you will mess up
your program's behaviour in different locales or other platforms. The
function is_locale_utf8
is declared in
utf8locale.h
and defined in
utf8locale.c.
The libutf8 library
A portable implementation of the ISO/ANSI C API, which supports 8-bit locales and UTF-8 locales, can be found in libutf8-0.7.3.tar.gz.
Advantages:
- Unicode UTF-8 support now, portably, even on OSes whose multibyte character support does not work or which don't have multibyte/wide character support at all.
- The same binary works in all OS supported 8-bit locales and in UTF-8 locales.
- When an OS vendor adds proper multibyte character support, you can take advantage of it by simply recompiling without -DHAVE_LIBUTF8 compiler option.
The Plan9 way
The Plan9 operating system, a variant of Unix, uses UTF-8 as character
encoding in all applications. Its wide character type is called
`Rune
', not `wchar_t
'. Parts of its libraries, written by
Rob Pike and Howard Trickey, are available at
ftp://ftp.cdrom.com/pub/netlib/research/9libs/9libs-1.0.tar.gz.
Another similar library, written by Alistair G. Crooks, is
ftp://ftp.cdrom.com/pub/NetBSD/packages/distfiles/libutf-2.10.tar.gz.
In particular, each of these libraries contains an UTF-8 aware regular
expression matcher.
Drawback of this API:
- UTF-8 is compiled in, not optional. Programs compiled in this universe lose support for the 8-bit encodings which are still frequently used in Europe.
For graphical user interface
The Qt-2.0 library http://www.troll.no/ contains a fully-Unicode QString class. You can use the member functions QString::utf8 and QString::fromUtf8 to convert to/from UTF-8 encoded text. The QString::ascii and QString::latin1 member functions should not be used any more.
For advanced text handling
The previously mentioned libraries implement Unicode aware versions of the ASCII concepts. Here are libraries which deal with Unicode concepts, such as titlecase (a third letter case, different from uppercase and lowercase), distinction between punctuation and symbols, canonical decomposition, combining classes, canonical ordering and the like.
- ucdata-2.4
The ucdata library by Mark Leisher http://crl.nmsu.edu/~mleisher/ucdata.html deals with character properties, case conversion, decomposition, combining classes. The companion package ure-0.5 http://crl.nmsu.edu/~mleisher/ure-0.5.tar.gz is a Unicode regular expression matcher.
- ustring
The ustring C++ library by Rodrigo Reyes http://ustring.charabia.net/ deals with character properties, case conversion, decomposition, combining classes, and includes a Unicode regular expression matcher.
- ICU
International Components for Unicode http://oss.software.ibm.com/icu/. IBM's very comprehensive internationalization library featuring Unicode strings, resource bundles, number formatters, date/time formatters, message formatters, collation and more. Lots of supported locales. Portable to Unix and Win32, but compiles out of the box only on Linux libc6, not libc5.
- libunicode
The GNOME libunicode library http://cvs.gnome.org/lxr/source/libunicode/ by Tom Tromey and others. It covers character set conversion, character properties, decomposition.
For conversion
Two kinds of conversion libraries, which support UTF-8 and a large number of 8-bit character sets, are available:
iconv
The iconv implementation by Ulrich Drepper, contained in the GNU glibc-2.2. ftp://ftp.gnu.org/pub/gnu/glibc/glibc-2.2.tar.gz. The iconv manpages are now contained in ftp://ftp.win.tue.nl/pub/linux-local/manpages/man-pages-1.29.tar.gz.
The portable iconv implementation by Bruno Haible. ftp://ftp.ilog.fr/pub/Users/haible/gnu/libiconv-1.5.1.tar.gz
The portable iconv implementation by Konstantin Chuguev. <joy@urc.ac.ru> ftp://ftp.urc.ac.ru/pub/local/OS/Unix/converters/iconv-0.4.tar.gz
Advantages:
- iconv is POSIX standardized, programs using iconv to convert from/to UTF-8 will also run under Solaris. However, the names for the character sets differ between platforms. For example, "EUC-JP" under glibc is "eucJP" under HP-UX. (The official IANA name for this character set is "EUC-JP", so it's clearly a HP-UX deficiency.)
- On glibc-2.1 systems, no additional library is needed. On other systems, one of the two other iconv implementations can be used.
librecode
librecode by François Pinard ftp://ftp.gnu.org/pub/gnu/recode/recode-3.6.tar.gz.
Advantages:
- Support for transliteration, i.e. conversion of non-ASCII characters to sequences of ASCII characters in order to preserve readability by humans, even when a lossless transformation is impossible.
Drawbacks:
- Non-standard API.
- Slow initialization.
ICU
International Components for Unicode 1.7
http://oss.software.ibm.com/icu/.
IBM's internationalization library also has conversion facilities, declared
in `ucnv.h
'.
Advantages:
- Comprehensive set of supported encodings.
Drawbacks:
- Non-standard API.
Other approaches
- libutf-8
libutf-8 by G. Adam Stanislav <adam@whizkidtech.net> contains a few functions for on-the-fly conversion from/to UTF-8 encoded `FILE*' streams. http://www.whizkidtech.net/i18n/libutf-8-1.0.tar.gz
Advantages:
- Very small.
Drawbacks:
- Non-standard API.
- UTF-8 is compiled in, not optional. Programs compiled with this library lose support for the 8-bit encodings which are still frequently used in Europe.
- Installation is nontrivial: Makefile needs tweaking, not autoconfiguring.
6.2 Java
Java has Unicode support built into the language. The type `char' denotes a Unicode character, and the `java.lang.String' class denotes a string built up from Unicode characters.
Java can display any Unicode characters through its windowing system AWT, provided that 1. you set the Java system property "user.language" appropriately, 2. the /usr/lib/java/lib/font.properties.language font set definitions are appropriate, and 3. the fonts specified in that file are installed. For example, in order to display text containing japanese characters, you would install japanese fonts and run "java -Duser.language=ja ...". You can combine font sets: In order to display western european, greek and japanese characters simultaneously, you would create a combination of the files "font.properties" (covers ISO-8859-1), "font.properties.el" (covers ISO-8859-7) and "font.properties.ja" into a single file. ??This is untested??
The interfaces java.io.DataInput and java.io.DataOutput have methods called `readUTF' and `writeUTF' respectively. But note that they don't use UTF-8; they use a modified UTF-8 encoding: the NUL character is encoded as the two-byte sequence 0xC0 0x80 instead of 0x00, and a 0x00 byte is added at the end. Encoded this way, strings can contain NUL characters and nevertheless need not be prefixed with a length field - the C <string.h> functions like strlen() and strcpy() can be used to manipulate them.
6.3 Lisp
The Common Lisp standard specifies two character types: `base-char' and `character'. It's up to the implementation to support Unicode or not. The language also specifies a keyword argument `:external-format' to `open', as the natural place to specify a character set or encoding.
Among the free Common Lisp implementations, only CLISP
http://clisp.cons.org/
supports Unicode. You need a CLISP version from March 2000 or newer.
ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz.
The types `base-char' and `character' are both equivalent to 16-bit Unicode.
The functions char-width
and string-width
provide an
API comparable to wcwidth()
and wcswidth()
.
The encoding used for file or socket/pipe I/O can be specified through the
`:external-format' argument. The encodings used for tty I/O and the default
encoding for file/socket/pipe I/O are locale dependent.
Among the commercial Common Lisp implementations:
LispWorks http://www.xanalys.com/software_tools/products/ supports Unicode. The type `base-char' is equivalent to ISO-8859-1, and the type `simple-char' (subtype of `character') contains all Unicode characters. The encoding used for file I/O can be specified through the `:external-format' argument, for example '(:UTF-8). Limitations: Encodings cannot be used for socket I/O. The editor cannot edit UTF-8 encoded files.
Eclipse http://www.elwood.com/eclipse/eclipse.htm supports Unicode. See http://www.elwood.com/eclipse/char.htm. The type `base-char' is equivalent to ISO-8859-1, and the type `character' contains all Unicode characters. The encoding used for file I/O can be specified through a combination of the `:element-type' and `:external-format' arguments to `open'. Limitations: Character attribute functions are locale dependent. Source and compiled source files cannot contain Unicode string literals.
The commercial Common Lisp implementation Allegro CL, in version 6.0, has
Unicode support. The types `base-char' and `character' are both equivalent
to 16-bit Unicode. The encoding used for file I/O can be specified through the
`:external-format' argument, for example :external-format :utf8
.
The default encoding is locale dependent. More details are at
http://www.franz.com/support/documentation/6.0/doc/iacl.htm.
6.4 Ada95
Ada95 was designed for Unicode support and the Ada95 standard library features special ISO 10646-1 data types Wide_Character and Wide_String, as well as numerous associated procedures and functions. The GNU Ada95 compiler (gnat-3.11 or newer) supports UTF-8 as the external encoding of wide characters. This allows you to use UTF-8 in both source code and application I/O. To activate it in the application, use "WCEM=8" in the FORM string when opening a file, and use compiler option "-gnatW8" if the source code is in UTF-8. See the GNAT ( ftp://cs.nyu.edu/pub/gnat/) and Ada95 ( ftp://ftp.cnam.fr/pub/Ada/PAL/userdocs/docadalt/rm95/index.htm) reference manuals for details.
6.5 Python
Python 2.0
(
http://www.python.org/2.0/,
http://www.python.org/pipermail/python-announce-list/2000-October/000889.html,
http://starship.python.net/crew/amk/python/writing/new-python/new-python.html)
contains Unicode support. It has a new fundamental data type
`unicode', representing a Unicode string, a module `unicodedata' for the
character properties, and a set of converters for the most important encodings.
See
http://starship.python.net/crew/lemburg/unicode-proposal.txt,
or the file Misc/unicode.txt
in the distribution, for details.
6.6 JavaScript/ECMAscript
Since JavaScript version 1.3, strings are always Unicode. There is no character type, but you can use the \uXXXX notation for Unicode characters inside strings. No normalization is done internally, so it expects to receive Unicode Normalization Form C, which the W3C recommends. See http://developer.netscape.com/docs/manuals/communicator/jsref/js13.html#Unicode for details and http://developer.netscape.com/docs/javascript/e262-pdf.pdf for the complete ECMAscript specification.
6.7 Tcl
Tcl/Tk started using Unicode as its base character set with version 8.1. Its internal representation for strings is UTF-8. It supports the \uXXXX notation for Unicode characters. See http://dev.scriptics.com/doc/howto/i18n.html.
6.8 Perl
Perl 5.6 stores strings internally in UTF-8 format, if you write
use utf8;
at the beginning of your script. length()
returns the number of
characters of a string. For details, see the Perl-i18n FAQ at
http://rf.net/~james/perli18n.html.
Support for other (non-8-bit) encodings is available through the iconv interface module http://cpan.perl.org/modules/by-module/Text/Text-Iconv-1.1.tar.gz.
6.9 Related reading
Tomohiro Kubota has written an introduction to internationalization http://www.debian.org/doc/manuals/intro-i18n/. The emphasis of his document is on writing software that runs in any locale, using the locale's encoding.
Next Previous Contents