Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: tlug: A couple of questions about Unicode
- To: tlug@example.com
- Subject: Re: tlug: A couple of questions about Unicode
- From: kls@example.com (Ken Schwarz)
- Date: Sat, 10 Jan 1998 02:53:17 +0900
- In-Reply-To: <199801091717.CAA03920@example.com> (jbyrne@example.com)
- Ncc: tlug@example.com
- Reply-To: tlug@example.com
- Sender: owner-tlug@example.com
Here are some notes I wrote a while ago on the subject of Unicode and conversion to other Japanese encodings. I'd appreciate comments from others with experience in this. Hope it helps. - Ken ------ Unicode is supposed to be the ultimate encoding by providing a uniform standard to handle every language in the world. Java is helping make this a reality. Unicode comes close to being everything for everyone in Japan, but it is flawed in some minor and not-so-minor ways. The biggest problem with Unicode is that vendors have not implemented it the same way. For example, the mapping from SJIS<->Unicode is defined slightly differently by the Sun Java VM on Windows NT than Microsoft NT handles it internally. So much for universality. Because these differences appear not only on different platforms, but within different applications running on a given platform, it is impossible to provide comprehensive handling of Unicode conversion. What we propose is a canonical converter (which is correct for 99.99% of a user's needs) plus the ability to override the canonical converter in a generic way. We can provide information on override values in FAQs or documentation and update them as new issues arise. Among the minor issues, the most important is that Unicode omits a range of SJIS characters found only in Microsoft Windows. While useful and fairly popular (such as circled roman numerals), the Unicode committees did not recognize them as legitimate characters, and so they do not have Unicode mappings. These characters are deprecated everywhere so we should see less and less of them as time goes on. The second is that there are a handful of cases of pairs of SJIS characters which map to a single character in Unicode. This class of exceptions is considered extremely minor in practice, and is the result of different editions (1983 and 1990) of the Japanese character standards as the basis of SJIS and Unicode. This seems to have no practical impact on the use of Unicode in Japan and is objectionable primarily to some academics involved in the standards committees. It's also worth pointing out that many people confuse "Unicode" and "ISO 10646". Unicode is the equivalent of a UCS-2 Level-3 ISO 10646 implementation. UCS-2 means all data is managed in 2-octet or 16-bit words (vs. the 4-octet or 32-bit words of UCS-4). However, Level-3 means that characters may be combined without restriction, so it is wrong to assume that all characters are expressed in 16-bits. The "ch" and "ll" characters in Spanish, for example, are considered single characters of 4-octets. Unicode is not simply a wide-char version of 8-bit char data; it is a multibyte encoding. In this way, going with Unicode to avoid the complexities of multibyte handling is misguided. In practice, though, it looks like 16-bit and Unicode are becoming synonymous since Microsoft and Java treat them that way. Since Unicode is a 16-bit quantity, byte order depends on platform architecture. On little-endian systems (Intel), the low-order byte comes first whereas on big-endian systems (Sparc, HP, Mips) the high-order byte is first. Java Unicode is always big-endian, even on Windows machines! --------------------------------------------------------------- Next TLUG Nomikai: 14 January 1998 19:15 Tokyo station Yaesu Chuo ticket gate. Or go directly to Tengu TokyoEkiMae 19:30 Chuo-ku, Kyobashi 1-1-6, EchiZenYa Bld. B1/B2 03-3275-3691 Next Saturday Meeting: 14 February 1998 12:30 Tokyo Station Yaesu Chuo ticket gate. --------------------------------------------------------------- a word from the sponsor: TWICS - Japan's First Public-Access Internet System www.twics.com info@example.com Tel:03-3351-5977 Fax:03-3353-6096
- Follow-Ups:
- Re: tlug: A couple of questions about Unicode
- From: "J. David Beutel" <jdb@example.com>
- References:
- tlug: A couple of questions about Unicode
- From: "Jonathan Byrne" <jbyrne@example.com>
Home | Main Index | Thread Index
- Prev by Date: tlug: A couple of questions about Unicode
- Next by Date: Re: tlug: A couple of questions about Unicode
- Prev by thread: tlug: A couple of questions about Unicode
- Next by thread: Re: tlug: A couple of questions about Unicode
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links