
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] "UTF-8 & ISO-2022-JP"
- Date: Tue, 06 Dec 2005 14:46:31 +0900
- From: "Lyle (Hiroshi) Saxon" <ronfaxon@example.com>
- Subject: Re: [tlug] "UTF-8 & ISO-2022-JP"
- References: <4393C9A2.7000103@example.com>
- Organization: Images Through Glass
- User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050511
Another bit of text from the (external) general discussion (I won't make
a habit of doing this, but think it's relevant in this case). - Lyle
[LHS] One non-technical observation. I've been asking my Japanese
friends about their experiences with mutated e-mail, and nearly all of
them say that they still have trouble with that from time to time -
although they say they're having less problems now than they were
before. [LHS]
There are other reasons why unencoded mail breaks. STMP doesn't honour
white space - usually it does but it can happen that some extra space
characters are added or removed. That's why Microsoft jumped on the HTML
bandwagon, because it is oblivious to adding or removing white space and
newlines, so formatting is not destroyed. Before HTML, text formatting
in email was a headache.
A big problem with Asian encodings is that the multibyte (ie
pre-Unicode) encodings use one or two shift (escape) symbols, and that's
a brittle idea, because the shift applies to all subsequent characters
until it's undone. If the part of the document containing a shift is
lost or garbled, then all subsequent symbols lose their context and
might be mis-decoded.
This gets particularly difficult when Japanese text say is mixed with
European text. The document might start in English, and switch to
Japanese at some point, and software might not be careful about keeping
the text as-is. In English, adding an extra space isn't the end of the
world, etc. In this sort of situation, a software system might often
miscostrue Japanese as some extended Latin character set.
[LHS] The fact that there are so bloody many different "standards" for
Japanese text is really horrible though! [LHS]
Indeed, but it's essentially a problem with the number of ideograms
available in the language. The character sets were designed for
different size/vocabulary tradeoffs, because in the early days it seemed
wasteful to reserve character codes for very rarely used symbols. So the
popular character sets are limited to save space, and that in turn means
that there needs to be another, more complete character set for
specialized applications. Worse, even when you restrict attention to the
most popular characters, 8 bytes is not enough to represent all major
Asian Chinese derived languages _simultaneously_. So China, Korea, etc.
designed their own flavours which are very close but not identical. The
Cyrillic based character sets also have this proliferation of flavours.
Of course, close but not identical means that automatic detection of the
character set is hard, because all the most common language constructs
are the same in all the flavours.
It was all supposed to be fixed with Unicode, but it too has problems.
Full Chinese has a _lot_ of characters, and then the professional
organizations decided they wanted space for their own symbols etc. So
the big Unicode system uses up to 5 (or is it 6?) bytes for a single
character, but with that many symbols it's not well supported. At least
Unicode doesn't use a shift sequence, but it uses up a lot of space and
it doesn't play well with the C language because most characters have
NUL bytes in them. So Unicode is often mangled unless it's used with
specifically unicode aware software systems.
Then there was Microsoft, who decided to do their own thing.
Home |
Main Index |
Thread Index