Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] "UTF-8 & ISO-2022-JP"



Another bit of text from the (external) general discussion (I won't make a habit of doing this, but think it's relevant in this case). - Lyle

[LHS] One non-technical observation. I've been asking my Japanese friends about their experiences with mutated e-mail, and nearly all of them say that they still have trouble with that from time to time - although they say they're having less problems now than they were before. [LHS]
There are other reasons why unencoded mail breaks. STMP doesn't honour 
white space - usually it does but it can happen that some extra space 
characters are added or removed. That's why Microsoft jumped on the HTML 
bandwagon, because it is oblivious to adding or removing white space and 
newlines, so formatting is not destroyed. Before HTML, text formatting 
in email was a headache.
A big problem with Asian encodings is that the multibyte (ie 
pre-Unicode) encodings use one or two shift (escape) symbols, and that's 
a brittle idea, because the shift applies to all subsequent characters 
until it's undone. If the part of the document containing a shift is 
lost or garbled, then all subsequent symbols lose their context and 
might be mis-decoded.
This gets particularly difficult when Japanese text say is mixed with 
European text. The document might start in English, and switch to 
Japanese at some point, and software might not be careful about keeping 
the text as-is. In English, adding an extra space isn't the end of the 
world, etc. In this sort of situation, a software system might often 
miscostrue Japanese as some extended Latin character set.
[LHS] The fact that there are so bloody many different "standards" for 
Japanese text is really horrible though! [LHS]
Indeed, but it's essentially a problem with the number of ideograms 
available in the language. The character sets were designed for 
different size/vocabulary tradeoffs, because in the early days it seemed 
wasteful to reserve character codes for very rarely used symbols. So the 
popular character sets are limited to save space, and that in turn means 
that there needs to be another, more complete character set for 
specialized applications. Worse, even when you restrict attention to the 
most popular characters, 8 bytes is not enough to represent all major 
Asian Chinese derived languages _simultaneously_. So China, Korea, etc. 
designed their own flavours which are very close but not identical. The 
Cyrillic based character sets also have this proliferation of flavours. 
Of course, close but not identical means that automatic detection of the 
character set is hard, because all the most common language constructs 
are the same in all the flavours.
It was all supposed to be fixed with Unicode, but it too has problems. 
Full Chinese has a _lot_ of characters, and then the professional 
organizations decided they wanted space for their own symbols etc. So 
the big Unicode system uses up to 5 (or is it 6?) bytes for a single 
character, but with that many symbols it's not well supported. At least 
Unicode doesn't use a shift sequence, but it uses up a lot of space and 
it doesn't play well with the C language because most characters have 
NUL bytes in them. So Unicode is often mangled unless it's used with 
specifically unicode aware software systems.
Then there was Microsoft, who decided to do their own thing.




Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links