tlug: Re: What decides Japanese file name encoding?

To: tlug@example.com
Subject: tlug: Re: What decides Japanese file name encoding?
From: "Stephen J. Turnbull" <turnbull@example.com>
Date: Fri, 6 Aug 1999 16:38:37 +0900 (JST)
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=iso-2022-jp
In-Reply-To: <37AA6443.AE21B911@example.com>
References: <199908050632.PAA23886@example.com><37AA6443.AE21B911@example.com>
Reply-To: tlug@example.com
Sender: owner-tlug@example.com
>>>>> "Jim" == Jim Blackson <blackson@example.com> writes:

    Jim> Thanks for the reply.  On Wed, 4 Aug 1999 15:02:40 +0900
    Jim> (JST) Stephen J. Turnbull wrote:
    >> ... (see my earlier post re Unicode implementation)

    Jim> I searched the tlug mailing list archive and found some posts
    Jim> from 9705 or 9801.  Anything more recent?

Yeah, August 4, 1999 ;-)   Sorry about the misleading subject:

    From: "Stephen J. Turnbull" <turnbull@example.com>
    To: tlug@example.com
    Message-ID: <14247.52981.74635.211486@example.com>
    Subject: Re: tlug: Re: pine, mutt, Chinese, Japanese
    Date: Wed, 4 Aug 1999 14:26:13 +0900 (JST)

The relevant part was:

    sjt> Unicode is going to require a certain amount of
    sjt> implementation of infrastructure.  The problem is that
    sjt> Unicode does not preserve collating orders and the like for
    sjt> anything except American English (and maybe British English).
    sjt> So sorts are going to have to be table-driven.  This is
    sjt> actually a good thing; JIS order isn't really all that
    sjt> interesting.  It would make it very easy to specify a sort
    sjt> like "kyouiku kanji by year, first, then jouyou kanji, then
    sjt> other Japanese kanji, then non-Japanese kanji, then other
    sjt> characters" by writing appropriate tables.  (Not to mention
    sjt> "unifying" zen and hankaku romaji, etc.)


The rest of that thread may be applicable to your situation, too.  
Continuing.

    Jim> How does this locale stuff really work?

    >> Really work?  Really badly, at least for Japanese.  That's
    >> required by JIS standard I believe.

I should also add that the locale APIs are completely unsuited for use
in multilingual applications, especially if threaded.  Anything plus
English is usually more or less OK because ASCII is a subset of
everything (so you can hope that the other locale won't mess up
English processing too much), but other combinations make things hairy.

    Jim> I wondered if (had hoped?) the Japanese kernel uses
    Jim> internally some standard encoding for filenames, etc., and
    Jim> that there would be some document/spec describing how it is
    Jim> (to be) implemented.

Japanese is standards-darake.  Unfortunately, except for Unicode,
which is politically incorrect (and as yet only sparsely implemented),
all of them have serious problems from the point of view of either the
kernel or some parts of the user community.  So the effect is that
there is no standard.  To be fair, it's hardly better outside of Japan
in terms of _internationalization_ standards, but Japan is the only
country where there is not a single de facto standard.

"Japanese kernel"?  Aaargh.

<RANT>
There should be no such thing as a Japanese Linux kernel.  Such
abominations (unfortunately) do in fact exist, but in the immortal
phrasing of xemacs/src/lisp.h,

/* Close your eyes now lest you vomit or spontaneously combust ... */

I will say no more about such things; Tipper Gore would molest me for
arresting children.
</RANT>

The point is that kernels should only worry about shuffling very
abstract things like byte streams and some simple very well-defined
things like IP packets and filesystem inodes around (and microkernel
advocates would say even IP packets and inodes belong in userland).
Everything else should be accomplished outside of the kernel, in user
processes (including some very privileged processes like `init').  If
you maintain such a clean interface, the kernel will not be language-
dependent at all (except for the unavoidable need to have static error 
messages, and even there it might be possible to do something for all
but the most primitive kernel functions---Heaven help log processing
software, though ;-).

I suppose you are aware that, despite (allegedly) using Unicode
internally, the only Microsoft-provided solution to using Chinese and
Japanese on the same box is dual-booting?  I'm pretty sure that one of 
the main obstacles is the fact that FAT filesystems (including VFAT)
use Shift-JIS and Big-5 encodings for Japanese and Taiwanese Chinese
respectively.  File name parsers have to be rather tricky as these are 
state-dependent encodings.  And neither one is compatible with
ISO-2022 mechanisms for shifting encodings (more state-dependence, and 
thus inappropriate anyway).

Embedding that kind of trickiness in the kernel is just not
acceptable, even for localized use.  Note that you can't be sure that
application programs will pass in acceptable file names; your kernel
needs to be able to handle putative file names that contain
non-characters!  (It's an error, of course, but you need to make sure
that you can handle errors like that.  So much easier if _all_ input
strings have interpretations as legal filenames, and you can just
return "no match".)  Not to mention that multilingual people (I'm not
one, I'm just a groupie, but there are several on this list) would
find a Japanization solution that rules out Chinese unacceptable.

So the bottom line is that the kernel _can't_ be responsible for
conversions.  The fact that there are no standards suggests that this
is hard to automate.  And "since using Microsoft products means never
having to say you're sorry" you can't depend on standards being
followed anyway; Outlook Express, for one, regularly lies about the
character set being used.  (Pine does, too, for that matter, but it
doesn't pretend to be well-internationalized.)

In fact, in turns out that it's pretty much every application for
itself, since different applications specify various restrictions.
Mail standards prohibit non-ASCII bytes in headers and deprecate them
in message bodies, but 7-bit encodings are inappropriate for file
systems precisely because they are too easy to confuse with ASCII.
The right way to go is with a universal encoding; for the US and
Europe that means ISO-8859-1, and you don't hear too many complaints
about internationalization there since the majority of computer users
worldwide still need only ISO-8859-1 to encode everything.  Even for
non-US users, a switch to UTF-8 for file system and network use could
be fairly transparent (how many users use binary editors on their
directory nodes or read their mail by attaching `less' as a listener
on port 25?)  So it seems pretty clear that Unicode is the way to go.

The big problem is that all of the helper conversion programs,
editors, etc are not yet Unicode-aware.  And unlike ISO-8859-1 where
UTF-8 collates the same way for the Basic Latin subset as ISO-8859-1
does, Unicode is effectively random compared to any of the Oriental
sets.  This will play hell with localized grep, awk, perl, and so on.
Big effort involved.  And some people don't like the idea in principle.

So Unicode is out.  Shift-JIS is out.  Plain JIS is out, since the
meaning is ambiguous (is bigendian 0x2331 the two characters "#1" or
is it the single character "$B#1(B"?)  What's left?

You _can_ Japanize your system by using EUC-JP everywhere.  This is
what commercial Unices do, and what Japanese free *nix distributions
do.  EUC is file system safe for the same reason that UTF-8 is (all
bytes in the range 0x00-0x7F are interpreted as single ASCII
characters, and never as components of a multibyte character).  But
this has a big problem with three aspects: it's a Japanization, not an
internationalization or multilingualization.

The first aspect is purely aesthetic.  I think Japanization is
inelegant and parochial.

The second may or may not be important depending on your application:
you will need to hack around the Japanization to extend your
application to other locales.  For example, if you are storing data
per person in subdirectories with their name, and somebody named
"Christian Nyb,Ax(B" shows up, you won't get their name right, since it
will show up as (broken) Japanese on rereading.  So you'll have to
test and filter out non-Japanese 8-bit character sets in that
application.  Bletch.  (That's not going to be easy, either, since
there is no way to distinguish between a well-formed string in EUC-JP
and one in ISO-8859-any, without using the semantic content.  There
are ISO-8859 strings which are ill-formed for EUC-JP, though, and
again you're going to have to worry about what your application is
going to do with that.  Linux file systems will be OK, but application
level mechanisms can bomb.  This can't happen with Unicode; you'll
just get "no match".)

Finally, if your product is going to be internationalized, Japanizing
it is probably a permanent commitment to maintaining a rather
different Japanese version, because the Japanese locales are all
currently broken in glibc (well, they were as of May), and glibc is
not coming along very quickly with respect to multibyte characters.
Ulrich Drepper, who maintains glibc, blames that on programmers who
submit unusable patches for the wctombs library functions.  Uli is
evidently capable of being a jerk, but he is the maintainer.  So
unless the people working on libwctombs come around to his point of
view, there isn't going to be a standard API to deal with both Asian
languages and European ones for a while.

And still you must deal with non-EUC stuff from outside of Known
Space.  For starters, mail and news messages will usually be in
ISO-2022-JP, but unless you have no Windows-using correspondents you
will surely receive some Shift-JIS messages.  You'll probably get a
few in EUC, too, which may break your mail software.  Not to mention
all the Godawful web pages generated with MS Weird etc.  And watch out
for filenames in Shift-JIS or whatever in mail attachments, say.

"The only standard around here is that we have no standards."


-- 
University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
__________________________________________________________________________
__________________________________________________________________________
What are those two straight lines for?  "Free software rules."
-------------------------------------------------------------------
Next Technical Meeting: August 14 (Sat), 13:00  place: Temple Univ.
*** Special guest: Marc Christensen (Salt Lake Linux Users Group)
Next Nomikai: September 20 (Fri), 19:30 Tengu TokyoEkiMae 03-3275-3691
-------------------------------------------------------------------
more info: http://www.tlug.gr.jp        Sponsor: Global Online Japan
References:
- tlug: Re: What decides Japanese file name encoding?
  - From: Jim Blackson <blackson@example.com>
Prev by Date: tlug: Re: What is UDF?
Next by Date: tlug: export DISPLAY / existing nonexisting files
Prev by thread: tlug: Re: What decides Japanese file name encoding?
Next by thread: tlug: Re: What is UDF?
Index(es):
- Date
- Thread
Home | Main Index | Thread Index