Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: SJIS & HTML - potential trouble?
- To: tlug@example.com
- Subject: Re: SJIS & HTML - potential trouble?
- From: turnbull@example.com (Stephen J. Turnbull)
- Date: Wed, 20 Nov 96 09:35 JST
- CC: tlug@example.com
- In-reply-to: <199611200000.LAA00937@example.com> (jwb@example.com)
- Reply-To: tlug@example.com
- Sender: owner-tlug
>>>>> "Jim" == Jim Breen <jwb@example.com> writes: Jim> Why on earth would SJIS be dear to anyone's heart?? :-) Bill Gates must love the wonderful joke he played on thousands of hapless nihongo programmers. Jim> Well I can quite imagine some Americo-centric programmer Jim> stumbling on codes > 128. OTOH, do they really write parsers Jim> that could not handle the ISO-8859-1 codes wich are very Jim> widely used in Europe? Be a little fair; almost nobody writes code that isn't 8-bit clean anymore; the big problem was that "8-bit-dirty" was embedded in lots and lots of libc.a's. Oriental languages which are inherently 2-byte *must* by the RFC mix with single byte ISO646 ("bare ASCII", you might say), and that is surely hairy. Jim> Seriously, though, people writing parsers, etc, should be Jim> producing code which is: (a) configurable for a series of Jim> muti-byte codes with the MSB set and not set (b) able to Jim> handle the UTF codings of Unicode/ISO10646 You don't ask for much, do you? I've looked at the source for Mule, and it's hairy; no, let me say it's positively furry. Let's at least say that Netscape 2.0 international beta regularly choked in documents including JIS and EUC codes both in auto-code mode and in assume JIS mode. To its credit, it always (in my experience) retrained to the correct mode after a couple of bytes, but I lost a few paragraph markers (I forget what "<p" is in escapeless JIS) and gained many extras that way. I'm not sure what exactly is legal in HTML, but I suspect you need to read RFC-MIME as well as RFC-HTML. I wouldn't be surprised if a strict reading of the RFCs led to the conclusion that each passage in an oriental language needs to be embedded in a separate part of a MIME multi-part document. What really needs to be done is a solid GPL-(or freer)-license lexing library which does all the above and also is extensible for national standards which are old and incompatible with the Unicode standard. This is not a project I'm willing to attempt at present, though. Presumably the Mule internal routines could be adapted, or jcode.c converted into a library (although the latter is just as Japan-centric as 7-bit ASCII is Americo-centric). >>>>> "Jim" == Jim Breen <jwb@example.com> writes: Jim> "Proper" handling of SJIS (an oxymoron if there ever was one) Jim> involves a lot of checking for valid/invalid sequences, as Jim> you have to cater for the unspeakable hankaku katakana as "Unspeakable?" Is this a technical linguistics term? :-) Jim> well. Trying to scan backwards, e.g. in a WP program, through Jim> some raw SJIS sends you grey. Usually developers do something Jim> like holding everything as 16-bit codes internally. Mule uses 32-bit codes, mostly!! Jim> (No excuse for bad parsing, though.) I've tried to write lex code to reproduce Ken Lunde's jcode.c; it's not easy unless you're looking at Ken's source. The author of xjdic should know this, though :-) Let's face it, the Japanese language is fundamentally just an attempt to postpone the day when Turing's Test is passed. What you're saying is that all programmers need to learn Japanese *and* Japanese character codesets? Realistically, we're going to have to accept that Japanese is not going to be well-handled by most software for some time to come, until most authors are using Unicode-emitting tools. Right. You really expect that on your Sharp Internet-capable wapuro? Maybe, but... -- Stephen J. Turnbull Institute of Policy and Planning Sciences Yaseppochi-Gumi University of Tsukuba http://turnbull.sk.tsukuba.ac.jp/ Tennodai 1-1-1, Tsukuba, 305 JAPAN turnbull@example.com ----------------------------------------------------------------- a word from the sponsor will appear below ----------------------------------------------------------------- The TLUG mailing list is proudly sponsored by TWICS - Japan's First Public-Access Internet System. Now offering 20,000 yen/year flat rate Internet access with no time charges. Full line of corporate Internet and intranet products are available. info@example.com Tel: 03-3351-5977 Fax: 03-3353-6096
- References:
- Re: SJIS & HTML - potential trouble?
- From: jwb@example.com (Jim Breen)
Home | Main Index | Thread Index
- Prev by Date: Re: SJIS & HTML - potential trouble?
- Next by Date: Re: SJIS & HTML - potential trouble?
- Prev by thread: Re: SJIS & HTML - potential trouble?
- Next by thread: Re: SJIS & HTML - potential trouble?
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links