Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: SJIS & HTML - potential trouble?
- To: tlug@example.com
- Subject: Re: SJIS & HTML - potential trouble?
- From: turnbull@example.com (Stephen J. Turnbull)
- Date: Wed, 20 Nov 96 12:06 JST
- In-reply-to: <199611200134.MAA01272@example.com> (jwb@example.com)
- Reply-To: tlug@example.com
- Sender: owner-tlug
The following is munged into order of importance, I think it's still readable. >>>>> "Jim" == Jim Breen <jwb@example.com> writes: Jim> And how. I got enthusiastic some years ago, and wrote a Jim> state-driven detecter which could reliably tell SJIS, EUC and Jim> UTF-8 apart. "Normal" techniques fail because there is so Jim> much overlap, so I did it by elimination. I can't imagine Jim> trying it in lex. Is this available publically? You're right, you wouldn't want to do it in lex, you'd need a yacc layer as well. (The lex part would be useful for creating character classes, yacc is much more convenient for explicitly tracking states, although not well-designed for this particular application.) Once you got into trying to tease apart different languages automatically, you'd need to use semantic content, I guess. Ah, AI.... [ The following is just chat.... ] ST> What you're saying is that all programmers need to learn ST> Japanese *and* Japanese character codesets? Jim> Well, codesets anyway. When Hongbo Ni sent me the pre-Alpha Jim> version of NJSTAR I looked at it and asked, among other But this is specifically internationalized. True, programmers of internationalized software need to learn it. But can we really expect RMS and Larry Wall to put that much effort into Emacs and Perl? Mule is a wonderful piece of code, although the comments left a lot to be desired by GNU standards in 1994; JPerl (of the same vintage) was a serious crock. The point is that neither RMS nor Wall should take credit or blame for the Japanizations. I guess I'm going to have to look into localization standards more carefully. Jim> Well HTML standards are a mess, thanks to Netscape. HTML V2.0 Jim> said just ISO646, in effect (actually it is a DTD within Jim> SGML.) Most browsers extend it to the Latin-1. HTML V3.0 Jim> died, and V3.2 seems trapped in a welter of propriatary Jim> extensions. Is it an RFC at all? I think it's from the WWWC, Jim> not the IETF. There were HTML RFCs, but they may have expired before becoming standards. The HTTP stuff is all RFCs (written by W3C staffers, of course). Jim> Don't be so sure about 8-bit clean. Well, most Unix text filters work pretty well with Japanese; less, for example, works fine for me in a kterm. But I take your point. And of course tools using heuristics (like glimpse or even grep) will be very language dependent, and not 8-bit clean. I had forgotten about those. Jim> And it is even worse when you are not talking about ISO646 Jim> but ISO646+Latin-x. Ask a Scandinavian trying to run Japanese Jim> applications on their localized versions of Windoze about Jim> colliding character sets. Bjarne Stroustrup had a snippet of what C looks like in Danish in one of the C++ books. Pretty funny if you don't have to deal with it. Steve -- Stephen J. Turnbull Institute of Policy and Planning Sciences Yaseppochi-Gumi University of Tsukuba http://turnbull.sk.tsukuba.ac.jp/ Tennodai 1-1-1, Tsukuba, 305 JAPAN turnbull@example.com ----------------------------------------------------------------- a word from the sponsor will appear below ----------------------------------------------------------------- The TLUG mailing list is proudly sponsored by TWICS - Japan's First Public-Access Internet System. Now offering 20,000 yen/year flat rate Internet access with no time charges. Full line of corporate Internet and intranet products are available. info@example.com Tel: 03-3351-5977 Fax: 03-3353-6096
- References:
- Re: SJIS & HTML - potential trouble?
- From: jwb@example.com (Jim Breen)
Home | Main Index | Thread Index
- Prev by Date: Re: SJIS & HTML - potential trouble?
- Next by Date: Re: SJIS & HTML - potential trouble?
- Prev by thread: Re: SJIS & HTML - potential trouble?
- Next by thread: Re: SJIS & HTML - potential trouble?
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links