Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Re: Security question with grep/e...
- Date: Sat, 27 Mar 2004 09:59:50 +1100 (EST)
- From: Jim Breen <Jim.Breen@example.com>
- Subject: Re: [tlug] Re: Security question with grep/e...
"Stephen J. Turnbull" <stephen@example.com> wrote: >> >> Jim> At some time in the distant future I may get the whole >> Jim> shebang migrated to UTF8 and I'll see if I can get wide-char >> Jim> grepping set up then. Maybe POSIX will be doing multilingual. >> >> For your purpose, this should work fine. According to Uli Drepper >> (glibc maintainer), the only real issue in doing byte-by-byte regexp >> searches with UTF-8 is efficiency. My problem at the moment on that score is that I have a heap of mirrors, and lowest-common-denominator rules. Working MB regexes are relatively new in glibc, and most mirror sites use old versions (as does the Monash server.) The only way I can do them reliably across all mirrors is to run my own source out. I'd rather wait. >> Same for EUC-JP, of course, main >> problem is ensuring that you get the right flavor of bytes stuffed >> into the regexp. People using 7-bit JIS, Shift-JIS, or a Unicode >> variant will not get sane output searching an EUC-JP text. Internally I use EUC-JP, both 2-byte and the 3-byte JIS X 0212 variety (I'm probably the only person in the galaxy doing the latter.). The server pages dish out `charset="euc-jp"' (and the server does the matching MIME header) by default. You can set the code to SJIS, UTF8 or ISO-2022-JP via a cookie, and the server does code-conversions on the I/O boundary (the AIX mirror at UofVirginia dies on UTF8 as its iconv tables aren't up to snuff.) The point of all this is that at present I don't have to handle anything apart from EUC-JP internally. >> But you might be surprised---modern HTTP 1.1 with charset negotiation >> between server and browser might get the right answer most of the time. I'm not convinced that HTTP-level charset negotiation has much to offer in the the case of a slab of CGI code. It's all very well when a server has a battery of pages available in various languages, and the browser rocks up and says: "talk to me in Greek". In the case of Japanese it's a simpler matter - which Japanese-capable codeset/encapsulation shall we use? Since all browsers (theoretically) support all of them, "server rules" is a reasonable working position. (This got broken by the bloody i-mode option, as NTT/DoCoMo decided to lock their HTML subset into Shit_JIS. I extended the break to include the other charsets because [a] it was easy to do some more, and [b] if you use the French and German files, the umlauts, etc. look better in Unicode fonts than in the horrible zenkaku JISASCII ones.) Cheers Jim -- Jim Breen http://www.csse.monash.edu.au/~jwb/ Computer Science & Software Engineering, Tel: +61 3 9905 9554 Monash University, VIC 3800, Australia Fax: +61 3 9905 5146 (Monash Provider No. 00008C) ジム・ブリーン@モナシュ大学
- Follow-Ups:
- [tlug] Re: Security question with grep/e...
- From: Tobias Diedrich
Home | Main Index | Thread Index
- Prev by Date: [tlug] Samba 3.0 Joining active directory domain
- Next by Date: [tlug] Joining AD Domain through linux and VPN
- Previous by thread: Re: [tlug] Re: Security question with grep/e...
- Next by thread: [tlug] Re: Security question with grep/e...
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links