Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Re: Security question with grep/e...



"Stephen J. Turnbull" <stephen@example.com> wrote:
>> 
>>     Jim> At some time in the distant future I may get the whole
>>     Jim> shebang migrated to UTF8 and I'll see if I can get wide-char
>>     Jim> grepping set up then. Maybe POSIX will be doing multilingual.
>> 
>> For your purpose, this should work fine.  According to Uli Drepper
>> (glibc maintainer), the only real issue in doing byte-by-byte regexp
>> searches with UTF-8 is efficiency.  

My problem at the moment on that score is that I have a heap of mirrors,
and lowest-common-denominator rules. Working MB regexes are relatively
new in glibc, and most mirror sites use old versions (as does the Monash
server.) The only way I can do them reliably across all mirrors
is to run my own source out. I'd rather wait.

>> Same for EUC-JP, of course, main
>> problem is ensuring that you get the right flavor of bytes stuffed
>> into the regexp.  People using 7-bit JIS, Shift-JIS, or a Unicode
>> variant will not get sane output searching an EUC-JP text.

Internally I use EUC-JP, both 2-byte and the 3-byte JIS X 0212 variety
(I'm probably the only person in the galaxy doing the latter.). The
server pages dish out `charset="euc-jp"' (and the server does the
matching MIME header) by default. You can set the code to SJIS, UTF8 or
ISO-2022-JP via a cookie, and the server does code-conversions on the
I/O boundary (the AIX mirror at UofVirginia dies on UTF8 as its iconv
tables aren't up to snuff.)

The point of all this is that at present I don't have to handle
anything apart from EUC-JP internally.

>> But you might be surprised---modern HTTP 1.1 with charset negotiation
>> between server and browser might get the right answer most of the time.

I'm not convinced that HTTP-level charset negotiation has much to offer
in the the case of a slab of CGI code. It's all very well when a server
has a battery of pages available in various languages, and the browser
rocks up and says: "talk to me in Greek". In the case of Japanese
it's a simpler matter - which Japanese-capable codeset/encapsulation
shall we use? Since all browsers (theoretically) support all of them,
"server rules" is a reasonable working position. 

(This got broken by the bloody i-mode option, as NTT/DoCoMo decided to 
lock their HTML subset into Shit_JIS. I extended the break to include the 
other charsets because [a] it was easy to do some more, and [b] if you use 
the French and German files, the umlauts, etc. look better in Unicode 
fonts than in the horrible zenkaku JISASCII ones.)


Cheers

Jim

-- 
Jim Breen                                http://www.csse.monash.edu.au/~jwb/
Computer Science & Software Engineering,                Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia                  Fax: +61 3 9905 5146
(Monash Provider No. 00008C)                ジム・ブリーン@モナシュ大学

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links