Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] unicode and Perl- how to pass command lineunicodearguments
- Date: Wed, 15 Feb 2006 18:11:17 +0900
- From: David Riggs <dariggs@example.com>
- Subject: Re: [tlug] unicode and Perl- how to pass command lineunicodearguments
- User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US;rv:1.7.7) Gecko/20050420 Debian/1.7.7-2
Neil Bortnak said: about perl invocation argument -C: >You missed A. IMHO, you should just use -C127 (enables all of the above) >in a kanji/unicode heavy program because it simply makes everything >unicode aware (except for unicode in the script, for which you still >need the utf pragma) and that will cut down on accidental encoding >problems. Yes, thanks, I did miss that, and -CSioA works well. And also: >s/日本語/英語/ m/日本語/ >seem to work fine for me in the middle of the program. I'm using "use >utf8;" as per normal, so I'm in a bit of wonderment as to why it >doesn't work for you. I The do work just fine, for MANY cases. But, I think that perl is actually doing byte level comparison/replace, and the above strings would work just fine as bytes (assuming your script and data are in the same encoding.) But even at this level there are still problems: as I mentioned earlier: if I try to match a ☆ (star: unicode E29886) if (/^☆.*tw:(.).*jp:(.)/) It just never works. But if I assign a star to a variable, either in the script or from the command line, and use that, it works fine. That really bothers me. And the real problem is if you try to do tr/// or more complex character sets, alternations and such in the regex, then it all breaks down unless you are really doing unicode. I did a whole search thing with character set skipping over punctuation, and actually it was just in byte mode-- I never realized it until I started to get false misses and such and finally realized that perl was just munching bytes. It was separately skipping over all three bytes of a unicode space character inside of a character class. And of course the tradtional tools like tr and grep work fine with unicode, it seems. But the results are wrong-- they are just doing bytes (as Steven T pointed out to us some time ago.) (Sorry, you probably already know all this...) Thanks for the tip about -CA (kinda wishing I were back in CA myself with this weather.) David Riggs
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] unicode and Perl- how to pass command line unicodearguments
- Next by Date: Re: [tlug] Japanese dictionaries
- Previous by thread: [tlug] [OT] Unix System Admin Job
- Next by thread: [tlug] Red Hat 7.2 Enterprise install.log.syslog times
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links