Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] unicode and Perl- how to pass command line unicodearguments



>>>>> "Ian" == Ian Wells <ijw@example.com> writes:

    Ian> So, the distinction being between an object that is a string
    Ian> of values representing text, and an object which is a string
    Ian> of values representing a string of values.

That's right.

    Ian> Both of which, I presume, work in most functions (otherwise
    Ian> the misuse you discuss wouldn't happen)

If you are programming for ASCII input, that will be true as long as
you're restricted to ASCII input.  The problem is that once you leave
that world, even for the upward compatible world of UTF-8, you are
going to have problems.

    Ian> And the one you're most likely to use (u"", representing
    Ian> readable text) is the one that's harder to type.

That's right.  For backward compatibility reasons. :-(

    Ian> So Perl doesn't make the distinction and Python doesn't
    Ian> enforce it properly.

That's right.  Again, for backward compatibility, Python only enforces
it partly.

    Ian> Personally speaking,

Well, whatever floats your boat, of course.  If the programmer is
comfortable with a given discipline, why bother making a rule that
says you have to do it right when he already does?  The problem is
when you deal with many programmers who prefer different disciplines,
or may be undisciplined but it doesn't hurt in the original
environment, you're going to have portability problems, and POLA
violations when the software gets into users' hands.

    Ian> I'd argue that since binary data is actually fairly uncommon,

"Actually", it's all over the place.  The first couple dozen bytes of
most XML input should be considered binary, then reread.  RFC 2822
headers are binary (EBCDIC and UTF-16 not allowed! and "AW:" is not
the German translation of "Re:", "Re:" is the German translation of
"Re:").  SMTP, of course, NNTP, HTTP, the list goes on and on.
Basically, anything that is a wire protocol is binary in the relevant
sense.

So you can't simply say "we will represent strings of 8-bit values as
an array of 16-bit values" (well, you can, but it would be horribly
inefficient to map from memory buffers to WC string buffers all the
time).

    Ian> Um.  I'm just saying that (in my head and in Perl) a string
    Ian> is a string is a string.  If you consider it mentally to be a
    Ian> list of numbers then it can contain either a language string
    Ian> or a binary data chunk without violating that assumption.

But you see, you *can't* think of a string as a list of numbers.  Eg,
consider case-insensitive matching.  *This is nonsense in the binary
context.*  In the case of Unicode, a program *must* identify (for
collation purposes) U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE and
U+212B ANGSTROM SIGN, and both of those with the composition of U+0041
LATIN CAPITAL LETTER A plus U+030A COMBINING RING ABOVE.

I really don't see how to go from "a list of numbers is a list of
numbers" to DWIMming the case above.  This is a *real* case, reported
within the last couple of weeks here on TLUG (Kevin Hoang's post about
getting his Vietnamese accents decomposed).  My guess is that the Perl
community will spend the next few years fumbling about with "cut and
try".

    Ian> in my experience the result is that you read a utf8 file by
    Ian> setting the utf8 flag, write it similarly and what you do
    Ian> inbetween Just Works because you're dealing with strings that
    Ian> can contain all unicode characters, not bytearrays.

That's not surprising.  Perl has had twenty years to work its way
through the pain of making byte arrays DWYM in string contexts.  But
they didn't DWDM because David didn't think like a Perl program.  My
suggestion was that Python might match his expectations better, and he
replied he's comfortable learning the Perl Way (or at least one of the
Perl万道 ;-).

    Ian> I suppose it depends on your expectations.  I like the 'my
    Ian> file's in unicode and my language understands that' approach;
    Ian> I don't see why you'd want a file you edit in unicode only
    Ian> for your language to consider it to be something else.

You wouldn't.  The problem is that there are binary protocols that
look like text, and there are binary protocols that represent text,
and DWIM is always a guess.  As David discovered.

    Ian> And that still seems to suggest that pretty much every string
    Ian> you're ever going to type into Python would need
    Ian> u"".toUnicode() (or whatever) when Perl would DWIM.

Of course not.  It's simply that Python gives you the option to do it
at the site (which I guess Perl does too, although the Python notation
allows you to use a string method, thus emphasizing the string literal
and not the coding method), and it doesn't allow you to do stuff like

    use utf8;
    $var = $_ + "ユニコードリテラル";

More precisely, it coerces $_ to Unicode according to the default
codec which is normally ASCII-only.  What would Perl do if $_ happened
to contain KOI8-R-encoded Cyrillic?  Just glom them together and cause
a utf8 error eventually?


-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links