Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][tlug] Re: Updating iconv tables
- Date: Fri, 13 Jun 2008 08:50:58 +1000
- From: "Jim Breen" <jimbreen@example.com>
- Subject: [tlug] Re: Updating iconv tables
- References: <5634e9210806102023p448a36bcw2d90f138cebb5597@mail.gmail.com>
A few days ago I wrote: 2008/6/11 Jim Breen <jimbreen@example.com>: > I have struck a problem with missing mappings in > iconv in several Linux distros. The problem has > arisen initially with ㈱ (i.e. (株)), but is sure crop > up with others. [...] > I'll send a copy of this to bug-gnu-libiconv@example.com > but is that enough? I received the following reply, which also went to bug-gnu-libiconv@example.com so I think I can relay it. ========================================================= from Bruno Haible <bruno@example.com> to Jim Breen <jimbreen@example.com>,bug-gnu-libiconv@example.com date 12 June 2008 10:42 subject Re: [bug-gnu-libiconv] Updating iconv tables Hi, I'm not sure I understand it all right. > When people have > gone to convert the EDICT file to UTF8 for other > systems, the iconv utility simply dies on that character In summary, you are saying that you have a particular character in EUC-JP, that the iconv conversion from EUC-JP to UTF-8 does not grok? Then the character is not EUC-JP. I'm not sure which character you are talking about, because your mail had an encoding specification of ISO-2022-JP, which usually means ISO-2022-JP-2, but that particular character was invalid in ISO-2022-JP-2 (it was encoded as "ESC $ B - j"), the other character in that line was U+682A, and you were talking about U+3231. > The problem, I conclude, is with the compiled-in tables > in iconv in the Linux distros. It seems Sun has gone to > the trouble of keeping theirs up-to-date, but the standard > distros haven't. You have a misconception of what EUC-JP is. EUC-JP is a character encoding scheme based on three standards: ASCII, JIS X 0208, and JIS X 0212. These are standards issued by Japanese authorities, and carved in stone. Anyone who thinks that EUC-JP tables have to be "kept up-to-date", is asking for deviation from standards, and is asking for interoperability problems! The interoperability problem that you encountered is *precisely* due to your vendor having added "extensions" to their EUC-JP fonts, and you expect that everyone else has the same extensions in their fonts and tables! Take a look at http://www.haible.de/bruno/charsets/conversion-tables/EUC-JP.html to see how many variants of EUC-JP already exist! Bruno ============================================================= Needless to say I couldn't let that pass. "Carve in stone" indeed! Vendor extension! Anyway, my response: ================================================================ Hi Bruno, Great to hear from you 2008/6/12 Bruno Haible <bruno@example.com>: > I'm not sure I understand it all right. > >> When people have >> gone to convert the EDICT file to UTF8 for other >> systems, the iconv utility simply dies on that character > > In summary, you are saying that you have a particular character in EUC-JP, > that the iconv conversion from EUC-JP to UTF-8 does not grok? > > Then the character is not EUC-JP. Wrong. I'll explain more below. > I'm not sure which character you are talking about, because your mail > had an encoding specification of ISO-2022-JP, which usually means > ISO-2022-JP-2, but that particular character was invalid in ISO-2022-JP-2 > (it was encoded as "ESC $ B - j"), the other character in that line was > U+682A, and you were talking about U+3231. This is a bit of a side issue. My email was indeed in ISO-2022-JP, since I have gmail set to use the default for the language, and my email contained Japanese. The code-point question converts and displays correctly in compliant mailers. Nothing illegal about it. >> The problem, I conclude, is with the compiled-in tables >> in iconv in the Linux distros. It seems Sun has gone to >> the trouble of keeping theirs up-to-date, but the standard >> distros haven't. > > You have a misconception of what EUC-JP is. EUC-JP is a character encoding > scheme based on three standards: ASCII, JIS X 0208, and JIS X 0212. These > are standards issued by Japanese authorities, and carved in stone. Anyone > who thinks that EUC-JP tables have to be "kept up-to-date", is asking for > deviation from standards, and is asking for interoperability problems! You are out-of-date there. EUC-JP also includes JIS X 0213, which was released in 2000 and updated in 2004. The codepoint I raised arrived in JIS X 0213. You can think of JIS X 0213 as an enhancement/replacement for JIS X 0208. It added a heap of additional characters, *all* of which have been included in Unicode, and all of which have EUC codings, since EUC-JP is simply a transformation of the ku-ten codes in the Japanese standards. Of course EUC-JP tables need to be kept up-to-date. See: http://en.wikipedia.org/wiki/JIS_X_0213 for an overview. > The interoperability problem that you encountered is *precisely* due to > your vendor having added "extensions" to their EUC-JP fonts, and you > expect that everyone else has the same extensions in their fonts and tables! > Take a look at > http://www.haible.de/bruno/charsets/conversion-tables/EUC-JP.html > to see how many variants of EUC-JP already exist! Sadly your WWW page omits any mention of JIS X 0213. In other words it is lacking all the characters added to the standard Japanese codings in the last decade. Sun has simply kept up with the developments in Japanese coding. These are *not* vendor extensions. In case you think I am talking through my hat, I must point out that I am one of only a handful of non-Japanese people who have participated in the development of the Japanese standards. You will find my name among the respondents at the back of JIS X 0208-1997, along with people like Ken Lunde and Martin Duerst. (I assume you have a copy.) Ask Ken if he has heard of me. I am happy to work with you in getting the full set of current Japanese codes into iconv. As it stands at the moment, the GNU issue does not adequately handle all the standard Japanese codes. Best wishes =============================================================== I'll keep the TLUG list informed of developments (if any). I think it would help a lot if one or two more people, e.g. Linux users in Japan, chipped in with emails to bug-gnu-libiconv@example.com on this matter. Cheers Jim -- Jim Breen Honorary Senior Research Fellow Clayton School of Information Technology, Monash University, VIC 3800, Australia http://www.csse.monash.edu.au/~jwb/
- Follow-Ups:
- [tlug] Re: Updating iconv tables
- From: Jim Breen
- References:
- [tlug] Updating iconv tables
- From: Jim Breen
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] TLUG Announcement June Nomikai
- Next by Date: [tlug] Re: Updating iconv tables
- Previous by thread: [tlug] Updating iconv tables
- Next by thread: [tlug] Re: Updating iconv tables
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links