Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] Search MySQL for Japanese Names]



Hi Jim. Overseas conference? ii ne~~

>> Jim, curious question: how many names in ENAMDICT resolve to just one
>> reading? Even a I-would-have-thought-surefire candidate for uniqueness
>> such as 田中(tanaka) resolves to ten different readings in ENAMDICT
>> (tanata, tanka, danaka, nunoka, ....). 鈴木(suzuki) has seven.

> So those ~74k merged entries come from ~205 "raw" entries, i.e. approx.
> 2.8 readings per entry for the 74k.

Okay, I got stuck into enamdict with awk and sort. This is the spread
I found out of the 280,677  entries. (Place, organization, full names
of famous individuals, and names that don't start with kanji are
stripped out. I treat given names and surnames as one.)

No. of readings vs count
1  224,453 (80%)
2   36,602 (13%)
3   10,486 (4%)
4    4,261 (2%)
5    1,904 (<1%)
6    1,071 (<1%)
..
(high counts omitted. If anyone's interested the 'wa'-as-in-wafu kanji
stands out as the most ridiculously overloaded kanji name with 54
readings.)

That's a way higher percentage of uniquely-reading names than I
expected  ^_^; Less than one percent have five or more, so pulling
Tanaka and Suzuki out of a hat as I did at the start was really
non-typical sampling :(

My next interest would be spread of names in the real population. Who
knows how the results of the above would be weighted then...

>> ..Mecab/Chasen ... By design these parsers don't want multiple
>> readings for names. They just want the most likely one.
>
> Well, even then the coverage is poor.

No controversy there- the Mecab developer, Taku Kudo, developer said
the same to me last week when I happened to meet him.

Thanks Jim!

Akira


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links