Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] [OT] Strip Kanji from a document for study purposes



Dave M G wrote:

There may be existing software that does what I'm looking for, but I 
haven't seen it. If you know of a suitable Linux based application, 
please let me know.
What I'd like to do is take a Japanese document and convert it into a 
list of the kanji included, and a list of words. Ideally repetitions 
would be removed, as would particles and other grammatical inflections. 
Hiragana and katakana words could be dropped too.
Try Juman:

http://nlp.kuee.kyoto-u.ac.jp/nl-resource/juman.html

Here's a CGI to try it out:

http://nlp.kuee.kyoto-u.ac.jp/nl-resource/juman-form.html

It doesn't do everything you want out of the box, but it's pretty powerful and with a bit of scripting and piping you should be able to get want you want. (it has a Perl module, I think)


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links