Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] [OT] Strip Kanji from a document for study purposes



On Tue, 18 Jul 2006 12:28:03 -0400
Jim <jep200404@example.com> wrote:

> > What I'd like to do is take a Japanese document and convert it into a 
> > list of the kanji included, and a list of words. Ideally repetitions 
> > would be removed, as would particles and other grammatical inflections. 
> > Hiragana and katakana words could be dropped too.
 
> Removing particles and other grammatical inflections might be 
> a significant project in itself. 

Removing particles and inflections isn't that hard because these are
hiragana following the kanji. Tokenizing is the tricky part where a
sentence can contain words in kanji which are not delimited by kana.
Consider this string for example: 技術的課題
You need to split this into two words before you can feed it to
a dictionary.

Attachment: signature.asc
Description: PGP signature


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links