Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Date: Wed, 19 Jul 2006 14:11:33 +1000 (EST)
- From: Jim Breen <Jim.Breen@example.com>
- Subject: Re: [tlug] [OT] Strip Kanji from a document for study purposes
Dave M G <martin@example.com> implored plaintively: >> >> (This message includes utf8 encoded Japanese text) Which arrived in the digest version as ?????????? >> There may be existing software that does what I'm looking for, but I >> haven't seen it. If you know of a suitable Linux based application, >> please let me know. >> >> What I'd like to do is take a Japanese document and convert it into a >> list of the kanji included, and a list of words. Ideally repetitions >> would be removed, as would particles and other grammatical inflections. >> Hiragana and katakana words could be dropped too. What you want is <drumroll>WWWJDIC</drumroll>. >> My ultimate goal would be to create a list that has definitions and >> readings. But, if that's too complex, then the next best thing would be >> to just have a list of words and individual kanji that I could look up >> on my own (perhaps with some kind of clever use of regular expressions >> or something?) Here is what you get from the "Translate Words in Text" function: Input: 今国会の会期延長がなくなったことで、教育基本法改正案や国民顛深法案など 与党が重要視してきた法案は軒並み継続審議となる。 Output: * 今国会 【こんこっかい】 (n) current Diet session; ED * 会期 【かいき】 (n) session (of a legislature); (P); EP * 延長 【えんちょう】 (n,vs) (1) extension; elongation; prolongation; lengthening; (n) (2) Enchou era (923.4.11-931.4.26); (P); EP * なくなった (exp) not any more; .KD * 教育基本法 【きょういくきほんほう】 (n) (Japanese) Education Act; Fundamental Law of Education; ED * 改正或触Â 【かいせいあん】 (n) reform bill; reform proposal; ED * 国民投票 【こくみんとうひょう】 (n) national referendum; (P); EP * 法案 【ほうあん】 (n) bill (law); (P); EP * 与党 【よとう】 (n) government party; (ruling) party in power; government; (P); EP * 重要視 【じゅうようし】 (n,vs) regarding highly; attaching importance to; ED * 法案 【ほうあん】 (n) bill (law); (P); EP * 軒並み 【のきなみ】 (n-adv,n) (1) row of houses; every door; (2) totally; altogether; across the board; (P); EP * 継続 【けいぞく】 (n,vs,adj-no) continuation; (P); EP * 審議 【しんぎ】 (n,vs) deliberation; (P); EP WWWJDIC is a server running at a heap of places. I have never produced a standalone version because it is being updated continually (the data files grow daily and the last code change was two hours ago.) If you are connected you can quickly knock up a script to drive its "backdoor" API, e.g. http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?9MGG%B6%B5%B0%E9.... Cheers Jim -- Jim Breen http://www.csse.monash.edu.au/~jwb/ Clayton School of Information Technology, Tel: +61 3 9905 9554 Monash University, VIC 3800, Australia Fax: +61 3 9905 5146 (Monash Provider No. 00008C) ジム・ブリーン@モナシュ大蛙触Â
- Follow-Ups:
- Re: [tlug] [OT] Strip Kanji from a document for study purposes
- From: Dave M G
- Re: [tlug] [OT] Strip Kanji from a document for study purposes
- From: GMO Unix Erin D. Hughes
- [tlug] MFIR Department
- From: Stephen J. Turnbull
- Re: [tlug] [OT] Strip Kanji from a document for study purposes
- From: Jim Tittsler
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Next by Date: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Previous by thread: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Next by thread: Re: [tlug] [OT] Strip Kanji from a document for study purposes
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links