Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Date: Mon, 07 Aug 2006 10:23:47 +1000 (EST)
- From: Jim Breen <Jim.Breen@example.com>
- Subject: Re: [tlug] [OT] Regular Expressions to find Japanese Text
Dave M G <martin@example.com> wrote: >> I want to divide the first line into three variables, $word, $reading,=20 >> and $meaning. And I want to divide the second line into two variables,=20 >> $word and $meaning. The output from WWWJDIC's text glosser usually comes in two types: 歴史 【れきし】 (n) history; (P); EP and ヨーロッパ (n) Europe; (P); EP If I had to parse this stuff for the purposes you state, I'd use a simple state machine. Assuming you are using WWWJDIC's default "glossdic", which has about 800,000 words/expressions, you can assume: (a) the occurrence of a 【 】 encapsulates a reading, and after that you are into the translation region. (b) once you reach a space followed by an ASCII character (usually alphabetic or a "("), you are into the translation region. If you didn't encounter a 【 】 pair along the way, the Japanese can be assumed to be kana-only. The exception to the above is Japanese names, where you get stuff like 寿康 【としやす】 Toshiyasu (g) 【じゅこう】 Jukou (g) 【ひさやす】 Hisayasu (u) NA as 寿康 can be read several ways. Again it can be parsed quite deterministically. The trigger is the multiple occurrences of 【 】. Stephen suggested going to the XML sources. That really doesn't work, as the glossdic file is built from 24 different files, only two of which are available as XML. Also you'd miss out on the work WWWJDIC puts into parsing the text, ducking and weaving around vern and adjective inflections, etc. HTH Jim -- Jim Breen http://www.csse.monash.edu.au/~jwb/ Clayton School of Information Technology, Tel: +61 3 9905 9554 Monash University, VIC 3800, Australia Fax: +61 3 9905 5146 (Monash Provider No. 00008C) ジム・ブリーン@モナシュ大蛙触Â
- Follow-Ups:
- Re: [tlug] [OT] Regular Expressions to find Japanese Text
- From: Dave M G
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Stop the FUD! here are some facts on the DMCA
- Next by Date: [tlug] Content management system
- Previous by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Next by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links