Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [tlug] [OT] Regular Expressions to find Japanese Text



 Dave M G <martin@example.com> wrote:
>> Jim said:
>> > (a) the occurrence of a 【 】 encapsulates a reading, and after that
>> > you are into the translation region.
>> > (b) once you reach a space followed by an ASCII character (usually alphabetic
>> > or a "("), you are into the translation region. If you didn't encounter
>> > a 【 】 pair along the way, the Japanese can be assumed to be kana-only.
   
>> There seem to be other issues, such as where it starts out by saying
>> "possible inflected verb", and "partial match". Is it the case that
>> sometimes there might be some kind of English text before a Japanese word?

The "Possible inflected verb or adjective" is, I think, the only time
non-Japanese will start it. The [Partial Match] is put at the end when
there is not an exact match between text and entry.

>> Or is the issue with my parser? In order to pull out definitions, I've
>> selected text that begins with <li> and ends with <br>, as this seems to
>> account for all words extracted from a WWWJDIC search.

That <br> is redundant. I may remove it at some stage. Better to extract
between <li> and the next <li> or the terminal </ul>.

>> > The exception to the above is Japanese names, where you get
>> > stuff like 
>> >  寿康 【としやす】 Toshiyasu (g) 【じゅこう】 Jukou (g) 【ひさやす】 Hisayasu (u) NA
>> 
>> Is it only Japanese names that have multiple readings? I would have
>> thought there would also be regular words with multiple readings,
>> especially with verbs with multiple inflections.

Several entries with kanji headwords have multiple readings. They will
be in the 【 】 region with ";" between them. 

*In General* the entries in EDICT with multiple headwords/readings are
broken up into their combinations, and the glossdic only has the most common
one. Where the reading affects the meaning, I use a special file
which has hybrids like:

	今日は [きょうは;こんにちは] /(1) (n-t) (きょうは) today/this day/
	(2) (int) (こんにちは) hello/good day (daytime greeting)/

Note the [..] gets formatted as 【...】. Note also that there may be
Japanese text in the translation, e.g.

	バヤイ (n-adv,n) case; situation; (slangy version of 場合)

I do names differently because "glossdic" uses a special version of the 
names file in which the readings/transliterations are string out like that. 
Also the merge attempts to put the common readings at the front.

>> Here's a question that has relevance to the flash card program that I am
>> importing data into:
>> 
>> What word (or name) in the WWWJDIC server has the most readings and
>> definitions, and how many does it have?

Well, 付ける;着ける [つける] has 25 meanings grouped in 11 senses.
Readings are a bit harder to count, but I think there are entries with 
5 or 6.

>> Botond said:
>> > You should also consider the fact that there are edict dictionary files
>> > in other languages also, not just Japanese-English.

>> That is a good consideration for a more generally adopted application.
>> However, even though I'd share the source with anyone who might find it
>> useful, what I'm working on now is for my own purposes and so I can
>> guarantee that I'm only going to be using the Japanese-English dictionaries.

Some people use that funtion of WWWJDIC with the French and/or
German dictionaries. The same parsing  suggestions apply.

Cheers

Jim

-- 
Jim Breen                                http://www.csse.monash.edu.au/~jwb/
Clayton School of Information Technology,               Tel: +61 3 9905 9554
Monash University, VIC 3800, Australia                  Fax: +61 3 9905 5146
(Monash Provider No. 00008C)                ジム・ブリーン@モナシュ大蛙触Â


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links