Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Date: Mon, 07 Aug 2006 01:19:29 +0900
- From: "Stephen J. Turnbull" <stephen@example.com>
- Subject: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- References: <44D5FB0A.6090605@example.com>
- Organization: The XEmacs Project
- User-agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.5-b27 (linux)
>>>>> Dave M G writes: Dave> The next step is to parse the words and definitions in such Dave> a way as to cleanly insert them into a database, which I can Dave> then use to create more personalized study lists. The preferred way to do this (as Jim may have already remarked) is to suck it out of Jim's XML sources, which means you'll have preparsed form (that's just the way XML works) ready for stuffing into the database du jour. Look for references to "expat", "libxml2", and/or "libneon" in the PHP docs. Dave> But it seems like it would be a lot more sophisticated if I Dave> could determine if a word was Japanese by testing it's Dave> Unicode value or some similar method. That way I would be Dave> less vulnerable to slight variabilities in positioning of Dave> words in the source material. Look for "ICU" (originally "IBM Classes for Unicode", now changed to something less corporate). If there is a PHP module that wraps ICU, it should provide functions and/or regexps for detecting "Unicode blocks". Another alternative is to try to convert the character to JIS X 0208 or JIS X O212. If those fail, you either don't have Japanese or you have an exceedingly rare word. Both ICU and the XML-related functionality are likely to be packaged as add-on modules for PHP, rather than being part of the PHP distribution. Dave> But this seems unwieldy, as I think, if I understand it Dave> correctly, I'd have to test each individual character. I Dave> could use it to test if there was any Japanese at all in a Dave> string, but I'm not confident I could use it to extract Dave> words. Extracting Japanese words is a hard problem, unless you're lucky enough to have them already broken out for you. Dave> In any case, if anyone has any tips for how I might create a Dave> logical way of looking at a string and selecting the Dave> Japanese words, that would be awesome. I would start by trying to get my hands on the XML; that's the RightThang[tm]. If that doesn't work or seems like more annoyance than you want to go to, trust Jim, and report any failures of the "whitespace delimits fields in the returned string" to him as bugs. :-) -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
- References:
- [tlug] [OT] Regular Expressions to find Japanese Text
- From: Dave M G
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Next by Date: Re: [tlug] sending mails to the localhost
- Previous by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Next by thread: Re: [tlug] [OT] Regular Expressions to find Japanese Text
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links