Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] searching for kanji strings, ignore punctuation and endof lines
- Date: Mon, 16 Jan 2006 18:56:27 +0900
- From: "Stephen J. Turnbull" <stephen@example.com>
- Subject: Re: [tlug] searching for kanji strings, ignore punctuation and endof lines
- References: <43CB4F48.1060200@example.com>
- Organization: The XEmacs Project
- User-agent: Gnus/5.1007 (Gnus v5.10.7) XEmacs/21.5-b24 (dandelion, linux)
>>>>> "David" == David Riggs <dariggs@example.com> writes: David> The line numbers are easy to ignore, thay are a fixed set David> of [-0-9()pabc], and the output of grep will include the David> file name, also a fixed set of 0-9 and ascii letters. BUT, David> I need to get that file name and the line number! egrep will emit both the file name and the line number on each line if there are multiple files and the -n flag. Since the line numbering and newlines are ASCII, in perl (python, ruby, elisp) you could do [[:kanji1:]][\000-\177[:maru:]]*[[:kanji2:]][\000-\0177[:maru:]]*[[:kanji3]] where [:xyz:] is pseudo-code for a specific named character. This will wrap around lines, since \012 is in the ASCII range. Writing the perl to take a string and convert it to a regexp like the above is beyond me, though (perl is a 4-letter word, that's why I use egrep and elisp). David> But, from the silence on the second part of my question, I David> guess there is no pre-index program that would handle this David> kind of thing and do it in a flash? Even a simple search on David> my data set takes a while. namazu and FreeWAIS come to mind. namazu is pretty common, Frank Bennett at Nagoya U is an expert on FreeWAIS. Be aware that your indicies are likely to be bigger than your corpus unless you're very slick with data structures. You might also look at agrep, but last I checked it didn't know about multibyte characters. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
- References:
- [tlug] searching for kanji strings, ignore punctuation and end of lines
- From: David Riggs
Home | Main Index | Thread Index
- Prev by Date: [tlug] [C&C] Nasty Problem: this is worth acquiring a good mail reader ;-)
- Next by Date: [tlug] [tlug-digest] re: searching for kanji strings, ignore punctuation and end of lines
- Previous by thread: Re: [tlug] searching for kanji strings, ignore punctuation and endof lines
- Next by thread: [tlug] [tlug-digest] re: searching for kanji strings, ignore punctuation and end of lines
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links