
Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[tlug] Re: tlug-digest Digest V2006 #28
- Date: Thu, 19 Jan 2006 09:49:20 +0900
- From: David Riggs <dariggs@example.com>
- Subject: [tlug] Re: tlug-digest Digest V2006 #28
- References: <200601181740.k0IHegS1024608@example.com>
- User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050420 Debian/1.7.7-2
> [tlug-digest] Re: [tlug] searching for kanji strings, ignore punctuation
> "Stephen J. Turnbull" <stephen@example.com>
>>>>>>"David" == David Riggs <dariggs@example.com> writes:
> Perl probably has a split function; make the kanji string a varaible
> (see below for why), and split it on "" which will give you an array
> of characters. Then do a join with "\$w".
>
> (defun mung-run-perl (kanji)--snip--
--
Thanks for the lisp. Just the trick!
>
> But 60 seconds is a long time. You really should find some way to get
> this indexed. Is there any restriction on the strings, or are they
> basically arbitrary sequences of CJK ideographs?
No restriction on the CJK ideographs, which I wish to see as simply a
sequence of CJK region utf-8. The key thing is that all the added noise
(which in another context is extremely valuable markup), must be
ignored. I typically have a "quote" from an unknown text, which my guy
(writing in the mid-Edo period) is commenting on. He just plops down the
string of kanji, the way it really was. I.e. no punctuation, breaks, or
anything, which is an addition to the "real" text. And I need to find
that same string. The kindly added "maru" space, line numbers are just
noise for this purpose.
It seems likely that there is something to do this with, somewhere? Of
course I expect the index to be bigger than the data, but whats a
gigabyte or two when you are searching a canon? The additional speed
would be worth the space, and the all-night run to index it. The
Buddhist canon does not change very rapidly.
Thanks,
David Riggs
Home |
Main Index |
Thread Index