Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][tlug] [tlug-digest] searching for kanji strings, ignore punctuation and end of lines. Text indexing and retrival in unicode.
- Date: Sat, 14 Jan 2006 09:52:42 +0900
- From: David Riggs <dariggs@example.com>
- Subject: [tlug] [tlug-digest] searching for kanji strings, ignore punctuation and end of lines. Text indexing and retrival in unicode.
- References: <200601130511.k0D5BxWg015897@example.com>
- User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050420 Debian/1.7.7-2
I need to find short kanji strings in a giant haystack of texts. Grep does not work because the haystack (the CBETA canon of Buddhist texts) adds punctuation characters, and inserts newline characters and line numbers. One approach is to make a pipeline to grep for the first kanji, strip out the punctuation characters with sed, and search again for the rest of the kanji: grep first-ji * | sed -e 's/[,;]//g' | grep later-kanjis This is OK, but does not work for a string which spans a new line, and anyway, I am not sure that sed is really doing a character replacement (the real punctuation is unicode two byte maru and space). If it is doing a byte-by-byte replacement, it could mangle kanji by taking the second byte of one and the first byte of the following ji. Is there a way to do this, preferably a fast way to do this? My haystack is hundreds of megabytes and I have to do it a lot. On the other hand, instead of searching each time, is there a text indexing and search system which works with unicode? All I find googling around is commerical stuff which seems orientated towards western languages. Thanks, David Riggs Kyoto
- Follow-Ups:
- Re: [tlug] Searching for kanji strings: Use UTF-8
- From: Jim
- Re: [tlug] [tlug-digest] searching for kanji strings, ignore punctuation and end of lines. Text indexing and retrival in unicode.
- From: Josh Glover
- Re: [tlug] [tlug-digest] Regex Efficiency
- From: Jim
- Re: [tlug] Nasty Problem: searching for strings that span newlines
- From: Jim
- Avoid Premature Optimization (Re: [tlug] searching for kanjistrings)
- From: Jim
Home | Main Index | Thread Index
- Prev by Date: [tlug] Yet Another Japanese Dictionary (in pascal?)
- Next by Date: Re: [tlug] Searching for kanji strings: Use UTF-8
- Previous by thread: [tlug] Yet Another Japanese Dictionary (in pascal?)
- Next by thread: Re: [tlug] Searching for kanji strings: Use UTF-8
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links