TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[tlug] [tlug-digest] searching for kanji strings, ignore punctuation and end of lines. Text indexing and retrival in unicode.

Date: Sat, 14 Jan 2006 09:52:42 +0900

From: David Riggs <dariggs@example.com>

Subject: [tlug] [tlug-digest] searching for kanji strings, ignore punctuation and end of lines. Text indexing and retrival in unicode.

References: <200601130511.k0D5BxWg015897@example.com>

User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050420 Debian/1.7.7-2
I need to find short kanji strings in a giant haystack of texts. Grep 
does not work because the haystack (the CBETA canon of Buddhist texts) 
adds punctuation characters, and inserts newline characters and line 
numbers.

One approach is to make a pipeline to grep for the first kanji, strip 
out the punctuation characters with sed, and search again for the rest 
of the kanji:

grep first-ji * | sed  -e 's/[,;]//g' | grep later-kanjis

This is OK, but does not work for a string which spans a new line, and 
anyway, I am not sure that sed is really doing a character replacement 
(the real punctuation is unicode two byte maru and space). If it is 
doing a byte-by-byte replacement, it could mangle kanji by taking the 
second byte of one and the first byte of the following ji.


Is there a way to do this, preferably a fast way to do this? My haystack 
is hundreds of megabytes and I have to do it a lot.

On the other hand, instead of searching each time, is there a text 
indexing and search system which works with unicode? All I find googling 
around is commerical stuff which seems orientated towards western languages.

Thanks,

David Riggs

Kyoto
Follow-Ups:

Re: [tlug] Searching for kanji strings: Use UTF-8
From: Jim

Re: [tlug] [tlug-digest] searching for kanji strings, ignore punctuation and end of lines. Text indexing and retrival in unicode.
From: Josh Glover

Re: [tlug] [tlug-digest] Regex Efficiency
From: Jim

Re: [tlug] Nasty Problem: searching for strings that span newlines
From: Jim

Avoid Premature Optimization (Re: [tlug] searching for kanjistrings)
From: Jim

Prev by Date: [tlug] Yet Another Japanese Dictionary (in pascal?)

Next by Date: Re: [tlug] Searching for kanji strings: Use UTF-8

Previous by thread: [tlug] Yet Another Japanese Dictionary (in pascal?)

Next by thread: Re: [tlug] Searching for kanji strings: Use UTF-8

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links