Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]RE: tlug: namazu
- To: <tlug@example.com>
- Subject: RE: tlug: namazu
- From: "Frank Bennett" <bennett@example.com>
- Date: Mon, 7 Feb 2000 13:34:52 +0900
- Content-Transfer-Encoding: 7bit
- Content-Type: text/plain;charset="iso-8859-1"
- Reply-To: tlug@example.com
- Sender: owner-tlug@example.com
>Was thinking of installing namazu, but this sounds scary. >But wait a minute, a 368M index from a 111M source? Can't be 386K >either as you have over 836K keywords. Some misprint somewhere? >> files with 836,439 >> keywords. > >Wow ! Didn't know there could be so many keywords, let alone *key*words >in this whole world :) ... The Namazu indexer (like all Japanese full-text search utilities) pipes all input files through a filter that splits the Japanese text stream into "word" units before indexing. From a casual look at the output of the same utility (Kakasi -- it's used by the WAIS-sf-jp indexer that I've installed here as well) it looks as though it's quite good on kanji, but interprets a lot of trailing hiragana as whole-word clusters. These strings could account for a lot of the bloat. It is also possible (pure speculation on my part) that Namazu does not have a facility for stopping high-frequency words like particles and the like. The size of indexes is pretty startling, but WAIS-sf routinely produces indexes just a little larger than the original data. Something is obviously awry with Namazu, but it looks like it's an orders of magnitude, not a many-orders-of-magnitude thing. :-) I wonder how WAIS-sf-jp would have performed on the same data set (I don't think I want to ask Tony to give it a try, I just *wonder*, you know); Running it in English mode, it returns index almost immediately on a small data set of many small files. In Japanese mode, it is much slower, because everything gets piped through nkf and kakasi --- run externally through pipes. Both are binaries, one wonders whether better performance couldn't be obtained by incorporating them fully into the modified WAIS binary itself. Cheers, Frank B -------------------------------------------------------------------- Next Nomikai Meeting: February 18 (Fri) 19:00 Tengu TokyoEkiMae Next Technical Meeting: March 11 (Sat) 13:00 Temple University Japan * Topic: TBD -------------------------------------------------------------------- more info: http://www.tlug.gr.jp Sponsor: Global Online Japan
- Follow-Ups:
- RE: tlug: namazu
- From: "Stephen J. Turnbull" <turnbull@example.com>
Home | Main Index | Thread Index
- Prev by Date: RE: tlug: namazu
- Next by Date: Re: tlug: namazu
- Prev by thread: RE: tlug: namazu
- Next by thread: RE: tlug: namazu
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links