Mailing List Archive

Support open source code!


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: tlug: namazu



>Was thinking of installing namazu, but this sounds scary.
>But wait a minute, a 368M index from a 111M source? Can't be 386K
>either as you have over 836K keywords. Some misprint somewhere?

>>    files with 836,439
>>    keywords.
>
>Wow ! Didn't know  there could be so many keywords, let alone *key*words
>in this whole world :) ...

The Namazu indexer (like all Japanese full-text search utilities) pipes all
input files through a filter that splits the Japanese text stream into
"word"
units before indexing.  From a casual look at the output of the same utility
(Kakasi -- it's used by the WAIS-sf-jp indexer that I've installed here as
well)
it looks as though it's quite good on kanji, but interprets a lot of
trailing
hiragana as whole-word clusters.  These strings could account for a lot
of the bloat.  It is also possible (pure speculation on my part) that Namazu
does not have a facility for stopping high-frequency words like particles
and the like.

The size of indexes is pretty startling, but WAIS-sf routinely produces
indexes just a little larger than the original data.  Something is obviously
awry with Namazu, but it looks like it's an orders of magnitude, not a
many-orders-of-magnitude thing.  :-)

I wonder how WAIS-sf-jp would have performed on the same data set
(I don't think I want to ask Tony to give it a try, I just *wonder*, you
know);
Running it in English mode, it returns index almost immediately on a small
data set of many small files.  In Japanese mode, it is much slower, because
everything gets piped through nkf and kakasi --- run externally through
pipes.  Both are binaries, one wonders whether better performance
couldn't be obtained by incorporating them fully into the modified WAIS
binary itself.

Cheers,
Frank B


--------------------------------------------------------------------
Next Nomikai Meeting: February 18 (Fri) 19:00 Tengu TokyoEkiMae
Next Technical Meeting:  March 11 (Sat) 13:00 Temple University Japan
* Topic: TBD
--------------------------------------------------------------------
more info: http://www.tlug.gr.jp        Sponsor: Global Online Japan


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links