Re: tlug: namazu

To: tlug@example.com
Subject: Re: tlug: namazu
From: "Stephen J. Turnbull" <turnbull@example.com>
Date: Mon, 7 Feb 2000 13:41:33 +0900 (JST)
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <Pine.LNX.3.95.1000207124120.7577A-100000@example.com>
References: <Pine.LNX.4.10.10002071151130.29766-100000@example.com><Pine.LNX.3.95.1000207124120.7577A-100000@example.com>
Reply-To: tlug@example.com
Sender: owner-tlug@example.com

>>>>> "Selva" == Selva Nair <selva@example.com> writes:

    Selva> On Mon, 7 Feb 2000, Tony Laszlo wrote:

    >> Namazu finished indexing 2,644 files (111M).

    >> It took 43.4 hours and created 368M of index

    Selva> Was thinking of installing namazu, but this sounds scary.
    Selva> But wait a minute, a 368M index from a 111M source? Can't
    Selva> be 386K either as you have over 836K keywords. Some
    Selva> misprint somewhere?

No.  A dictionary with 50,000 words at 10 bytes/word should fill
maye 0.5MB?  Uh-uh, a dictionary with no definitions is useless.

The index should be able to compress things somewhat, so simply
mentioning a path and a byte index for each reference isn't going to
take quite as much space as a definition would, but still the keyword
itself is the smallest part of the reference.  And many keywords will
have several references.

    >> files with 836,439 keywords.

    Selva> Wow ! Didn't know there could be so many keywords, let
    Selva> alone *key*words in this whole world :)

Actually, they're key phrases, and they probably also include
information that allows partial matches or fuzzy matches, and multiple
criteria.  No?

    Selva> Does the time scale exponentially with data volume?

Algorithmically, it shouldn't, with reasonable data structures the
file reading should be a single linear-time pass, and constructing the
index is probably n log n.  But from Tony's travails you should recall
that Namazu very quickly outstripped his memory space; I would bet the
process started doing serious page thrashing at that point.  I rather
doubt that the namazu programmers bothered to review Vol 3 of Knuth
before starting this project.[1]  "Get More RAM!" solves all external
search problems by turning them into internal search.  :-)

Footnotes: 
[1]  For a serious attack on this problem, what you probably want to
do is have a data structure that limits the "top" of the index
structure to about 1/2 of available memory, then pages in "leaf nodes"
from files.

-- 
University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
_________________  _________________  _________________  _________________
What are those straight lines for?  "XEmacs rules."
--------------------------------------------------------------------
Next Nomikai Meeting: February 18 (Fri) 19:00 Tengu TokyoEkiMae
Next Technical Meeting:  March 11 (Sat) 13:00 Temple University Japan
* Topic: TBD
--------------------------------------------------------------------
more info: http://www.tlug.gr.jp        Sponsor: Global Online Japan

Follow-Ups:
- Re: tlug: namazu
  - From: Selva Nair <selva@example.com>

References:
- tlug: namazu
  - From: Tony Laszlo <laszlo@example.com>
- Re: tlug: namazu
  - From: Selva Nair <selva@example.com>

Prev by Date: RE: tlug: namazu
Next by Date: Re: tlug: 2/18 Nomikai Attendance
Prev by thread: Re: tlug: namazu
Next by thread: Re: tlug: namazu
Index(es):
- Date
- Thread

Home | Main Index | Thread Index