Mailing List ArchiveSupport open source code!
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: tlug: namazu
- To: tlug@example.com
- Subject: Re: tlug: namazu
- From: "Stephen J. Turnbull" <turnbull@example.com>
- Date: Mon, 7 Feb 2000 13:41:33 +0900 (JST)
- Content-Transfer-Encoding: 7bit
- Content-Type: text/plain; charset=us-ascii
- In-Reply-To: <Pine.LNX.3.95.1000207124120.7577A-100000@example.com>
- References: <Pine.LNX.4.10.10002071151130.29766-100000@example.com><Pine.LNX.3.95.1000207124120.7577A-100000@example.com>
- Reply-To: tlug@example.com
- Sender: owner-tlug@example.com
>>>>> "Selva" == Selva Nair <selva@example.com> writes: Selva> On Mon, 7 Feb 2000, Tony Laszlo wrote: >> Namazu finished indexing 2,644 files (111M). >> It took 43.4 hours and created 368M of index Selva> Was thinking of installing namazu, but this sounds scary. Selva> But wait a minute, a 368M index from a 111M source? Can't Selva> be 386K either as you have over 836K keywords. Some Selva> misprint somewhere? No. A dictionary with 50,000 words at 10 bytes/word should fill maye 0.5MB? Uh-uh, a dictionary with no definitions is useless. The index should be able to compress things somewhat, so simply mentioning a path and a byte index for each reference isn't going to take quite as much space as a definition would, but still the keyword itself is the smallest part of the reference. And many keywords will have several references. >> files with 836,439 keywords. Selva> Wow ! Didn't know there could be so many keywords, let Selva> alone *key*words in this whole world :) Actually, they're key phrases, and they probably also include information that allows partial matches or fuzzy matches, and multiple criteria. No? Selva> Does the time scale exponentially with data volume? Algorithmically, it shouldn't, with reasonable data structures the file reading should be a single linear-time pass, and constructing the index is probably n log n. But from Tony's travails you should recall that Namazu very quickly outstripped his memory space; I would bet the process started doing serious page thrashing at that point. I rather doubt that the namazu programmers bothered to review Vol 3 of Knuth before starting this project.[1] "Get More RAM!" solves all external search problems by turning them into internal search. :-) Footnotes: [1] For a serious attack on this problem, what you probably want to do is have a data structure that limits the "top" of the index structure to about 1/2 of available memory, then pages in "leaf nodes" from files. -- University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091 _________________ _________________ _________________ _________________ What are those straight lines for? "XEmacs rules." -------------------------------------------------------------------- Next Nomikai Meeting: February 18 (Fri) 19:00 Tengu TokyoEkiMae Next Technical Meeting: March 11 (Sat) 13:00 Temple University Japan * Topic: TBD -------------------------------------------------------------------- more info: http://www.tlug.gr.jp Sponsor: Global Online Japan
- Follow-Ups:
- Re: tlug: namazu
- From: Selva Nair <selva@example.com>
- References:
- tlug: namazu
- From: Tony Laszlo <laszlo@example.com>
- Re: tlug: namazu
- From: Selva Nair <selva@example.com>
Home | Main Index | Thread Index
- Prev by Date: RE: tlug: namazu
- Next by Date: Re: tlug: 2/18 Nomikai Attendance
- Prev by thread: Re: tlug: namazu
- Next by thread: Re: tlug: namazu
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links