Mailing List Archive
tlug.jp Mailing List tlug archive tlug Mailing List Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]Re: [tlug] Limits on file numbers in sort -m
- Date: Wed, 28 May 2014 15:21:17 +1000
- From: Jim Breen <jimbreen@example.com>
- Subject: Re: [tlug] Limits on file numbers in sort -m
- References: <CABHGxq7jYkDDLkF8uzzNK8WeU+37t1wgpVhk6VD2HQKyEi7wBw@mail.gmail.com> <CAJMSLH618MfmhL9ufAOfLXxw52i4STpF8dsc_+xe-2GRB3JM8g@mail.gmail.com> <87bnui8sky.fsf@uwakimon.sk.tsukuba.ac.jp>
On 28 May 2014 13:55, Stephen J. Turnbull <stephen@example.com> wrote: > �\��章 writes: > > > A small precursor to consider is if the filename expansion > > (i.e. from *.interim to all the separate files) will exceed the > > size of ARG_MAX. On my system you'd be OK (it's ~2M, i.e. more > > than ~10k * 20 chars = ~200k) > > There are shell limits as well, even if ARG_MAX is huge Jim probably > wants to use xargs. I can't see where xargs would assist with "sort -m ....", as by definition it wants all the files at once. > > > I'm gearing up for a merging of a very large number of > > > sorted text files(*). Does anyone know if there is an upper > > > limit on how many sorted files can be merged using something > > > like: "sort -m *.interim > final". > > I don't know about upper limits, but you might consider whether you > wouldn't get much better performance from a multipass approach. > > > > Also, is it worth fiddling with the "--batch-size=NMERGE" option? > > Pretty much what I had in mind. Specifically, assuming 100-byte > lines, merging 10 files at a time means 4GB in the first pass, > comfortably fitting in your memory and allowing very efficient I/O. > I'll bet that this is a big win (on the first pass only). On later > passes, the performance analysis is non-trivial, but the I/O > efficiency of having a big buffer for each file in the batch may > outweigh the additional passes. But does "sort -m ..." pull everything into RAM? If I were implementing it I'd have a heap of open input files and pop the individual files as needed. Last night I did a test with ~150 files. I don't know how it went about it, but it only used a moderate amount of RAM, so I expect it's doing a classical file merge. > Do you expect the output file to be ~= 40x10^9 lines!? Or is some > uniquification going to be applied? If so, I suspect that > interleaving merge and uniquification passes will be a lot faster. Yes, I'll be uniquificationするing, in which identical lines are counted and tagged with their frequency. this this will become this\t2 I can't get sort to do that, and rather than worry about adding a multi-file merge to my uniquification utility, I'll do it in a separate pass. > For quad core, see the --parallel option. This is better documented > in the Info manual for coreutils than in the man page. I can't see that option in either the man page or the info/coreutils manual. I see when I run sort (but not sort -m) that it goes parallel by default. "top" shows the processor load going to 300+%. Cheers Jim -- Jim Breen Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
- Follow-Ups:
- References:
- [tlug] Limits on file numbers in sort -m
- From: Jim Breen
- Re: [tlug] Limits on file numbers in sort -m
- From: 黒鉄章
- Re: [tlug] Limits on file numbers in sort -m
- From: Stephen J. Turnbull
Home | Main Index | Thread Index
- Prev by Date: Re: [tlug] Limits on file numbers in sort -m
- Next by Date: Re: [tlug] Limits on file numbers in sort -m
- Previous by thread: Re: [tlug] Limits on file numbers in sort -m
- Next by thread: Re: [tlug] Limits on file numbers in sort -m
- Index(es):
Home Page Mailing List Linux and Japan TLUG Members Links