TLUG Mailing List

Mailing List Archive

tlug.jp Mailing List tlug archive tlug Mailing List Archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [tlug] Limits on file numbers in sort -m

Date: Fri, 30 May 2014 03:52:58 +0900

From: "Stephen J. Turnbull" <stephen@example.com>

Subject: Re: [tlug] Limits on file numbers in sort -m

References: <CABHGxq7jYkDDLkF8uzzNK8WeU+37t1wgpVhk6VD2HQKyEi7wBw@mail.gmail.com> <CAJMSLH618MfmhL9ufAOfLXxw52i4STpF8dsc_+xe-2GRB3JM8g@mail.gmail.com> <87bnui8sky.fsf@uwakimon.sk.tsukuba.ac.jp> <CABHGxq4NEBMVR8jndiEvcgsGkc_B0f-qcrs2sFjqaAdWH3n9sw@mail.gmail.com> <CAJMSLH6SdSUmvHsjmZBZP-g1graNuPV51vdwLzpPf7ipmz7+zA@mail.gmail.com> <CABHGxq7eCk9Pk1JtNrZuqK_8yv4bt7ftoWwyXqf5P+GKYQH=5w@mail.gmail.com> <87sins7mhy.fsf@uwakimon.sk.tsukuba.ac.jp> <CAJA1Y2b6XyFNsFhDbK+ktgWk0cE5Lzfv9OrhimBH8RyN78yzLQ@mail.gmail.com>
Bruno Raoult writes:

 > Could you precise again "which kind of application"?

One that reads the entire contents of each of several thousand files
each of which is 4 million lines long.

 > A syscall is difficult to track, except when following them (which
 > is very difficult, but possible).  Using buffered I/O is to avoid
 > syscalls.

Exactly my point.

 > uniq -c does it.

It does.  The problem is that if Jim merges 100 files 100 times
(that's 10,000 files) and then runs uniq -c on each of the 100 merged
files, he now has 100 files each of which has a count for each line.
If the "real" line in two such files is a duplicate, then he gets two
such lines.  That's what Jim meant by

    3 this                <- result of merge 1
    4 this                <- result of merge 2

Then he wants to merge the two files and get

    7 this

because there were 7 lines like that in all the different files.
(Actually he wants the count after the "real" text of the line, but
that's not a big deal.)  But he can't, because uniq doesn't know about
its own output format.  (You can use the -f flags to uniq and sort to
ignore the counts for sorting and uniquifying, but you're not quite
there because uniq won't add up the counts for identical lines from
the 1st pass merges for you.)
Follow-Ups:

Re: [tlug] Limits on file numbers in sort -m
From: Bruno Raoult

References:

[tlug] Limits on file numbers in sort -m
From: Jim Breen

Re: [tlug] Limits on file numbers in sort -m
From: 黒鉄章

Re: [tlug] Limits on file numbers in sort -m
From: Stephen J. Turnbull

Re: [tlug] Limits on file numbers in sort -m
From: Jim Breen

Re: [tlug] Limits on file numbers in sort -m
From: 黒鉄章

Re: [tlug] Limits on file numbers in sort -m
From: Jim Breen

Re: [tlug] Limits on file numbers in sort -m
From: Stephen J. Turnbull

Re: [tlug] Limits on file numbers in sort -m
From: Bruno Raoult

Prev by Date: Re: [tlug] Limits on file numbers in sort -m

Next by Date: Re: [tlug] Limits on file numbers in sort -m

Previous by thread: Re: [tlug] Limits on file numbers in sort -m

Next by thread: Re: [tlug] Limits on file numbers in sort -m

Index(es):

Date

Thread

Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links