TLUG Mailing List

Re: [tlug] Limits on file numbers in sort -m

On Thu, May 29, 2014 at 8:52 PM, Stephen J. Turnbull <stephen@example.com> wrote:

Bruno Raoult writes:

> Could you precise again "which kind of application"?

One that reads the entire contents of each of several thousand files
each of which is 4 million lines long.

> A syscall is difficult to track, except when following them (which
> is very difficult, but possible). Using buffered I/O is to avoid
> syscalls.

Exactly my point.

> uniq -c does it.

It does. The problem is that if Jim merges 100 files 100 times
(that's 10,000 files) and then runs uniq -c on each of the 100 merged
files, he now has 100 files each of which has a count for each line.
If the "real" line in two such files is a duplicate, then he gets two
such lines. That's what Jim meant by

3 this <- result of merge 1
4 this <- result of merge 2

Then he wants to merge the two files and get

7 this

because there were 7 lines like that in all the different files.
(Actually he wants the count after the "real" text of the line, but
that's not a big deal.) But he can't, because uniq doesn't know about
its own output format. (You can use the -f flags to uniq and sort to
ignore the counts for sorting and uniquifying, but you're not quite
there because uniq won't add up the counts for identical lines from
the 1st pass merges for you.)

I keep your entire post on purpose...

So "uniq *" was able to read files, but "sort -m *" was not, right?

And a "uniq | sort | uniq" is not possible???

I am stupid, I dont understand the issue at all :-(, and I would like

to understand clearly, with output of commands if possible...

br.

--
2 + 2 = 5, for very large values of 2.