Randomly draw a certain number of lines from a data file

up vote
13
down vote

favorite

I have a data list, like

Assume the size of this data set (i.e. lines of file) is N. I want to randomly draw m lines from this data file. Therefore, the output should be two files, one is the file including these m lines of data, and the other one includes N-m lines of data.

Is there a way to do that using a Linux command?

edited Jan 22 '12 at 13:49

sr_

12.8k3042

asked Jan 22 '12 at 13:44

user288609

3412412

1

Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
â€“Â Peter.O
Jan 22 '12 at 14:04

add a commentÂ |Â

up vote
13
down vote

favorite

I have a data list, like

Is there a way to do that using a Linux command?

edited Jan 22 '12 at 13:49

sr_

12.8k3042

asked Jan 22 '12 at 13:44

user288609

3412412

1

Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
â€“Â Peter.O
Jan 22 '12 at 14:04

add a commentÂ |Â

up vote
13
down vote

favorite

I have a data list, like

Is there a way to do that using a Linux command?

edited Jan 22 '12 at 13:49

sr_

12.8k3042

asked Jan 22 '12 at 13:44

user288609

3412412

I have a data list, like

Is there a way to do that using a Linux command?

linux shell text-processing

edited Jan 22 '12 at 13:49

sr_

12.8k3042

asked Jan 22 '12 at 13:44

user288609

3412412

edited Jan 22 '12 at 13:49

sr_

12.8k3042

asked Jan 22 '12 at 13:44

user288609

3412412

edited Jan 22 '12 at 13:49

sr_

12.8k3042

edited Jan 22 '12 at 13:49

sr_

12.8k3042

edited Jan 22 '12 at 13:49

sr_

12.8k3042

asked Jan 22 '12 at 13:44

user288609

3412412

asked Jan 22 '12 at 13:44

user288609

3412412

asked Jan 22 '12 at 13:44

user288609

3412412

1

Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
â€“Â Peter.O
Jan 22 '12 at 14:04

add a commentÂ |Â

1

Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
â€“Â Peter.O
Jan 22 '12 at 14:04

Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
â€“Â Peter.O
Jan 22 '12 at 14:04

add a commentÂ |Â

5 Answers
5

active

oldest

votes

up vote
18
down vote

accepted

This might not be the most efficient way but it works:

shuf <file> > tmp
head -n $m tmp > out1
tail -n +$(( m + 1 )) tmp > out2

With $m containing the number of lines.

edited Jan 23 '12 at 0:49

Gilles

511k12010141543

answered Jan 22 '12 at 13:52

Rob Wouters

51635

@userunknown, sort -R takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
â€“Â Rob Wouters
Jan 22 '12 at 14:31

2

Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you don't need a temporary file.
â€“Â Gilles
Jan 23 '12 at 0:45

why not shuf <file> |head -n $m?
â€“Â emanuele
Jun 19 '14 at 16:56

@emanuele: Because we need both the head and the tail in two separate files.
â€“Â Rob Wouters
Jun 20 '14 at 7:39

add a commentÂ |Â

up vote
5
down vote

This bash/awk script chooses lines at random, and maintains the original sequence in both output files.

awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2 
 'BEGIN srand()
 do lnb = 1 + int(rand()*N)
 if ( !(lnb in R) ) 
 R[lnb] = 1
 ct++ 
 while (ct<m)
 if (R[NR]==1) print > out1 
 else print > out2 
 ' file
cat /tmp/out1
echo ========
cat /tmp/out2

Output, based ont the data in the question.

12345
23456
200
600
========
67891
-20000
20

edited Jan 22 '12 at 16:05

answered Jan 22 '12 at 15:15

Peter.O

18.4k1688143

add a commentÂ |Â

up vote
4
down vote

As with all things Unix, There's a Utility for That^TM.

Program of the day: split
split will split a file in many different ways, -b bytes, -l lines, -n number of output files. We will be using the -l option. Since you want to pick random lines and not just the first m, we'll sort the file randomly first. If you want to read about sort, refer to my answer here.

Now, the actual code. It's quite simple, really:

sort -R input_file | split -l $m output_prefix

This will make two files, one with m lines and one with N-m lines, named output_prefixaa and output_prefixab.
Make sure m is the larger file you want or you'll get several files of length m (and one with N % m).

If you want to ensure that you use the correct size, here's a little code to do that:

m=10 # size you want one file to be
N=$(wc -l input_file)
m=$(( m > N/2 ? m : N - m ))
sort -R input_file | split -l $m output_prefix

Edit: It has come to my attention that some sort implementations don't have a -R flag. If you have perl, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'.

edited Apr 13 '17 at 12:36

Communityâ™¦

answered Jan 22 '12 at 16:37

Kevin

26k95797

1

Unfortunately, sort -R appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
â€“Â fluffy
Jan 22 '12 at 18:49

1

Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you can choose the output file names by using head and tail instead of split.
â€“Â Gilles
Jan 23 '12 at 0:48

add a commentÂ |Â

up vote
3
down vote

If you don't mind reordering the lines and you have GNU coreutils (i.e. on non-embedded Linux or Cygwin, not too ancient since shuf appeared in version 6.0), shuf (Ã¢Â€ÂœshuffleÃ¢Â€Â) reorders the lines of a file randomly. So you can shuffle the file and dispatch the first m lines into one file and the rest into another.

There's no ideal way to do that dispatch. You can't just chain head and tail because head would buffer ahead. You can use split, but you don't get any flexibility with respect to the output file names. You can use awk, of course:

<input shuf | awk -v m=$m ' if (NR <= m) print >"output1" else print '

You can use sed, which is obscure but possibly faster for large files.

<input shuf | sed -e "1,$m w output1" -e "1,$m d" >output2

Or you can use tee to duplicate the data, if your platform has /dev/fd; that's ok if m is small:

<input shuf | head -n $m >output1; 3>&1 | tail -n +$(($m+1)) >output2

Portably, you can use awk to dispatch each line in turn. Note that awk is not very good at initializing its random number generator; the randomness is not only definitely not suitable for cryptography, but not even very good for numerical simulations. The seed will be the same for all awk invocations on any system withing a one-second period.

<input awk -v N=$(wc -l <input) -v m=3 '
 BEGIN srand()
 
 if (rand() * N < m) --m; print >"output1" else print >"output2"
 --N;
 '

If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.

<input perl -e '
 open OUT1, ">", "output1" or die $!;
 open OUT2, ">", "output2" or die $!;
 my $N = `wc -l <input`;
 my $m = $ARGV[0];
 while (<STDIN>) 
 if (rand($N) < $m) --$m; print OUT1 $_; else print OUT2 $_; 
 --$N;
 
 close OUT1 or die $!;
 close OUT2 or die $!;
' 42

edited Jan 24 '12 at 15:09

answered Jan 23 '12 at 0:43

Gilles

511k12010141543

@Gilles: For the awk example: -v N=$(wc -l <file) -v m=4 ... and it only prints a "random" line when the random value is less than $m, rather than printing $m random lines... It seems that perl may be doing the same thing with rand, but I don't know perl well enough to get past a compilation error: syntax error at -e line 7, near ") print"
â€“Â Peter.O
Jan 23 '12 at 4:49

@Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
â€“Â Gilles
Jan 23 '12 at 10:12

All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in the shuf example.
â€“Â Peter.O
Jan 23 '12 at 23:03

A buffereing problem?. Am I missing something? The head cat combo causes loss of data in the following second test 3-4 .... TEST 1-2 for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2; .. TEST 3-4 for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4; ... wc -l results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
â€“Â Peter.O
Jan 24 '12 at 4:00

@Peter.O Right again, thanks. Indeed, head reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
â€“Â Gilles
Jan 24 '12 at 15:10

add a commentÂ |Â

up vote
2
down vote

Assuming m = 7 and N = 21:

cp ints ints.bak
for i in 1..7
do
 rnd=$((RANDOM%(21-i)+1))
 # echo $rnd; 
 sed -n "$rndp,q" 10k.dat >> mlines 
 sed -i "$rndd" ints 
done

Note:
If you replace 7 with a variable like $1 or $m, you have to use seq, not the from..to-notation, which doesn't do variable expansion.

It works by deleting line by line from the file, which gets shorter and shorter, so the line number, which can be removed, has to get smaller and smaller.

This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.

edited Jan 24 '12 at 15:40

answered Jan 22 '12 at 14:19

user unknown

7,02912148

He needs a file with the lines that are removed too.
â€“Â Rob Wouters
Jan 22 '12 at 14:36

I thought "including these m lines of data" should mean including them but the original lines as well - therefore including, not consisting of, and not using only, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
â€“Â user unknown
Jan 22 '12 at 14:39

Looks good. ````
â€“Â Rob Wouters
Jan 22 '12 at 14:52

@user unknown: You have the +1 in the wrong place. It should be rnd=$((RANDOM%(N-i)+1)) where N=21 in your example.. It currently causes sed to crash when rnd is evaluated to 0. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
â€“Â Peter.O
Jan 23 '12 at 12:04

@Peter.O: You're right (corrected) and you're right.
â€“Â user unknown
Jan 23 '12 at 12:38

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f29709%2frandomly-draw-a-certain-number-of-lines-from-a-data-file%23new-answer', 'question_page');

);

Post as a guest

Name

5 Answers
5

active

oldest

votes

5 Answers
5

active

oldest

votes

up vote
18
down vote

accepted

This might not be the most efficient way but it works:

shuf <file> > tmp
head -n $m tmp > out1
tail -n +$(( m + 1 )) tmp > out2

With $m containing the number of lines.

edited Jan 23 '12 at 0:49

Gilles

511k12010141543

answered Jan 22 '12 at 13:52

Rob Wouters

51635

@userunknown, sort -R takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
â€“Â Rob Wouters
Jan 22 '12 at 14:31

2

Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you don't need a temporary file.
â€“Â Gilles
Jan 23 '12 at 0:45

why not shuf <file> |head -n $m?
â€“Â emanuele
Jun 19 '14 at 16:56

@emanuele: Because we need both the head and the tail in two separate files.
â€“Â Rob Wouters
Jun 20 '14 at 7:39

add a commentÂ |Â

up vote
18
down vote

accepted

This might not be the most efficient way but it works:

shuf <file> > tmp
head -n $m tmp > out1
tail -n +$(( m + 1 )) tmp > out2

With $m containing the number of lines.

edited Jan 23 '12 at 0:49

Gilles

511k12010141543

answered Jan 22 '12 at 13:52

Rob Wouters

51635

@userunknown, sort -R takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
â€“Â Rob Wouters
Jan 22 '12 at 14:31

2

Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you don't need a temporary file.
â€“Â Gilles
Jan 23 '12 at 0:45

why not shuf <file> |head -n $m?
â€“Â emanuele
Jun 19 '14 at 16:56

@emanuele: Because we need both the head and the tail in two separate files.
â€“Â Rob Wouters
Jun 20 '14 at 7:39

add a commentÂ |Â

up vote
18
down vote

accepted

This might not be the most efficient way but it works:

shuf <file> > tmp
head -n $m tmp > out1
tail -n +$(( m + 1 )) tmp > out2

With $m containing the number of lines.

edited Jan 23 '12 at 0:49

Gilles

511k12010141543

answered Jan 22 '12 at 13:52

Rob Wouters

51635

This might not be the most efficient way but it works:

shuf <file> > tmp
head -n $m tmp > out1
tail -n +$(( m + 1 )) tmp > out2

With $m containing the number of lines.

edited Jan 23 '12 at 0:49

Gilles

511k12010141543

answered Jan 22 '12 at 13:52

Rob Wouters

51635

edited Jan 23 '12 at 0:49

Gilles

511k12010141543

edited Jan 23 '12 at 0:49

Gilles

511k12010141543

edited Jan 23 '12 at 0:49

Gilles

511k12010141543

answered Jan 22 '12 at 13:52

Rob Wouters

51635

answered Jan 22 '12 at 13:52

Rob Wouters

51635

answered Jan 22 '12 at 13:52

Rob Wouters

51635

@userunknown, sort -R takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
â€“Â Rob Wouters
Jan 22 '12 at 14:31

2

Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you don't need a temporary file.
â€“Â Gilles
Jan 23 '12 at 0:45

why not shuf <file> |head -n $m?
â€“Â emanuele
Jun 19 '14 at 16:56

@emanuele: Because we need both the head and the tail in two separate files.
â€“Â Rob Wouters
Jun 20 '14 at 7:39

add a commentÂ |Â

@userunknown, sort -R takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
â€“Â Rob Wouters
Jan 22 '12 at 14:31

2

Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you don't need a temporary file.
â€“Â Gilles
Jan 23 '12 at 0:45

why not shuf <file> |head -n $m?
â€“Â emanuele
Jun 19 '14 at 16:56

@emanuele: Because we need both the head and the tail in two separate files.
â€“Â Rob Wouters
Jun 20 '14 at 7:39

@userunknown, sort -R takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
â€“Â Rob Wouters
Jan 22 '12 at 14:31

Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you don't need a temporary file.
â€“Â Gilles
Jan 23 '12 at 0:45

why not shuf <file> |head -n $m?
â€“Â emanuele
Jun 19 '14 at 16:56

@emanuele: Because we need both the head and the tail in two separate files.
â€“Â Rob Wouters
Jun 20 '14 at 7:39

add a commentÂ |Â

up vote
5
down vote

This bash/awk script chooses lines at random, and maintains the original sequence in both output files.

awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2 
 'BEGIN srand()
 do lnb = 1 + int(rand()*N)
 if ( !(lnb in R) ) 
 R[lnb] = 1
 ct++ 
 while (ct<m)
 if (R[NR]==1) print > out1 
 else print > out2 
 ' file
cat /tmp/out1
echo ========
cat /tmp/out2

Output, based ont the data in the question.

12345
23456
200
600
========
67891
-20000
20

edited Jan 22 '12 at 16:05

answered Jan 22 '12 at 15:15

Peter.O

18.4k1688143

add a commentÂ |Â

up vote
5
down vote

This bash/awk script chooses lines at random, and maintains the original sequence in both output files.

awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2 
 'BEGIN srand()
 do lnb = 1 + int(rand()*N)
 if ( !(lnb in R) ) 
 R[lnb] = 1
 ct++ 
 while (ct<m)
 if (R[NR]==1) print > out1 
 else print > out2 
 ' file
cat /tmp/out1
echo ========
cat /tmp/out2

Output, based ont the data in the question.

12345
23456
200
600
========
67891
-20000
20

edited Jan 22 '12 at 16:05

answered Jan 22 '12 at 15:15

Peter.O

18.4k1688143

add a commentÂ |Â

up vote
5
down vote

This bash/awk script chooses lines at random, and maintains the original sequence in both output files.

awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2 
 'BEGIN srand()
 do lnb = 1 + int(rand()*N)
 if ( !(lnb in R) ) 
 R[lnb] = 1
 ct++ 
 while (ct<m)
 if (R[NR]==1) print > out1 
 else print > out2 
 ' file
cat /tmp/out1
echo ========
cat /tmp/out2

Output, based ont the data in the question.

12345
23456
200
600
========
67891
-20000
20

edited Jan 22 '12 at 16:05

answered Jan 22 '12 at 15:15

Peter.O

18.4k1688143

This bash/awk script chooses lines at random, and maintains the original sequence in both output files.

awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2 
 'BEGIN srand()
 do lnb = 1 + int(rand()*N)
 if ( !(lnb in R) ) 
 R[lnb] = 1
 ct++ 
 while (ct<m)
 if (R[NR]==1) print > out1 
 else print > out2 
 ' file
cat /tmp/out1
echo ========
cat /tmp/out2

Output, based ont the data in the question.

12345
23456
200
600
========
67891
-20000
20

edited Jan 22 '12 at 16:05

answered Jan 22 '12 at 15:15

Peter.O

18.4k1688143

edited Jan 22 '12 at 16:05

answered Jan 22 '12 at 15:15

Peter.O

18.4k1688143

answered Jan 22 '12 at 15:15

Peter.O

18.4k1688143

answered Jan 22 '12 at 15:15

Peter.O

18.4k1688143

add a commentÂ |Â

up vote
4
down vote

As with all things Unix, There's a Utility for That^TM.

Now, the actual code. It's quite simple, really:

sort -R input_file | split -l $m output_prefix

If you want to ensure that you use the correct size, here's a little code to do that:

m=10 # size you want one file to be
N=$(wc -l input_file)
m=$(( m > N/2 ? m : N - m ))
sort -R input_file | split -l $m output_prefix

Edit: It has come to my attention that some sort implementations don't have a -R flag. If you have perl, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'.

edited Apr 13 '17 at 12:36

Communityâ™¦

answered Jan 22 '12 at 16:37

Kevin

26k95797

1

Unfortunately, sort -R appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
â€“Â fluffy
Jan 22 '12 at 18:49

1

Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you can choose the output file names by using head and tail instead of split.
â€“Â Gilles
Jan 23 '12 at 0:48

add a commentÂ |Â

up vote
4
down vote

As with all things Unix, There's a Utility for That^TM.

Now, the actual code. It's quite simple, really:

sort -R input_file | split -l $m output_prefix

If you want to ensure that you use the correct size, here's a little code to do that:

m=10 # size you want one file to be
N=$(wc -l input_file)
m=$(( m > N/2 ? m : N - m ))
sort -R input_file | split -l $m output_prefix

Edit: It has come to my attention that some sort implementations don't have a -R flag. If you have perl, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'.

edited Apr 13 '17 at 12:36

Communityâ™¦

answered Jan 22 '12 at 16:37

Kevin

26k95797

1

Unfortunately, sort -R appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
â€“Â fluffy
Jan 22 '12 at 18:49

1

Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you can choose the output file names by using head and tail instead of split.
â€“Â Gilles
Jan 23 '12 at 0:48

add a commentÂ |Â

up vote
4
down vote

As with all things Unix, There's a Utility for That^TM.

Now, the actual code. It's quite simple, really:

sort -R input_file | split -l $m output_prefix

If you want to ensure that you use the correct size, here's a little code to do that:

m=10 # size you want one file to be
N=$(wc -l input_file)
m=$(( m > N/2 ? m : N - m ))
sort -R input_file | split -l $m output_prefix

Edit: It has come to my attention that some sort implementations don't have a -R flag. If you have perl, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'.

edited Apr 13 '17 at 12:36

Communityâ™¦

answered Jan 22 '12 at 16:37

Kevin

26k95797

As with all things Unix, There's a Utility for That^TM.

Now, the actual code. It's quite simple, really:

sort -R input_file | split -l $m output_prefix

If you want to ensure that you use the correct size, here's a little code to do that:

m=10 # size you want one file to be
N=$(wc -l input_file)
m=$(( m > N/2 ? m : N - m ))
sort -R input_file | split -l $m output_prefix

Edit: It has come to my attention that some sort implementations don't have a -R flag. If you have perl, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'.

edited Apr 13 '17 at 12:36

Communityâ™¦

answered Jan 22 '12 at 16:37

Kevin

26k95797

edited Apr 13 '17 at 12:36

Communityâ™¦

edited Apr 13 '17 at 12:36

Communityâ™¦

edited Apr 13 '17 at 12:36

Communityâ™¦

answered Jan 22 '12 at 16:37

Kevin

26k95797

answered Jan 22 '12 at 16:37

Kevin

26k95797

answered Jan 22 '12 at 16:37

Kevin

26k95797

1

Unfortunately, sort -R appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
â€“Â fluffy
Jan 22 '12 at 18:49

1

Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you can choose the output file names by using head and tail instead of split.
â€“Â Gilles
Jan 23 '12 at 0:48

add a commentÂ |Â

1

Unfortunately, sort -R appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
â€“Â fluffy
Jan 22 '12 at 18:49

1

Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you can choose the output file names by using head and tail instead of split.
â€“Â Gilles
Jan 23 '12 at 0:48

Unfortunately, sort -R appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
â€“Â fluffy
Jan 22 '12 at 18:49

Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you can choose the output file names by using head and tail instead of split.
â€“Â Gilles
Jan 23 '12 at 0:48

add a commentÂ |Â

up vote
3
down vote

<input shuf | awk -v m=$m ' if (NR <= m) print >"output1" else print '

You can use sed, which is obscure but possibly faster for large files.

<input shuf | sed -e "1,$m w output1" -e "1,$m d" >output2

Or you can use tee to duplicate the data, if your platform has /dev/fd; that's ok if m is small:

<input shuf | head -n $m >output1; 3>&1 | tail -n +$(($m+1)) >output2

<input awk -v N=$(wc -l <input) -v m=3 '
 BEGIN srand()
 
 if (rand() * N < m) --m; print >"output1" else print >"output2"
 --N;
 '

If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.

<input perl -e '
 open OUT1, ">", "output1" or die $!;
 open OUT2, ">", "output2" or die $!;
 my $N = `wc -l <input`;
 my $m = $ARGV[0];
 while (<STDIN>) 
 if (rand($N) < $m) --$m; print OUT1 $_; else print OUT2 $_; 
 --$N;
 
 close OUT1 or die $!;
 close OUT2 or die $!;
' 42

edited Jan 24 '12 at 15:09

answered Jan 23 '12 at 0:43

Gilles

511k12010141543

@Gilles: For the awk example: -v N=$(wc -l <file) -v m=4 ... and it only prints a "random" line when the random value is less than $m, rather than printing $m random lines... It seems that perl may be doing the same thing with rand, but I don't know perl well enough to get past a compilation error: syntax error at -e line 7, near ") print"
â€“Â Peter.O
Jan 23 '12 at 4:49

@Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
â€“Â Gilles
Jan 23 '12 at 10:12

All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in the shuf example.
â€“Â Peter.O
Jan 23 '12 at 23:03

A buffereing problem?. Am I missing something? The head cat combo causes loss of data in the following second test 3-4 .... TEST 1-2 for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2; .. TEST 3-4 for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4; ... wc -l results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
â€“Â Peter.O
Jan 24 '12 at 4:00

@Peter.O Right again, thanks. Indeed, head reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
â€“Â Gilles
Jan 24 '12 at 15:10

add a commentÂ |Â

up vote
3
down vote

<input shuf | awk -v m=$m ' if (NR <= m) print >"output1" else print '

You can use sed, which is obscure but possibly faster for large files.

<input shuf | sed -e "1,$m w output1" -e "1,$m d" >output2

Or you can use tee to duplicate the data, if your platform has /dev/fd; that's ok if m is small:

<input shuf | head -n $m >output1; 3>&1 | tail -n +$(($m+1)) >output2

<input awk -v N=$(wc -l <input) -v m=3 '
 BEGIN srand()
 
 if (rand() * N < m) --m; print >"output1" else print >"output2"
 --N;
 '

If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.

<input perl -e '
 open OUT1, ">", "output1" or die $!;
 open OUT2, ">", "output2" or die $!;
 my $N = `wc -l <input`;
 my $m = $ARGV[0];
 while (<STDIN>) 
 if (rand($N) < $m) --$m; print OUT1 $_; else print OUT2 $_; 
 --$N;
 
 close OUT1 or die $!;
 close OUT2 or die $!;
' 42

edited Jan 24 '12 at 15:09

answered Jan 23 '12 at 0:43

Gilles

511k12010141543

@Gilles: For the awk example: -v N=$(wc -l <file) -v m=4 ... and it only prints a "random" line when the random value is less than $m, rather than printing $m random lines... It seems that perl may be doing the same thing with rand, but I don't know perl well enough to get past a compilation error: syntax error at -e line 7, near ") print"
â€“Â Peter.O
Jan 23 '12 at 4:49

@Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
â€“Â Gilles
Jan 23 '12 at 10:12

All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in the shuf example.
â€“Â Peter.O
Jan 23 '12 at 23:03

A buffereing problem?. Am I missing something? The head cat combo causes loss of data in the following second test 3-4 .... TEST 1-2 for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2; .. TEST 3-4 for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4; ... wc -l results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
â€“Â Peter.O
Jan 24 '12 at 4:00

@Peter.O Right again, thanks. Indeed, head reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
â€“Â Gilles
Jan 24 '12 at 15:10

add a commentÂ |Â

up vote
3
down vote

<input shuf | awk -v m=$m ' if (NR <= m) print >"output1" else print '

You can use sed, which is obscure but possibly faster for large files.

<input shuf | sed -e "1,$m w output1" -e "1,$m d" >output2

Or you can use tee to duplicate the data, if your platform has /dev/fd; that's ok if m is small:

<input shuf | head -n $m >output1; 3>&1 | tail -n +$(($m+1)) >output2

<input awk -v N=$(wc -l <input) -v m=3 '
 BEGIN srand()
 
 if (rand() * N < m) --m; print >"output1" else print >"output2"
 --N;
 '

If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.

<input perl -e '
 open OUT1, ">", "output1" or die $!;
 open OUT2, ">", "output2" or die $!;
 my $N = `wc -l <input`;
 my $m = $ARGV[0];
 while (<STDIN>) 
 if (rand($N) < $m) --$m; print OUT1 $_; else print OUT2 $_; 
 --$N;
 
 close OUT1 or die $!;
 close OUT2 or die $!;
' 42

edited Jan 24 '12 at 15:09

answered Jan 23 '12 at 0:43

Gilles

511k12010141543

<input shuf | awk -v m=$m ' if (NR <= m) print >"output1" else print '

You can use sed, which is obscure but possibly faster for large files.

<input shuf | sed -e "1,$m w output1" -e "1,$m d" >output2

Or you can use tee to duplicate the data, if your platform has /dev/fd; that's ok if m is small:

<input shuf | head -n $m >output1; 3>&1 | tail -n +$(($m+1)) >output2

<input awk -v N=$(wc -l <input) -v m=3 '
 BEGIN srand()
 
 if (rand() * N < m) --m; print >"output1" else print >"output2"
 --N;
 '

If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.

<input perl -e '
 open OUT1, ">", "output1" or die $!;
 open OUT2, ">", "output2" or die $!;
 my $N = `wc -l <input`;
 my $m = $ARGV[0];
 while (<STDIN>) 
 if (rand($N) < $m) --$m; print OUT1 $_; else print OUT2 $_; 
 --$N;
 
 close OUT1 or die $!;
 close OUT2 or die $!;
' 42

edited Jan 24 '12 at 15:09

answered Jan 23 '12 at 0:43

Gilles

511k12010141543

edited Jan 24 '12 at 15:09

answered Jan 23 '12 at 0:43

Gilles

511k12010141543

answered Jan 23 '12 at 0:43

Gilles

511k12010141543

answered Jan 23 '12 at 0:43

Gilles

511k12010141543

@Gilles: For the awk example: -v N=$(wc -l <file) -v m=4 ... and it only prints a "random" line when the random value is less than $m, rather than printing $m random lines... It seems that perl may be doing the same thing with rand, but I don't know perl well enough to get past a compilation error: syntax error at -e line 7, near ") print"
â€“Â Peter.O
Jan 23 '12 at 4:49

@Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
â€“Â Gilles
Jan 23 '12 at 10:12

All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in the shuf example.
â€“Â Peter.O
Jan 23 '12 at 23:03

A buffereing problem?. Am I missing something? The head cat combo causes loss of data in the following second test 3-4 .... TEST 1-2 for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2; .. TEST 3-4 for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4; ... wc -l results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
â€“Â Peter.O
Jan 24 '12 at 4:00

@Peter.O Right again, thanks. Indeed, head reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
â€“Â Gilles
Jan 24 '12 at 15:10

add a commentÂ |Â

@Gilles: For the awk example: -v N=$(wc -l <file) -v m=4 ... and it only prints a "random" line when the random value is less than $m, rather than printing $m random lines... It seems that perl may be doing the same thing with rand, but I don't know perl well enough to get past a compilation error: syntax error at -e line 7, near ") print"
â€“Â Peter.O
Jan 23 '12 at 4:49

@Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
â€“Â Gilles
Jan 23 '12 at 10:12

All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in the shuf example.
â€“Â Peter.O
Jan 23 '12 at 23:03

A buffereing problem?. Am I missing something? The head cat combo causes loss of data in the following second test 3-4 .... TEST 1-2 for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2; .. TEST 3-4 for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4; ... wc -l results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
â€“Â Peter.O
Jan 24 '12 at 4:00

@Peter.O Right again, thanks. Indeed, head reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
â€“Â Gilles
Jan 24 '12 at 15:10

@Gilles: For the awk example: -v N=$(wc -l <file) -v m=4 ... and it only prints a "random" line when the random value is less than $m, rather than printing $m random lines... It seems that perl may be doing the same thing with rand, but I don't know perl well enough to get past a compilation error: syntax error at -e line 7, near ") print"
â€“Â Peter.O
Jan 23 '12 at 4:49

@Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
â€“Â Gilles
Jan 23 '12 at 10:12

All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in the shuf example.
â€“Â Peter.O
Jan 23 '12 at 23:03

A buffereing problem?. Am I missing something? The head cat combo causes loss of data in the following second test 3-4 .... TEST 1-2 for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2; .. TEST 3-4 for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4; ... wc -l results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
â€“Â Peter.O
Jan 24 '12 at 4:00

@Peter.O Right again, thanks. Indeed, head reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
â€“Â Gilles
Jan 24 '12 at 15:10

add a commentÂ |Â

up vote
2
down vote

Assuming m = 7 and N = 21:

cp ints ints.bak
for i in 1..7
do
 rnd=$((RANDOM%(21-i)+1))
 # echo $rnd; 
 sed -n "$rndp,q" 10k.dat >> mlines 
 sed -i "$rndd" ints 
done

Note:
If you replace 7 with a variable like $1 or $m, you have to use seq, not the from..to-notation, which doesn't do variable expansion.

It works by deleting line by line from the file, which gets shorter and shorter, so the line number, which can be removed, has to get smaller and smaller.

This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.

edited Jan 24 '12 at 15:40

answered Jan 22 '12 at 14:19

user unknown

7,02912148

He needs a file with the lines that are removed too.
â€“Â Rob Wouters
Jan 22 '12 at 14:36

I thought "including these m lines of data" should mean including them but the original lines as well - therefore including, not consisting of, and not using only, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
â€“Â user unknown
Jan 22 '12 at 14:39

Looks good. ````
â€“Â Rob Wouters
Jan 22 '12 at 14:52

@user unknown: You have the +1 in the wrong place. It should be rnd=$((RANDOM%(N-i)+1)) where N=21 in your example.. It currently causes sed to crash when rnd is evaluated to 0. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
â€“Â Peter.O
Jan 23 '12 at 12:04

@Peter.O: You're right (corrected) and you're right.
â€“Â user unknown
Jan 23 '12 at 12:38

add a commentÂ |Â

up vote
2
down vote

Assuming m = 7 and N = 21:

cp ints ints.bak
for i in 1..7
do
 rnd=$((RANDOM%(21-i)+1))
 # echo $rnd; 
 sed -n "$rndp,q" 10k.dat >> mlines 
 sed -i "$rndd" ints 
done

Note:
If you replace 7 with a variable like $1 or $m, you have to use seq, not the from..to-notation, which doesn't do variable expansion.

It works by deleting line by line from the file, which gets shorter and shorter, so the line number, which can be removed, has to get smaller and smaller.

This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.

edited Jan 24 '12 at 15:40

answered Jan 22 '12 at 14:19

user unknown

7,02912148

He needs a file with the lines that are removed too.
â€“Â Rob Wouters
Jan 22 '12 at 14:36

I thought "including these m lines of data" should mean including them but the original lines as well - therefore including, not consisting of, and not using only, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
â€“Â user unknown
Jan 22 '12 at 14:39

Looks good. ````
â€“Â Rob Wouters
Jan 22 '12 at 14:52

@user unknown: You have the +1 in the wrong place. It should be rnd=$((RANDOM%(N-i)+1)) where N=21 in your example.. It currently causes sed to crash when rnd is evaluated to 0. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
â€“Â Peter.O
Jan 23 '12 at 12:04

@Peter.O: You're right (corrected) and you're right.
â€“Â user unknown
Jan 23 '12 at 12:38

add a commentÂ |Â

up vote
2
down vote

Assuming m = 7 and N = 21:

cp ints ints.bak
for i in 1..7
do
 rnd=$((RANDOM%(21-i)+1))
 # echo $rnd; 
 sed -n "$rndp,q" 10k.dat >> mlines 
 sed -i "$rndd" ints 
done

Note:
If you replace 7 with a variable like $1 or $m, you have to use seq, not the from..to-notation, which doesn't do variable expansion.

It works by deleting line by line from the file, which gets shorter and shorter, so the line number, which can be removed, has to get smaller and smaller.

This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.

edited Jan 24 '12 at 15:40

answered Jan 22 '12 at 14:19

user unknown

7,02912148

Assuming m = 7 and N = 21:

cp ints ints.bak
for i in 1..7
do
 rnd=$((RANDOM%(21-i)+1))
 # echo $rnd; 
 sed -n "$rndp,q" 10k.dat >> mlines 
 sed -i "$rndd" ints 
done

Note:
If you replace 7 with a variable like $1 or $m, you have to use seq, not the from..to-notation, which doesn't do variable expansion.

It works by deleting line by line from the file, which gets shorter and shorter, so the line number, which can be removed, has to get smaller and smaller.

This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.

edited Jan 24 '12 at 15:40

answered Jan 22 '12 at 14:19

user unknown

7,02912148

edited Jan 24 '12 at 15:40

answered Jan 22 '12 at 14:19

user unknown

7,02912148

answered Jan 22 '12 at 14:19

user unknown

7,02912148

answered Jan 22 '12 at 14:19

user unknown

7,02912148

He needs a file with the lines that are removed too.
â€“Â Rob Wouters
Jan 22 '12 at 14:36

I thought "including these m lines of data" should mean including them but the original lines as well - therefore including, not consisting of, and not using only, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
â€“Â user unknown
Jan 22 '12 at 14:39

Looks good. ````
â€“Â Rob Wouters
Jan 22 '12 at 14:52

@user unknown: You have the +1 in the wrong place. It should be rnd=$((RANDOM%(N-i)+1)) where N=21 in your example.. It currently causes sed to crash when rnd is evaluated to 0. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
â€“Â Peter.O
Jan 23 '12 at 12:04

@Peter.O: You're right (corrected) and you're right.
â€“Â user unknown
Jan 23 '12 at 12:38

add a commentÂ |Â

He needs a file with the lines that are removed too.
â€“Â Rob Wouters
Jan 22 '12 at 14:36

I thought "including these m lines of data" should mean including them but the original lines as well - therefore including, not consisting of, and not using only, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
â€“Â user unknown
Jan 22 '12 at 14:39

Looks good. ````
â€“Â Rob Wouters
Jan 22 '12 at 14:52

@user unknown: You have the +1 in the wrong place. It should be rnd=$((RANDOM%(N-i)+1)) where N=21 in your example.. It currently causes sed to crash when rnd is evaluated to 0. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
â€“Â Peter.O
Jan 23 '12 at 12:04

@Peter.O: You're right (corrected) and you're right.
â€“Â user unknown
Jan 23 '12 at 12:38

He needs a file with the lines that are removed too.
â€“Â Rob Wouters
Jan 22 '12 at 14:36

I thought "including these m lines of data" should mean including them but the original lines as well - therefore including, not consisting of, and not using only, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
â€“Â user unknown
Jan 22 '12 at 14:39

Looks good. ````
â€“Â Rob Wouters
Jan 22 '12 at 14:52

@user unknown: You have the +1 in the wrong place. It should be rnd=$((RANDOM%(N-i)+1)) where N=21 in your example.. It currently causes sed to crash when rnd is evaluated to 0. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
â€“Â Peter.O
Jan 23 '12 at 12:04

@Peter.O: You're right (corrected) and you're right.
â€“Â user unknown
Jan 23 '12 at 12:38

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu