Randomly draw a certain number of lines from a data file
Clash Royale CLAN TAG#URR8PPP
up vote
13
down vote
favorite
I have a data list, like
12345
23456
67891
-20000
200
600
20
...
Assume the size of this data set (i.e. lines of file) is N
. I want to randomly draw m
lines from this data file. Therefore, the output should be two files, one is the file including these m
lines of data, and the other one includes N-m
lines of data.
Is there a way to do that using a Linux command?
linux shell text-processing
add a comment |Â
up vote
13
down vote
favorite
I have a data list, like
12345
23456
67891
-20000
200
600
20
...
Assume the size of this data set (i.e. lines of file) is N
. I want to randomly draw m
lines from this data file. Therefore, the output should be two files, one is the file including these m
lines of data, and the other one includes N-m
lines of data.
Is there a way to do that using a Linux command?
linux shell text-processing
1
Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
â Peter.O
Jan 22 '12 at 14:04
add a comment |Â
up vote
13
down vote
favorite
up vote
13
down vote
favorite
I have a data list, like
12345
23456
67891
-20000
200
600
20
...
Assume the size of this data set (i.e. lines of file) is N
. I want to randomly draw m
lines from this data file. Therefore, the output should be two files, one is the file including these m
lines of data, and the other one includes N-m
lines of data.
Is there a way to do that using a Linux command?
linux shell text-processing
I have a data list, like
12345
23456
67891
-20000
200
600
20
...
Assume the size of this data set (i.e. lines of file) is N
. I want to randomly draw m
lines from this data file. Therefore, the output should be two files, one is the file including these m
lines of data, and the other one includes N-m
lines of data.
Is there a way to do that using a Linux command?
linux shell text-processing
linux shell text-processing
edited Jan 22 '12 at 13:49
sr_
12.8k3042
12.8k3042
asked Jan 22 '12 at 13:44
user288609
3412412
3412412
1
Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
â Peter.O
Jan 22 '12 at 14:04
add a comment |Â
1
Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
â Peter.O
Jan 22 '12 at 14:04
1
1
Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
â Peter.O
Jan 22 '12 at 14:04
Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
â Peter.O
Jan 22 '12 at 14:04
add a comment |Â
5 Answers
5
active
oldest
votes
up vote
18
down vote
accepted
This might not be the most efficient way but it works:
shuf <file> > tmp
head -n $m tmp > out1
tail -n +$(( m + 1 )) tmp > out2
With $m
containing the number of lines.
@userunknown,sort -R
takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
â Rob Wouters
Jan 22 '12 at 14:31
2
Note thatsort -R
doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g.foo
,foo
,bar
,bar
and m=2, then one file will contain bothfoo
s and the other will contain bothbar
s. GNU coreutils also hasshuf
, which randomizes the input lines. Also, you don't need a temporary file.
â Gilles
Jan 23 '12 at 0:45
why notshuf <file> |head -n $m
?
â emanuele
Jun 19 '14 at 16:56
@emanuele: Because we need both the head and the tail in two separate files.
â Rob Wouters
Jun 20 '14 at 7:39
add a comment |Â
up vote
5
down vote
This bash/awk script chooses lines at random, and maintains the original sequence in both output files.
awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2
'BEGIN srand()
do lnb = 1 + int(rand()*N)
if ( !(lnb in R) )
R[lnb] = 1
ct++
while (ct<m)
if (R[NR]==1) print > out1
else print > out2
' file
cat /tmp/out1
echo ========
cat /tmp/out2
Output, based ont the data in the question.
12345
23456
200
600
========
67891
-20000
20
add a comment |Â
up vote
4
down vote
As with all things Unix, There's a Utility for ThatTM.
Program of the day: split
split
will split a file in many different ways, -b
bytes, -l
lines, -n
number of output files. We will be using the -l
option. Since you want to pick random lines and not just the first m
, we'll sort
the file randomly first. If you want to read about sort
, refer to my answer here.
Now, the actual code. It's quite simple, really:
sort -R input_file | split -l $m output_prefix
This will make two files, one with m
lines and one with N-m
lines, named output_prefixaa
and output_prefixab
.
Make sure m
is the larger file you want or you'll get several files of length m
(and one with N % m
).
If you want to ensure that you use the correct size, here's a little code to do that:
m=10 # size you want one file to be
N=$(wc -l input_file)
m=$(( m > N/2 ? m : N - m ))
sort -R input_file | split -l $m output_prefix
Edit: It has come to my attention that some sort
implementations don't have a -R
flag. If you have perl
, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'
.
1
Unfortunately,sort -R
appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
â fluffy
Jan 22 '12 at 18:49
1
Note thatsort -R
doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g.foo
,foo
,bar
,bar
and m=2, then one file will contain bothfoo
s and the other will contain bothbar
s. GNU coreutils also hasshuf
, which randomizes the input lines. Also, you can choose the output file names by usinghead
andtail
instead ofsplit
.
â Gilles
Jan 23 '12 at 0:48
add a comment |Â
up vote
3
down vote
If you don't mind reordering the lines and you have GNU coreutils (i.e. on non-embedded Linux or Cygwin, not too ancient since shuf
appeared in version 6.0), shuf
(âÂÂshuffleâÂÂ) reorders the lines of a file randomly. So you can shuffle the file and dispatch the first m lines into one file and the rest into another.
There's no ideal way to do that dispatch. You can't just chain head
and tail
because head
would buffer ahead. You can use split
, but you don't get any flexibility with respect to the output file names. You can use awk
, of course:
<input shuf | awk -v m=$m ' if (NR <= m) print >"output1" else print '
You can use sed
, which is obscure but possibly faster for large files.
<input shuf | sed -e "1,$m w output1" -e "1,$m d" >output2
Or you can use tee
to duplicate the data, if your platform has /dev/fd
; that's ok if m is small:
<input shuf | head -n $m >output1; 3>&1 | tail -n +$(($m+1)) >output2
Portably, you can use awk to dispatch each line in turn. Note that awk is not very good at initializing its random number generator; the randomness is not only definitely not suitable for cryptography, but not even very good for numerical simulations. The seed will be the same for all awk invocations on any system withing a one-second period.
<input awk -v N=$(wc -l <input) -v m=3 '
BEGIN srand()
if (rand() * N < m) --m; print >"output1" else print >"output2"
--N;
'
If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.
<input perl -e '
open OUT1, ">", "output1" or die $!;
open OUT2, ">", "output2" or die $!;
my $N = `wc -l <input`;
my $m = $ARGV[0];
while (<STDIN>)
if (rand($N) < $m) --$m; print OUT1 $_; else print OUT2 $_;
--$N;
close OUT1 or die $!;
close OUT2 or die $!;
' 42
@Gilles: For theawk
example:-v N=$(wc -l <file) -v m=4
... and it only prints a "random" line when the random value is less than$m
, rather than printing$m
random lines... It seems thatperl
may be doing the same thing with rand, but I don't knowperl
well enough to get past a compilation error: syntax error at -e line 7, near ") print"
â Peter.O
Jan 23 '12 at 4:49
@Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
â Gilles
Jan 23 '12 at 10:12
All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in theshuf
example.
â Peter.O
Jan 23 '12 at 23:03
A buffereing problem?. Am I missing something? Thehead
cat
combo causes loss of data in the following second test 3-4 .... TEST 1-2for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2;
.. TEST 3-4for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4;
...wc -l
results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
â Peter.O
Jan 24 '12 at 4:00
@Peter.O Right again, thanks. Indeed,head
reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
â Gilles
Jan 24 '12 at 15:10
add a comment |Â
up vote
2
down vote
Assuming m = 7
and N = 21
:
cp ints ints.bak
for i in 1..7
do
rnd=$((RANDOM%(21-i)+1))
# echo $rnd;
sed -n "$rndp,q" 10k.dat >> mlines
sed -i "$rndd" ints
done
Note:
If you replace 7
with a variable like $1
or $m
, you have to use seq
, not the from..to
-notation, which doesn't do variable expansion.
It works by deleting line by line from the file, which gets shorter and shorter, so the line number, which can be removed, has to get smaller and smaller.
This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.
He needs a file with the lines that are removed too.
â Rob Wouters
Jan 22 '12 at 14:36
I thought "including these m lines of data" should meanincluding them
but the original lines as well - thereforeincluding
, notconsisting of
, and not usingonly
, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
â user unknown
Jan 22 '12 at 14:39
Looks good. ````
â Rob Wouters
Jan 22 '12 at 14:52
@user unknown: You have the+1
in the wrong place. It should bernd=$((RANDOM%(N-i)+1))
where N=21 in your example.. It currently causessed
to crash whenrnd
is evaluated to0
. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
â Peter.O
Jan 23 '12 at 12:04
@Peter.O: You're right (corrected) and you're right.
â user unknown
Jan 23 '12 at 12:38
add a comment |Â
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
18
down vote
accepted
This might not be the most efficient way but it works:
shuf <file> > tmp
head -n $m tmp > out1
tail -n +$(( m + 1 )) tmp > out2
With $m
containing the number of lines.
@userunknown,sort -R
takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
â Rob Wouters
Jan 22 '12 at 14:31
2
Note thatsort -R
doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g.foo
,foo
,bar
,bar
and m=2, then one file will contain bothfoo
s and the other will contain bothbar
s. GNU coreutils also hasshuf
, which randomizes the input lines. Also, you don't need a temporary file.
â Gilles
Jan 23 '12 at 0:45
why notshuf <file> |head -n $m
?
â emanuele
Jun 19 '14 at 16:56
@emanuele: Because we need both the head and the tail in two separate files.
â Rob Wouters
Jun 20 '14 at 7:39
add a comment |Â
up vote
18
down vote
accepted
This might not be the most efficient way but it works:
shuf <file> > tmp
head -n $m tmp > out1
tail -n +$(( m + 1 )) tmp > out2
With $m
containing the number of lines.
@userunknown,sort -R
takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
â Rob Wouters
Jan 22 '12 at 14:31
2
Note thatsort -R
doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g.foo
,foo
,bar
,bar
and m=2, then one file will contain bothfoo
s and the other will contain bothbar
s. GNU coreutils also hasshuf
, which randomizes the input lines. Also, you don't need a temporary file.
â Gilles
Jan 23 '12 at 0:45
why notshuf <file> |head -n $m
?
â emanuele
Jun 19 '14 at 16:56
@emanuele: Because we need both the head and the tail in two separate files.
â Rob Wouters
Jun 20 '14 at 7:39
add a comment |Â
up vote
18
down vote
accepted
up vote
18
down vote
accepted
This might not be the most efficient way but it works:
shuf <file> > tmp
head -n $m tmp > out1
tail -n +$(( m + 1 )) tmp > out2
With $m
containing the number of lines.
This might not be the most efficient way but it works:
shuf <file> > tmp
head -n $m tmp > out1
tail -n +$(( m + 1 )) tmp > out2
With $m
containing the number of lines.
edited Jan 23 '12 at 0:49
Gilles
511k12010141543
511k12010141543
answered Jan 22 '12 at 13:52
Rob Wouters
51635
51635
@userunknown,sort -R
takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
â Rob Wouters
Jan 22 '12 at 14:31
2
Note thatsort -R
doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g.foo
,foo
,bar
,bar
and m=2, then one file will contain bothfoo
s and the other will contain bothbar
s. GNU coreutils also hasshuf
, which randomizes the input lines. Also, you don't need a temporary file.
â Gilles
Jan 23 '12 at 0:45
why notshuf <file> |head -n $m
?
â emanuele
Jun 19 '14 at 16:56
@emanuele: Because we need both the head and the tail in two separate files.
â Rob Wouters
Jun 20 '14 at 7:39
add a comment |Â
@userunknown,sort -R
takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
â Rob Wouters
Jan 22 '12 at 14:31
2
Note thatsort -R
doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g.foo
,foo
,bar
,bar
and m=2, then one file will contain bothfoo
s and the other will contain bothbar
s. GNU coreutils also hasshuf
, which randomizes the input lines. Also, you don't need a temporary file.
â Gilles
Jan 23 '12 at 0:45
why notshuf <file> |head -n $m
?
â emanuele
Jun 19 '14 at 16:56
@emanuele: Because we need both the head and the tail in two separate files.
â Rob Wouters
Jun 20 '14 at 7:39
@userunknown,
sort -R
takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.â Rob Wouters
Jan 22 '12 at 14:31
@userunknown,
sort -R
takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.â Rob Wouters
Jan 22 '12 at 14:31
2
2
Note that
sort -R
doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo
, foo
, bar
, bar
and m=2, then one file will contain both foo
s and the other will contain both bar
s. GNU coreutils also has shuf
, which randomizes the input lines. Also, you don't need a temporary file.â Gilles
Jan 23 '12 at 0:45
Note that
sort -R
doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo
, foo
, bar
, bar
and m=2, then one file will contain both foo
s and the other will contain both bar
s. GNU coreutils also has shuf
, which randomizes the input lines. Also, you don't need a temporary file.â Gilles
Jan 23 '12 at 0:45
why not
shuf <file> |head -n $m
?â emanuele
Jun 19 '14 at 16:56
why not
shuf <file> |head -n $m
?â emanuele
Jun 19 '14 at 16:56
@emanuele: Because we need both the head and the tail in two separate files.
â Rob Wouters
Jun 20 '14 at 7:39
@emanuele: Because we need both the head and the tail in two separate files.
â Rob Wouters
Jun 20 '14 at 7:39
add a comment |Â
up vote
5
down vote
This bash/awk script chooses lines at random, and maintains the original sequence in both output files.
awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2
'BEGIN srand()
do lnb = 1 + int(rand()*N)
if ( !(lnb in R) )
R[lnb] = 1
ct++
while (ct<m)
if (R[NR]==1) print > out1
else print > out2
' file
cat /tmp/out1
echo ========
cat /tmp/out2
Output, based ont the data in the question.
12345
23456
200
600
========
67891
-20000
20
add a comment |Â
up vote
5
down vote
This bash/awk script chooses lines at random, and maintains the original sequence in both output files.
awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2
'BEGIN srand()
do lnb = 1 + int(rand()*N)
if ( !(lnb in R) )
R[lnb] = 1
ct++
while (ct<m)
if (R[NR]==1) print > out1
else print > out2
' file
cat /tmp/out1
echo ========
cat /tmp/out2
Output, based ont the data in the question.
12345
23456
200
600
========
67891
-20000
20
add a comment |Â
up vote
5
down vote
up vote
5
down vote
This bash/awk script chooses lines at random, and maintains the original sequence in both output files.
awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2
'BEGIN srand()
do lnb = 1 + int(rand()*N)
if ( !(lnb in R) )
R[lnb] = 1
ct++
while (ct<m)
if (R[NR]==1) print > out1
else print > out2
' file
cat /tmp/out1
echo ========
cat /tmp/out2
Output, based ont the data in the question.
12345
23456
200
600
========
67891
-20000
20
This bash/awk script chooses lines at random, and maintains the original sequence in both output files.
awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2
'BEGIN srand()
do lnb = 1 + int(rand()*N)
if ( !(lnb in R) )
R[lnb] = 1
ct++
while (ct<m)
if (R[NR]==1) print > out1
else print > out2
' file
cat /tmp/out1
echo ========
cat /tmp/out2
Output, based ont the data in the question.
12345
23456
200
600
========
67891
-20000
20
edited Jan 22 '12 at 16:05
answered Jan 22 '12 at 15:15
Peter.O
18.4k1688143
18.4k1688143
add a comment |Â
add a comment |Â
up vote
4
down vote
As with all things Unix, There's a Utility for ThatTM.
Program of the day: split
split
will split a file in many different ways, -b
bytes, -l
lines, -n
number of output files. We will be using the -l
option. Since you want to pick random lines and not just the first m
, we'll sort
the file randomly first. If you want to read about sort
, refer to my answer here.
Now, the actual code. It's quite simple, really:
sort -R input_file | split -l $m output_prefix
This will make two files, one with m
lines and one with N-m
lines, named output_prefixaa
and output_prefixab
.
Make sure m
is the larger file you want or you'll get several files of length m
(and one with N % m
).
If you want to ensure that you use the correct size, here's a little code to do that:
m=10 # size you want one file to be
N=$(wc -l input_file)
m=$(( m > N/2 ? m : N - m ))
sort -R input_file | split -l $m output_prefix
Edit: It has come to my attention that some sort
implementations don't have a -R
flag. If you have perl
, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'
.
1
Unfortunately,sort -R
appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
â fluffy
Jan 22 '12 at 18:49
1
Note thatsort -R
doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g.foo
,foo
,bar
,bar
and m=2, then one file will contain bothfoo
s and the other will contain bothbar
s. GNU coreutils also hasshuf
, which randomizes the input lines. Also, you can choose the output file names by usinghead
andtail
instead ofsplit
.
â Gilles
Jan 23 '12 at 0:48
add a comment |Â
up vote
4
down vote
As with all things Unix, There's a Utility for ThatTM.
Program of the day: split
split
will split a file in many different ways, -b
bytes, -l
lines, -n
number of output files. We will be using the -l
option. Since you want to pick random lines and not just the first m
, we'll sort
the file randomly first. If you want to read about sort
, refer to my answer here.
Now, the actual code. It's quite simple, really:
sort -R input_file | split -l $m output_prefix
This will make two files, one with m
lines and one with N-m
lines, named output_prefixaa
and output_prefixab
.
Make sure m
is the larger file you want or you'll get several files of length m
(and one with N % m
).
If you want to ensure that you use the correct size, here's a little code to do that:
m=10 # size you want one file to be
N=$(wc -l input_file)
m=$(( m > N/2 ? m : N - m ))
sort -R input_file | split -l $m output_prefix
Edit: It has come to my attention that some sort
implementations don't have a -R
flag. If you have perl
, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'
.
1
Unfortunately,sort -R
appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
â fluffy
Jan 22 '12 at 18:49
1
Note thatsort -R
doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g.foo
,foo
,bar
,bar
and m=2, then one file will contain bothfoo
s and the other will contain bothbar
s. GNU coreutils also hasshuf
, which randomizes the input lines. Also, you can choose the output file names by usinghead
andtail
instead ofsplit
.
â Gilles
Jan 23 '12 at 0:48
add a comment |Â
up vote
4
down vote
up vote
4
down vote
As with all things Unix, There's a Utility for ThatTM.
Program of the day: split
split
will split a file in many different ways, -b
bytes, -l
lines, -n
number of output files. We will be using the -l
option. Since you want to pick random lines and not just the first m
, we'll sort
the file randomly first. If you want to read about sort
, refer to my answer here.
Now, the actual code. It's quite simple, really:
sort -R input_file | split -l $m output_prefix
This will make two files, one with m
lines and one with N-m
lines, named output_prefixaa
and output_prefixab
.
Make sure m
is the larger file you want or you'll get several files of length m
(and one with N % m
).
If you want to ensure that you use the correct size, here's a little code to do that:
m=10 # size you want one file to be
N=$(wc -l input_file)
m=$(( m > N/2 ? m : N - m ))
sort -R input_file | split -l $m output_prefix
Edit: It has come to my attention that some sort
implementations don't have a -R
flag. If you have perl
, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'
.
As with all things Unix, There's a Utility for ThatTM.
Program of the day: split
split
will split a file in many different ways, -b
bytes, -l
lines, -n
number of output files. We will be using the -l
option. Since you want to pick random lines and not just the first m
, we'll sort
the file randomly first. If you want to read about sort
, refer to my answer here.
Now, the actual code. It's quite simple, really:
sort -R input_file | split -l $m output_prefix
This will make two files, one with m
lines and one with N-m
lines, named output_prefixaa
and output_prefixab
.
Make sure m
is the larger file you want or you'll get several files of length m
(and one with N % m
).
If you want to ensure that you use the correct size, here's a little code to do that:
m=10 # size you want one file to be
N=$(wc -l input_file)
m=$(( m > N/2 ? m : N - m ))
sort -R input_file | split -l $m output_prefix
Edit: It has come to my attention that some sort
implementations don't have a -R
flag. If you have perl
, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'
.
edited Apr 13 '17 at 12:36
Communityâ¦
1
1
answered Jan 22 '12 at 16:37
Kevin
26k95797
26k95797
1
Unfortunately,sort -R
appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
â fluffy
Jan 22 '12 at 18:49
1
Note thatsort -R
doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g.foo
,foo
,bar
,bar
and m=2, then one file will contain bothfoo
s and the other will contain bothbar
s. GNU coreutils also hasshuf
, which randomizes the input lines. Also, you can choose the output file names by usinghead
andtail
instead ofsplit
.
â Gilles
Jan 23 '12 at 0:48
add a comment |Â
1
Unfortunately,sort -R
appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
â fluffy
Jan 22 '12 at 18:49
1
Note thatsort -R
doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g.foo
,foo
,bar
,bar
and m=2, then one file will contain bothfoo
s and the other will contain bothbar
s. GNU coreutils also hasshuf
, which randomizes the input lines. Also, you can choose the output file names by usinghead
andtail
instead ofsplit
.
â Gilles
Jan 23 '12 at 0:48
1
1
Unfortunately,
sort -R
appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)â fluffy
Jan 22 '12 at 18:49
Unfortunately,
sort -R
appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)â fluffy
Jan 22 '12 at 18:49
1
1
Note that
sort -R
doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo
, foo
, bar
, bar
and m=2, then one file will contain both foo
s and the other will contain both bar
s. GNU coreutils also has shuf
, which randomizes the input lines. Also, you can choose the output file names by using head
and tail
instead of split
.â Gilles
Jan 23 '12 at 0:48
Note that
sort -R
doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo
, foo
, bar
, bar
and m=2, then one file will contain both foo
s and the other will contain both bar
s. GNU coreutils also has shuf
, which randomizes the input lines. Also, you can choose the output file names by using head
and tail
instead of split
.â Gilles
Jan 23 '12 at 0:48
add a comment |Â
up vote
3
down vote
If you don't mind reordering the lines and you have GNU coreutils (i.e. on non-embedded Linux or Cygwin, not too ancient since shuf
appeared in version 6.0), shuf
(âÂÂshuffleâÂÂ) reorders the lines of a file randomly. So you can shuffle the file and dispatch the first m lines into one file and the rest into another.
There's no ideal way to do that dispatch. You can't just chain head
and tail
because head
would buffer ahead. You can use split
, but you don't get any flexibility with respect to the output file names. You can use awk
, of course:
<input shuf | awk -v m=$m ' if (NR <= m) print >"output1" else print '
You can use sed
, which is obscure but possibly faster for large files.
<input shuf | sed -e "1,$m w output1" -e "1,$m d" >output2
Or you can use tee
to duplicate the data, if your platform has /dev/fd
; that's ok if m is small:
<input shuf | head -n $m >output1; 3>&1 | tail -n +$(($m+1)) >output2
Portably, you can use awk to dispatch each line in turn. Note that awk is not very good at initializing its random number generator; the randomness is not only definitely not suitable for cryptography, but not even very good for numerical simulations. The seed will be the same for all awk invocations on any system withing a one-second period.
<input awk -v N=$(wc -l <input) -v m=3 '
BEGIN srand()
if (rand() * N < m) --m; print >"output1" else print >"output2"
--N;
'
If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.
<input perl -e '
open OUT1, ">", "output1" or die $!;
open OUT2, ">", "output2" or die $!;
my $N = `wc -l <input`;
my $m = $ARGV[0];
while (<STDIN>)
if (rand($N) < $m) --$m; print OUT1 $_; else print OUT2 $_;
--$N;
close OUT1 or die $!;
close OUT2 or die $!;
' 42
@Gilles: For theawk
example:-v N=$(wc -l <file) -v m=4
... and it only prints a "random" line when the random value is less than$m
, rather than printing$m
random lines... It seems thatperl
may be doing the same thing with rand, but I don't knowperl
well enough to get past a compilation error: syntax error at -e line 7, near ") print"
â Peter.O
Jan 23 '12 at 4:49
@Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
â Gilles
Jan 23 '12 at 10:12
All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in theshuf
example.
â Peter.O
Jan 23 '12 at 23:03
A buffereing problem?. Am I missing something? Thehead
cat
combo causes loss of data in the following second test 3-4 .... TEST 1-2for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2;
.. TEST 3-4for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4;
...wc -l
results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
â Peter.O
Jan 24 '12 at 4:00
@Peter.O Right again, thanks. Indeed,head
reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
â Gilles
Jan 24 '12 at 15:10
add a comment |Â
up vote
3
down vote
If you don't mind reordering the lines and you have GNU coreutils (i.e. on non-embedded Linux or Cygwin, not too ancient since shuf
appeared in version 6.0), shuf
(âÂÂshuffleâÂÂ) reorders the lines of a file randomly. So you can shuffle the file and dispatch the first m lines into one file and the rest into another.
There's no ideal way to do that dispatch. You can't just chain head
and tail
because head
would buffer ahead. You can use split
, but you don't get any flexibility with respect to the output file names. You can use awk
, of course:
<input shuf | awk -v m=$m ' if (NR <= m) print >"output1" else print '
You can use sed
, which is obscure but possibly faster for large files.
<input shuf | sed -e "1,$m w output1" -e "1,$m d" >output2
Or you can use tee
to duplicate the data, if your platform has /dev/fd
; that's ok if m is small:
<input shuf | head -n $m >output1; 3>&1 | tail -n +$(($m+1)) >output2
Portably, you can use awk to dispatch each line in turn. Note that awk is not very good at initializing its random number generator; the randomness is not only definitely not suitable for cryptography, but not even very good for numerical simulations. The seed will be the same for all awk invocations on any system withing a one-second period.
<input awk -v N=$(wc -l <input) -v m=3 '
BEGIN srand()
if (rand() * N < m) --m; print >"output1" else print >"output2"
--N;
'
If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.
<input perl -e '
open OUT1, ">", "output1" or die $!;
open OUT2, ">", "output2" or die $!;
my $N = `wc -l <input`;
my $m = $ARGV[0];
while (<STDIN>)
if (rand($N) < $m) --$m; print OUT1 $_; else print OUT2 $_;
--$N;
close OUT1 or die $!;
close OUT2 or die $!;
' 42
@Gilles: For theawk
example:-v N=$(wc -l <file) -v m=4
... and it only prints a "random" line when the random value is less than$m
, rather than printing$m
random lines... It seems thatperl
may be doing the same thing with rand, but I don't knowperl
well enough to get past a compilation error: syntax error at -e line 7, near ") print"
â Peter.O
Jan 23 '12 at 4:49
@Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
â Gilles
Jan 23 '12 at 10:12
All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in theshuf
example.
â Peter.O
Jan 23 '12 at 23:03
A buffereing problem?. Am I missing something? Thehead
cat
combo causes loss of data in the following second test 3-4 .... TEST 1-2for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2;
.. TEST 3-4for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4;
...wc -l
results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
â Peter.O
Jan 24 '12 at 4:00
@Peter.O Right again, thanks. Indeed,head
reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
â Gilles
Jan 24 '12 at 15:10
add a comment |Â
up vote
3
down vote
up vote
3
down vote
If you don't mind reordering the lines and you have GNU coreutils (i.e. on non-embedded Linux or Cygwin, not too ancient since shuf
appeared in version 6.0), shuf
(âÂÂshuffleâÂÂ) reorders the lines of a file randomly. So you can shuffle the file and dispatch the first m lines into one file and the rest into another.
There's no ideal way to do that dispatch. You can't just chain head
and tail
because head
would buffer ahead. You can use split
, but you don't get any flexibility with respect to the output file names. You can use awk
, of course:
<input shuf | awk -v m=$m ' if (NR <= m) print >"output1" else print '
You can use sed
, which is obscure but possibly faster for large files.
<input shuf | sed -e "1,$m w output1" -e "1,$m d" >output2
Or you can use tee
to duplicate the data, if your platform has /dev/fd
; that's ok if m is small:
<input shuf | head -n $m >output1; 3>&1 | tail -n +$(($m+1)) >output2
Portably, you can use awk to dispatch each line in turn. Note that awk is not very good at initializing its random number generator; the randomness is not only definitely not suitable for cryptography, but not even very good for numerical simulations. The seed will be the same for all awk invocations on any system withing a one-second period.
<input awk -v N=$(wc -l <input) -v m=3 '
BEGIN srand()
if (rand() * N < m) --m; print >"output1" else print >"output2"
--N;
'
If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.
<input perl -e '
open OUT1, ">", "output1" or die $!;
open OUT2, ">", "output2" or die $!;
my $N = `wc -l <input`;
my $m = $ARGV[0];
while (<STDIN>)
if (rand($N) < $m) --$m; print OUT1 $_; else print OUT2 $_;
--$N;
close OUT1 or die $!;
close OUT2 or die $!;
' 42
If you don't mind reordering the lines and you have GNU coreutils (i.e. on non-embedded Linux or Cygwin, not too ancient since shuf
appeared in version 6.0), shuf
(âÂÂshuffleâÂÂ) reorders the lines of a file randomly. So you can shuffle the file and dispatch the first m lines into one file and the rest into another.
There's no ideal way to do that dispatch. You can't just chain head
and tail
because head
would buffer ahead. You can use split
, but you don't get any flexibility with respect to the output file names. You can use awk
, of course:
<input shuf | awk -v m=$m ' if (NR <= m) print >"output1" else print '
You can use sed
, which is obscure but possibly faster for large files.
<input shuf | sed -e "1,$m w output1" -e "1,$m d" >output2
Or you can use tee
to duplicate the data, if your platform has /dev/fd
; that's ok if m is small:
<input shuf | head -n $m >output1; 3>&1 | tail -n +$(($m+1)) >output2
Portably, you can use awk to dispatch each line in turn. Note that awk is not very good at initializing its random number generator; the randomness is not only definitely not suitable for cryptography, but not even very good for numerical simulations. The seed will be the same for all awk invocations on any system withing a one-second period.
<input awk -v N=$(wc -l <input) -v m=3 '
BEGIN srand()
if (rand() * N < m) --m; print >"output1" else print >"output2"
--N;
'
If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.
<input perl -e '
open OUT1, ">", "output1" or die $!;
open OUT2, ">", "output2" or die $!;
my $N = `wc -l <input`;
my $m = $ARGV[0];
while (<STDIN>)
if (rand($N) < $m) --$m; print OUT1 $_; else print OUT2 $_;
--$N;
close OUT1 or die $!;
close OUT2 or die $!;
' 42
edited Jan 24 '12 at 15:09
answered Jan 23 '12 at 0:43
Gilles
511k12010141543
511k12010141543
@Gilles: For theawk
example:-v N=$(wc -l <file) -v m=4
... and it only prints a "random" line when the random value is less than$m
, rather than printing$m
random lines... It seems thatperl
may be doing the same thing with rand, but I don't knowperl
well enough to get past a compilation error: syntax error at -e line 7, near ") print"
â Peter.O
Jan 23 '12 at 4:49
@Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
â Gilles
Jan 23 '12 at 10:12
All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in theshuf
example.
â Peter.O
Jan 23 '12 at 23:03
A buffereing problem?. Am I missing something? Thehead
cat
combo causes loss of data in the following second test 3-4 .... TEST 1-2for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2;
.. TEST 3-4for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4;
...wc -l
results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
â Peter.O
Jan 24 '12 at 4:00
@Peter.O Right again, thanks. Indeed,head
reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
â Gilles
Jan 24 '12 at 15:10
add a comment |Â
@Gilles: For theawk
example:-v N=$(wc -l <file) -v m=4
... and it only prints a "random" line when the random value is less than$m
, rather than printing$m
random lines... It seems thatperl
may be doing the same thing with rand, but I don't knowperl
well enough to get past a compilation error: syntax error at -e line 7, near ") print"
â Peter.O
Jan 23 '12 at 4:49
@Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
â Gilles
Jan 23 '12 at 10:12
All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in theshuf
example.
â Peter.O
Jan 23 '12 at 23:03
A buffereing problem?. Am I missing something? Thehead
cat
combo causes loss of data in the following second test 3-4 .... TEST 1-2for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2;
.. TEST 3-4for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4;
...wc -l
results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
â Peter.O
Jan 24 '12 at 4:00
@Peter.O Right again, thanks. Indeed,head
reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
â Gilles
Jan 24 '12 at 15:10
@Gilles: For the
awk
example: -v N=$(wc -l <file) -v m=4
... and it only prints a "random" line when the random value is less than $m
, rather than printing $m
random lines... It seems that perl
may be doing the same thing with rand, but I don't know perl
well enough to get past a compilation error: syntax error at -e line 7, near ") print"â Peter.O
Jan 23 '12 at 4:49
@Gilles: For the
awk
example: -v N=$(wc -l <file) -v m=4
... and it only prints a "random" line when the random value is less than $m
, rather than printing $m
random lines... It seems that perl
may be doing the same thing with rand, but I don't know perl
well enough to get past a compilation error: syntax error at -e line 7, near ") print"â Peter.O
Jan 23 '12 at 4:49
@Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
â Gilles
Jan 23 '12 at 10:12
@Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
â Gilles
Jan 23 '12 at 10:12
All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in the
shuf
example.â Peter.O
Jan 23 '12 at 23:03
All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in the
shuf
example.â Peter.O
Jan 23 '12 at 23:03
A buffereing problem?. Am I missing something? The
head
cat
combo causes loss of data in the following second test 3-4 .... TEST 1-2 for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2;
.. TEST 3-4 for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4;
... wc -l
results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test codeâ Peter.O
Jan 24 '12 at 4:00
A buffereing problem?. Am I missing something? The
head
cat
combo causes loss of data in the following second test 3-4 .... TEST 1-2 for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2;
.. TEST 3-4 for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4;
... wc -l
results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test codeâ Peter.O
Jan 24 '12 at 4:00
@Peter.O Right again, thanks. Indeed,
head
reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.â Gilles
Jan 24 '12 at 15:10
@Peter.O Right again, thanks. Indeed,
head
reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.â Gilles
Jan 24 '12 at 15:10
add a comment |Â
up vote
2
down vote
Assuming m = 7
and N = 21
:
cp ints ints.bak
for i in 1..7
do
rnd=$((RANDOM%(21-i)+1))
# echo $rnd;
sed -n "$rndp,q" 10k.dat >> mlines
sed -i "$rndd" ints
done
Note:
If you replace 7
with a variable like $1
or $m
, you have to use seq
, not the from..to
-notation, which doesn't do variable expansion.
It works by deleting line by line from the file, which gets shorter and shorter, so the line number, which can be removed, has to get smaller and smaller.
This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.
He needs a file with the lines that are removed too.
â Rob Wouters
Jan 22 '12 at 14:36
I thought "including these m lines of data" should meanincluding them
but the original lines as well - thereforeincluding
, notconsisting of
, and not usingonly
, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
â user unknown
Jan 22 '12 at 14:39
Looks good. ````
â Rob Wouters
Jan 22 '12 at 14:52
@user unknown: You have the+1
in the wrong place. It should bernd=$((RANDOM%(N-i)+1))
where N=21 in your example.. It currently causessed
to crash whenrnd
is evaluated to0
. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
â Peter.O
Jan 23 '12 at 12:04
@Peter.O: You're right (corrected) and you're right.
â user unknown
Jan 23 '12 at 12:38
add a comment |Â
up vote
2
down vote
Assuming m = 7
and N = 21
:
cp ints ints.bak
for i in 1..7
do
rnd=$((RANDOM%(21-i)+1))
# echo $rnd;
sed -n "$rndp,q" 10k.dat >> mlines
sed -i "$rndd" ints
done
Note:
If you replace 7
with a variable like $1
or $m
, you have to use seq
, not the from..to
-notation, which doesn't do variable expansion.
It works by deleting line by line from the file, which gets shorter and shorter, so the line number, which can be removed, has to get smaller and smaller.
This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.
He needs a file with the lines that are removed too.
â Rob Wouters
Jan 22 '12 at 14:36
I thought "including these m lines of data" should meanincluding them
but the original lines as well - thereforeincluding
, notconsisting of
, and not usingonly
, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
â user unknown
Jan 22 '12 at 14:39
Looks good. ````
â Rob Wouters
Jan 22 '12 at 14:52
@user unknown: You have the+1
in the wrong place. It should bernd=$((RANDOM%(N-i)+1))
where N=21 in your example.. It currently causessed
to crash whenrnd
is evaluated to0
. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
â Peter.O
Jan 23 '12 at 12:04
@Peter.O: You're right (corrected) and you're right.
â user unknown
Jan 23 '12 at 12:38
add a comment |Â
up vote
2
down vote
up vote
2
down vote
Assuming m = 7
and N = 21
:
cp ints ints.bak
for i in 1..7
do
rnd=$((RANDOM%(21-i)+1))
# echo $rnd;
sed -n "$rndp,q" 10k.dat >> mlines
sed -i "$rndd" ints
done
Note:
If you replace 7
with a variable like $1
or $m
, you have to use seq
, not the from..to
-notation, which doesn't do variable expansion.
It works by deleting line by line from the file, which gets shorter and shorter, so the line number, which can be removed, has to get smaller and smaller.
This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.
Assuming m = 7
and N = 21
:
cp ints ints.bak
for i in 1..7
do
rnd=$((RANDOM%(21-i)+1))
# echo $rnd;
sed -n "$rndp,q" 10k.dat >> mlines
sed -i "$rndd" ints
done
Note:
If you replace 7
with a variable like $1
or $m
, you have to use seq
, not the from..to
-notation, which doesn't do variable expansion.
It works by deleting line by line from the file, which gets shorter and shorter, so the line number, which can be removed, has to get smaller and smaller.
This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.
edited Jan 24 '12 at 15:40
answered Jan 22 '12 at 14:19
user unknown
7,02912148
7,02912148
He needs a file with the lines that are removed too.
â Rob Wouters
Jan 22 '12 at 14:36
I thought "including these m lines of data" should meanincluding them
but the original lines as well - thereforeincluding
, notconsisting of
, and not usingonly
, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
â user unknown
Jan 22 '12 at 14:39
Looks good. ````
â Rob Wouters
Jan 22 '12 at 14:52
@user unknown: You have the+1
in the wrong place. It should bernd=$((RANDOM%(N-i)+1))
where N=21 in your example.. It currently causessed
to crash whenrnd
is evaluated to0
. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
â Peter.O
Jan 23 '12 at 12:04
@Peter.O: You're right (corrected) and you're right.
â user unknown
Jan 23 '12 at 12:38
add a comment |Â
He needs a file with the lines that are removed too.
â Rob Wouters
Jan 22 '12 at 14:36
I thought "including these m lines of data" should meanincluding them
but the original lines as well - thereforeincluding
, notconsisting of
, and not usingonly
, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
â user unknown
Jan 22 '12 at 14:39
Looks good. ````
â Rob Wouters
Jan 22 '12 at 14:52
@user unknown: You have the+1
in the wrong place. It should bernd=$((RANDOM%(N-i)+1))
where N=21 in your example.. It currently causessed
to crash whenrnd
is evaluated to0
. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
â Peter.O
Jan 23 '12 at 12:04
@Peter.O: You're right (corrected) and you're right.
â user unknown
Jan 23 '12 at 12:38
He needs a file with the lines that are removed too.
â Rob Wouters
Jan 22 '12 at 14:36
He needs a file with the lines that are removed too.
â Rob Wouters
Jan 22 '12 at 14:36
I thought "including these m lines of data" should mean
including them
but the original lines as well - therefore including
, not consisting of
, and not using only
, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.â user unknown
Jan 22 '12 at 14:39
I thought "including these m lines of data" should mean
including them
but the original lines as well - therefore including
, not consisting of
, and not using only
, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.â user unknown
Jan 22 '12 at 14:39
Looks good. ````
â Rob Wouters
Jan 22 '12 at 14:52
Looks good. ````
â Rob Wouters
Jan 22 '12 at 14:52
@user unknown: You have the
+1
in the wrong place. It should be rnd=$((RANDOM%(N-i)+1))
where N=21 in your example.. It currently causes sed
to crash when rnd
is evaluated to 0
. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...â Peter.O
Jan 23 '12 at 12:04
@user unknown: You have the
+1
in the wrong place. It should be rnd=$((RANDOM%(N-i)+1))
where N=21 in your example.. It currently causes sed
to crash when rnd
is evaluated to 0
. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...â Peter.O
Jan 23 '12 at 12:04
@Peter.O: You're right (corrected) and you're right.
â user unknown
Jan 23 '12 at 12:38
@Peter.O: You're right (corrected) and you're right.
â user unknown
Jan 23 '12 at 12:38
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f29709%2frandomly-draw-a-certain-number-of-lines-from-a-data-file%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
â Peter.O
Jan 22 '12 at 14:04