Randomly draw a certain number of lines from a data file

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
13
down vote

favorite
3












I have a data list, like



12345
23456
67891
-20000
200
600
20
...


Assume the size of this data set (i.e. lines of file) is N. I want to randomly draw m lines from this data file. Therefore, the output should be two files, one is the file including these m lines of data, and the other one includes N-m lines of data.



Is there a way to do that using a Linux command?










share|improve this question



















  • 1




    Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
    – Peter.O
    Jan 22 '12 at 14:04















up vote
13
down vote

favorite
3












I have a data list, like



12345
23456
67891
-20000
200
600
20
...


Assume the size of this data set (i.e. lines of file) is N. I want to randomly draw m lines from this data file. Therefore, the output should be two files, one is the file including these m lines of data, and the other one includes N-m lines of data.



Is there a way to do that using a Linux command?










share|improve this question



















  • 1




    Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
    – Peter.O
    Jan 22 '12 at 14:04













up vote
13
down vote

favorite
3









up vote
13
down vote

favorite
3






3





I have a data list, like



12345
23456
67891
-20000
200
600
20
...


Assume the size of this data set (i.e. lines of file) is N. I want to randomly draw m lines from this data file. Therefore, the output should be two files, one is the file including these m lines of data, and the other one includes N-m lines of data.



Is there a way to do that using a Linux command?










share|improve this question















I have a data list, like



12345
23456
67891
-20000
200
600
20
...


Assume the size of this data set (i.e. lines of file) is N. I want to randomly draw m lines from this data file. Therefore, the output should be two files, one is the file including these m lines of data, and the other one includes N-m lines of data.



Is there a way to do that using a Linux command?







linux shell text-processing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 22 '12 at 13:49









sr_

12.8k3042




12.8k3042










asked Jan 22 '12 at 13:44









user288609

3412412




3412412







  • 1




    Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
    – Peter.O
    Jan 22 '12 at 14:04













  • 1




    Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
    – Peter.O
    Jan 22 '12 at 14:04








1




1




Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
– Peter.O
Jan 22 '12 at 14:04





Are you concerned about the sequence of lines? eg. Do you want to maintain the source order, or do you want that sequence to be itself random as well as the choice of lines being random?
– Peter.O
Jan 22 '12 at 14:04











5 Answers
5






active

oldest

votes

















up vote
18
down vote



accepted










This might not be the most efficient way but it works:



shuf <file> > tmp
head -n $m tmp > out1
tail -n +$(( m + 1 )) tmp > out2


With $m containing the number of lines.






share|improve this answer






















  • @userunknown, sort -R takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
    – Rob Wouters
    Jan 22 '12 at 14:31







  • 2




    Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you don't need a temporary file.
    – Gilles
    Jan 23 '12 at 0:45











  • why not shuf <file> |head -n $m?
    – emanuele
    Jun 19 '14 at 16:56










  • @emanuele: Because we need both the head and the tail in two separate files.
    – Rob Wouters
    Jun 20 '14 at 7:39

















up vote
5
down vote













This bash/awk script chooses lines at random, and maintains the original sequence in both output files.



awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2 
'BEGIN srand()
do lnb = 1 + int(rand()*N)
if ( !(lnb in R) )
R[lnb] = 1
ct++
while (ct<m)
if (R[NR]==1) print > out1
else print > out2
' file
cat /tmp/out1
echo ========
cat /tmp/out2


Output, based ont the data in the question.



12345
23456
200
600
========
67891
-20000
20





share|improve this answer





























    up vote
    4
    down vote













    As with all things Unix, There's a Utility for ThatTM.



    Program of the day: split
    split will split a file in many different ways, -b bytes, -l lines, -n number of output files. We will be using the -l option. Since you want to pick random lines and not just the first m, we'll sort the file randomly first. If you want to read about sort, refer to my answer here.



    Now, the actual code. It's quite simple, really:



    sort -R input_file | split -l $m output_prefix


    This will make two files, one with m lines and one with N-m lines, named output_prefixaa and output_prefixab.
    Make sure m is the larger file you want or you'll get several files of length m (and one with N % m).



    If you want to ensure that you use the correct size, here's a little code to do that:



    m=10 # size you want one file to be
    N=$(wc -l input_file)
    m=$(( m > N/2 ? m : N - m ))
    sort -R input_file | split -l $m output_prefix


    Edit: It has come to my attention that some sort implementations don't have a -R flag. If you have perl, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'.






    share|improve this answer


















    • 1




      Unfortunately, sort -R appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
      – fluffy
      Jan 22 '12 at 18:49






    • 1




      Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you can choose the output file names by using head and tail instead of split.
      – Gilles
      Jan 23 '12 at 0:48

















    up vote
    3
    down vote













    If you don't mind reordering the lines and you have GNU coreutils (i.e. on non-embedded Linux or Cygwin, not too ancient since shuf appeared in version 6.0), shuf (“shuffle”) reorders the lines of a file randomly. So you can shuffle the file and dispatch the first m lines into one file and the rest into another.



    There's no ideal way to do that dispatch. You can't just chain head and tail because head would buffer ahead. You can use split, but you don't get any flexibility with respect to the output file names. You can use awk, of course:



    <input shuf | awk -v m=$m ' if (NR <= m) print >"output1" else print '


    You can use sed, which is obscure but possibly faster for large files.



    <input shuf | sed -e "1,$m w output1" -e "1,$m d" >output2


    Or you can use tee to duplicate the data, if your platform has /dev/fd; that's ok if m is small:



    <input shuf | head -n $m >output1; 3>&1 | tail -n +$(($m+1)) >output2


    Portably, you can use awk to dispatch each line in turn. Note that awk is not very good at initializing its random number generator; the randomness is not only definitely not suitable for cryptography, but not even very good for numerical simulations. The seed will be the same for all awk invocations on any system withing a one-second period.



    <input awk -v N=$(wc -l <input) -v m=3 '
    BEGIN srand()

    if (rand() * N < m) --m; print >"output1" else print >"output2"
    --N;
    '


    If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.



    <input perl -e '
    open OUT1, ">", "output1" or die $!;
    open OUT2, ">", "output2" or die $!;
    my $N = `wc -l <input`;
    my $m = $ARGV[0];
    while (<STDIN>)
    if (rand($N) < $m) --$m; print OUT1 $_; else print OUT2 $_;
    --$N;

    close OUT1 or die $!;
    close OUT2 or die $!;
    ' 42





    share|improve this answer






















    • @Gilles: For the awk example: -v N=$(wc -l <file) -v m=4 ... and it only prints a "random" line when the random value is less than $m, rather than printing $m random lines... It seems that perl may be doing the same thing with rand, but I don't know perl well enough to get past a compilation error: syntax error at -e line 7, near ") print"
      – Peter.O
      Jan 23 '12 at 4:49











    • @Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
      – Gilles
      Jan 23 '12 at 10:12










    • All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in the shuf example.
      – Peter.O
      Jan 23 '12 at 23:03










    • A buffereing problem?. Am I missing something? The head cat combo causes loss of data in the following second test 3-4 .... TEST 1-2 for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2; .. TEST 3-4 for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4; ... wc -l results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
      – Peter.O
      Jan 24 '12 at 4:00










    • @Peter.O Right again, thanks. Indeed, head reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
      – Gilles
      Jan 24 '12 at 15:10

















    up vote
    2
    down vote













    Assuming m = 7 and N = 21:



    cp ints ints.bak
    for i in 1..7
    do
    rnd=$((RANDOM%(21-i)+1))
    # echo $rnd;
    sed -n "$rndp,q" 10k.dat >> mlines
    sed -i "$rndd" ints
    done


    Note:
    If you replace 7 with a variable like $1 or $m, you have to use seq, not the from..to-notation, which doesn't do variable expansion.



    It works by deleting line by line from the file, which gets shorter and shorter, so the line number, which can be removed, has to get smaller and smaller.



    This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.






    share|improve this answer






















    • He needs a file with the lines that are removed too.
      – Rob Wouters
      Jan 22 '12 at 14:36










    • I thought "including these m lines of data" should mean including them but the original lines as well - therefore including, not consisting of, and not using only, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
      – user unknown
      Jan 22 '12 at 14:39










    • Looks good. ````
      – Rob Wouters
      Jan 22 '12 at 14:52










    • @user unknown: You have the +1 in the wrong place. It should be rnd=$((RANDOM%(N-i)+1)) where N=21 in your example.. It currently causes sed to crash when rnd is evaluated to 0. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
      – Peter.O
      Jan 23 '12 at 12:04










    • @Peter.O: You're right (corrected) and you're right.
      – user unknown
      Jan 23 '12 at 12:38










    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f29709%2frandomly-draw-a-certain-number-of-lines-from-a-data-file%23new-answer', 'question_page');

    );

    Post as a guest






























    5 Answers
    5






    active

    oldest

    votes








    5 Answers
    5






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    18
    down vote



    accepted










    This might not be the most efficient way but it works:



    shuf <file> > tmp
    head -n $m tmp > out1
    tail -n +$(( m + 1 )) tmp > out2


    With $m containing the number of lines.






    share|improve this answer






















    • @userunknown, sort -R takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
      – Rob Wouters
      Jan 22 '12 at 14:31







    • 2




      Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you don't need a temporary file.
      – Gilles
      Jan 23 '12 at 0:45











    • why not shuf <file> |head -n $m?
      – emanuele
      Jun 19 '14 at 16:56










    • @emanuele: Because we need both the head and the tail in two separate files.
      – Rob Wouters
      Jun 20 '14 at 7:39














    up vote
    18
    down vote



    accepted










    This might not be the most efficient way but it works:



    shuf <file> > tmp
    head -n $m tmp > out1
    tail -n +$(( m + 1 )) tmp > out2


    With $m containing the number of lines.






    share|improve this answer






















    • @userunknown, sort -R takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
      – Rob Wouters
      Jan 22 '12 at 14:31







    • 2




      Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you don't need a temporary file.
      – Gilles
      Jan 23 '12 at 0:45











    • why not shuf <file> |head -n $m?
      – emanuele
      Jun 19 '14 at 16:56










    • @emanuele: Because we need both the head and the tail in two separate files.
      – Rob Wouters
      Jun 20 '14 at 7:39












    up vote
    18
    down vote



    accepted







    up vote
    18
    down vote



    accepted






    This might not be the most efficient way but it works:



    shuf <file> > tmp
    head -n $m tmp > out1
    tail -n +$(( m + 1 )) tmp > out2


    With $m containing the number of lines.






    share|improve this answer














    This might not be the most efficient way but it works:



    shuf <file> > tmp
    head -n $m tmp > out1
    tail -n +$(( m + 1 )) tmp > out2


    With $m containing the number of lines.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Jan 23 '12 at 0:49









    Gilles

    511k12010141543




    511k12010141543










    answered Jan 22 '12 at 13:52









    Rob Wouters

    51635




    51635











    • @userunknown, sort -R takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
      – Rob Wouters
      Jan 22 '12 at 14:31







    • 2




      Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you don't need a temporary file.
      – Gilles
      Jan 23 '12 at 0:45











    • why not shuf <file> |head -n $m?
      – emanuele
      Jun 19 '14 at 16:56










    • @emanuele: Because we need both the head and the tail in two separate files.
      – Rob Wouters
      Jun 20 '14 at 7:39
















    • @userunknown, sort -R takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
      – Rob Wouters
      Jan 22 '12 at 14:31







    • 2




      Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you don't need a temporary file.
      – Gilles
      Jan 23 '12 at 0:45











    • why not shuf <file> |head -n $m?
      – emanuele
      Jun 19 '14 at 16:56










    • @emanuele: Because we need both the head and the tail in two separate files.
      – Rob Wouters
      Jun 20 '14 at 7:39















    @userunknown, sort -R takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
    – Rob Wouters
    Jan 22 '12 at 14:31





    @userunknown, sort -R takes care of the randomness. Not sure if you downvoted the answer for that, but look it up in the manpage first.
    – Rob Wouters
    Jan 22 '12 at 14:31





    2




    2




    Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you don't need a temporary file.
    – Gilles
    Jan 23 '12 at 0:45





    Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you don't need a temporary file.
    – Gilles
    Jan 23 '12 at 0:45













    why not shuf <file> |head -n $m?
    – emanuele
    Jun 19 '14 at 16:56




    why not shuf <file> |head -n $m?
    – emanuele
    Jun 19 '14 at 16:56












    @emanuele: Because we need both the head and the tail in two separate files.
    – Rob Wouters
    Jun 20 '14 at 7:39




    @emanuele: Because we need both the head and the tail in two separate files.
    – Rob Wouters
    Jun 20 '14 at 7:39












    up vote
    5
    down vote













    This bash/awk script chooses lines at random, and maintains the original sequence in both output files.



    awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2 
    'BEGIN srand()
    do lnb = 1 + int(rand()*N)
    if ( !(lnb in R) )
    R[lnb] = 1
    ct++
    while (ct<m)
    if (R[NR]==1) print > out1
    else print > out2
    ' file
    cat /tmp/out1
    echo ========
    cat /tmp/out2


    Output, based ont the data in the question.



    12345
    23456
    200
    600
    ========
    67891
    -20000
    20





    share|improve this answer


























      up vote
      5
      down vote













      This bash/awk script chooses lines at random, and maintains the original sequence in both output files.



      awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2 
      'BEGIN srand()
      do lnb = 1 + int(rand()*N)
      if ( !(lnb in R) )
      R[lnb] = 1
      ct++
      while (ct<m)
      if (R[NR]==1) print > out1
      else print > out2
      ' file
      cat /tmp/out1
      echo ========
      cat /tmp/out2


      Output, based ont the data in the question.



      12345
      23456
      200
      600
      ========
      67891
      -20000
      20





      share|improve this answer
























        up vote
        5
        down vote










        up vote
        5
        down vote









        This bash/awk script chooses lines at random, and maintains the original sequence in both output files.



        awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2 
        'BEGIN srand()
        do lnb = 1 + int(rand()*N)
        if ( !(lnb in R) )
        R[lnb] = 1
        ct++
        while (ct<m)
        if (R[NR]==1) print > out1
        else print > out2
        ' file
        cat /tmp/out1
        echo ========
        cat /tmp/out2


        Output, based ont the data in the question.



        12345
        23456
        200
        600
        ========
        67891
        -20000
        20





        share|improve this answer














        This bash/awk script chooses lines at random, and maintains the original sequence in both output files.



        awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2 
        'BEGIN srand()
        do lnb = 1 + int(rand()*N)
        if ( !(lnb in R) )
        R[lnb] = 1
        ct++
        while (ct<m)
        if (R[NR]==1) print > out1
        else print > out2
        ' file
        cat /tmp/out1
        echo ========
        cat /tmp/out2


        Output, based ont the data in the question.



        12345
        23456
        200
        600
        ========
        67891
        -20000
        20






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jan 22 '12 at 16:05

























        answered Jan 22 '12 at 15:15









        Peter.O

        18.4k1688143




        18.4k1688143




















            up vote
            4
            down vote













            As with all things Unix, There's a Utility for ThatTM.



            Program of the day: split
            split will split a file in many different ways, -b bytes, -l lines, -n number of output files. We will be using the -l option. Since you want to pick random lines and not just the first m, we'll sort the file randomly first. If you want to read about sort, refer to my answer here.



            Now, the actual code. It's quite simple, really:



            sort -R input_file | split -l $m output_prefix


            This will make two files, one with m lines and one with N-m lines, named output_prefixaa and output_prefixab.
            Make sure m is the larger file you want or you'll get several files of length m (and one with N % m).



            If you want to ensure that you use the correct size, here's a little code to do that:



            m=10 # size you want one file to be
            N=$(wc -l input_file)
            m=$(( m > N/2 ? m : N - m ))
            sort -R input_file | split -l $m output_prefix


            Edit: It has come to my attention that some sort implementations don't have a -R flag. If you have perl, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'.






            share|improve this answer


















            • 1




              Unfortunately, sort -R appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
              – fluffy
              Jan 22 '12 at 18:49






            • 1




              Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you can choose the output file names by using head and tail instead of split.
              – Gilles
              Jan 23 '12 at 0:48














            up vote
            4
            down vote













            As with all things Unix, There's a Utility for ThatTM.



            Program of the day: split
            split will split a file in many different ways, -b bytes, -l lines, -n number of output files. We will be using the -l option. Since you want to pick random lines and not just the first m, we'll sort the file randomly first. If you want to read about sort, refer to my answer here.



            Now, the actual code. It's quite simple, really:



            sort -R input_file | split -l $m output_prefix


            This will make two files, one with m lines and one with N-m lines, named output_prefixaa and output_prefixab.
            Make sure m is the larger file you want or you'll get several files of length m (and one with N % m).



            If you want to ensure that you use the correct size, here's a little code to do that:



            m=10 # size you want one file to be
            N=$(wc -l input_file)
            m=$(( m > N/2 ? m : N - m ))
            sort -R input_file | split -l $m output_prefix


            Edit: It has come to my attention that some sort implementations don't have a -R flag. If you have perl, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'.






            share|improve this answer


















            • 1




              Unfortunately, sort -R appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
              – fluffy
              Jan 22 '12 at 18:49






            • 1




              Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you can choose the output file names by using head and tail instead of split.
              – Gilles
              Jan 23 '12 at 0:48












            up vote
            4
            down vote










            up vote
            4
            down vote









            As with all things Unix, There's a Utility for ThatTM.



            Program of the day: split
            split will split a file in many different ways, -b bytes, -l lines, -n number of output files. We will be using the -l option. Since you want to pick random lines and not just the first m, we'll sort the file randomly first. If you want to read about sort, refer to my answer here.



            Now, the actual code. It's quite simple, really:



            sort -R input_file | split -l $m output_prefix


            This will make two files, one with m lines and one with N-m lines, named output_prefixaa and output_prefixab.
            Make sure m is the larger file you want or you'll get several files of length m (and one with N % m).



            If you want to ensure that you use the correct size, here's a little code to do that:



            m=10 # size you want one file to be
            N=$(wc -l input_file)
            m=$(( m > N/2 ? m : N - m ))
            sort -R input_file | split -l $m output_prefix


            Edit: It has come to my attention that some sort implementations don't have a -R flag. If you have perl, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'.






            share|improve this answer














            As with all things Unix, There's a Utility for ThatTM.



            Program of the day: split
            split will split a file in many different ways, -b bytes, -l lines, -n number of output files. We will be using the -l option. Since you want to pick random lines and not just the first m, we'll sort the file randomly first. If you want to read about sort, refer to my answer here.



            Now, the actual code. It's quite simple, really:



            sort -R input_file | split -l $m output_prefix


            This will make two files, one with m lines and one with N-m lines, named output_prefixaa and output_prefixab.
            Make sure m is the larger file you want or you'll get several files of length m (and one with N % m).



            If you want to ensure that you use the correct size, here's a little code to do that:



            m=10 # size you want one file to be
            N=$(wc -l input_file)
            m=$(( m > N/2 ? m : N - m ))
            sort -R input_file | split -l $m output_prefix


            Edit: It has come to my attention that some sort implementations don't have a -R flag. If you have perl, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Apr 13 '17 at 12:36









            Community♦

            1




            1










            answered Jan 22 '12 at 16:37









            Kevin

            26k95797




            26k95797







            • 1




              Unfortunately, sort -R appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
              – fluffy
              Jan 22 '12 at 18:49






            • 1




              Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you can choose the output file names by using head and tail instead of split.
              – Gilles
              Jan 23 '12 at 0:48












            • 1




              Unfortunately, sort -R appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
              – fluffy
              Jan 22 '12 at 18:49






            • 1




              Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you can choose the output file names by using head and tail instead of split.
              – Gilles
              Jan 23 '12 at 0:48







            1




            1




            Unfortunately, sort -R appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
            – fluffy
            Jan 22 '12 at 18:49




            Unfortunately, sort -R appears to only be in some versions of sort (probably the gnu version). For other platforms I wrote a tool called 'randline' which does nothing but randomize stdin. It's at beesbuzz.biz/code for anyone who needs it. (I tend to shuffle file contents quite a lot.)
            – fluffy
            Jan 22 '12 at 18:49




            1




            1




            Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you can choose the output file names by using head and tail instead of split.
            – Gilles
            Jan 23 '12 at 0:48




            Note that sort -R doesn't exactly sort its input randomly: it groups identical lines. So if the input is e.g. foo, foo, bar, bar and m=2, then one file will contain both foos and the other will contain both bars. GNU coreutils also has shuf, which randomizes the input lines. Also, you can choose the output file names by using head and tail instead of split.
            – Gilles
            Jan 23 '12 at 0:48










            up vote
            3
            down vote













            If you don't mind reordering the lines and you have GNU coreutils (i.e. on non-embedded Linux or Cygwin, not too ancient since shuf appeared in version 6.0), shuf (“shuffle”) reorders the lines of a file randomly. So you can shuffle the file and dispatch the first m lines into one file and the rest into another.



            There's no ideal way to do that dispatch. You can't just chain head and tail because head would buffer ahead. You can use split, but you don't get any flexibility with respect to the output file names. You can use awk, of course:



            <input shuf | awk -v m=$m ' if (NR <= m) print >"output1" else print '


            You can use sed, which is obscure but possibly faster for large files.



            <input shuf | sed -e "1,$m w output1" -e "1,$m d" >output2


            Or you can use tee to duplicate the data, if your platform has /dev/fd; that's ok if m is small:



            <input shuf | head -n $m >output1; 3>&1 | tail -n +$(($m+1)) >output2


            Portably, you can use awk to dispatch each line in turn. Note that awk is not very good at initializing its random number generator; the randomness is not only definitely not suitable for cryptography, but not even very good for numerical simulations. The seed will be the same for all awk invocations on any system withing a one-second period.



            <input awk -v N=$(wc -l <input) -v m=3 '
            BEGIN srand()

            if (rand() * N < m) --m; print >"output1" else print >"output2"
            --N;
            '


            If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.



            <input perl -e '
            open OUT1, ">", "output1" or die $!;
            open OUT2, ">", "output2" or die $!;
            my $N = `wc -l <input`;
            my $m = $ARGV[0];
            while (<STDIN>)
            if (rand($N) < $m) --$m; print OUT1 $_; else print OUT2 $_;
            --$N;

            close OUT1 or die $!;
            close OUT2 or die $!;
            ' 42





            share|improve this answer






















            • @Gilles: For the awk example: -v N=$(wc -l <file) -v m=4 ... and it only prints a "random" line when the random value is less than $m, rather than printing $m random lines... It seems that perl may be doing the same thing with rand, but I don't know perl well enough to get past a compilation error: syntax error at -e line 7, near ") print"
              – Peter.O
              Jan 23 '12 at 4:49











            • @Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
              – Gilles
              Jan 23 '12 at 10:12










            • All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in the shuf example.
              – Peter.O
              Jan 23 '12 at 23:03










            • A buffereing problem?. Am I missing something? The head cat combo causes loss of data in the following second test 3-4 .... TEST 1-2 for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2; .. TEST 3-4 for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4; ... wc -l results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
              – Peter.O
              Jan 24 '12 at 4:00










            • @Peter.O Right again, thanks. Indeed, head reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
              – Gilles
              Jan 24 '12 at 15:10














            up vote
            3
            down vote













            If you don't mind reordering the lines and you have GNU coreutils (i.e. on non-embedded Linux or Cygwin, not too ancient since shuf appeared in version 6.0), shuf (“shuffle”) reorders the lines of a file randomly. So you can shuffle the file and dispatch the first m lines into one file and the rest into another.



            There's no ideal way to do that dispatch. You can't just chain head and tail because head would buffer ahead. You can use split, but you don't get any flexibility with respect to the output file names. You can use awk, of course:



            <input shuf | awk -v m=$m ' if (NR <= m) print >"output1" else print '


            You can use sed, which is obscure but possibly faster for large files.



            <input shuf | sed -e "1,$m w output1" -e "1,$m d" >output2


            Or you can use tee to duplicate the data, if your platform has /dev/fd; that's ok if m is small:



            <input shuf | head -n $m >output1; 3>&1 | tail -n +$(($m+1)) >output2


            Portably, you can use awk to dispatch each line in turn. Note that awk is not very good at initializing its random number generator; the randomness is not only definitely not suitable for cryptography, but not even very good for numerical simulations. The seed will be the same for all awk invocations on any system withing a one-second period.



            <input awk -v N=$(wc -l <input) -v m=3 '
            BEGIN srand()

            if (rand() * N < m) --m; print >"output1" else print >"output2"
            --N;
            '


            If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.



            <input perl -e '
            open OUT1, ">", "output1" or die $!;
            open OUT2, ">", "output2" or die $!;
            my $N = `wc -l <input`;
            my $m = $ARGV[0];
            while (<STDIN>)
            if (rand($N) < $m) --$m; print OUT1 $_; else print OUT2 $_;
            --$N;

            close OUT1 or die $!;
            close OUT2 or die $!;
            ' 42





            share|improve this answer






















            • @Gilles: For the awk example: -v N=$(wc -l <file) -v m=4 ... and it only prints a "random" line when the random value is less than $m, rather than printing $m random lines... It seems that perl may be doing the same thing with rand, but I don't know perl well enough to get past a compilation error: syntax error at -e line 7, near ") print"
              – Peter.O
              Jan 23 '12 at 4:49











            • @Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
              – Gilles
              Jan 23 '12 at 10:12










            • All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in the shuf example.
              – Peter.O
              Jan 23 '12 at 23:03










            • A buffereing problem?. Am I missing something? The head cat combo causes loss of data in the following second test 3-4 .... TEST 1-2 for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2; .. TEST 3-4 for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4; ... wc -l results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
              – Peter.O
              Jan 24 '12 at 4:00










            • @Peter.O Right again, thanks. Indeed, head reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
              – Gilles
              Jan 24 '12 at 15:10












            up vote
            3
            down vote










            up vote
            3
            down vote









            If you don't mind reordering the lines and you have GNU coreutils (i.e. on non-embedded Linux or Cygwin, not too ancient since shuf appeared in version 6.0), shuf (“shuffle”) reorders the lines of a file randomly. So you can shuffle the file and dispatch the first m lines into one file and the rest into another.



            There's no ideal way to do that dispatch. You can't just chain head and tail because head would buffer ahead. You can use split, but you don't get any flexibility with respect to the output file names. You can use awk, of course:



            <input shuf | awk -v m=$m ' if (NR <= m) print >"output1" else print '


            You can use sed, which is obscure but possibly faster for large files.



            <input shuf | sed -e "1,$m w output1" -e "1,$m d" >output2


            Or you can use tee to duplicate the data, if your platform has /dev/fd; that's ok if m is small:



            <input shuf | head -n $m >output1; 3>&1 | tail -n +$(($m+1)) >output2


            Portably, you can use awk to dispatch each line in turn. Note that awk is not very good at initializing its random number generator; the randomness is not only definitely not suitable for cryptography, but not even very good for numerical simulations. The seed will be the same for all awk invocations on any system withing a one-second period.



            <input awk -v N=$(wc -l <input) -v m=3 '
            BEGIN srand()

            if (rand() * N < m) --m; print >"output1" else print >"output2"
            --N;
            '


            If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.



            <input perl -e '
            open OUT1, ">", "output1" or die $!;
            open OUT2, ">", "output2" or die $!;
            my $N = `wc -l <input`;
            my $m = $ARGV[0];
            while (<STDIN>)
            if (rand($N) < $m) --$m; print OUT1 $_; else print OUT2 $_;
            --$N;

            close OUT1 or die $!;
            close OUT2 or die $!;
            ' 42





            share|improve this answer














            If you don't mind reordering the lines and you have GNU coreutils (i.e. on non-embedded Linux or Cygwin, not too ancient since shuf appeared in version 6.0), shuf (“shuffle”) reorders the lines of a file randomly. So you can shuffle the file and dispatch the first m lines into one file and the rest into another.



            There's no ideal way to do that dispatch. You can't just chain head and tail because head would buffer ahead. You can use split, but you don't get any flexibility with respect to the output file names. You can use awk, of course:



            <input shuf | awk -v m=$m ' if (NR <= m) print >"output1" else print '


            You can use sed, which is obscure but possibly faster for large files.



            <input shuf | sed -e "1,$m w output1" -e "1,$m d" >output2


            Or you can use tee to duplicate the data, if your platform has /dev/fd; that's ok if m is small:



            <input shuf | head -n $m >output1; 3>&1 | tail -n +$(($m+1)) >output2


            Portably, you can use awk to dispatch each line in turn. Note that awk is not very good at initializing its random number generator; the randomness is not only definitely not suitable for cryptography, but not even very good for numerical simulations. The seed will be the same for all awk invocations on any system withing a one-second period.



            <input awk -v N=$(wc -l <input) -v m=3 '
            BEGIN srand()

            if (rand() * N < m) --m; print >"output1" else print >"output2"
            --N;
            '


            If you need better randomness, you can do the same thing in Perl, which seeds its RNG decently.



            <input perl -e '
            open OUT1, ">", "output1" or die $!;
            open OUT2, ">", "output2" or die $!;
            my $N = `wc -l <input`;
            my $m = $ARGV[0];
            while (<STDIN>)
            if (rand($N) < $m) --$m; print OUT1 $_; else print OUT2 $_;
            --$N;

            close OUT1 or die $!;
            close OUT2 or die $!;
            ' 42






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jan 24 '12 at 15:09

























            answered Jan 23 '12 at 0:43









            Gilles

            511k12010141543




            511k12010141543











            • @Gilles: For the awk example: -v N=$(wc -l <file) -v m=4 ... and it only prints a "random" line when the random value is less than $m, rather than printing $m random lines... It seems that perl may be doing the same thing with rand, but I don't know perl well enough to get past a compilation error: syntax error at -e line 7, near ") print"
              – Peter.O
              Jan 23 '12 at 4:49











            • @Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
              – Gilles
              Jan 23 '12 at 10:12










            • All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in the shuf example.
              – Peter.O
              Jan 23 '12 at 23:03










            • A buffereing problem?. Am I missing something? The head cat combo causes loss of data in the following second test 3-4 .... TEST 1-2 for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2; .. TEST 3-4 for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4; ... wc -l results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
              – Peter.O
              Jan 24 '12 at 4:00










            • @Peter.O Right again, thanks. Indeed, head reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
              – Gilles
              Jan 24 '12 at 15:10
















            • @Gilles: For the awk example: -v N=$(wc -l <file) -v m=4 ... and it only prints a "random" line when the random value is less than $m, rather than printing $m random lines... It seems that perl may be doing the same thing with rand, but I don't know perl well enough to get past a compilation error: syntax error at -e line 7, near ") print"
              – Peter.O
              Jan 23 '12 at 4:49











            • @Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
              – Gilles
              Jan 23 '12 at 10:12










            • All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in the shuf example.
              – Peter.O
              Jan 23 '12 at 23:03










            • A buffereing problem?. Am I missing something? The head cat combo causes loss of data in the following second test 3-4 .... TEST 1-2 for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2; .. TEST 3-4 for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4; ... wc -l results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
              – Peter.O
              Jan 24 '12 at 4:00










            • @Peter.O Right again, thanks. Indeed, head reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
              – Gilles
              Jan 24 '12 at 15:10















            @Gilles: For the awk example: -v N=$(wc -l <file) -v m=4 ... and it only prints a "random" line when the random value is less than $m, rather than printing $m random lines... It seems that perl may be doing the same thing with rand, but I don't know perl well enough to get past a compilation error: syntax error at -e line 7, near ") print"
            – Peter.O
            Jan 23 '12 at 4:49





            @Gilles: For the awk example: -v N=$(wc -l <file) -v m=4 ... and it only prints a "random" line when the random value is less than $m, rather than printing $m random lines... It seems that perl may be doing the same thing with rand, but I don't know perl well enough to get past a compilation error: syntax error at -e line 7, near ") print"
            – Peter.O
            Jan 23 '12 at 4:49













            @Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
            – Gilles
            Jan 23 '12 at 10:12




            @Peter.O Thanks, that's what comes from typing in a browser and carelessly editing. I've fixed the awk and perl code.
            – Gilles
            Jan 23 '12 at 10:12












            All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in the shuf example.
            – Peter.O
            Jan 23 '12 at 23:03




            All 3 methods working well and fast.. thanks (+1) ... I'm slowly getting my head around perl... and that's a particularly interesting and useful file split in the shuf example.
            – Peter.O
            Jan 23 '12 at 23:03












            A buffereing problem?. Am I missing something? The head cat combo causes loss of data in the following second test 3-4 .... TEST 1-2 for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2; .. TEST 3-4 for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4; ... wc -l results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
            – Peter.O
            Jan 24 '12 at 4:00




            A buffereing problem?. Am I missing something? The head cat combo causes loss of data in the following second test 3-4 .... TEST 1-2 for i in 00001..10000 ;do echo $i; done; | head -n 5000 >out1; cat >out2; .. TEST 3-4 for i in 00001..10000 ;do echo $i; done; >input; cat input | head -n 5000 >out3; cat >out4; ... wc -l results for the outputs of TEST 1-2 are 5000 5000 (good), but for TEST 3-4 are 5000 4539 (not good).. The differnece varies depending on the file sizes involved... Here is a link to my test code
            – Peter.O
            Jan 24 '12 at 4:00












            @Peter.O Right again, thanks. Indeed, head reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
            – Gilles
            Jan 24 '12 at 15:10




            @Peter.O Right again, thanks. Indeed, head reads ahead; what it reads ahead and doesn't print out is discarded. I've updated my answer with less elegant but (I'm reasonably sure) correct solutions.
            – Gilles
            Jan 24 '12 at 15:10










            up vote
            2
            down vote













            Assuming m = 7 and N = 21:



            cp ints ints.bak
            for i in 1..7
            do
            rnd=$((RANDOM%(21-i)+1))
            # echo $rnd;
            sed -n "$rndp,q" 10k.dat >> mlines
            sed -i "$rndd" ints
            done


            Note:
            If you replace 7 with a variable like $1 or $m, you have to use seq, not the from..to-notation, which doesn't do variable expansion.



            It works by deleting line by line from the file, which gets shorter and shorter, so the line number, which can be removed, has to get smaller and smaller.



            This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.






            share|improve this answer






















            • He needs a file with the lines that are removed too.
              – Rob Wouters
              Jan 22 '12 at 14:36










            • I thought "including these m lines of data" should mean including them but the original lines as well - therefore including, not consisting of, and not using only, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
              – user unknown
              Jan 22 '12 at 14:39










            • Looks good. ````
              – Rob Wouters
              Jan 22 '12 at 14:52










            • @user unknown: You have the +1 in the wrong place. It should be rnd=$((RANDOM%(N-i)+1)) where N=21 in your example.. It currently causes sed to crash when rnd is evaluated to 0. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
              – Peter.O
              Jan 23 '12 at 12:04










            • @Peter.O: You're right (corrected) and you're right.
              – user unknown
              Jan 23 '12 at 12:38














            up vote
            2
            down vote













            Assuming m = 7 and N = 21:



            cp ints ints.bak
            for i in 1..7
            do
            rnd=$((RANDOM%(21-i)+1))
            # echo $rnd;
            sed -n "$rndp,q" 10k.dat >> mlines
            sed -i "$rndd" ints
            done


            Note:
            If you replace 7 with a variable like $1 or $m, you have to use seq, not the from..to-notation, which doesn't do variable expansion.



            It works by deleting line by line from the file, which gets shorter and shorter, so the line number, which can be removed, has to get smaller and smaller.



            This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.






            share|improve this answer






















            • He needs a file with the lines that are removed too.
              – Rob Wouters
              Jan 22 '12 at 14:36










            • I thought "including these m lines of data" should mean including them but the original lines as well - therefore including, not consisting of, and not using only, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
              – user unknown
              Jan 22 '12 at 14:39










            • Looks good. ````
              – Rob Wouters
              Jan 22 '12 at 14:52










            • @user unknown: You have the +1 in the wrong place. It should be rnd=$((RANDOM%(N-i)+1)) where N=21 in your example.. It currently causes sed to crash when rnd is evaluated to 0. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
              – Peter.O
              Jan 23 '12 at 12:04










            • @Peter.O: You're right (corrected) and you're right.
              – user unknown
              Jan 23 '12 at 12:38












            up vote
            2
            down vote










            up vote
            2
            down vote









            Assuming m = 7 and N = 21:



            cp ints ints.bak
            for i in 1..7
            do
            rnd=$((RANDOM%(21-i)+1))
            # echo $rnd;
            sed -n "$rndp,q" 10k.dat >> mlines
            sed -i "$rndd" ints
            done


            Note:
            If you replace 7 with a variable like $1 or $m, you have to use seq, not the from..to-notation, which doesn't do variable expansion.



            It works by deleting line by line from the file, which gets shorter and shorter, so the line number, which can be removed, has to get smaller and smaller.



            This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.






            share|improve this answer














            Assuming m = 7 and N = 21:



            cp ints ints.bak
            for i in 1..7
            do
            rnd=$((RANDOM%(21-i)+1))
            # echo $rnd;
            sed -n "$rndp,q" 10k.dat >> mlines
            sed -i "$rndd" ints
            done


            Note:
            If you replace 7 with a variable like $1 or $m, you have to use seq, not the from..to-notation, which doesn't do variable expansion.



            It works by deleting line by line from the file, which gets shorter and shorter, so the line number, which can be removed, has to get smaller and smaller.



            This should not be used for longer files, and many lines, since for every number, on average, the half file needs to be read for the 1st, and the whole file for the 2nd sed code.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jan 24 '12 at 15:40

























            answered Jan 22 '12 at 14:19









            user unknown

            7,02912148




            7,02912148











            • He needs a file with the lines that are removed too.
              – Rob Wouters
              Jan 22 '12 at 14:36










            • I thought "including these m lines of data" should mean including them but the original lines as well - therefore including, not consisting of, and not using only, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
              – user unknown
              Jan 22 '12 at 14:39










            • Looks good. ````
              – Rob Wouters
              Jan 22 '12 at 14:52










            • @user unknown: You have the +1 in the wrong place. It should be rnd=$((RANDOM%(N-i)+1)) where N=21 in your example.. It currently causes sed to crash when rnd is evaluated to 0. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
              – Peter.O
              Jan 23 '12 at 12:04










            • @Peter.O: You're right (corrected) and you're right.
              – user unknown
              Jan 23 '12 at 12:38
















            • He needs a file with the lines that are removed too.
              – Rob Wouters
              Jan 22 '12 at 14:36










            • I thought "including these m lines of data" should mean including them but the original lines as well - therefore including, not consisting of, and not using only, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
              – user unknown
              Jan 22 '12 at 14:39










            • Looks good. ````
              – Rob Wouters
              Jan 22 '12 at 14:52










            • @user unknown: You have the +1 in the wrong place. It should be rnd=$((RANDOM%(N-i)+1)) where N=21 in your example.. It currently causes sed to crash when rnd is evaluated to 0. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
              – Peter.O
              Jan 23 '12 at 12:04










            • @Peter.O: You're right (corrected) and you're right.
              – user unknown
              Jan 23 '12 at 12:38















            He needs a file with the lines that are removed too.
            – Rob Wouters
            Jan 22 '12 at 14:36




            He needs a file with the lines that are removed too.
            – Rob Wouters
            Jan 22 '12 at 14:36












            I thought "including these m lines of data" should mean including them but the original lines as well - therefore including, not consisting of, and not using only, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
            – user unknown
            Jan 22 '12 at 14:39




            I thought "including these m lines of data" should mean including them but the original lines as well - therefore including, not consisting of, and not using only, but I guess your interpretation is, what user288609 meant. I will adjust my script accordingly.
            – user unknown
            Jan 22 '12 at 14:39












            Looks good. ````
            – Rob Wouters
            Jan 22 '12 at 14:52




            Looks good. ````
            – Rob Wouters
            Jan 22 '12 at 14:52












            @user unknown: You have the +1 in the wrong place. It should be rnd=$((RANDOM%(N-i)+1)) where N=21 in your example.. It currently causes sed to crash when rnd is evaluated to 0. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
            – Peter.O
            Jan 23 '12 at 12:04




            @user unknown: You have the +1 in the wrong place. It should be rnd=$((RANDOM%(N-i)+1)) where N=21 in your example.. It currently causes sed to crash when rnd is evaluated to 0. .. Also, it doesn't scale very well with all that file re-writing. eg 123 seconds to extract 5,000 random lines from a 10,000 line file vs. 0.03 seconds for a more direct method...
            – Peter.O
            Jan 23 '12 at 12:04












            @Peter.O: You're right (corrected) and you're right.
            – user unknown
            Jan 23 '12 at 12:38




            @Peter.O: You're right (corrected) and you're right.
            – user unknown
            Jan 23 '12 at 12:38

















             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f29709%2frandomly-draw-a-certain-number-of-lines-from-a-data-file%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Displaying single band from multi-band raster using QGIS

            How many registers does an x86_64 CPU actually have?