How to randomly sample a subset of a file

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
23
down vote

favorite
6












Is there any Linux command one can use to sample subset of a file? For instance, a file contains one million lines, and we want to randomly sample only one thousand lines from that file.



For random I mean that every line gets the same probability to be chosen and none of the lines chosen are repetitive.



head and tail can pick a subset of the file but not randomly. I know I can always write a python script to do so but just wondering is there a command for this usage.










share|improve this question























  • lines in random order, or a random block of 1000 consecutive lines of that file?
    – frostschutz
    Jan 9 '14 at 17:49










  • Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
    – clwen
    Jan 9 '14 at 18:08















up vote
23
down vote

favorite
6












Is there any Linux command one can use to sample subset of a file? For instance, a file contains one million lines, and we want to randomly sample only one thousand lines from that file.



For random I mean that every line gets the same probability to be chosen and none of the lines chosen are repetitive.



head and tail can pick a subset of the file but not randomly. I know I can always write a python script to do so but just wondering is there a command for this usage.










share|improve this question























  • lines in random order, or a random block of 1000 consecutive lines of that file?
    – frostschutz
    Jan 9 '14 at 17:49










  • Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
    – clwen
    Jan 9 '14 at 18:08













up vote
23
down vote

favorite
6









up vote
23
down vote

favorite
6






6





Is there any Linux command one can use to sample subset of a file? For instance, a file contains one million lines, and we want to randomly sample only one thousand lines from that file.



For random I mean that every line gets the same probability to be chosen and none of the lines chosen are repetitive.



head and tail can pick a subset of the file but not randomly. I know I can always write a python script to do so but just wondering is there a command for this usage.










share|improve this question















Is there any Linux command one can use to sample subset of a file? For instance, a file contains one million lines, and we want to randomly sample only one thousand lines from that file.



For random I mean that every line gets the same probability to be chosen and none of the lines chosen are repetitive.



head and tail can pick a subset of the file but not randomly. I know I can always write a python script to do so but just wondering is there a command for this usage.







command-line files command






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 12 '14 at 15:03









Timo

4,6851625




4,6851625










asked Jan 9 '14 at 16:24









clwen

223127




223127











  • lines in random order, or a random block of 1000 consecutive lines of that file?
    – frostschutz
    Jan 9 '14 at 17:49










  • Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
    – clwen
    Jan 9 '14 at 18:08

















  • lines in random order, or a random block of 1000 consecutive lines of that file?
    – frostschutz
    Jan 9 '14 at 17:49










  • Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
    – clwen
    Jan 9 '14 at 18:08
















lines in random order, or a random block of 1000 consecutive lines of that file?
– frostschutz
Jan 9 '14 at 17:49




lines in random order, or a random block of 1000 consecutive lines of that file?
– frostschutz
Jan 9 '14 at 17:49












Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
– clwen
Jan 9 '14 at 18:08





Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
– clwen
Jan 9 '14 at 18:08











10 Answers
10






active

oldest

votes

















up vote
44
down vote



accepted










The shuf command (part of coreutils) can do this:



shuf -n 1000 file





share|improve this answer




















  • According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/…
    – Ketan
    Jan 9 '14 at 19:17










  • @Ketan, doesn't seem that way
    – frostschutz
    Jan 9 '14 at 19:44






  • 2




    @Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also that sort is in the same section, and it clearly doesn't require sorted input.
    – derobert
    Jan 9 '14 at 19:49











  • Yes, true. I tried the command, works well.
    – Ketan
    Jan 9 '14 at 19:56






  • 1




    shuf was introduced to coreutils in version 6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
    – offby1
    Nov 18 '14 at 19:59


















up vote
6
down vote













If you have a very large file (which is a common reason to take a sample) you will find that:




  1. shuf exhausts memory

  2. Using $RANDOM won't work correctly if the file exceeds 32767 lines

If you don't need "exactly" n sampled lines you can sample a ratio like this:



cat input.txt | awk 'BEGIN srand() !/^$/ if (rand() <= .01) print $0' > sample.txt



This uses constant memory, samples 1% of the file (if you know the number of lines of the file you can adjust this factor to sample a close to a limited number of lines), and works with any size of file but it will not return a precise number of lines, just a statistical ratio.



Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix






share|improve this answer






















  • If a user wants approximately 1% of the non-blank lines, this is a pretty good answer. But if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails. As the answer you got it from says, it yields only a statistical estimate. And do you understand the answer well enough to see that it is ignoring blank lines? This might be a good idea, in practice, but undocumented features are, in general, not a good idea.
    – G-Man
    Dec 5 '16 at 21:47






  • 1




    P.S.  Simplistic approaches using $RANDOM won’t work correctly for files larger than 32767 lines.  The statement “Using $RANDOM doesn’t reach the entire file” is a bit broad.
    – G-Man
    Dec 5 '16 at 21:48










  • @G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
    – Txangel
    Dec 6 '16 at 18:32











  • This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition awk is more resource friendly than shuf
    – Polymerase
    Apr 15 at 18:42

















up vote
2
down vote













Not aware of any single command which could do what you ask but here is a loop I put together which can do the job:



for i in `seq 1000`; do sed -n `echo $RANDOM % 1000000 | bc`p alargefile.txt; done > sample.txt


sed will pick up a random line on each of the 1000 passes. Possibly there are more efficient solutions.






share|improve this answer




















  • Is it possible to get the same line multiple times in this approach?
    – clwen
    Jan 9 '14 at 18:11






  • 1




    Yes, quite possible to get the same line number more than once. Additionally, $RANDOM has a range between 0 and 32767. So, you will not get a well spread line numbers.
    – Ketan
    Jan 9 '14 at 18:21










  • does not work - random is called once
    – Bohdan
    Aug 20 '14 at 5:20

















up vote
2
down vote













You can save the follow code in a file (by example randextract.sh) and execute as:



randextract.sh file.txt


---- BEGIN FILE ----



#!/bin/sh -xv

#configuration MAX_LINES is the number of lines to extract
MAX_LINES=10

#number of lines in the file (is a limit)
NUM_LINES=`wc -l $1 | cut -d' ' -f1`

#generate a random number
#in bash the variable $RANDOM returns diferent values on each call
if [ "$RANDOM." != "$RANDOM." ]
then
#bigger number (0 to 3276732767)
RAND=$RANDOM$RANDOM
else
RAND=`date +'%s'`
fi

#The start line
START_LINE=`expr $RAND % '(' $NUM_LINES - $MAX_LINES ')'`

tail -n +$START_LINE $1 | head -n $MAX_LINES


---- END FILE ----






share|improve this answer


















  • 3




    I'm not sure what you're trying to do here with RAND, but $RANDOM$RANDOM does not generate random numbers in the whole range “0 to 3276732767” (for example, it will generate 1000100000 but not 1000099999).
    – Gilles
    Jan 9 '14 at 22:37










  • The OP says, “Every line gets the same probability to be chosen.  … there is a tiny probability that a consecutive block of lines be chosen together.” I also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point. That is not what the OP is asking for.
    – G-Man
    Dec 5 '16 at 21:19

















up vote
2
down vote













In case the shuf -n trick on large files runs out of memory and you still need a fixed size sample and an external utility can be installed then try sample:



$ sample -N 1000 < FILE_WITH_MILLIONS_OF_LINES 


The caveat is that the sample (1000 lines in the example) must fit into memory.



Disclaimer: I am the author of the recommended software.






share|improve this answer



























    up vote
    1
    down vote













    Or like this:



    LINES=$(wc -l < file) 
    RANDLINE=$[ $RANDOM % $LINES ]
    tail -n $RANDLINE < file|head -1


    From the bash man page:




    RANDOM Each time this parameter is referenced, a random integer
    between 0 and 32767 is generated. The sequence of random
    numbers may be initialized by assigning a value to RAN‐
    DOM. If RANDOM is unset, it loses its special proper‐
    ties, even if it is subsequently reset.





    share|improve this answer






















    • This fails badly if the file has fewer than 32767 lines.
      – offby1
      Nov 18 '14 at 20:00










    • This will output one line from the file. (I guess your idea is to execute the above commands in a loop?) If the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.  Aside from possible inefficiency, I don’t see any big problem with this answer if the file has fewer than 32767 lines.
      – G-Man
      Dec 5 '16 at 21:27

















    up vote
    1
    down vote













    If you file size isn't huge, you can use Sort random. This takes a little longer than shuf, but it randomizes the entire data. So, you could easily just do the following to use head as you requested:



    sort -R input | head -1000 > output


    This would sort the file randomly and give you the first 1000 lines.






    share|improve this answer



























      up vote
      1
      down vote













      If you know the number of lines in the file (like 1e6 in your case), you can do:



      awk -v n=1e6 -v p=1000 '
      BEGIN srand()
      rand() * n-- < p p--; print' < file


      If not, you can always do



      awk -v n="$(wc -l < file)" -v p=1000 '
      BEGIN srand()
      rand() * n-- < p p--; print' < file


      That would do two passes in the file, but still avoid storing the whole file in memory.



      Another advantage over GNU shuf is that it preserves the order of the lines in the file.



      Note that it assumes n is the number of lines in the file. If you want to print p out of the first n lines of the file (which has potentially more lines), you'd need to stop awk at the nth line like:



      awk -v n=1e6 -v p=1000 '
      BEGIN srand()
      rand() * n-- < p p--; print
      !n exit' < file





      share|improve this answer





























        up vote
        1
        down vote













        I like using awk for this when I want to preserve a header row, and when the sample can be an approximate percentage of the file. Works for very large files:



        awk 'BEGIN srand() !/^$/ ' data.txt





        share|improve this answer



























          up vote
          0
          down vote













          Similar to @Txangel's probabilistic solution but approaching 100x faster.



          perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv


          If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):



          perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv


          .. or indeed chain a second sample method instead of head.






          share|improve this answer






















            Your Answer







            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "106"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f108581%2fhow-to-randomly-sample-a-subset-of-a-file%23new-answer', 'question_page');

            );

            Post as a guest






























            10 Answers
            10






            active

            oldest

            votes








            10 Answers
            10






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            44
            down vote



            accepted










            The shuf command (part of coreutils) can do this:



            shuf -n 1000 file





            share|improve this answer




















            • According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/…
              – Ketan
              Jan 9 '14 at 19:17










            • @Ketan, doesn't seem that way
              – frostschutz
              Jan 9 '14 at 19:44






            • 2




              @Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also that sort is in the same section, and it clearly doesn't require sorted input.
              – derobert
              Jan 9 '14 at 19:49











            • Yes, true. I tried the command, works well.
              – Ketan
              Jan 9 '14 at 19:56






            • 1




              shuf was introduced to coreutils in version 6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
              – offby1
              Nov 18 '14 at 19:59















            up vote
            44
            down vote



            accepted










            The shuf command (part of coreutils) can do this:



            shuf -n 1000 file





            share|improve this answer




















            • According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/…
              – Ketan
              Jan 9 '14 at 19:17










            • @Ketan, doesn't seem that way
              – frostschutz
              Jan 9 '14 at 19:44






            • 2




              @Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also that sort is in the same section, and it clearly doesn't require sorted input.
              – derobert
              Jan 9 '14 at 19:49











            • Yes, true. I tried the command, works well.
              – Ketan
              Jan 9 '14 at 19:56






            • 1




              shuf was introduced to coreutils in version 6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
              – offby1
              Nov 18 '14 at 19:59













            up vote
            44
            down vote



            accepted







            up vote
            44
            down vote



            accepted






            The shuf command (part of coreutils) can do this:



            shuf -n 1000 file





            share|improve this answer












            The shuf command (part of coreutils) can do this:



            shuf -n 1000 file






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Jan 9 '14 at 18:57









            derobert

            69.2k8150206




            69.2k8150206











            • According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/…
              – Ketan
              Jan 9 '14 at 19:17










            • @Ketan, doesn't seem that way
              – frostschutz
              Jan 9 '14 at 19:44






            • 2




              @Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also that sort is in the same section, and it clearly doesn't require sorted input.
              – derobert
              Jan 9 '14 at 19:49











            • Yes, true. I tried the command, works well.
              – Ketan
              Jan 9 '14 at 19:56






            • 1




              shuf was introduced to coreutils in version 6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
              – offby1
              Nov 18 '14 at 19:59

















            • According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/…
              – Ketan
              Jan 9 '14 at 19:17










            • @Ketan, doesn't seem that way
              – frostschutz
              Jan 9 '14 at 19:44






            • 2




              @Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also that sort is in the same section, and it clearly doesn't require sorted input.
              – derobert
              Jan 9 '14 at 19:49











            • Yes, true. I tried the command, works well.
              – Ketan
              Jan 9 '14 at 19:56






            • 1




              shuf was introduced to coreutils in version 6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
              – offby1
              Nov 18 '14 at 19:59
















            According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/…
            – Ketan
            Jan 9 '14 at 19:17




            According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/…
            – Ketan
            Jan 9 '14 at 19:17












            @Ketan, doesn't seem that way
            – frostschutz
            Jan 9 '14 at 19:44




            @Ketan, doesn't seem that way
            – frostschutz
            Jan 9 '14 at 19:44




            2




            2




            @Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also that sort is in the same section, and it clearly doesn't require sorted input.
            – derobert
            Jan 9 '14 at 19:49





            @Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also that sort is in the same section, and it clearly doesn't require sorted input.
            – derobert
            Jan 9 '14 at 19:49













            Yes, true. I tried the command, works well.
            – Ketan
            Jan 9 '14 at 19:56




            Yes, true. I tried the command, works well.
            – Ketan
            Jan 9 '14 at 19:56




            1




            1




            shuf was introduced to coreutils in version 6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
            – offby1
            Nov 18 '14 at 19:59





            shuf was introduced to coreutils in version 6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
            – offby1
            Nov 18 '14 at 19:59













            up vote
            6
            down vote













            If you have a very large file (which is a common reason to take a sample) you will find that:




            1. shuf exhausts memory

            2. Using $RANDOM won't work correctly if the file exceeds 32767 lines

            If you don't need "exactly" n sampled lines you can sample a ratio like this:



            cat input.txt | awk 'BEGIN srand() !/^$/ if (rand() <= .01) print $0' > sample.txt



            This uses constant memory, samples 1% of the file (if you know the number of lines of the file you can adjust this factor to sample a close to a limited number of lines), and works with any size of file but it will not return a precise number of lines, just a statistical ratio.



            Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix






            share|improve this answer






















            • If a user wants approximately 1% of the non-blank lines, this is a pretty good answer. But if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails. As the answer you got it from says, it yields only a statistical estimate. And do you understand the answer well enough to see that it is ignoring blank lines? This might be a good idea, in practice, but undocumented features are, in general, not a good idea.
              – G-Man
              Dec 5 '16 at 21:47






            • 1




              P.S.  Simplistic approaches using $RANDOM won’t work correctly for files larger than 32767 lines.  The statement “Using $RANDOM doesn’t reach the entire file” is a bit broad.
              – G-Man
              Dec 5 '16 at 21:48










            • @G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
              – Txangel
              Dec 6 '16 at 18:32











            • This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition awk is more resource friendly than shuf
              – Polymerase
              Apr 15 at 18:42














            up vote
            6
            down vote













            If you have a very large file (which is a common reason to take a sample) you will find that:




            1. shuf exhausts memory

            2. Using $RANDOM won't work correctly if the file exceeds 32767 lines

            If you don't need "exactly" n sampled lines you can sample a ratio like this:



            cat input.txt | awk 'BEGIN srand() !/^$/ if (rand() <= .01) print $0' > sample.txt



            This uses constant memory, samples 1% of the file (if you know the number of lines of the file you can adjust this factor to sample a close to a limited number of lines), and works with any size of file but it will not return a precise number of lines, just a statistical ratio.



            Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix






            share|improve this answer






















            • If a user wants approximately 1% of the non-blank lines, this is a pretty good answer. But if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails. As the answer you got it from says, it yields only a statistical estimate. And do you understand the answer well enough to see that it is ignoring blank lines? This might be a good idea, in practice, but undocumented features are, in general, not a good idea.
              – G-Man
              Dec 5 '16 at 21:47






            • 1




              P.S.  Simplistic approaches using $RANDOM won’t work correctly for files larger than 32767 lines.  The statement “Using $RANDOM doesn’t reach the entire file” is a bit broad.
              – G-Man
              Dec 5 '16 at 21:48










            • @G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
              – Txangel
              Dec 6 '16 at 18:32











            • This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition awk is more resource friendly than shuf
              – Polymerase
              Apr 15 at 18:42












            up vote
            6
            down vote










            up vote
            6
            down vote









            If you have a very large file (which is a common reason to take a sample) you will find that:




            1. shuf exhausts memory

            2. Using $RANDOM won't work correctly if the file exceeds 32767 lines

            If you don't need "exactly" n sampled lines you can sample a ratio like this:



            cat input.txt | awk 'BEGIN srand() !/^$/ if (rand() <= .01) print $0' > sample.txt



            This uses constant memory, samples 1% of the file (if you know the number of lines of the file you can adjust this factor to sample a close to a limited number of lines), and works with any size of file but it will not return a precise number of lines, just a statistical ratio.



            Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix






            share|improve this answer














            If you have a very large file (which is a common reason to take a sample) you will find that:




            1. shuf exhausts memory

            2. Using $RANDOM won't work correctly if the file exceeds 32767 lines

            If you don't need "exactly" n sampled lines you can sample a ratio like this:



            cat input.txt | awk 'BEGIN srand() !/^$/ if (rand() <= .01) print $0' > sample.txt



            This uses constant memory, samples 1% of the file (if you know the number of lines of the file you can adjust this factor to sample a close to a limited number of lines), and works with any size of file but it will not return a precise number of lines, just a statistical ratio.



            Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Dec 6 '16 at 18:35

























            answered Dec 5 '16 at 20:23









            Txangel

            16112




            16112











            • If a user wants approximately 1% of the non-blank lines, this is a pretty good answer. But if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails. As the answer you got it from says, it yields only a statistical estimate. And do you understand the answer well enough to see that it is ignoring blank lines? This might be a good idea, in practice, but undocumented features are, in general, not a good idea.
              – G-Man
              Dec 5 '16 at 21:47






            • 1




              P.S.  Simplistic approaches using $RANDOM won’t work correctly for files larger than 32767 lines.  The statement “Using $RANDOM doesn’t reach the entire file” is a bit broad.
              – G-Man
              Dec 5 '16 at 21:48










            • @G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
              – Txangel
              Dec 6 '16 at 18:32











            • This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition awk is more resource friendly than shuf
              – Polymerase
              Apr 15 at 18:42
















            • If a user wants approximately 1% of the non-blank lines, this is a pretty good answer. But if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails. As the answer you got it from says, it yields only a statistical estimate. And do you understand the answer well enough to see that it is ignoring blank lines? This might be a good idea, in practice, but undocumented features are, in general, not a good idea.
              – G-Man
              Dec 5 '16 at 21:47






            • 1




              P.S.  Simplistic approaches using $RANDOM won’t work correctly for files larger than 32767 lines.  The statement “Using $RANDOM doesn’t reach the entire file” is a bit broad.
              – G-Man
              Dec 5 '16 at 21:48










            • @G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
              – Txangel
              Dec 6 '16 at 18:32











            • This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition awk is more resource friendly than shuf
              – Polymerase
              Apr 15 at 18:42















            If a user wants approximately 1% of the non-blank lines, this is a pretty good answer. But if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails. As the answer you got it from says, it yields only a statistical estimate. And do you understand the answer well enough to see that it is ignoring blank lines? This might be a good idea, in practice, but undocumented features are, in general, not a good idea.
            – G-Man
            Dec 5 '16 at 21:47




            If a user wants approximately 1% of the non-blank lines, this is a pretty good answer. But if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails. As the answer you got it from says, it yields only a statistical estimate. And do you understand the answer well enough to see that it is ignoring blank lines? This might be a good idea, in practice, but undocumented features are, in general, not a good idea.
            – G-Man
            Dec 5 '16 at 21:47




            1




            1




            P.S.  Simplistic approaches using $RANDOM won’t work correctly for files larger than 32767 lines.  The statement “Using $RANDOM doesn’t reach the entire file” is a bit broad.
            – G-Man
            Dec 5 '16 at 21:48




            P.S.  Simplistic approaches using $RANDOM won’t work correctly for files larger than 32767 lines.  The statement “Using $RANDOM doesn’t reach the entire file” is a bit broad.
            – G-Man
            Dec 5 '16 at 21:48












            @G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
            – Txangel
            Dec 6 '16 at 18:32





            @G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
            – Txangel
            Dec 6 '16 at 18:32













            This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition awk is more resource friendly than shuf
            – Polymerase
            Apr 15 at 18:42




            This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition awk is more resource friendly than shuf
            – Polymerase
            Apr 15 at 18:42










            up vote
            2
            down vote













            Not aware of any single command which could do what you ask but here is a loop I put together which can do the job:



            for i in `seq 1000`; do sed -n `echo $RANDOM % 1000000 | bc`p alargefile.txt; done > sample.txt


            sed will pick up a random line on each of the 1000 passes. Possibly there are more efficient solutions.






            share|improve this answer




















            • Is it possible to get the same line multiple times in this approach?
              – clwen
              Jan 9 '14 at 18:11






            • 1




              Yes, quite possible to get the same line number more than once. Additionally, $RANDOM has a range between 0 and 32767. So, you will not get a well spread line numbers.
              – Ketan
              Jan 9 '14 at 18:21










            • does not work - random is called once
              – Bohdan
              Aug 20 '14 at 5:20














            up vote
            2
            down vote













            Not aware of any single command which could do what you ask but here is a loop I put together which can do the job:



            for i in `seq 1000`; do sed -n `echo $RANDOM % 1000000 | bc`p alargefile.txt; done > sample.txt


            sed will pick up a random line on each of the 1000 passes. Possibly there are more efficient solutions.






            share|improve this answer




















            • Is it possible to get the same line multiple times in this approach?
              – clwen
              Jan 9 '14 at 18:11






            • 1




              Yes, quite possible to get the same line number more than once. Additionally, $RANDOM has a range between 0 and 32767. So, you will not get a well spread line numbers.
              – Ketan
              Jan 9 '14 at 18:21










            • does not work - random is called once
              – Bohdan
              Aug 20 '14 at 5:20












            up vote
            2
            down vote










            up vote
            2
            down vote









            Not aware of any single command which could do what you ask but here is a loop I put together which can do the job:



            for i in `seq 1000`; do sed -n `echo $RANDOM % 1000000 | bc`p alargefile.txt; done > sample.txt


            sed will pick up a random line on each of the 1000 passes. Possibly there are more efficient solutions.






            share|improve this answer












            Not aware of any single command which could do what you ask but here is a loop I put together which can do the job:



            for i in `seq 1000`; do sed -n `echo $RANDOM % 1000000 | bc`p alargefile.txt; done > sample.txt


            sed will pick up a random line on each of the 1000 passes. Possibly there are more efficient solutions.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Jan 9 '14 at 16:47









            Ketan

            5,43942741




            5,43942741











            • Is it possible to get the same line multiple times in this approach?
              – clwen
              Jan 9 '14 at 18:11






            • 1




              Yes, quite possible to get the same line number more than once. Additionally, $RANDOM has a range between 0 and 32767. So, you will not get a well spread line numbers.
              – Ketan
              Jan 9 '14 at 18:21










            • does not work - random is called once
              – Bohdan
              Aug 20 '14 at 5:20
















            • Is it possible to get the same line multiple times in this approach?
              – clwen
              Jan 9 '14 at 18:11






            • 1




              Yes, quite possible to get the same line number more than once. Additionally, $RANDOM has a range between 0 and 32767. So, you will not get a well spread line numbers.
              – Ketan
              Jan 9 '14 at 18:21










            • does not work - random is called once
              – Bohdan
              Aug 20 '14 at 5:20















            Is it possible to get the same line multiple times in this approach?
            – clwen
            Jan 9 '14 at 18:11




            Is it possible to get the same line multiple times in this approach?
            – clwen
            Jan 9 '14 at 18:11




            1




            1




            Yes, quite possible to get the same line number more than once. Additionally, $RANDOM has a range between 0 and 32767. So, you will not get a well spread line numbers.
            – Ketan
            Jan 9 '14 at 18:21




            Yes, quite possible to get the same line number more than once. Additionally, $RANDOM has a range between 0 and 32767. So, you will not get a well spread line numbers.
            – Ketan
            Jan 9 '14 at 18:21












            does not work - random is called once
            – Bohdan
            Aug 20 '14 at 5:20




            does not work - random is called once
            – Bohdan
            Aug 20 '14 at 5:20










            up vote
            2
            down vote













            You can save the follow code in a file (by example randextract.sh) and execute as:



            randextract.sh file.txt


            ---- BEGIN FILE ----



            #!/bin/sh -xv

            #configuration MAX_LINES is the number of lines to extract
            MAX_LINES=10

            #number of lines in the file (is a limit)
            NUM_LINES=`wc -l $1 | cut -d' ' -f1`

            #generate a random number
            #in bash the variable $RANDOM returns diferent values on each call
            if [ "$RANDOM." != "$RANDOM." ]
            then
            #bigger number (0 to 3276732767)
            RAND=$RANDOM$RANDOM
            else
            RAND=`date +'%s'`
            fi

            #The start line
            START_LINE=`expr $RAND % '(' $NUM_LINES - $MAX_LINES ')'`

            tail -n +$START_LINE $1 | head -n $MAX_LINES


            ---- END FILE ----






            share|improve this answer


















            • 3




              I'm not sure what you're trying to do here with RAND, but $RANDOM$RANDOM does not generate random numbers in the whole range “0 to 3276732767” (for example, it will generate 1000100000 but not 1000099999).
              – Gilles
              Jan 9 '14 at 22:37










            • The OP says, “Every line gets the same probability to be chosen.  … there is a tiny probability that a consecutive block of lines be chosen together.” I also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point. That is not what the OP is asking for.
              – G-Man
              Dec 5 '16 at 21:19














            up vote
            2
            down vote













            You can save the follow code in a file (by example randextract.sh) and execute as:



            randextract.sh file.txt


            ---- BEGIN FILE ----



            #!/bin/sh -xv

            #configuration MAX_LINES is the number of lines to extract
            MAX_LINES=10

            #number of lines in the file (is a limit)
            NUM_LINES=`wc -l $1 | cut -d' ' -f1`

            #generate a random number
            #in bash the variable $RANDOM returns diferent values on each call
            if [ "$RANDOM." != "$RANDOM." ]
            then
            #bigger number (0 to 3276732767)
            RAND=$RANDOM$RANDOM
            else
            RAND=`date +'%s'`
            fi

            #The start line
            START_LINE=`expr $RAND % '(' $NUM_LINES - $MAX_LINES ')'`

            tail -n +$START_LINE $1 | head -n $MAX_LINES


            ---- END FILE ----






            share|improve this answer


















            • 3




              I'm not sure what you're trying to do here with RAND, but $RANDOM$RANDOM does not generate random numbers in the whole range “0 to 3276732767” (for example, it will generate 1000100000 but not 1000099999).
              – Gilles
              Jan 9 '14 at 22:37










            • The OP says, “Every line gets the same probability to be chosen.  … there is a tiny probability that a consecutive block of lines be chosen together.” I also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point. That is not what the OP is asking for.
              – G-Man
              Dec 5 '16 at 21:19












            up vote
            2
            down vote










            up vote
            2
            down vote









            You can save the follow code in a file (by example randextract.sh) and execute as:



            randextract.sh file.txt


            ---- BEGIN FILE ----



            #!/bin/sh -xv

            #configuration MAX_LINES is the number of lines to extract
            MAX_LINES=10

            #number of lines in the file (is a limit)
            NUM_LINES=`wc -l $1 | cut -d' ' -f1`

            #generate a random number
            #in bash the variable $RANDOM returns diferent values on each call
            if [ "$RANDOM." != "$RANDOM." ]
            then
            #bigger number (0 to 3276732767)
            RAND=$RANDOM$RANDOM
            else
            RAND=`date +'%s'`
            fi

            #The start line
            START_LINE=`expr $RAND % '(' $NUM_LINES - $MAX_LINES ')'`

            tail -n +$START_LINE $1 | head -n $MAX_LINES


            ---- END FILE ----






            share|improve this answer














            You can save the follow code in a file (by example randextract.sh) and execute as:



            randextract.sh file.txt


            ---- BEGIN FILE ----



            #!/bin/sh -xv

            #configuration MAX_LINES is the number of lines to extract
            MAX_LINES=10

            #number of lines in the file (is a limit)
            NUM_LINES=`wc -l $1 | cut -d' ' -f1`

            #generate a random number
            #in bash the variable $RANDOM returns diferent values on each call
            if [ "$RANDOM." != "$RANDOM." ]
            then
            #bigger number (0 to 3276732767)
            RAND=$RANDOM$RANDOM
            else
            RAND=`date +'%s'`
            fi

            #The start line
            START_LINE=`expr $RAND % '(' $NUM_LINES - $MAX_LINES ')'`

            tail -n +$START_LINE $1 | head -n $MAX_LINES


            ---- END FILE ----







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jan 9 '14 at 17:17

























            answered Jan 9 '14 at 17:00









            razzek

            212




            212







            • 3




              I'm not sure what you're trying to do here with RAND, but $RANDOM$RANDOM does not generate random numbers in the whole range “0 to 3276732767” (for example, it will generate 1000100000 but not 1000099999).
              – Gilles
              Jan 9 '14 at 22:37










            • The OP says, “Every line gets the same probability to be chosen.  … there is a tiny probability that a consecutive block of lines be chosen together.” I also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point. That is not what the OP is asking for.
              – G-Man
              Dec 5 '16 at 21:19












            • 3




              I'm not sure what you're trying to do here with RAND, but $RANDOM$RANDOM does not generate random numbers in the whole range “0 to 3276732767” (for example, it will generate 1000100000 but not 1000099999).
              – Gilles
              Jan 9 '14 at 22:37










            • The OP says, “Every line gets the same probability to be chosen.  … there is a tiny probability that a consecutive block of lines be chosen together.” I also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point. That is not what the OP is asking for.
              – G-Man
              Dec 5 '16 at 21:19







            3




            3




            I'm not sure what you're trying to do here with RAND, but $RANDOM$RANDOM does not generate random numbers in the whole range “0 to 3276732767” (for example, it will generate 1000100000 but not 1000099999).
            – Gilles
            Jan 9 '14 at 22:37




            I'm not sure what you're trying to do here with RAND, but $RANDOM$RANDOM does not generate random numbers in the whole range “0 to 3276732767” (for example, it will generate 1000100000 but not 1000099999).
            – Gilles
            Jan 9 '14 at 22:37












            The OP says, “Every line gets the same probability to be chosen.  … there is a tiny probability that a consecutive block of lines be chosen together.” I also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point. That is not what the OP is asking for.
            – G-Man
            Dec 5 '16 at 21:19




            The OP says, “Every line gets the same probability to be chosen.  … there is a tiny probability that a consecutive block of lines be chosen together.” I also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point. That is not what the OP is asking for.
            – G-Man
            Dec 5 '16 at 21:19










            up vote
            2
            down vote













            In case the shuf -n trick on large files runs out of memory and you still need a fixed size sample and an external utility can be installed then try sample:



            $ sample -N 1000 < FILE_WITH_MILLIONS_OF_LINES 


            The caveat is that the sample (1000 lines in the example) must fit into memory.



            Disclaimer: I am the author of the recommended software.






            share|improve this answer
























              up vote
              2
              down vote













              In case the shuf -n trick on large files runs out of memory and you still need a fixed size sample and an external utility can be installed then try sample:



              $ sample -N 1000 < FILE_WITH_MILLIONS_OF_LINES 


              The caveat is that the sample (1000 lines in the example) must fit into memory.



              Disclaimer: I am the author of the recommended software.






              share|improve this answer






















                up vote
                2
                down vote










                up vote
                2
                down vote









                In case the shuf -n trick on large files runs out of memory and you still need a fixed size sample and an external utility can be installed then try sample:



                $ sample -N 1000 < FILE_WITH_MILLIONS_OF_LINES 


                The caveat is that the sample (1000 lines in the example) must fit into memory.



                Disclaimer: I am the author of the recommended software.






                share|improve this answer












                In case the shuf -n trick on large files runs out of memory and you still need a fixed size sample and an external utility can be installed then try sample:



                $ sample -N 1000 < FILE_WITH_MILLIONS_OF_LINES 


                The caveat is that the sample (1000 lines in the example) must fit into memory.



                Disclaimer: I am the author of the recommended software.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Jun 11 '17 at 4:03









                hroptatyr

                82188




                82188




















                    up vote
                    1
                    down vote













                    Or like this:



                    LINES=$(wc -l < file) 
                    RANDLINE=$[ $RANDOM % $LINES ]
                    tail -n $RANDLINE < file|head -1


                    From the bash man page:




                    RANDOM Each time this parameter is referenced, a random integer
                    between 0 and 32767 is generated. The sequence of random
                    numbers may be initialized by assigning a value to RAN‐
                    DOM. If RANDOM is unset, it loses its special proper‐
                    ties, even if it is subsequently reset.





                    share|improve this answer






















                    • This fails badly if the file has fewer than 32767 lines.
                      – offby1
                      Nov 18 '14 at 20:00










                    • This will output one line from the file. (I guess your idea is to execute the above commands in a loop?) If the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.  Aside from possible inefficiency, I don’t see any big problem with this answer if the file has fewer than 32767 lines.
                      – G-Man
                      Dec 5 '16 at 21:27














                    up vote
                    1
                    down vote













                    Or like this:



                    LINES=$(wc -l < file) 
                    RANDLINE=$[ $RANDOM % $LINES ]
                    tail -n $RANDLINE < file|head -1


                    From the bash man page:




                    RANDOM Each time this parameter is referenced, a random integer
                    between 0 and 32767 is generated. The sequence of random
                    numbers may be initialized by assigning a value to RAN‐
                    DOM. If RANDOM is unset, it loses its special proper‐
                    ties, even if it is subsequently reset.





                    share|improve this answer






















                    • This fails badly if the file has fewer than 32767 lines.
                      – offby1
                      Nov 18 '14 at 20:00










                    • This will output one line from the file. (I guess your idea is to execute the above commands in a loop?) If the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.  Aside from possible inefficiency, I don’t see any big problem with this answer if the file has fewer than 32767 lines.
                      – G-Man
                      Dec 5 '16 at 21:27












                    up vote
                    1
                    down vote










                    up vote
                    1
                    down vote









                    Or like this:



                    LINES=$(wc -l < file) 
                    RANDLINE=$[ $RANDOM % $LINES ]
                    tail -n $RANDLINE < file|head -1


                    From the bash man page:




                    RANDOM Each time this parameter is referenced, a random integer
                    between 0 and 32767 is generated. The sequence of random
                    numbers may be initialized by assigning a value to RAN‐
                    DOM. If RANDOM is unset, it loses its special proper‐
                    ties, even if it is subsequently reset.





                    share|improve this answer














                    Or like this:



                    LINES=$(wc -l < file) 
                    RANDLINE=$[ $RANDOM % $LINES ]
                    tail -n $RANDLINE < file|head -1


                    From the bash man page:




                    RANDOM Each time this parameter is referenced, a random integer
                    between 0 and 32767 is generated. The sequence of random
                    numbers may be initialized by assigning a value to RAN‐
                    DOM. If RANDOM is unset, it loses its special proper‐
                    ties, even if it is subsequently reset.






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Jan 11 '14 at 9:51

























                    answered Jan 9 '14 at 16:49







                    user55518


















                    • This fails badly if the file has fewer than 32767 lines.
                      – offby1
                      Nov 18 '14 at 20:00










                    • This will output one line from the file. (I guess your idea is to execute the above commands in a loop?) If the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.  Aside from possible inefficiency, I don’t see any big problem with this answer if the file has fewer than 32767 lines.
                      – G-Man
                      Dec 5 '16 at 21:27
















                    • This fails badly if the file has fewer than 32767 lines.
                      – offby1
                      Nov 18 '14 at 20:00










                    • This will output one line from the file. (I guess your idea is to execute the above commands in a loop?) If the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.  Aside from possible inefficiency, I don’t see any big problem with this answer if the file has fewer than 32767 lines.
                      – G-Man
                      Dec 5 '16 at 21:27















                    This fails badly if the file has fewer than 32767 lines.
                    – offby1
                    Nov 18 '14 at 20:00




                    This fails badly if the file has fewer than 32767 lines.
                    – offby1
                    Nov 18 '14 at 20:00












                    This will output one line from the file. (I guess your idea is to execute the above commands in a loop?) If the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.  Aside from possible inefficiency, I don’t see any big problem with this answer if the file has fewer than 32767 lines.
                    – G-Man
                    Dec 5 '16 at 21:27




                    This will output one line from the file. (I guess your idea is to execute the above commands in a loop?) If the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.  Aside from possible inefficiency, I don’t see any big problem with this answer if the file has fewer than 32767 lines.
                    – G-Man
                    Dec 5 '16 at 21:27










                    up vote
                    1
                    down vote













                    If you file size isn't huge, you can use Sort random. This takes a little longer than shuf, but it randomizes the entire data. So, you could easily just do the following to use head as you requested:



                    sort -R input | head -1000 > output


                    This would sort the file randomly and give you the first 1000 lines.






                    share|improve this answer
























                      up vote
                      1
                      down vote













                      If you file size isn't huge, you can use Sort random. This takes a little longer than shuf, but it randomizes the entire data. So, you could easily just do the following to use head as you requested:



                      sort -R input | head -1000 > output


                      This would sort the file randomly and give you the first 1000 lines.






                      share|improve this answer






















                        up vote
                        1
                        down vote










                        up vote
                        1
                        down vote









                        If you file size isn't huge, you can use Sort random. This takes a little longer than shuf, but it randomizes the entire data. So, you could easily just do the following to use head as you requested:



                        sort -R input | head -1000 > output


                        This would sort the file randomly and give you the first 1000 lines.






                        share|improve this answer












                        If you file size isn't huge, you can use Sort random. This takes a little longer than shuf, but it randomizes the entire data. So, you could easily just do the following to use head as you requested:



                        sort -R input | head -1000 > output


                        This would sort the file randomly and give you the first 1000 lines.







                        share|improve this answer












                        share|improve this answer



                        share|improve this answer










                        answered Jun 16 '16 at 19:48









                        DomainsFeatured

                        1348




                        1348




















                            up vote
                            1
                            down vote













                            If you know the number of lines in the file (like 1e6 in your case), you can do:



                            awk -v n=1e6 -v p=1000 '
                            BEGIN srand()
                            rand() * n-- < p p--; print' < file


                            If not, you can always do



                            awk -v n="$(wc -l < file)" -v p=1000 '
                            BEGIN srand()
                            rand() * n-- < p p--; print' < file


                            That would do two passes in the file, but still avoid storing the whole file in memory.



                            Another advantage over GNU shuf is that it preserves the order of the lines in the file.



                            Note that it assumes n is the number of lines in the file. If you want to print p out of the first n lines of the file (which has potentially more lines), you'd need to stop awk at the nth line like:



                            awk -v n=1e6 -v p=1000 '
                            BEGIN srand()
                            rand() * n-- < p p--; print
                            !n exit' < file





                            share|improve this answer


























                              up vote
                              1
                              down vote













                              If you know the number of lines in the file (like 1e6 in your case), you can do:



                              awk -v n=1e6 -v p=1000 '
                              BEGIN srand()
                              rand() * n-- < p p--; print' < file


                              If not, you can always do



                              awk -v n="$(wc -l < file)" -v p=1000 '
                              BEGIN srand()
                              rand() * n-- < p p--; print' < file


                              That would do two passes in the file, but still avoid storing the whole file in memory.



                              Another advantage over GNU shuf is that it preserves the order of the lines in the file.



                              Note that it assumes n is the number of lines in the file. If you want to print p out of the first n lines of the file (which has potentially more lines), you'd need to stop awk at the nth line like:



                              awk -v n=1e6 -v p=1000 '
                              BEGIN srand()
                              rand() * n-- < p p--; print
                              !n exit' < file





                              share|improve this answer
























                                up vote
                                1
                                down vote










                                up vote
                                1
                                down vote









                                If you know the number of lines in the file (like 1e6 in your case), you can do:



                                awk -v n=1e6 -v p=1000 '
                                BEGIN srand()
                                rand() * n-- < p p--; print' < file


                                If not, you can always do



                                awk -v n="$(wc -l < file)" -v p=1000 '
                                BEGIN srand()
                                rand() * n-- < p p--; print' < file


                                That would do two passes in the file, but still avoid storing the whole file in memory.



                                Another advantage over GNU shuf is that it preserves the order of the lines in the file.



                                Note that it assumes n is the number of lines in the file. If you want to print p out of the first n lines of the file (which has potentially more lines), you'd need to stop awk at the nth line like:



                                awk -v n=1e6 -v p=1000 '
                                BEGIN srand()
                                rand() * n-- < p p--; print
                                !n exit' < file





                                share|improve this answer














                                If you know the number of lines in the file (like 1e6 in your case), you can do:



                                awk -v n=1e6 -v p=1000 '
                                BEGIN srand()
                                rand() * n-- < p p--; print' < file


                                If not, you can always do



                                awk -v n="$(wc -l < file)" -v p=1000 '
                                BEGIN srand()
                                rand() * n-- < p p--; print' < file


                                That would do two passes in the file, but still avoid storing the whole file in memory.



                                Another advantage over GNU shuf is that it preserves the order of the lines in the file.



                                Note that it assumes n is the number of lines in the file. If you want to print p out of the first n lines of the file (which has potentially more lines), you'd need to stop awk at the nth line like:



                                awk -v n=1e6 -v p=1000 '
                                BEGIN srand()
                                rand() * n-- < p p--; print
                                !n exit' < file






                                share|improve this answer














                                share|improve this answer



                                share|improve this answer








                                edited Jun 11 '17 at 7:51

























                                answered Jun 11 '17 at 7:46









                                Stéphane Chazelas

                                285k53525864




                                285k53525864




















                                    up vote
                                    1
                                    down vote













                                    I like using awk for this when I want to preserve a header row, and when the sample can be an approximate percentage of the file. Works for very large files:



                                    awk 'BEGIN srand() !/^$/ ' data.txt





                                    share|improve this answer
























                                      up vote
                                      1
                                      down vote













                                      I like using awk for this when I want to preserve a header row, and when the sample can be an approximate percentage of the file. Works for very large files:



                                      awk 'BEGIN srand() !/^$/ ' data.txt





                                      share|improve this answer






















                                        up vote
                                        1
                                        down vote










                                        up vote
                                        1
                                        down vote









                                        I like using awk for this when I want to preserve a header row, and when the sample can be an approximate percentage of the file. Works for very large files:



                                        awk 'BEGIN srand() !/^$/ ' data.txt





                                        share|improve this answer












                                        I like using awk for this when I want to preserve a header row, and when the sample can be an approximate percentage of the file. Works for very large files:



                                        awk 'BEGIN srand() !/^$/ ' data.txt






                                        share|improve this answer












                                        share|improve this answer



                                        share|improve this answer










                                        answered Jan 11 at 20:53









                                        Merlin

                                        1112




                                        1112




















                                            up vote
                                            0
                                            down vote













                                            Similar to @Txangel's probabilistic solution but approaching 100x faster.



                                            perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv


                                            If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):



                                            perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv


                                            .. or indeed chain a second sample method instead of head.






                                            share|improve this answer


























                                              up vote
                                              0
                                              down vote













                                              Similar to @Txangel's probabilistic solution but approaching 100x faster.



                                              perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv


                                              If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):



                                              perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv


                                              .. or indeed chain a second sample method instead of head.






                                              share|improve this answer
























                                                up vote
                                                0
                                                down vote










                                                up vote
                                                0
                                                down vote









                                                Similar to @Txangel's probabilistic solution but approaching 100x faster.



                                                perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv


                                                If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):



                                                perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv


                                                .. or indeed chain a second sample method instead of head.






                                                share|improve this answer














                                                Similar to @Txangel's probabilistic solution but approaching 100x faster.



                                                perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv


                                                If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):



                                                perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv


                                                .. or indeed chain a second sample method instead of head.







                                                share|improve this answer














                                                share|improve this answer



                                                share|improve this answer








                                                edited Aug 16 at 18:05

























                                                answered Aug 16 at 17:57









                                                geotheory

                                                147110




                                                147110



























                                                     

                                                    draft saved


                                                    draft discarded















































                                                     


                                                    draft saved


                                                    draft discarded














                                                    StackExchange.ready(
                                                    function ()
                                                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f108581%2fhow-to-randomly-sample-a-subset-of-a-file%23new-answer', 'question_page');

                                                    );

                                                    Post as a guest













































































                                                    Popular posts from this blog

                                                    Peggy Mitchell

                                                    Palaiologos

                                                    The Forum (Inglewood, California)