How to randomly sample a subset of a file

Clash Royale CLAN TAG#URR8PPP
up vote
23
down vote
favorite
Is there any Linux command one can use to sample subset of a file? For instance, a file contains one million lines, and we want to randomly sample only one thousand lines from that file.
For random I mean that every line gets the same probability to be chosen and none of the lines chosen are repetitive.
head and tail can pick a subset of the file but not randomly. I know I can always write a python script to do so but just wondering is there a command for this usage.
command-line files command
add a comment |Â
up vote
23
down vote
favorite
Is there any Linux command one can use to sample subset of a file? For instance, a file contains one million lines, and we want to randomly sample only one thousand lines from that file.
For random I mean that every line gets the same probability to be chosen and none of the lines chosen are repetitive.
head and tail can pick a subset of the file but not randomly. I know I can always write a python script to do so but just wondering is there a command for this usage.
command-line files command
lines in random order, or a random block of 1000 consecutive lines of that file?
â frostschutz
Jan 9 '14 at 17:49
Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
â clwen
Jan 9 '14 at 18:08
add a comment |Â
up vote
23
down vote
favorite
up vote
23
down vote
favorite
Is there any Linux command one can use to sample subset of a file? For instance, a file contains one million lines, and we want to randomly sample only one thousand lines from that file.
For random I mean that every line gets the same probability to be chosen and none of the lines chosen are repetitive.
head and tail can pick a subset of the file but not randomly. I know I can always write a python script to do so but just wondering is there a command for this usage.
command-line files command
Is there any Linux command one can use to sample subset of a file? For instance, a file contains one million lines, and we want to randomly sample only one thousand lines from that file.
For random I mean that every line gets the same probability to be chosen and none of the lines chosen are repetitive.
head and tail can pick a subset of the file but not randomly. I know I can always write a python script to do so but just wondering is there a command for this usage.
command-line files command
command-line files command
edited Jan 12 '14 at 15:03
Timo
4,6851625
4,6851625
asked Jan 9 '14 at 16:24
clwen
223127
223127
lines in random order, or a random block of 1000 consecutive lines of that file?
â frostschutz
Jan 9 '14 at 17:49
Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
â clwen
Jan 9 '14 at 18:08
add a comment |Â
lines in random order, or a random block of 1000 consecutive lines of that file?
â frostschutz
Jan 9 '14 at 17:49
Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
â clwen
Jan 9 '14 at 18:08
lines in random order, or a random block of 1000 consecutive lines of that file?
â frostschutz
Jan 9 '14 at 17:49
lines in random order, or a random block of 1000 consecutive lines of that file?
â frostschutz
Jan 9 '14 at 17:49
Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
â clwen
Jan 9 '14 at 18:08
Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
â clwen
Jan 9 '14 at 18:08
add a comment |Â
10 Answers
10
active
oldest
votes
up vote
44
down vote
accepted
The shuf command (part of coreutils) can do this:
shuf -n 1000 file
According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/â¦
â Ketan
Jan 9 '14 at 19:17
@Ketan, doesn't seem that way
â frostschutz
Jan 9 '14 at 19:44
2
@Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also thatsortis in the same section, and it clearly doesn't require sorted input.
â derobert
Jan 9 '14 at 19:49
Yes, true. I tried the command, works well.
â Ketan
Jan 9 '14 at 19:56
1
shufwas introduced to coreutils in version6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
â offby1
Nov 18 '14 at 19:59
 |Â
show 1 more comment
up vote
6
down vote
If you have a very large file (which is a common reason to take a sample) you will find that:
shufexhausts memory- Using
$RANDOMwon't work correctly if the file exceeds 32767 lines
If you don't need "exactly" n sampled lines you can sample a ratio like this:
cat input.txt | awk 'BEGIN srand() !/^$/ if (rand() <= .01) print $0' > sample.txt
This uses constant memory, samples 1% of the file (if you know the number of lines of the file you can adjust this factor to sample a close to a limited number of lines), and works with any size of file but it will not return a precise number of lines, just a statistical ratio.
Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix
If a user wants approximately 1% of the non-blank lines, this is a pretty good answer.âÂÂBut if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails.âÂÂAs the answer you got it from says, it yields only a statistical estimate.âÂÂAnd do you understand the answer well enough to see that it is ignoring blank lines?âÂÂThis might be a good idea, in practice, but undocumented features are, in general, not a good idea.
â G-Man
Dec 5 '16 at 21:47
1
P.S.â¯â¯Simplistic approaches using$RANDOMwonâÂÂt work correctly for files larger than 32767 lines.â The statement âÂÂUsing$RANDOMdoesnâÂÂt reach the entire fileâ is a bit broad.
â G-Man
Dec 5 '16 at 21:48
@G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
â Txangel
Dec 6 '16 at 18:32
This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In additionawkis more resource friendly thanshuf
â Polymerase
Apr 15 at 18:42
add a comment |Â
up vote
2
down vote
Not aware of any single command which could do what you ask but here is a loop I put together which can do the job:
for i in `seq 1000`; do sed -n `echo $RANDOM % 1000000 | bc`p alargefile.txt; done > sample.txt
sed will pick up a random line on each of the 1000 passes. Possibly there are more efficient solutions.
Is it possible to get the same line multiple times in this approach?
â clwen
Jan 9 '14 at 18:11
1
Yes, quite possible to get the same line number more than once. Additionally,$RANDOMhas a range between 0 and 32767. So, you will not get a well spread line numbers.
â Ketan
Jan 9 '14 at 18:21
does not work - random is called once
â Bohdan
Aug 20 '14 at 5:20
add a comment |Â
up vote
2
down vote
You can save the follow code in a file (by example randextract.sh) and execute as:
randextract.sh file.txt
---- BEGIN FILE ----
#!/bin/sh -xv
#configuration MAX_LINES is the number of lines to extract
MAX_LINES=10
#number of lines in the file (is a limit)
NUM_LINES=`wc -l $1 | cut -d' ' -f1`
#generate a random number
#in bash the variable $RANDOM returns diferent values on each call
if [ "$RANDOM." != "$RANDOM." ]
then
#bigger number (0 to 3276732767)
RAND=$RANDOM$RANDOM
else
RAND=`date +'%s'`
fi
#The start line
START_LINE=`expr $RAND % '(' $NUM_LINES - $MAX_LINES ')'`
tail -n +$START_LINE $1 | head -n $MAX_LINES
---- END FILE ----
3
I'm not sure what you're trying to do here with RAND, but$RANDOM$RANDOMdoes not generate random numbers in the whole range âÂÂ0 to 3276732767â (for example, it will generate 1000100000 but not 1000099999).
â Gilles
Jan 9 '14 at 22:37
The OP says, âÂÂEvery line gets the same probability to be chosen.⯠⦠there is a tiny probability that a consecutive block of lines be chosen together.âÂÂâÂÂI also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point.âÂÂThat is not what the OP is asking for.
â G-Man
Dec 5 '16 at 21:19
add a comment |Â
up vote
2
down vote
In case the shuf -n trick on large files runs out of memory and you still need a fixed size sample and an external utility can be installed then try sample:
$ sample -N 1000 < FILE_WITH_MILLIONS_OF_LINES
The caveat is that the sample (1000 lines in the example) must fit into memory.
Disclaimer: I am the author of the recommended software.
add a comment |Â
up vote
1
down vote
Or like this:
LINES=$(wc -l < file)
RANDLINE=$[ $RANDOM % $LINES ]
tail -n $RANDLINE < file|head -1
From the bash man page:
RANDOM Each time this parameter is referenced, a random integer
between 0 and 32767 is generated. The sequence of random
numbers may be initialized by assigning a value to RANâÂÂ
DOM. If RANDOM is unset, it loses its special properâÂÂ
ties, even if it is subsequently reset.
This fails badly if the file has fewer than 32767 lines.
â offby1
Nov 18 '14 at 20:00
This will output one line from the file.âÂÂ(I guess your idea is to execute the above commands in a loop?)âÂÂIf the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.â Aside from possible inefficiency, I donâÂÂt see any big problem with this answer if the file has fewer than 32767 lines.
â G-Man
Dec 5 '16 at 21:27
add a comment |Â
up vote
1
down vote
If you file size isn't huge, you can use Sort random. This takes a little longer than shuf, but it randomizes the entire data. So, you could easily just do the following to use head as you requested:
sort -R input | head -1000 > output
This would sort the file randomly and give you the first 1000 lines.
add a comment |Â
up vote
1
down vote
If you know the number of lines in the file (like 1e6 in your case), you can do:
awk -v n=1e6 -v p=1000 '
BEGIN srand()
rand() * n-- < p p--; print' < file
If not, you can always do
awk -v n="$(wc -l < file)" -v p=1000 '
BEGIN srand()
rand() * n-- < p p--; print' < file
That would do two passes in the file, but still avoid storing the whole file in memory.
Another advantage over GNU shuf is that it preserves the order of the lines in the file.
Note that it assumes n is the number of lines in the file. If you want to print p out of the first n lines of the file (which has potentially more lines), you'd need to stop awk at the nth line like:
awk -v n=1e6 -v p=1000 '
BEGIN srand()
rand() * n-- < p p--; print
!n exit' < file
add a comment |Â
up vote
1
down vote
I like using awk for this when I want to preserve a header row, and when the sample can be an approximate percentage of the file. Works for very large files:
awk 'BEGIN srand() !/^$/ ' data.txt
add a comment |Â
up vote
0
down vote
Similar to @Txangel's probabilistic solution but approaching 100x faster.
perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv
If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):
perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv
.. or indeed chain a second sample method instead of head.
add a comment |Â
10 Answers
10
active
oldest
votes
10 Answers
10
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
44
down vote
accepted
The shuf command (part of coreutils) can do this:
shuf -n 1000 file
According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/â¦
â Ketan
Jan 9 '14 at 19:17
@Ketan, doesn't seem that way
â frostschutz
Jan 9 '14 at 19:44
2
@Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also thatsortis in the same section, and it clearly doesn't require sorted input.
â derobert
Jan 9 '14 at 19:49
Yes, true. I tried the command, works well.
â Ketan
Jan 9 '14 at 19:56
1
shufwas introduced to coreutils in version6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
â offby1
Nov 18 '14 at 19:59
 |Â
show 1 more comment
up vote
44
down vote
accepted
The shuf command (part of coreutils) can do this:
shuf -n 1000 file
According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/â¦
â Ketan
Jan 9 '14 at 19:17
@Ketan, doesn't seem that way
â frostschutz
Jan 9 '14 at 19:44
2
@Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also thatsortis in the same section, and it clearly doesn't require sorted input.
â derobert
Jan 9 '14 at 19:49
Yes, true. I tried the command, works well.
â Ketan
Jan 9 '14 at 19:56
1
shufwas introduced to coreutils in version6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
â offby1
Nov 18 '14 at 19:59
 |Â
show 1 more comment
up vote
44
down vote
accepted
up vote
44
down vote
accepted
The shuf command (part of coreutils) can do this:
shuf -n 1000 file
The shuf command (part of coreutils) can do this:
shuf -n 1000 file
answered Jan 9 '14 at 18:57
derobert
69.2k8150206
69.2k8150206
According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/â¦
â Ketan
Jan 9 '14 at 19:17
@Ketan, doesn't seem that way
â frostschutz
Jan 9 '14 at 19:44
2
@Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also thatsortis in the same section, and it clearly doesn't require sorted input.
â derobert
Jan 9 '14 at 19:49
Yes, true. I tried the command, works well.
â Ketan
Jan 9 '14 at 19:56
1
shufwas introduced to coreutils in version6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
â offby1
Nov 18 '14 at 19:59
 |Â
show 1 more comment
According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/â¦
â Ketan
Jan 9 '14 at 19:17
@Ketan, doesn't seem that way
â frostschutz
Jan 9 '14 at 19:44
2
@Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also thatsortis in the same section, and it clearly doesn't require sorted input.
â derobert
Jan 9 '14 at 19:49
Yes, true. I tried the command, works well.
â Ketan
Jan 9 '14 at 19:56
1
shufwas introduced to coreutils in version6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
â offby1
Nov 18 '14 at 19:59
According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/â¦
â Ketan
Jan 9 '14 at 19:17
According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/â¦
â Ketan
Jan 9 '14 at 19:17
@Ketan, doesn't seem that way
â frostschutz
Jan 9 '14 at 19:44
@Ketan, doesn't seem that way
â frostschutz
Jan 9 '14 at 19:44
2
2
@Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also that
sort is in the same section, and it clearly doesn't require sorted input.â derobert
Jan 9 '14 at 19:49
@Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also that
sort is in the same section, and it clearly doesn't require sorted input.â derobert
Jan 9 '14 at 19:49
Yes, true. I tried the command, works well.
â Ketan
Jan 9 '14 at 19:56
Yes, true. I tried the command, works well.
â Ketan
Jan 9 '14 at 19:56
1
1
shuf was introduced to coreutils in version 6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|â offby1
Nov 18 '14 at 19:59
shuf was introduced to coreutils in version 6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|â offby1
Nov 18 '14 at 19:59
 |Â
show 1 more comment
up vote
6
down vote
If you have a very large file (which is a common reason to take a sample) you will find that:
shufexhausts memory- Using
$RANDOMwon't work correctly if the file exceeds 32767 lines
If you don't need "exactly" n sampled lines you can sample a ratio like this:
cat input.txt | awk 'BEGIN srand() !/^$/ if (rand() <= .01) print $0' > sample.txt
This uses constant memory, samples 1% of the file (if you know the number of lines of the file you can adjust this factor to sample a close to a limited number of lines), and works with any size of file but it will not return a precise number of lines, just a statistical ratio.
Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix
If a user wants approximately 1% of the non-blank lines, this is a pretty good answer.âÂÂBut if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails.âÂÂAs the answer you got it from says, it yields only a statistical estimate.âÂÂAnd do you understand the answer well enough to see that it is ignoring blank lines?âÂÂThis might be a good idea, in practice, but undocumented features are, in general, not a good idea.
â G-Man
Dec 5 '16 at 21:47
1
P.S.â¯â¯Simplistic approaches using$RANDOMwonâÂÂt work correctly for files larger than 32767 lines.â The statement âÂÂUsing$RANDOMdoesnâÂÂt reach the entire fileâ is a bit broad.
â G-Man
Dec 5 '16 at 21:48
@G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
â Txangel
Dec 6 '16 at 18:32
This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In additionawkis more resource friendly thanshuf
â Polymerase
Apr 15 at 18:42
add a comment |Â
up vote
6
down vote
If you have a very large file (which is a common reason to take a sample) you will find that:
shufexhausts memory- Using
$RANDOMwon't work correctly if the file exceeds 32767 lines
If you don't need "exactly" n sampled lines you can sample a ratio like this:
cat input.txt | awk 'BEGIN srand() !/^$/ if (rand() <= .01) print $0' > sample.txt
This uses constant memory, samples 1% of the file (if you know the number of lines of the file you can adjust this factor to sample a close to a limited number of lines), and works with any size of file but it will not return a precise number of lines, just a statistical ratio.
Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix
If a user wants approximately 1% of the non-blank lines, this is a pretty good answer.âÂÂBut if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails.âÂÂAs the answer you got it from says, it yields only a statistical estimate.âÂÂAnd do you understand the answer well enough to see that it is ignoring blank lines?âÂÂThis might be a good idea, in practice, but undocumented features are, in general, not a good idea.
â G-Man
Dec 5 '16 at 21:47
1
P.S.â¯â¯Simplistic approaches using$RANDOMwonâÂÂt work correctly for files larger than 32767 lines.â The statement âÂÂUsing$RANDOMdoesnâÂÂt reach the entire fileâ is a bit broad.
â G-Man
Dec 5 '16 at 21:48
@G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
â Txangel
Dec 6 '16 at 18:32
This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In additionawkis more resource friendly thanshuf
â Polymerase
Apr 15 at 18:42
add a comment |Â
up vote
6
down vote
up vote
6
down vote
If you have a very large file (which is a common reason to take a sample) you will find that:
shufexhausts memory- Using
$RANDOMwon't work correctly if the file exceeds 32767 lines
If you don't need "exactly" n sampled lines you can sample a ratio like this:
cat input.txt | awk 'BEGIN srand() !/^$/ if (rand() <= .01) print $0' > sample.txt
This uses constant memory, samples 1% of the file (if you know the number of lines of the file you can adjust this factor to sample a close to a limited number of lines), and works with any size of file but it will not return a precise number of lines, just a statistical ratio.
Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix
If you have a very large file (which is a common reason to take a sample) you will find that:
shufexhausts memory- Using
$RANDOMwon't work correctly if the file exceeds 32767 lines
If you don't need "exactly" n sampled lines you can sample a ratio like this:
cat input.txt | awk 'BEGIN srand() !/^$/ if (rand() <= .01) print $0' > sample.txt
This uses constant memory, samples 1% of the file (if you know the number of lines of the file you can adjust this factor to sample a close to a limited number of lines), and works with any size of file but it will not return a precise number of lines, just a statistical ratio.
Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix
edited Dec 6 '16 at 18:35
answered Dec 5 '16 at 20:23
Txangel
16112
16112
If a user wants approximately 1% of the non-blank lines, this is a pretty good answer.âÂÂBut if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails.âÂÂAs the answer you got it from says, it yields only a statistical estimate.âÂÂAnd do you understand the answer well enough to see that it is ignoring blank lines?âÂÂThis might be a good idea, in practice, but undocumented features are, in general, not a good idea.
â G-Man
Dec 5 '16 at 21:47
1
P.S.â¯â¯Simplistic approaches using$RANDOMwonâÂÂt work correctly for files larger than 32767 lines.â The statement âÂÂUsing$RANDOMdoesnâÂÂt reach the entire fileâ is a bit broad.
â G-Man
Dec 5 '16 at 21:48
@G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
â Txangel
Dec 6 '16 at 18:32
This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In additionawkis more resource friendly thanshuf
â Polymerase
Apr 15 at 18:42
add a comment |Â
If a user wants approximately 1% of the non-blank lines, this is a pretty good answer.âÂÂBut if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails.âÂÂAs the answer you got it from says, it yields only a statistical estimate.âÂÂAnd do you understand the answer well enough to see that it is ignoring blank lines?âÂÂThis might be a good idea, in practice, but undocumented features are, in general, not a good idea.
â G-Man
Dec 5 '16 at 21:47
1
P.S.â¯â¯Simplistic approaches using$RANDOMwonâÂÂt work correctly for files larger than 32767 lines.â The statement âÂÂUsing$RANDOMdoesnâÂÂt reach the entire fileâ is a bit broad.
â G-Man
Dec 5 '16 at 21:48
@G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
â Txangel
Dec 6 '16 at 18:32
This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In additionawkis more resource friendly thanshuf
â Polymerase
Apr 15 at 18:42
If a user wants approximately 1% of the non-blank lines, this is a pretty good answer.âÂÂBut if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails.âÂÂAs the answer you got it from says, it yields only a statistical estimate.âÂÂAnd do you understand the answer well enough to see that it is ignoring blank lines?âÂÂThis might be a good idea, in practice, but undocumented features are, in general, not a good idea.
â G-Man
Dec 5 '16 at 21:47
If a user wants approximately 1% of the non-blank lines, this is a pretty good answer.âÂÂBut if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails.âÂÂAs the answer you got it from says, it yields only a statistical estimate.âÂÂAnd do you understand the answer well enough to see that it is ignoring blank lines?âÂÂThis might be a good idea, in practice, but undocumented features are, in general, not a good idea.
â G-Man
Dec 5 '16 at 21:47
1
1
P.S.â¯â¯Simplistic approaches using
$RANDOM wonâÂÂt work correctly for files larger than 32767 lines.â The statement âÂÂUsing $RANDOM doesnâÂÂt reach the entire fileâ is a bit broad.â G-Man
Dec 5 '16 at 21:48
P.S.â¯â¯Simplistic approaches using
$RANDOM wonâÂÂt work correctly for files larger than 32767 lines.â The statement âÂÂUsing $RANDOM doesnâÂÂt reach the entire fileâ is a bit broad.â G-Man
Dec 5 '16 at 21:48
@G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
â Txangel
Dec 6 '16 at 18:32
@G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
â Txangel
Dec 6 '16 at 18:32
This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition
awk is more resource friendly than shufâ Polymerase
Apr 15 at 18:42
This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition
awk is more resource friendly than shufâ Polymerase
Apr 15 at 18:42
add a comment |Â
up vote
2
down vote
Not aware of any single command which could do what you ask but here is a loop I put together which can do the job:
for i in `seq 1000`; do sed -n `echo $RANDOM % 1000000 | bc`p alargefile.txt; done > sample.txt
sed will pick up a random line on each of the 1000 passes. Possibly there are more efficient solutions.
Is it possible to get the same line multiple times in this approach?
â clwen
Jan 9 '14 at 18:11
1
Yes, quite possible to get the same line number more than once. Additionally,$RANDOMhas a range between 0 and 32767. So, you will not get a well spread line numbers.
â Ketan
Jan 9 '14 at 18:21
does not work - random is called once
â Bohdan
Aug 20 '14 at 5:20
add a comment |Â
up vote
2
down vote
Not aware of any single command which could do what you ask but here is a loop I put together which can do the job:
for i in `seq 1000`; do sed -n `echo $RANDOM % 1000000 | bc`p alargefile.txt; done > sample.txt
sed will pick up a random line on each of the 1000 passes. Possibly there are more efficient solutions.
Is it possible to get the same line multiple times in this approach?
â clwen
Jan 9 '14 at 18:11
1
Yes, quite possible to get the same line number more than once. Additionally,$RANDOMhas a range between 0 and 32767. So, you will not get a well spread line numbers.
â Ketan
Jan 9 '14 at 18:21
does not work - random is called once
â Bohdan
Aug 20 '14 at 5:20
add a comment |Â
up vote
2
down vote
up vote
2
down vote
Not aware of any single command which could do what you ask but here is a loop I put together which can do the job:
for i in `seq 1000`; do sed -n `echo $RANDOM % 1000000 | bc`p alargefile.txt; done > sample.txt
sed will pick up a random line on each of the 1000 passes. Possibly there are more efficient solutions.
Not aware of any single command which could do what you ask but here is a loop I put together which can do the job:
for i in `seq 1000`; do sed -n `echo $RANDOM % 1000000 | bc`p alargefile.txt; done > sample.txt
sed will pick up a random line on each of the 1000 passes. Possibly there are more efficient solutions.
answered Jan 9 '14 at 16:47
Ketan
5,43942741
5,43942741
Is it possible to get the same line multiple times in this approach?
â clwen
Jan 9 '14 at 18:11
1
Yes, quite possible to get the same line number more than once. Additionally,$RANDOMhas a range between 0 and 32767. So, you will not get a well spread line numbers.
â Ketan
Jan 9 '14 at 18:21
does not work - random is called once
â Bohdan
Aug 20 '14 at 5:20
add a comment |Â
Is it possible to get the same line multiple times in this approach?
â clwen
Jan 9 '14 at 18:11
1
Yes, quite possible to get the same line number more than once. Additionally,$RANDOMhas a range between 0 and 32767. So, you will not get a well spread line numbers.
â Ketan
Jan 9 '14 at 18:21
does not work - random is called once
â Bohdan
Aug 20 '14 at 5:20
Is it possible to get the same line multiple times in this approach?
â clwen
Jan 9 '14 at 18:11
Is it possible to get the same line multiple times in this approach?
â clwen
Jan 9 '14 at 18:11
1
1
Yes, quite possible to get the same line number more than once. Additionally,
$RANDOM has a range between 0 and 32767. So, you will not get a well spread line numbers.â Ketan
Jan 9 '14 at 18:21
Yes, quite possible to get the same line number more than once. Additionally,
$RANDOM has a range between 0 and 32767. So, you will not get a well spread line numbers.â Ketan
Jan 9 '14 at 18:21
does not work - random is called once
â Bohdan
Aug 20 '14 at 5:20
does not work - random is called once
â Bohdan
Aug 20 '14 at 5:20
add a comment |Â
up vote
2
down vote
You can save the follow code in a file (by example randextract.sh) and execute as:
randextract.sh file.txt
---- BEGIN FILE ----
#!/bin/sh -xv
#configuration MAX_LINES is the number of lines to extract
MAX_LINES=10
#number of lines in the file (is a limit)
NUM_LINES=`wc -l $1 | cut -d' ' -f1`
#generate a random number
#in bash the variable $RANDOM returns diferent values on each call
if [ "$RANDOM." != "$RANDOM." ]
then
#bigger number (0 to 3276732767)
RAND=$RANDOM$RANDOM
else
RAND=`date +'%s'`
fi
#The start line
START_LINE=`expr $RAND % '(' $NUM_LINES - $MAX_LINES ')'`
tail -n +$START_LINE $1 | head -n $MAX_LINES
---- END FILE ----
3
I'm not sure what you're trying to do here with RAND, but$RANDOM$RANDOMdoes not generate random numbers in the whole range âÂÂ0 to 3276732767â (for example, it will generate 1000100000 but not 1000099999).
â Gilles
Jan 9 '14 at 22:37
The OP says, âÂÂEvery line gets the same probability to be chosen.⯠⦠there is a tiny probability that a consecutive block of lines be chosen together.âÂÂâÂÂI also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point.âÂÂThat is not what the OP is asking for.
â G-Man
Dec 5 '16 at 21:19
add a comment |Â
up vote
2
down vote
You can save the follow code in a file (by example randextract.sh) and execute as:
randextract.sh file.txt
---- BEGIN FILE ----
#!/bin/sh -xv
#configuration MAX_LINES is the number of lines to extract
MAX_LINES=10
#number of lines in the file (is a limit)
NUM_LINES=`wc -l $1 | cut -d' ' -f1`
#generate a random number
#in bash the variable $RANDOM returns diferent values on each call
if [ "$RANDOM." != "$RANDOM." ]
then
#bigger number (0 to 3276732767)
RAND=$RANDOM$RANDOM
else
RAND=`date +'%s'`
fi
#The start line
START_LINE=`expr $RAND % '(' $NUM_LINES - $MAX_LINES ')'`
tail -n +$START_LINE $1 | head -n $MAX_LINES
---- END FILE ----
3
I'm not sure what you're trying to do here with RAND, but$RANDOM$RANDOMdoes not generate random numbers in the whole range âÂÂ0 to 3276732767â (for example, it will generate 1000100000 but not 1000099999).
â Gilles
Jan 9 '14 at 22:37
The OP says, âÂÂEvery line gets the same probability to be chosen.⯠⦠there is a tiny probability that a consecutive block of lines be chosen together.âÂÂâÂÂI also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point.âÂÂThat is not what the OP is asking for.
â G-Man
Dec 5 '16 at 21:19
add a comment |Â
up vote
2
down vote
up vote
2
down vote
You can save the follow code in a file (by example randextract.sh) and execute as:
randextract.sh file.txt
---- BEGIN FILE ----
#!/bin/sh -xv
#configuration MAX_LINES is the number of lines to extract
MAX_LINES=10
#number of lines in the file (is a limit)
NUM_LINES=`wc -l $1 | cut -d' ' -f1`
#generate a random number
#in bash the variable $RANDOM returns diferent values on each call
if [ "$RANDOM." != "$RANDOM." ]
then
#bigger number (0 to 3276732767)
RAND=$RANDOM$RANDOM
else
RAND=`date +'%s'`
fi
#The start line
START_LINE=`expr $RAND % '(' $NUM_LINES - $MAX_LINES ')'`
tail -n +$START_LINE $1 | head -n $MAX_LINES
---- END FILE ----
You can save the follow code in a file (by example randextract.sh) and execute as:
randextract.sh file.txt
---- BEGIN FILE ----
#!/bin/sh -xv
#configuration MAX_LINES is the number of lines to extract
MAX_LINES=10
#number of lines in the file (is a limit)
NUM_LINES=`wc -l $1 | cut -d' ' -f1`
#generate a random number
#in bash the variable $RANDOM returns diferent values on each call
if [ "$RANDOM." != "$RANDOM." ]
then
#bigger number (0 to 3276732767)
RAND=$RANDOM$RANDOM
else
RAND=`date +'%s'`
fi
#The start line
START_LINE=`expr $RAND % '(' $NUM_LINES - $MAX_LINES ')'`
tail -n +$START_LINE $1 | head -n $MAX_LINES
---- END FILE ----
edited Jan 9 '14 at 17:17
answered Jan 9 '14 at 17:00
razzek
212
212
3
I'm not sure what you're trying to do here with RAND, but$RANDOM$RANDOMdoes not generate random numbers in the whole range âÂÂ0 to 3276732767â (for example, it will generate 1000100000 but not 1000099999).
â Gilles
Jan 9 '14 at 22:37
The OP says, âÂÂEvery line gets the same probability to be chosen.⯠⦠there is a tiny probability that a consecutive block of lines be chosen together.âÂÂâÂÂI also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point.âÂÂThat is not what the OP is asking for.
â G-Man
Dec 5 '16 at 21:19
add a comment |Â
3
I'm not sure what you're trying to do here with RAND, but$RANDOM$RANDOMdoes not generate random numbers in the whole range âÂÂ0 to 3276732767â (for example, it will generate 1000100000 but not 1000099999).
â Gilles
Jan 9 '14 at 22:37
The OP says, âÂÂEvery line gets the same probability to be chosen.⯠⦠there is a tiny probability that a consecutive block of lines be chosen together.âÂÂâÂÂI also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point.âÂÂThat is not what the OP is asking for.
â G-Man
Dec 5 '16 at 21:19
3
3
I'm not sure what you're trying to do here with RAND, but
$RANDOM$RANDOM does not generate random numbers in the whole range âÂÂ0 to 3276732767â (for example, it will generate 1000100000 but not 1000099999).â Gilles
Jan 9 '14 at 22:37
I'm not sure what you're trying to do here with RAND, but
$RANDOM$RANDOM does not generate random numbers in the whole range âÂÂ0 to 3276732767â (for example, it will generate 1000100000 but not 1000099999).â Gilles
Jan 9 '14 at 22:37
The OP says, âÂÂEvery line gets the same probability to be chosen.⯠⦠there is a tiny probability that a consecutive block of lines be chosen together.âÂÂâÂÂI also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point.âÂÂThat is not what the OP is asking for.
â G-Man
Dec 5 '16 at 21:19
The OP says, âÂÂEvery line gets the same probability to be chosen.⯠⦠there is a tiny probability that a consecutive block of lines be chosen together.âÂÂâÂÂI also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point.âÂÂThat is not what the OP is asking for.
â G-Man
Dec 5 '16 at 21:19
add a comment |Â
up vote
2
down vote
In case the shuf -n trick on large files runs out of memory and you still need a fixed size sample and an external utility can be installed then try sample:
$ sample -N 1000 < FILE_WITH_MILLIONS_OF_LINES
The caveat is that the sample (1000 lines in the example) must fit into memory.
Disclaimer: I am the author of the recommended software.
add a comment |Â
up vote
2
down vote
In case the shuf -n trick on large files runs out of memory and you still need a fixed size sample and an external utility can be installed then try sample:
$ sample -N 1000 < FILE_WITH_MILLIONS_OF_LINES
The caveat is that the sample (1000 lines in the example) must fit into memory.
Disclaimer: I am the author of the recommended software.
add a comment |Â
up vote
2
down vote
up vote
2
down vote
In case the shuf -n trick on large files runs out of memory and you still need a fixed size sample and an external utility can be installed then try sample:
$ sample -N 1000 < FILE_WITH_MILLIONS_OF_LINES
The caveat is that the sample (1000 lines in the example) must fit into memory.
Disclaimer: I am the author of the recommended software.
In case the shuf -n trick on large files runs out of memory and you still need a fixed size sample and an external utility can be installed then try sample:
$ sample -N 1000 < FILE_WITH_MILLIONS_OF_LINES
The caveat is that the sample (1000 lines in the example) must fit into memory.
Disclaimer: I am the author of the recommended software.
answered Jun 11 '17 at 4:03
hroptatyr
82188
82188
add a comment |Â
add a comment |Â
up vote
1
down vote
Or like this:
LINES=$(wc -l < file)
RANDLINE=$[ $RANDOM % $LINES ]
tail -n $RANDLINE < file|head -1
From the bash man page:
RANDOM Each time this parameter is referenced, a random integer
between 0 and 32767 is generated. The sequence of random
numbers may be initialized by assigning a value to RANâÂÂ
DOM. If RANDOM is unset, it loses its special properâÂÂ
ties, even if it is subsequently reset.
This fails badly if the file has fewer than 32767 lines.
â offby1
Nov 18 '14 at 20:00
This will output one line from the file.âÂÂ(I guess your idea is to execute the above commands in a loop?)âÂÂIf the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.â Aside from possible inefficiency, I donâÂÂt see any big problem with this answer if the file has fewer than 32767 lines.
â G-Man
Dec 5 '16 at 21:27
add a comment |Â
up vote
1
down vote
Or like this:
LINES=$(wc -l < file)
RANDLINE=$[ $RANDOM % $LINES ]
tail -n $RANDLINE < file|head -1
From the bash man page:
RANDOM Each time this parameter is referenced, a random integer
between 0 and 32767 is generated. The sequence of random
numbers may be initialized by assigning a value to RANâÂÂ
DOM. If RANDOM is unset, it loses its special properâÂÂ
ties, even if it is subsequently reset.
This fails badly if the file has fewer than 32767 lines.
â offby1
Nov 18 '14 at 20:00
This will output one line from the file.âÂÂ(I guess your idea is to execute the above commands in a loop?)âÂÂIf the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.â Aside from possible inefficiency, I donâÂÂt see any big problem with this answer if the file has fewer than 32767 lines.
â G-Man
Dec 5 '16 at 21:27
add a comment |Â
up vote
1
down vote
up vote
1
down vote
Or like this:
LINES=$(wc -l < file)
RANDLINE=$[ $RANDOM % $LINES ]
tail -n $RANDLINE < file|head -1
From the bash man page:
RANDOM Each time this parameter is referenced, a random integer
between 0 and 32767 is generated. The sequence of random
numbers may be initialized by assigning a value to RANâÂÂ
DOM. If RANDOM is unset, it loses its special properâÂÂ
ties, even if it is subsequently reset.
Or like this:
LINES=$(wc -l < file)
RANDLINE=$[ $RANDOM % $LINES ]
tail -n $RANDLINE < file|head -1
From the bash man page:
RANDOM Each time this parameter is referenced, a random integer
between 0 and 32767 is generated. The sequence of random
numbers may be initialized by assigning a value to RANâÂÂ
DOM. If RANDOM is unset, it loses its special properâÂÂ
ties, even if it is subsequently reset.
edited Jan 11 '14 at 9:51
answered Jan 9 '14 at 16:49
user55518
This fails badly if the file has fewer than 32767 lines.
â offby1
Nov 18 '14 at 20:00
This will output one line from the file.âÂÂ(I guess your idea is to execute the above commands in a loop?)âÂÂIf the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.â Aside from possible inefficiency, I donâÂÂt see any big problem with this answer if the file has fewer than 32767 lines.
â G-Man
Dec 5 '16 at 21:27
add a comment |Â
This fails badly if the file has fewer than 32767 lines.
â offby1
Nov 18 '14 at 20:00
This will output one line from the file.âÂÂ(I guess your idea is to execute the above commands in a loop?)âÂÂIf the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.â Aside from possible inefficiency, I donâÂÂt see any big problem with this answer if the file has fewer than 32767 lines.
â G-Man
Dec 5 '16 at 21:27
This fails badly if the file has fewer than 32767 lines.
â offby1
Nov 18 '14 at 20:00
This fails badly if the file has fewer than 32767 lines.
â offby1
Nov 18 '14 at 20:00
This will output one line from the file.âÂÂ(I guess your idea is to execute the above commands in a loop?)âÂÂIf the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.â Aside from possible inefficiency, I donâÂÂt see any big problem with this answer if the file has fewer than 32767 lines.
â G-Man
Dec 5 '16 at 21:27
This will output one line from the file.âÂÂ(I guess your idea is to execute the above commands in a loop?)âÂÂIf the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.â Aside from possible inefficiency, I donâÂÂt see any big problem with this answer if the file has fewer than 32767 lines.
â G-Man
Dec 5 '16 at 21:27
add a comment |Â
up vote
1
down vote
If you file size isn't huge, you can use Sort random. This takes a little longer than shuf, but it randomizes the entire data. So, you could easily just do the following to use head as you requested:
sort -R input | head -1000 > output
This would sort the file randomly and give you the first 1000 lines.
add a comment |Â
up vote
1
down vote
If you file size isn't huge, you can use Sort random. This takes a little longer than shuf, but it randomizes the entire data. So, you could easily just do the following to use head as you requested:
sort -R input | head -1000 > output
This would sort the file randomly and give you the first 1000 lines.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
If you file size isn't huge, you can use Sort random. This takes a little longer than shuf, but it randomizes the entire data. So, you could easily just do the following to use head as you requested:
sort -R input | head -1000 > output
This would sort the file randomly and give you the first 1000 lines.
If you file size isn't huge, you can use Sort random. This takes a little longer than shuf, but it randomizes the entire data. So, you could easily just do the following to use head as you requested:
sort -R input | head -1000 > output
This would sort the file randomly and give you the first 1000 lines.
answered Jun 16 '16 at 19:48
DomainsFeatured
1348
1348
add a comment |Â
add a comment |Â
up vote
1
down vote
If you know the number of lines in the file (like 1e6 in your case), you can do:
awk -v n=1e6 -v p=1000 '
BEGIN srand()
rand() * n-- < p p--; print' < file
If not, you can always do
awk -v n="$(wc -l < file)" -v p=1000 '
BEGIN srand()
rand() * n-- < p p--; print' < file
That would do two passes in the file, but still avoid storing the whole file in memory.
Another advantage over GNU shuf is that it preserves the order of the lines in the file.
Note that it assumes n is the number of lines in the file. If you want to print p out of the first n lines of the file (which has potentially more lines), you'd need to stop awk at the nth line like:
awk -v n=1e6 -v p=1000 '
BEGIN srand()
rand() * n-- < p p--; print
!n exit' < file
add a comment |Â
up vote
1
down vote
If you know the number of lines in the file (like 1e6 in your case), you can do:
awk -v n=1e6 -v p=1000 '
BEGIN srand()
rand() * n-- < p p--; print' < file
If not, you can always do
awk -v n="$(wc -l < file)" -v p=1000 '
BEGIN srand()
rand() * n-- < p p--; print' < file
That would do two passes in the file, but still avoid storing the whole file in memory.
Another advantage over GNU shuf is that it preserves the order of the lines in the file.
Note that it assumes n is the number of lines in the file. If you want to print p out of the first n lines of the file (which has potentially more lines), you'd need to stop awk at the nth line like:
awk -v n=1e6 -v p=1000 '
BEGIN srand()
rand() * n-- < p p--; print
!n exit' < file
add a comment |Â
up vote
1
down vote
up vote
1
down vote
If you know the number of lines in the file (like 1e6 in your case), you can do:
awk -v n=1e6 -v p=1000 '
BEGIN srand()
rand() * n-- < p p--; print' < file
If not, you can always do
awk -v n="$(wc -l < file)" -v p=1000 '
BEGIN srand()
rand() * n-- < p p--; print' < file
That would do two passes in the file, but still avoid storing the whole file in memory.
Another advantage over GNU shuf is that it preserves the order of the lines in the file.
Note that it assumes n is the number of lines in the file. If you want to print p out of the first n lines of the file (which has potentially more lines), you'd need to stop awk at the nth line like:
awk -v n=1e6 -v p=1000 '
BEGIN srand()
rand() * n-- < p p--; print
!n exit' < file
If you know the number of lines in the file (like 1e6 in your case), you can do:
awk -v n=1e6 -v p=1000 '
BEGIN srand()
rand() * n-- < p p--; print' < file
If not, you can always do
awk -v n="$(wc -l < file)" -v p=1000 '
BEGIN srand()
rand() * n-- < p p--; print' < file
That would do two passes in the file, but still avoid storing the whole file in memory.
Another advantage over GNU shuf is that it preserves the order of the lines in the file.
Note that it assumes n is the number of lines in the file. If you want to print p out of the first n lines of the file (which has potentially more lines), you'd need to stop awk at the nth line like:
awk -v n=1e6 -v p=1000 '
BEGIN srand()
rand() * n-- < p p--; print
!n exit' < file
edited Jun 11 '17 at 7:51
answered Jun 11 '17 at 7:46
Stéphane Chazelas
285k53525864
285k53525864
add a comment |Â
add a comment |Â
up vote
1
down vote
I like using awk for this when I want to preserve a header row, and when the sample can be an approximate percentage of the file. Works for very large files:
awk 'BEGIN srand() !/^$/ ' data.txt
add a comment |Â
up vote
1
down vote
I like using awk for this when I want to preserve a header row, and when the sample can be an approximate percentage of the file. Works for very large files:
awk 'BEGIN srand() !/^$/ ' data.txt
add a comment |Â
up vote
1
down vote
up vote
1
down vote
I like using awk for this when I want to preserve a header row, and when the sample can be an approximate percentage of the file. Works for very large files:
awk 'BEGIN srand() !/^$/ ' data.txt
I like using awk for this when I want to preserve a header row, and when the sample can be an approximate percentage of the file. Works for very large files:
awk 'BEGIN srand() !/^$/ ' data.txt
answered Jan 11 at 20:53
Merlin
1112
1112
add a comment |Â
add a comment |Â
up vote
0
down vote
Similar to @Txangel's probabilistic solution but approaching 100x faster.
perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv
If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):
perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv
.. or indeed chain a second sample method instead of head.
add a comment |Â
up vote
0
down vote
Similar to @Txangel's probabilistic solution but approaching 100x faster.
perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv
If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):
perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv
.. or indeed chain a second sample method instead of head.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Similar to @Txangel's probabilistic solution but approaching 100x faster.
perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv
If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):
perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv
.. or indeed chain a second sample method instead of head.
Similar to @Txangel's probabilistic solution but approaching 100x faster.
perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv
If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):
perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv
.. or indeed chain a second sample method instead of head.
edited Aug 16 at 18:05
answered Aug 16 at 17:57
geotheory
147110
147110
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f108581%2fhow-to-randomly-sample-a-subset-of-a-file%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
lines in random order, or a random block of 1000 consecutive lines of that file?
â frostschutz
Jan 9 '14 at 17:49
Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
â clwen
Jan 9 '14 at 18:08