How to randomly sample a subset of a file

up vote
23
down vote

favorite

Is there any Linux command one can use to sample subset of a file? For instance, a file contains one million lines, and we want to randomly sample only one thousand lines from that file.

For random I mean that every line gets the same probability to be chosen and none of the lines chosen are repetitive.

head and tail can pick a subset of the file but not randomly. I know I can always write a python script to do so but just wondering is there a command for this usage.

edited Jan 12 '14 at 15:03

Timo

4,6851625

asked Jan 9 '14 at 16:24

clwen

223127

lines in random order, or a random block of 1000 consecutive lines of that file?
â€“Â frostschutz
Jan 9 '14 at 17:49

Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
â€“Â clwen
Jan 9 '14 at 18:08

add a commentÂ |Â

up vote
23
down vote

favorite

Is there any Linux command one can use to sample subset of a file? For instance, a file contains one million lines, and we want to randomly sample only one thousand lines from that file.

For random I mean that every line gets the same probability to be chosen and none of the lines chosen are repetitive.

head and tail can pick a subset of the file but not randomly. I know I can always write a python script to do so but just wondering is there a command for this usage.

edited Jan 12 '14 at 15:03

Timo

4,6851625

asked Jan 9 '14 at 16:24

clwen

223127

lines in random order, or a random block of 1000 consecutive lines of that file?
â€“Â frostschutz
Jan 9 '14 at 17:49

Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
â€“Â clwen
Jan 9 '14 at 18:08

add a commentÂ |Â

up vote
23
down vote

favorite

Is there any Linux command one can use to sample subset of a file? For instance, a file contains one million lines, and we want to randomly sample only one thousand lines from that file.

For random I mean that every line gets the same probability to be chosen and none of the lines chosen are repetitive.

head and tail can pick a subset of the file but not randomly. I know I can always write a python script to do so but just wondering is there a command for this usage.

edited Jan 12 '14 at 15:03

Timo

4,6851625

asked Jan 9 '14 at 16:24

clwen

223127

Is there any Linux command one can use to sample subset of a file? For instance, a file contains one million lines, and we want to randomly sample only one thousand lines from that file.

For random I mean that every line gets the same probability to be chosen and none of the lines chosen are repetitive.

head and tail can pick a subset of the file but not randomly. I know I can always write a python script to do so but just wondering is there a command for this usage.

command-line files command

edited Jan 12 '14 at 15:03

Timo

4,6851625

asked Jan 9 '14 at 16:24

clwen

223127

edited Jan 12 '14 at 15:03

Timo

4,6851625

asked Jan 9 '14 at 16:24

clwen

223127

edited Jan 12 '14 at 15:03

Timo

4,6851625

edited Jan 12 '14 at 15:03

Timo

4,6851625

edited Jan 12 '14 at 15:03

Timo

4,6851625

asked Jan 9 '14 at 16:24

clwen

223127

asked Jan 9 '14 at 16:24

clwen

223127

asked Jan 9 '14 at 16:24

clwen

223127

lines in random order, or a random block of 1000 consecutive lines of that file?
â€“Â frostschutz
Jan 9 '14 at 17:49

Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
â€“Â clwen
Jan 9 '14 at 18:08

add a commentÂ |Â

lines in random order, or a random block of 1000 consecutive lines of that file?
â€“Â frostschutz
Jan 9 '14 at 17:49

Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
â€“Â clwen
Jan 9 '14 at 18:08

lines in random order, or a random block of 1000 consecutive lines of that file?
â€“Â frostschutz
Jan 9 '14 at 17:49

Every line gets the same probability to be chosen. Don't need to be consecutive although there is a tiny probability that a consecutive block of lines be chosen together. I've updated my question to clearer about that. Thanks.
â€“Â clwen
Jan 9 '14 at 18:08

add a commentÂ |Â

10 Answers
10

active

oldest

votes

up vote
44
down vote

accepted

The shuf command (part of coreutils) can do this:

shuf -n 1000 file

answered Jan 9 '14 at 18:57

derobert

69.2k8150206

According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/â€¦
â€“Â Ketan
Jan 9 '14 at 19:17

@Ketan, doesn't seem that way
â€“Â frostschutz
Jan 9 '14 at 19:44

2

@Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also that sort is in the same section, and it clearly doesn't require sorted input.
â€“Â derobert
Jan 9 '14 at 19:49

Yes, true. I tried the command, works well.
â€“Â Ketan
Jan 9 '14 at 19:56

1

shuf was introduced to coreutils in version 6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
â€“Â offby1
Nov 18 '14 at 19:59

Â |Â
show 1 more comment

up vote
6
down vote

If you have a very large file (which is a common reason to take a sample) you will find that:

shuf exhausts memory

Using $RANDOM won't work correctly if the file exceeds 32767 lines

If you don't need "exactly" n sampled lines you can sample a ratio like this:

cat input.txt | awk 'BEGIN srand() !/^$/ if (rand() <= .01) print $0' > sample.txt

This uses constant memory, samples 1% of the file (if you know the number of lines of the file you can adjust this factor to sample a close to a limited number of lines), and works with any size of file but it will not return a precise number of lines, just a statistical ratio.

Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix

edited Dec 6 '16 at 18:35

answered Dec 5 '16 at 20:23

Txangel

16112

If a user wants approximately 1% of the non-blank lines, this is a pretty good answer.Ã¢Â€ÂƒBut if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails.Ã¢Â€ÂƒAs the answer you got it from says, it yields only a statistical estimate.Ã¢Â€ÂƒAnd do you understand the answer well enough to see that it is ignoring blank lines?Ã¢Â€ÂƒThis might be a good idea, in practice, but undocumented features are, in general, not a good idea.
â€“Â G-Man
Dec 5 '16 at 21:47

1

P.S.Ã¢Â€Â¯Ã¢Â€Â¯Simplistic approaches using $RANDOM wonÃ¢Â€Â™t work correctly for files larger than 32767 lines.Ã¢Â€Â‚ The statement Ã¢Â€ÂœUsing $RANDOM doesnÃ¢Â€Â™t reach the entire fileÃ¢Â€Â is a bit broad.
â€“Â G-Man
Dec 5 '16 at 21:48

@G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
â€“Â Txangel
Dec 6 '16 at 18:32

This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition awk is more resource friendly than shuf
â€“Â Polymerase
Apr 15 at 18:42

add a commentÂ |Â

up vote
2
down vote

Not aware of any single command which could do what you ask but here is a loop I put together which can do the job:

for i in `seq 1000`; do sed -n `echo $RANDOM % 1000000 | bc`p alargefile.txt; done > sample.txt

sed will pick up a random line on each of the 1000 passes. Possibly there are more efficient solutions.

answered Jan 9 '14 at 16:47

Ketan

5,43942741

Is it possible to get the same line multiple times in this approach?
â€“Â clwen
Jan 9 '14 at 18:11

1

Yes, quite possible to get the same line number more than once. Additionally, $RANDOM has a range between 0 and 32767. So, you will not get a well spread line numbers.
â€“Â Ketan
Jan 9 '14 at 18:21

does not work - random is called once
â€“Â Bohdan
Aug 20 '14 at 5:20

add a commentÂ |Â

up vote
2
down vote

You can save the follow code in a file (by example randextract.sh) and execute as:

randextract.sh file.txt

---- BEGIN FILE ----

#!/bin/sh -xv

#configuration MAX_LINES is the number of lines to extract
MAX_LINES=10

#number of lines in the file (is a limit)
NUM_LINES=`wc -l $1 | cut -d' ' -f1`

#generate a random number
#in bash the variable $RANDOM returns diferent values on each call
if [ "$RANDOM." != "$RANDOM." ]
then
 #bigger number (0 to 3276732767)
 RAND=$RANDOM$RANDOM
else
 RAND=`date +'%s'`
fi 

#The start line
START_LINE=`expr $RAND % '(' $NUM_LINES - $MAX_LINES ')'`

tail -n +$START_LINE $1 | head -n $MAX_LINES

---- END FILE ----

edited Jan 9 '14 at 17:17

answered Jan 9 '14 at 17:00

razzek

212

3

I'm not sure what you're trying to do here with RAND, but $RANDOM$RANDOM does not generate random numbers in the whole range Ã¢Â€Âœ0 to 3276732767Ã¢Â€Â (for example, it will generate 1000100000 but not 1000099999).
â€“Â Gilles
Jan 9 '14 at 22:37

The OP says, Ã¢Â€ÂœEvery line gets the same probability to be chosen.Ã¢Â€Â¯ Ã¢Â€Â¦ there is a tiny probability that a consecutive block of lines be chosen together.Ã¢Â€ÂÃ¢Â€ÂƒI also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point.Ã¢Â€ÂƒThat is not what the OP is asking for.
â€“Â G-Man
Dec 5 '16 at 21:19

add a commentÂ |Â

up vote
2
down vote

In case the shuf -n trick on large files runs out of memory and you still need a fixed size sample and an external utility can be installed then try sample:

$ sample -N 1000 < FILE_WITH_MILLIONS_OF_LINES

The caveat is that the sample (1000 lines in the example) must fit into memory.

Disclaimer: I am the author of the recommended software.

answered Jun 11 '17 at 4:03

hroptatyr

82188

add a commentÂ |Â

up vote
1
down vote

Or like this:

LINES=$(wc -l < file) 
RANDLINE=$[ $RANDOM % $LINES ] 
tail -n $RANDLINE < file|head -1

From the bash man page:


 RANDOM Each time this parameter is referenced, a random integer
 between 0 and 32767 is generated. The sequence of random
 numbers may be initialized by assigning a value to RANÃ¢Â€Â
 DOM. If RANDOM is unset, it loses its special properÃ¢Â€Â
 ties, even if it is subsequently reset.

edited Jan 11 '14 at 9:51

answered Jan 9 '14 at 16:49

user55518

This fails badly if the file has fewer than 32767 lines.
â€“Â offby1
Nov 18 '14 at 20:00

This will output one line from the file.Ã¢Â€Âƒ(I guess your idea is to execute the above commands in a loop?)Ã¢Â€ÂƒIf the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.Ã¢Â€Â‚ Aside from possible inefficiency, I donÃ¢Â€Â™t see any big problem with this answer if the file has fewer than 32767 lines.
â€“Â G-Man
Dec 5 '16 at 21:27

add a commentÂ |Â

up vote
1
down vote

If you file size isn't huge, you can use Sort random. This takes a little longer than shuf, but it randomizes the entire data. So, you could easily just do the following to use head as you requested:

sort -R input | head -1000 > output

This would sort the file randomly and give you the first 1000 lines.

answered Jun 16 '16 at 19:48

DomainsFeatured

1348

add a commentÂ |Â

up vote
1
down vote

If you know the number of lines in the file (like 1e6 in your case), you can do:

awk -v n=1e6 -v p=1000 '
 BEGIN srand()
 rand() * n-- < p p--; print' < file

If not, you can always do

awk -v n="$(wc -l < file)" -v p=1000 '
 BEGIN srand()
 rand() * n-- < p p--; print' < file

That would do two passes in the file, but still avoid storing the whole file in memory.

Another advantage over GNU shuf is that it preserves the order of the lines in the file.

Note that it assumes n is the number of lines in the file. If you want to print p out of the first n lines of the file (which has potentially more lines), you'd need to stop awk at the n^th line like:

awk -v n=1e6 -v p=1000 '
 BEGIN srand()
 rand() * n-- < p p--; print
 !n exit' < file

edited Jun 11 '17 at 7:51

answered Jun 11 '17 at 7:46

StÃ©phane Chazelas

285k53525864

add a commentÂ |Â

up vote
1
down vote

I like using awk for this when I want to preserve a header row, and when the sample can be an approximate percentage of the file. Works for very large files:

awk 'BEGIN srand() !/^$/ ' data.txt

answered Jan 11 at 20:53

Merlin

1112

add a commentÂ |Â

up vote
0
down vote

Similar to @Txangel's probabilistic solution but approaching 100x faster.

perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv

If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):

perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv

.. or indeed chain a second sample method instead of head.

edited Aug 16 at 18:05

answered Aug 16 at 17:57

geotheory

147110

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f108581%2fhow-to-randomly-sample-a-subset-of-a-file%23new-answer', 'question_page');

);

Post as a guest

Name

10 Answers
10

active

oldest

votes

10 Answers
10

active

oldest

votes

up vote
44
down vote

accepted

The shuf command (part of coreutils) can do this:

shuf -n 1000 file

answered Jan 9 '14 at 18:57

derobert

69.2k8150206

According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/â€¦
â€“Â Ketan
Jan 9 '14 at 19:17

@Ketan, doesn't seem that way
â€“Â frostschutz
Jan 9 '14 at 19:44

2

@Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also that sort is in the same section, and it clearly doesn't require sorted input.
â€“Â derobert
Jan 9 '14 at 19:49

Yes, true. I tried the command, works well.
â€“Â Ketan
Jan 9 '14 at 19:56

1

shuf was introduced to coreutils in version 6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
â€“Â offby1
Nov 18 '14 at 19:59

Â |Â
show 1 more comment

up vote
44
down vote

accepted

The shuf command (part of coreutils) can do this:

shuf -n 1000 file

answered Jan 9 '14 at 18:57

derobert

69.2k8150206

According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/â€¦
â€“Â Ketan
Jan 9 '14 at 19:17

@Ketan, doesn't seem that way
â€“Â frostschutz
Jan 9 '14 at 19:44

2

@Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also that sort is in the same section, and it clearly doesn't require sorted input.
â€“Â derobert
Jan 9 '14 at 19:49

Yes, true. I tried the command, works well.
â€“Â Ketan
Jan 9 '14 at 19:56

1

shuf was introduced to coreutils in version 6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
â€“Â offby1
Nov 18 '14 at 19:59

Â |Â
show 1 more comment

up vote
44
down vote

accepted

The shuf command (part of coreutils) can do this:

shuf -n 1000 file

answered Jan 9 '14 at 18:57

derobert

69.2k8150206

The shuf command (part of coreutils) can do this:

shuf -n 1000 file

answered Jan 9 '14 at 18:57

derobert

69.2k8150206

answered Jan 9 '14 at 18:57

derobert

69.2k8150206

answered Jan 9 '14 at 18:57

derobert

69.2k8150206

answered Jan 9 '14 at 18:57

derobert

69.2k8150206

According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/â€¦
â€“Â Ketan
Jan 9 '14 at 19:17

@Ketan, doesn't seem that way
â€“Â frostschutz
Jan 9 '14 at 19:44

2

@Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also that sort is in the same section, and it clearly doesn't require sorted input.
â€“Â derobert
Jan 9 '14 at 19:49

Yes, true. I tried the command, works well.
â€“Â Ketan
Jan 9 '14 at 19:56

1

shuf was introduced to coreutils in version 6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
â€“Â offby1
Nov 18 '14 at 19:59

Â |Â
show 1 more comment

According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/â€¦
â€“Â Ketan
Jan 9 '14 at 19:17

@Ketan, doesn't seem that way
â€“Â frostschutz
Jan 9 '14 at 19:44

2

@Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also that sort is in the same section, and it clearly doesn't require sorted input.
â€“Â derobert
Jan 9 '14 at 19:49

Yes, true. I tried the command, works well.
â€“Â Ketan
Jan 9 '14 at 19:56

1

shuf was introduced to coreutils in version 6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
â€“Â offby1
Nov 18 '14 at 19:59

According to documentation, it needs a sorted file as input: gnu.org/software/coreutils/manual/â€¦
â€“Â Ketan
Jan 9 '14 at 19:17

@Ketan, doesn't seem that way
â€“Â frostschutz
Jan 9 '14 at 19:44

@Ketan it's just in the wrong section of the manual, I believe. Note that even the examples in the manual are not sorted. Note also that sort is in the same section, and it clearly doesn't require sorted input.
â€“Â derobert
Jan 9 '14 at 19:49

Yes, true. I tried the command, works well.
â€“Â Ketan
Jan 9 '14 at 19:56

shuf was introduced to coreutils in version 6.0 (2006-08-15), and believe it or not, some reasonably-common systems (CentOS 6.5 in particular) don't have that version :-|
â€“Â offby1
Nov 18 '14 at 19:59

Â |Â
show 1 more comment

up vote
6
down vote

If you have a very large file (which is a common reason to take a sample) you will find that:

shuf exhausts memory

Using $RANDOM won't work correctly if the file exceeds 32767 lines

If you don't need "exactly" n sampled lines you can sample a ratio like this:

cat input.txt | awk 'BEGIN srand() !/^$/ if (rand() <= .01) print $0' > sample.txt

Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix

edited Dec 6 '16 at 18:35

answered Dec 5 '16 at 20:23

Txangel

16112

If a user wants approximately 1% of the non-blank lines, this is a pretty good answer.Ã¢Â€ÂƒBut if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails.Ã¢Â€ÂƒAs the answer you got it from says, it yields only a statistical estimate.Ã¢Â€ÂƒAnd do you understand the answer well enough to see that it is ignoring blank lines?Ã¢Â€ÂƒThis might be a good idea, in practice, but undocumented features are, in general, not a good idea.
â€“Â G-Man
Dec 5 '16 at 21:47

1

P.S.Ã¢Â€Â¯Ã¢Â€Â¯Simplistic approaches using $RANDOM wonÃ¢Â€Â™t work correctly for files larger than 32767 lines.Ã¢Â€Â‚ The statement Ã¢Â€ÂœUsing $RANDOM doesnÃ¢Â€Â™t reach the entire fileÃ¢Â€Â is a bit broad.
â€“Â G-Man
Dec 5 '16 at 21:48

@G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
â€“Â Txangel
Dec 6 '16 at 18:32

This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition awk is more resource friendly than shuf
â€“Â Polymerase
Apr 15 at 18:42

add a commentÂ |Â

up vote
6
down vote

If you have a very large file (which is a common reason to take a sample) you will find that:

shuf exhausts memory

Using $RANDOM won't work correctly if the file exceeds 32767 lines

If you don't need "exactly" n sampled lines you can sample a ratio like this:

cat input.txt | awk 'BEGIN srand() !/^$/ if (rand() <= .01) print $0' > sample.txt

Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix

edited Dec 6 '16 at 18:35

answered Dec 5 '16 at 20:23

Txangel

16112

If a user wants approximately 1% of the non-blank lines, this is a pretty good answer.Ã¢Â€ÂƒBut if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails.Ã¢Â€ÂƒAs the answer you got it from says, it yields only a statistical estimate.Ã¢Â€ÂƒAnd do you understand the answer well enough to see that it is ignoring blank lines?Ã¢Â€ÂƒThis might be a good idea, in practice, but undocumented features are, in general, not a good idea.
â€“Â G-Man
Dec 5 '16 at 21:47

1

P.S.Ã¢Â€Â¯Ã¢Â€Â¯Simplistic approaches using $RANDOM wonÃ¢Â€Â™t work correctly for files larger than 32767 lines.Ã¢Â€Â‚ The statement Ã¢Â€ÂœUsing $RANDOM doesnÃ¢Â€Â™t reach the entire fileÃ¢Â€Â is a bit broad.
â€“Â G-Man
Dec 5 '16 at 21:48

@G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
â€“Â Txangel
Dec 6 '16 at 18:32

This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition awk is more resource friendly than shuf
â€“Â Polymerase
Apr 15 at 18:42

add a commentÂ |Â

up vote
6
down vote

If you have a very large file (which is a common reason to take a sample) you will find that:

shuf exhausts memory

Using $RANDOM won't work correctly if the file exceeds 32767 lines

If you don't need "exactly" n sampled lines you can sample a ratio like this:

cat input.txt | awk 'BEGIN srand() !/^$/ if (rand() <= .01) print $0' > sample.txt

Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix

edited Dec 6 '16 at 18:35

answered Dec 5 '16 at 20:23

Txangel

16112

If you have a very large file (which is a common reason to take a sample) you will find that:

shuf exhausts memory

Using $RANDOM won't work correctly if the file exceeds 32767 lines

If you don't need "exactly" n sampled lines you can sample a ratio like this:

cat input.txt | awk 'BEGIN srand() !/^$/ if (rand() <= .01) print $0' > sample.txt

Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix

edited Dec 6 '16 at 18:35

answered Dec 5 '16 at 20:23

Txangel

16112

edited Dec 6 '16 at 18:35

answered Dec 5 '16 at 20:23

Txangel

16112

answered Dec 5 '16 at 20:23

Txangel

16112

answered Dec 5 '16 at 20:23

Txangel

16112

If a user wants approximately 1% of the non-blank lines, this is a pretty good answer.Ã¢Â€ÂƒBut if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails.Ã¢Â€ÂƒAs the answer you got it from says, it yields only a statistical estimate.Ã¢Â€ÂƒAnd do you understand the answer well enough to see that it is ignoring blank lines?Ã¢Â€ÂƒThis might be a good idea, in practice, but undocumented features are, in general, not a good idea.
â€“Â G-Man
Dec 5 '16 at 21:47

1

P.S.Ã¢Â€Â¯Ã¢Â€Â¯Simplistic approaches using $RANDOM wonÃ¢Â€Â™t work correctly for files larger than 32767 lines.Ã¢Â€Â‚ The statement Ã¢Â€ÂœUsing $RANDOM doesnÃ¢Â€Â™t reach the entire fileÃ¢Â€Â is a bit broad.
â€“Â G-Man
Dec 5 '16 at 21:48

@G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
â€“Â Txangel
Dec 6 '16 at 18:32

This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition awk is more resource friendly than shuf
â€“Â Polymerase
Apr 15 at 18:42

add a commentÂ |Â

If a user wants approximately 1% of the non-blank lines, this is a pretty good answer.Ã¢Â€ÂƒBut if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails.Ã¢Â€ÂƒAs the answer you got it from says, it yields only a statistical estimate.Ã¢Â€ÂƒAnd do you understand the answer well enough to see that it is ignoring blank lines?Ã¢Â€ÂƒThis might be a good idea, in practice, but undocumented features are, in general, not a good idea.
â€“Â G-Man
Dec 5 '16 at 21:47

1

P.S.Ã¢Â€Â¯Ã¢Â€Â¯Simplistic approaches using $RANDOM wonÃ¢Â€Â™t work correctly for files larger than 32767 lines.Ã¢Â€Â‚ The statement Ã¢Â€ÂœUsing $RANDOM doesnÃ¢Â€Â™t reach the entire fileÃ¢Â€Â is a bit broad.
â€“Â G-Man
Dec 5 '16 at 21:48

@G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
â€“Â Txangel
Dec 6 '16 at 18:32

This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition awk is more resource friendly than shuf
â€“Â Polymerase
Apr 15 at 18:42

If a user wants approximately 1% of the non-blank lines, this is a pretty good answer.Ã¢Â€ÂƒBut if the user wants an exact number of lines (e.g., 1000 out of a 1000000-line file), this fails.Ã¢Â€ÂƒAs the answer you got it from says, it yields only a statistical estimate.Ã¢Â€ÂƒAnd do you understand the answer well enough to see that it is ignoring blank lines?Ã¢Â€ÂƒThis might be a good idea, in practice, but undocumented features are, in general, not a good idea.
â€“Â G-Man
Dec 5 '16 at 21:47

P.S.Ã¢Â€Â¯Ã¢Â€Â¯Simplistic approaches using $RANDOM wonÃ¢Â€Â™t work correctly for files larger than 32767 lines.Ã¢Â€Â‚ The statement Ã¢Â€ÂœUsing $RANDOM doesnÃ¢Â€Â™t reach the entire fileÃ¢Â€Â is a bit broad.
â€“Â G-Man
Dec 5 '16 at 21:48

@G-Man The question seems to talk about getting 10k lines from a million as an example. None of the answers around did work for me (because of the size of the files and hardware limitations) and I propose this as a reasonable compromise. It won't get you 10k lines out of a million but it might be close enough for most practical purposes. I've clarified it a bit more following your advise. Thanks.
â€“Â Txangel
Dec 6 '16 at 18:32

This is the best answer, the lines are picked randomly while respecting the chronological order of the original file, in case this is a requirement. In addition awk is more resource friendly than shuf
â€“Â Polymerase
Apr 15 at 18:42

add a commentÂ |Â

up vote
2
down vote

Not aware of any single command which could do what you ask but here is a loop I put together which can do the job:

for i in `seq 1000`; do sed -n `echo $RANDOM % 1000000 | bc`p alargefile.txt; done > sample.txt

sed will pick up a random line on each of the 1000 passes. Possibly there are more efficient solutions.

answered Jan 9 '14 at 16:47

Ketan

5,43942741

Is it possible to get the same line multiple times in this approach?
â€“Â clwen
Jan 9 '14 at 18:11

1

Yes, quite possible to get the same line number more than once. Additionally, $RANDOM has a range between 0 and 32767. So, you will not get a well spread line numbers.
â€“Â Ketan
Jan 9 '14 at 18:21

does not work - random is called once
â€“Â Bohdan
Aug 20 '14 at 5:20

add a commentÂ |Â

up vote
2
down vote

Not aware of any single command which could do what you ask but here is a loop I put together which can do the job:

for i in `seq 1000`; do sed -n `echo $RANDOM % 1000000 | bc`p alargefile.txt; done > sample.txt

sed will pick up a random line on each of the 1000 passes. Possibly there are more efficient solutions.

answered Jan 9 '14 at 16:47

Ketan

5,43942741

Is it possible to get the same line multiple times in this approach?
â€“Â clwen
Jan 9 '14 at 18:11

1

Yes, quite possible to get the same line number more than once. Additionally, $RANDOM has a range between 0 and 32767. So, you will not get a well spread line numbers.
â€“Â Ketan
Jan 9 '14 at 18:21

does not work - random is called once
â€“Â Bohdan
Aug 20 '14 at 5:20

add a commentÂ |Â

up vote
2
down vote

Not aware of any single command which could do what you ask but here is a loop I put together which can do the job:

for i in `seq 1000`; do sed -n `echo $RANDOM % 1000000 | bc`p alargefile.txt; done > sample.txt

sed will pick up a random line on each of the 1000 passes. Possibly there are more efficient solutions.

answered Jan 9 '14 at 16:47

Ketan

5,43942741

Not aware of any single command which could do what you ask but here is a loop I put together which can do the job:

for i in `seq 1000`; do sed -n `echo $RANDOM % 1000000 | bc`p alargefile.txt; done > sample.txt

sed will pick up a random line on each of the 1000 passes. Possibly there are more efficient solutions.

answered Jan 9 '14 at 16:47

Ketan

5,43942741

answered Jan 9 '14 at 16:47

Ketan

5,43942741

answered Jan 9 '14 at 16:47

Ketan

5,43942741

answered Jan 9 '14 at 16:47

Ketan

5,43942741

Is it possible to get the same line multiple times in this approach?
â€“Â clwen
Jan 9 '14 at 18:11

1

Yes, quite possible to get the same line number more than once. Additionally, $RANDOM has a range between 0 and 32767. So, you will not get a well spread line numbers.
â€“Â Ketan
Jan 9 '14 at 18:21

does not work - random is called once
â€“Â Bohdan
Aug 20 '14 at 5:20

add a commentÂ |Â

Is it possible to get the same line multiple times in this approach?
â€“Â clwen
Jan 9 '14 at 18:11

1

Yes, quite possible to get the same line number more than once. Additionally, $RANDOM has a range between 0 and 32767. So, you will not get a well spread line numbers.
â€“Â Ketan
Jan 9 '14 at 18:21

does not work - random is called once
â€“Â Bohdan
Aug 20 '14 at 5:20

Is it possible to get the same line multiple times in this approach?
â€“Â clwen
Jan 9 '14 at 18:11

Yes, quite possible to get the same line number more than once. Additionally, $RANDOM has a range between 0 and 32767. So, you will not get a well spread line numbers.
â€“Â Ketan
Jan 9 '14 at 18:21

does not work - random is called once
â€“Â Bohdan
Aug 20 '14 at 5:20

add a commentÂ |Â

up vote
2
down vote

You can save the follow code in a file (by example randextract.sh) and execute as:

randextract.sh file.txt

---- BEGIN FILE ----

#!/bin/sh -xv

#configuration MAX_LINES is the number of lines to extract
MAX_LINES=10

#number of lines in the file (is a limit)
NUM_LINES=`wc -l $1 | cut -d' ' -f1`

#generate a random number
#in bash the variable $RANDOM returns diferent values on each call
if [ "$RANDOM." != "$RANDOM." ]
then
 #bigger number (0 to 3276732767)
 RAND=$RANDOM$RANDOM
else
 RAND=`date +'%s'`
fi 

#The start line
START_LINE=`expr $RAND % '(' $NUM_LINES - $MAX_LINES ')'`

tail -n +$START_LINE $1 | head -n $MAX_LINES

---- END FILE ----

edited Jan 9 '14 at 17:17

answered Jan 9 '14 at 17:00

razzek

212

3

I'm not sure what you're trying to do here with RAND, but $RANDOM$RANDOM does not generate random numbers in the whole range Ã¢Â€Âœ0 to 3276732767Ã¢Â€Â (for example, it will generate 1000100000 but not 1000099999).
â€“Â Gilles
Jan 9 '14 at 22:37

The OP says, Ã¢Â€ÂœEvery line gets the same probability to be chosen.Ã¢Â€Â¯ Ã¢Â€Â¦ there is a tiny probability that a consecutive block of lines be chosen together.Ã¢Â€ÂÃ¢Â€ÂƒI also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point.Ã¢Â€ÂƒThat is not what the OP is asking for.
â€“Â G-Man
Dec 5 '16 at 21:19

add a commentÂ |Â

up vote
2
down vote

You can save the follow code in a file (by example randextract.sh) and execute as:

randextract.sh file.txt

---- BEGIN FILE ----

#!/bin/sh -xv

#configuration MAX_LINES is the number of lines to extract
MAX_LINES=10

#number of lines in the file (is a limit)
NUM_LINES=`wc -l $1 | cut -d' ' -f1`

#generate a random number
#in bash the variable $RANDOM returns diferent values on each call
if [ "$RANDOM." != "$RANDOM." ]
then
 #bigger number (0 to 3276732767)
 RAND=$RANDOM$RANDOM
else
 RAND=`date +'%s'`
fi 

#The start line
START_LINE=`expr $RAND % '(' $NUM_LINES - $MAX_LINES ')'`

tail -n +$START_LINE $1 | head -n $MAX_LINES

---- END FILE ----

edited Jan 9 '14 at 17:17

answered Jan 9 '14 at 17:00

razzek

212

3

I'm not sure what you're trying to do here with RAND, but $RANDOM$RANDOM does not generate random numbers in the whole range Ã¢Â€Âœ0 to 3276732767Ã¢Â€Â (for example, it will generate 1000100000 but not 1000099999).
â€“Â Gilles
Jan 9 '14 at 22:37

The OP says, Ã¢Â€ÂœEvery line gets the same probability to be chosen.Ã¢Â€Â¯ Ã¢Â€Â¦ there is a tiny probability that a consecutive block of lines be chosen together.Ã¢Â€ÂÃ¢Â€ÂƒI also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point.Ã¢Â€ÂƒThat is not what the OP is asking for.
â€“Â G-Man
Dec 5 '16 at 21:19

add a commentÂ |Â

up vote
2
down vote

You can save the follow code in a file (by example randextract.sh) and execute as:

randextract.sh file.txt

---- BEGIN FILE ----

#!/bin/sh -xv

#configuration MAX_LINES is the number of lines to extract
MAX_LINES=10

#number of lines in the file (is a limit)
NUM_LINES=`wc -l $1 | cut -d' ' -f1`

#generate a random number
#in bash the variable $RANDOM returns diferent values on each call
if [ "$RANDOM." != "$RANDOM." ]
then
 #bigger number (0 to 3276732767)
 RAND=$RANDOM$RANDOM
else
 RAND=`date +'%s'`
fi 

#The start line
START_LINE=`expr $RAND % '(' $NUM_LINES - $MAX_LINES ')'`

tail -n +$START_LINE $1 | head -n $MAX_LINES

---- END FILE ----

edited Jan 9 '14 at 17:17

answered Jan 9 '14 at 17:00

razzek

212

You can save the follow code in a file (by example randextract.sh) and execute as:

randextract.sh file.txt

---- BEGIN FILE ----

#!/bin/sh -xv

#configuration MAX_LINES is the number of lines to extract
MAX_LINES=10

#number of lines in the file (is a limit)
NUM_LINES=`wc -l $1 | cut -d' ' -f1`

#generate a random number
#in bash the variable $RANDOM returns diferent values on each call
if [ "$RANDOM." != "$RANDOM." ]
then
 #bigger number (0 to 3276732767)
 RAND=$RANDOM$RANDOM
else
 RAND=`date +'%s'`
fi 

#The start line
START_LINE=`expr $RAND % '(' $NUM_LINES - $MAX_LINES ')'`

tail -n +$START_LINE $1 | head -n $MAX_LINES

---- END FILE ----

edited Jan 9 '14 at 17:17

answered Jan 9 '14 at 17:00

razzek

212

edited Jan 9 '14 at 17:17

answered Jan 9 '14 at 17:00

razzek

212

answered Jan 9 '14 at 17:00

razzek

212

answered Jan 9 '14 at 17:00

razzek

212

3

I'm not sure what you're trying to do here with RAND, but $RANDOM$RANDOM does not generate random numbers in the whole range Ã¢Â€Âœ0 to 3276732767Ã¢Â€Â (for example, it will generate 1000100000 but not 1000099999).
â€“Â Gilles
Jan 9 '14 at 22:37

The OP says, Ã¢Â€ÂœEvery line gets the same probability to be chosen.Ã¢Â€Â¯ Ã¢Â€Â¦ there is a tiny probability that a consecutive block of lines be chosen together.Ã¢Â€ÂÃ¢Â€ÂƒI also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point.Ã¢Â€ÂƒThat is not what the OP is asking for.
â€“Â G-Man
Dec 5 '16 at 21:19

add a commentÂ |Â

3

I'm not sure what you're trying to do here with RAND, but $RANDOM$RANDOM does not generate random numbers in the whole range Ã¢Â€Âœ0 to 3276732767Ã¢Â€Â (for example, it will generate 1000100000 but not 1000099999).
â€“Â Gilles
Jan 9 '14 at 22:37

The OP says, Ã¢Â€ÂœEvery line gets the same probability to be chosen.Ã¢Â€Â¯ Ã¢Â€Â¦ there is a tiny probability that a consecutive block of lines be chosen together.Ã¢Â€ÂÃ¢Â€ÂƒI also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point.Ã¢Â€ÂƒThat is not what the OP is asking for.
â€“Â G-Man
Dec 5 '16 at 21:19

I'm not sure what you're trying to do here with RAND, but $RANDOM$RANDOM does not generate random numbers in the whole range Ã¢Â€Âœ0 to 3276732767Ã¢Â€Â (for example, it will generate 1000100000 but not 1000099999).
â€“Â Gilles
Jan 9 '14 at 22:37

The OP says, Ã¢Â€ÂœEvery line gets the same probability to be chosen.Ã¢Â€Â¯ Ã¢Â€Â¦ there is a tiny probability that a consecutive block of lines be chosen together.Ã¢Â€ÂÃ¢Â€ÂƒI also find this answer to be cryptic, but it looks like it is extracting a 10-line block of consecutive lines from a random starting point.Ã¢Â€ÂƒThat is not what the OP is asking for.
â€“Â G-Man
Dec 5 '16 at 21:19

add a commentÂ |Â

up vote
2
down vote

In case the shuf -n trick on large files runs out of memory and you still need a fixed size sample and an external utility can be installed then try sample:

$ sample -N 1000 < FILE_WITH_MILLIONS_OF_LINES

The caveat is that the sample (1000 lines in the example) must fit into memory.

Disclaimer: I am the author of the recommended software.

answered Jun 11 '17 at 4:03

hroptatyr

82188

add a commentÂ |Â

up vote
2
down vote

In case the shuf -n trick on large files runs out of memory and you still need a fixed size sample and an external utility can be installed then try sample:

$ sample -N 1000 < FILE_WITH_MILLIONS_OF_LINES

The caveat is that the sample (1000 lines in the example) must fit into memory.

Disclaimer: I am the author of the recommended software.

answered Jun 11 '17 at 4:03

hroptatyr

82188

add a commentÂ |Â

up vote
2
down vote

In case the shuf -n trick on large files runs out of memory and you still need a fixed size sample and an external utility can be installed then try sample:

$ sample -N 1000 < FILE_WITH_MILLIONS_OF_LINES

The caveat is that the sample (1000 lines in the example) must fit into memory.

Disclaimer: I am the author of the recommended software.

answered Jun 11 '17 at 4:03

hroptatyr

82188

In case the shuf -n trick on large files runs out of memory and you still need a fixed size sample and an external utility can be installed then try sample:

$ sample -N 1000 < FILE_WITH_MILLIONS_OF_LINES

The caveat is that the sample (1000 lines in the example) must fit into memory.

Disclaimer: I am the author of the recommended software.

answered Jun 11 '17 at 4:03

hroptatyr

82188

answered Jun 11 '17 at 4:03

hroptatyr

82188

answered Jun 11 '17 at 4:03

hroptatyr

82188

answered Jun 11 '17 at 4:03

hroptatyr

82188

add a commentÂ |Â

up vote
1
down vote

Or like this:

LINES=$(wc -l < file) 
RANDLINE=$[ $RANDOM % $LINES ] 
tail -n $RANDLINE < file|head -1

From the bash man page:


 RANDOM Each time this parameter is referenced, a random integer
 between 0 and 32767 is generated. The sequence of random
 numbers may be initialized by assigning a value to RANÃ¢Â€Â
 DOM. If RANDOM is unset, it loses its special properÃ¢Â€Â
 ties, even if it is subsequently reset.

edited Jan 11 '14 at 9:51

answered Jan 9 '14 at 16:49

user55518

This fails badly if the file has fewer than 32767 lines.
â€“Â offby1
Nov 18 '14 at 20:00

This will output one line from the file.Ã¢Â€Âƒ(I guess your idea is to execute the above commands in a loop?)Ã¢Â€ÂƒIf the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.Ã¢Â€Â‚ Aside from possible inefficiency, I donÃ¢Â€Â™t see any big problem with this answer if the file has fewer than 32767 lines.
â€“Â G-Man
Dec 5 '16 at 21:27

add a commentÂ |Â

up vote
1
down vote

Or like this:

LINES=$(wc -l < file) 
RANDLINE=$[ $RANDOM % $LINES ] 
tail -n $RANDLINE < file|head -1

From the bash man page:


 RANDOM Each time this parameter is referenced, a random integer
 between 0 and 32767 is generated. The sequence of random
 numbers may be initialized by assigning a value to RANÃ¢Â€Â
 DOM. If RANDOM is unset, it loses its special properÃ¢Â€Â
 ties, even if it is subsequently reset.

edited Jan 11 '14 at 9:51

answered Jan 9 '14 at 16:49

user55518

This fails badly if the file has fewer than 32767 lines.
â€“Â offby1
Nov 18 '14 at 20:00

This will output one line from the file.Ã¢Â€Âƒ(I guess your idea is to execute the above commands in a loop?)Ã¢Â€ÂƒIf the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.Ã¢Â€Â‚ Aside from possible inefficiency, I donÃ¢Â€Â™t see any big problem with this answer if the file has fewer than 32767 lines.
â€“Â G-Man
Dec 5 '16 at 21:27

add a commentÂ |Â

up vote
1
down vote

Or like this:

LINES=$(wc -l < file) 
RANDLINE=$[ $RANDOM % $LINES ] 
tail -n $RANDLINE < file|head -1

From the bash man page:


 RANDOM Each time this parameter is referenced, a random integer
 between 0 and 32767 is generated. The sequence of random
 numbers may be initialized by assigning a value to RANÃ¢Â€Â
 DOM. If RANDOM is unset, it loses its special properÃ¢Â€Â
 ties, even if it is subsequently reset.

edited Jan 11 '14 at 9:51

answered Jan 9 '14 at 16:49

user55518

Or like this:

LINES=$(wc -l < file) 
RANDLINE=$[ $RANDOM % $LINES ] 
tail -n $RANDLINE < file|head -1

From the bash man page:


 RANDOM Each time this parameter is referenced, a random integer
 between 0 and 32767 is generated. The sequence of random
 numbers may be initialized by assigning a value to RANÃ¢Â€Â
 DOM. If RANDOM is unset, it loses its special properÃ¢Â€Â
 ties, even if it is subsequently reset.

edited Jan 11 '14 at 9:51

answered Jan 9 '14 at 16:49

user55518

edited Jan 11 '14 at 9:51

answered Jan 9 '14 at 16:49

user55518

answered Jan 9 '14 at 16:49

user55518

answered Jan 9 '14 at 16:49

user55518

This fails badly if the file has fewer than 32767 lines.
â€“Â offby1
Nov 18 '14 at 20:00

This will output one line from the file.Ã¢Â€Âƒ(I guess your idea is to execute the above commands in a loop?)Ã¢Â€ÂƒIf the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.Ã¢Â€Â‚ Aside from possible inefficiency, I donÃ¢Â€Â™t see any big problem with this answer if the file has fewer than 32767 lines.
â€“Â G-Man
Dec 5 '16 at 21:27

add a commentÂ |Â

This fails badly if the file has fewer than 32767 lines.
â€“Â offby1
Nov 18 '14 at 20:00

This will output one line from the file.Ã¢Â€Âƒ(I guess your idea is to execute the above commands in a loop?)Ã¢Â€ÂƒIf the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.Ã¢Â€Â‚ Aside from possible inefficiency, I donÃ¢Â€Â™t see any big problem with this answer if the file has fewer than 32767 lines.
â€“Â G-Man
Dec 5 '16 at 21:27

This fails badly if the file has fewer than 32767 lines.
â€“Â offby1
Nov 18 '14 at 20:00

This will output one line from the file.Ã¢Â€Âƒ(I guess your idea is to execute the above commands in a loop?)Ã¢Â€ÂƒIf the file has more than 32767 lines, then these commands will choose only from the first 32767 lines.Ã¢Â€Â‚ Aside from possible inefficiency, I donÃ¢Â€Â™t see any big problem with this answer if the file has fewer than 32767 lines.
â€“Â G-Man
Dec 5 '16 at 21:27

add a commentÂ |Â

up vote
1
down vote

If you file size isn't huge, you can use Sort random. This takes a little longer than shuf, but it randomizes the entire data. So, you could easily just do the following to use head as you requested:

sort -R input | head -1000 > output

This would sort the file randomly and give you the first 1000 lines.

answered Jun 16 '16 at 19:48

DomainsFeatured

1348

add a commentÂ |Â

up vote
1
down vote

If you file size isn't huge, you can use Sort random. This takes a little longer than shuf, but it randomizes the entire data. So, you could easily just do the following to use head as you requested:

sort -R input | head -1000 > output

This would sort the file randomly and give you the first 1000 lines.

answered Jun 16 '16 at 19:48

DomainsFeatured

1348

add a commentÂ |Â

up vote
1
down vote

If you file size isn't huge, you can use Sort random. This takes a little longer than shuf, but it randomizes the entire data. So, you could easily just do the following to use head as you requested:

sort -R input | head -1000 > output

This would sort the file randomly and give you the first 1000 lines.

answered Jun 16 '16 at 19:48

DomainsFeatured

1348

If you file size isn't huge, you can use Sort random. This takes a little longer than shuf, but it randomizes the entire data. So, you could easily just do the following to use head as you requested:

sort -R input | head -1000 > output

This would sort the file randomly and give you the first 1000 lines.

answered Jun 16 '16 at 19:48

DomainsFeatured

1348

answered Jun 16 '16 at 19:48

DomainsFeatured

1348

answered Jun 16 '16 at 19:48

DomainsFeatured

1348

answered Jun 16 '16 at 19:48

DomainsFeatured

1348

add a commentÂ |Â

up vote
1
down vote

If you know the number of lines in the file (like 1e6 in your case), you can do:

awk -v n=1e6 -v p=1000 '
 BEGIN srand()
 rand() * n-- < p p--; print' < file

If not, you can always do

awk -v n="$(wc -l < file)" -v p=1000 '
 BEGIN srand()
 rand() * n-- < p p--; print' < file

That would do two passes in the file, but still avoid storing the whole file in memory.

Another advantage over GNU shuf is that it preserves the order of the lines in the file.

awk -v n=1e6 -v p=1000 '
 BEGIN srand()
 rand() * n-- < p p--; print
 !n exit' < file

edited Jun 11 '17 at 7:51

answered Jun 11 '17 at 7:46

285k53525864

add a commentÂ |Â

up vote
1
down vote

If you know the number of lines in the file (like 1e6 in your case), you can do:

awk -v n=1e6 -v p=1000 '
 BEGIN srand()
 rand() * n-- < p p--; print' < file

If not, you can always do

awk -v n="$(wc -l < file)" -v p=1000 '
 BEGIN srand()
 rand() * n-- < p p--; print' < file

That would do two passes in the file, but still avoid storing the whole file in memory.

Another advantage over GNU shuf is that it preserves the order of the lines in the file.

awk -v n=1e6 -v p=1000 '
 BEGIN srand()
 rand() * n-- < p p--; print
 !n exit' < file

edited Jun 11 '17 at 7:51

answered Jun 11 '17 at 7:46

285k53525864

add a commentÂ |Â

up vote
1
down vote

If you know the number of lines in the file (like 1e6 in your case), you can do:

awk -v n=1e6 -v p=1000 '
 BEGIN srand()
 rand() * n-- < p p--; print' < file

If not, you can always do

awk -v n="$(wc -l < file)" -v p=1000 '
 BEGIN srand()
 rand() * n-- < p p--; print' < file

That would do two passes in the file, but still avoid storing the whole file in memory.

Another advantage over GNU shuf is that it preserves the order of the lines in the file.

awk -v n=1e6 -v p=1000 '
 BEGIN srand()
 rand() * n-- < p p--; print
 !n exit' < file

edited Jun 11 '17 at 7:51

answered Jun 11 '17 at 7:46

285k53525864

If you know the number of lines in the file (like 1e6 in your case), you can do:

awk -v n=1e6 -v p=1000 '
 BEGIN srand()
 rand() * n-- < p p--; print' < file

If not, you can always do

awk -v n="$(wc -l < file)" -v p=1000 '
 BEGIN srand()
 rand() * n-- < p p--; print' < file

That would do two passes in the file, but still avoid storing the whole file in memory.

Another advantage over GNU shuf is that it preserves the order of the lines in the file.

awk -v n=1e6 -v p=1000 '
 BEGIN srand()
 rand() * n-- < p p--; print
 !n exit' < file

edited Jun 11 '17 at 7:51

answered Jun 11 '17 at 7:46

285k53525864

edited Jun 11 '17 at 7:51

answered Jun 11 '17 at 7:46

285k53525864

answered Jun 11 '17 at 7:46

285k53525864

answered Jun 11 '17 at 7:46

285k53525864

add a commentÂ |Â

up vote
1
down vote

I like using awk for this when I want to preserve a header row, and when the sample can be an approximate percentage of the file. Works for very large files:

awk 'BEGIN srand() !/^$/ ' data.txt

answered Jan 11 at 20:53

Merlin

1112

add a commentÂ |Â

up vote
1
down vote

I like using awk for this when I want to preserve a header row, and when the sample can be an approximate percentage of the file. Works for very large files:

awk 'BEGIN srand() !/^$/ ' data.txt

answered Jan 11 at 20:53

Merlin

1112

add a commentÂ |Â

up vote
1
down vote

I like using awk for this when I want to preserve a header row, and when the sample can be an approximate percentage of the file. Works for very large files:

awk 'BEGIN srand() !/^$/ ' data.txt

answered Jan 11 at 20:53

Merlin

1112

I like using awk for this when I want to preserve a header row, and when the sample can be an approximate percentage of the file. Works for very large files:

awk 'BEGIN srand() !/^$/ ' data.txt

answered Jan 11 at 20:53

Merlin

1112

answered Jan 11 at 20:53

Merlin

1112

answered Jan 11 at 20:53

Merlin

1112

answered Jan 11 at 20:53

Merlin

1112

add a commentÂ |Â

up vote
0
down vote

Similar to @Txangel's probabilistic solution but approaching 100x faster.

perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv

If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):

perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv

.. or indeed chain a second sample method instead of head.

edited Aug 16 at 18:05

answered Aug 16 at 17:57

geotheory

147110

add a commentÂ |Â

up vote
0
down vote

Similar to @Txangel's probabilistic solution but approaching 100x faster.

perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv

If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):

perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv

.. or indeed chain a second sample method instead of head.

edited Aug 16 at 18:05

answered Aug 16 at 17:57

geotheory

147110

add a commentÂ |Â

up vote
0
down vote

Similar to @Txangel's probabilistic solution but approaching 100x faster.

perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv

If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):

perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv

.. or indeed chain a second sample method instead of head.

edited Aug 16 at 18:05

answered Aug 16 at 17:57

geotheory

147110

Similar to @Txangel's probabilistic solution but approaching 100x faster.

perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv

If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):

perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv

.. or indeed chain a second sample method instead of head.

edited Aug 16 at 18:05

answered Aug 16 at 17:57

geotheory

147110

edited Aug 16 at 18:05

answered Aug 16 at 17:57

geotheory

147110

answered Aug 16 at 17:57

geotheory

147110

answered Aug 16 at 17:57

geotheory

147110

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu