Using a single command-line command, how would I search every text file in a database to find the 10 most used words?
Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
This answered question explains how to search and sort a specific filename, but how would you accomplish this for an entire directory? I have 1 million text files I need to to search for the ten most frequently used words.
database= /data/000/0000000/s##_date/*.txt - /data/999/0999999/s##_data/*txt
Everything I have attempted results in sorting filenames, paths, or directory errors.
I have made some progress with grep, but parts of filenames seem to appear in my results.
grep -r . * | tr -c '[:alnum:]' '[n*]' | sort | uniq -c | sort -nr | head -10
output:
1145
253 txt
190 s01
132 is
126 of
116 the
108 and
104 test
92 with
84 in
The 'txt' and 's01' come from file names and not from the text inside the text file. I know there are ways of excluding common words like "the" but would rather not sort and count file names at all.
command-line sort search database
 |Â
show 1 more comment
up vote
2
down vote
favorite
This answered question explains how to search and sort a specific filename, but how would you accomplish this for an entire directory? I have 1 million text files I need to to search for the ten most frequently used words.
database= /data/000/0000000/s##_date/*.txt - /data/999/0999999/s##_data/*txt
Everything I have attempted results in sorting filenames, paths, or directory errors.
I have made some progress with grep, but parts of filenames seem to appear in my results.
grep -r . * | tr -c '[:alnum:]' '[n*]' | sort | uniq -c | sort -nr | head -10
output:
1145
253 txt
190 s01
132 is
126 of
116 the
108 and
104 test
92 with
84 in
The 'txt' and 's01' come from file names and not from the text inside the text file. I know there are ways of excluding common words like "the" but would rather not sort and count file names at all.
command-line sort search database
1
When you say "command line" that suggests bash which is probably not best suited to your task.
â jdwolf
Jan 22 at 1:41
1
The link doesnâÂÂt work. Are you just looking for the top ten file names, or words in the files?
â Guy
Jan 22 at 3:45
idownvotedbecau.se/nocode
â Murphy
Jan 22 at 13:18
I added another link. It worked when I tested it.
â dpoiesz
Jan 22 at 16:54
grep --no-filename
; or use cat instead of grep, likecat /data/*/*/s*/*txt
â drewbenn
Jan 22 at 17:06
 |Â
show 1 more comment
up vote
2
down vote
favorite
up vote
2
down vote
favorite
This answered question explains how to search and sort a specific filename, but how would you accomplish this for an entire directory? I have 1 million text files I need to to search for the ten most frequently used words.
database= /data/000/0000000/s##_date/*.txt - /data/999/0999999/s##_data/*txt
Everything I have attempted results in sorting filenames, paths, or directory errors.
I have made some progress with grep, but parts of filenames seem to appear in my results.
grep -r . * | tr -c '[:alnum:]' '[n*]' | sort | uniq -c | sort -nr | head -10
output:
1145
253 txt
190 s01
132 is
126 of
116 the
108 and
104 test
92 with
84 in
The 'txt' and 's01' come from file names and not from the text inside the text file. I know there are ways of excluding common words like "the" but would rather not sort and count file names at all.
command-line sort search database
This answered question explains how to search and sort a specific filename, but how would you accomplish this for an entire directory? I have 1 million text files I need to to search for the ten most frequently used words.
database= /data/000/0000000/s##_date/*.txt - /data/999/0999999/s##_data/*txt
Everything I have attempted results in sorting filenames, paths, or directory errors.
I have made some progress with grep, but parts of filenames seem to appear in my results.
grep -r . * | tr -c '[:alnum:]' '[n*]' | sort | uniq -c | sort -nr | head -10
output:
1145
253 txt
190 s01
132 is
126 of
116 the
108 and
104 test
92 with
84 in
The 'txt' and 's01' come from file names and not from the text inside the text file. I know there are ways of excluding common words like "the" but would rather not sort and count file names at all.
command-line sort search database
edited Apr 12 at 0:27
Jeff Schaller
31.7k847108
31.7k847108
asked Jan 22 at 1:36
dpoiesz
133
133
1
When you say "command line" that suggests bash which is probably not best suited to your task.
â jdwolf
Jan 22 at 1:41
1
The link doesnâÂÂt work. Are you just looking for the top ten file names, or words in the files?
â Guy
Jan 22 at 3:45
idownvotedbecau.se/nocode
â Murphy
Jan 22 at 13:18
I added another link. It worked when I tested it.
â dpoiesz
Jan 22 at 16:54
grep --no-filename
; or use cat instead of grep, likecat /data/*/*/s*/*txt
â drewbenn
Jan 22 at 17:06
 |Â
show 1 more comment
1
When you say "command line" that suggests bash which is probably not best suited to your task.
â jdwolf
Jan 22 at 1:41
1
The link doesnâÂÂt work. Are you just looking for the top ten file names, or words in the files?
â Guy
Jan 22 at 3:45
idownvotedbecau.se/nocode
â Murphy
Jan 22 at 13:18
I added another link. It worked when I tested it.
â dpoiesz
Jan 22 at 16:54
grep --no-filename
; or use cat instead of grep, likecat /data/*/*/s*/*txt
â drewbenn
Jan 22 at 17:06
1
1
When you say "command line" that suggests bash which is probably not best suited to your task.
â jdwolf
Jan 22 at 1:41
When you say "command line" that suggests bash which is probably not best suited to your task.
â jdwolf
Jan 22 at 1:41
1
1
The link doesnâÂÂt work. Are you just looking for the top ten file names, or words in the files?
â Guy
Jan 22 at 3:45
The link doesnâÂÂt work. Are you just looking for the top ten file names, or words in the files?
â Guy
Jan 22 at 3:45
idownvotedbecau.se/nocode
â Murphy
Jan 22 at 13:18
idownvotedbecau.se/nocode
â Murphy
Jan 22 at 13:18
I added another link. It worked when I tested it.
â dpoiesz
Jan 22 at 16:54
I added another link. It worked when I tested it.
â dpoiesz
Jan 22 at 16:54
grep --no-filename
; or use cat instead of grep, like cat /data/*/*/s*/*txt
â drewbenn
Jan 22 at 17:06
grep --no-filename
; or use cat instead of grep, like cat /data/*/*/s*/*txt
â drewbenn
Jan 22 at 17:06
 |Â
show 1 more comment
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
grep
will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.
Instead of using grep
(which is an inspired but slow solution to not being able to cat
all files on the command line in one go) you may actually cat
all the text files together and process it as one big document like this:
find /data -type f -name '*.txt' -exec cat + |
tr -cs '[:alnum:]' 'n' | sort | uniq -c | sort -nr | head
I've added -s
to tr
so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([n*]
made little sense to me). The head
command produces ten lines of output by default, so -10
(or -n 10
) is not needed.
The find
command finds all regular files (-type f
) anywhere under /data
whose filenames matches the pattern *.txt
. For as many as possible of those files at a time, cat
is invoked to concatenate them (this is what -exec cat +
does). cat
is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find
+cat
.
To avoid counting empty lines, you may want to insert sed '/^ *$/d'
just before or just after the first sort
in the pipeline.
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
grep
will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.
Instead of using grep
(which is an inspired but slow solution to not being able to cat
all files on the command line in one go) you may actually cat
all the text files together and process it as one big document like this:
find /data -type f -name '*.txt' -exec cat + |
tr -cs '[:alnum:]' 'n' | sort | uniq -c | sort -nr | head
I've added -s
to tr
so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([n*]
made little sense to me). The head
command produces ten lines of output by default, so -10
(or -n 10
) is not needed.
The find
command finds all regular files (-type f
) anywhere under /data
whose filenames matches the pattern *.txt
. For as many as possible of those files at a time, cat
is invoked to concatenate them (this is what -exec cat +
does). cat
is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find
+cat
.
To avoid counting empty lines, you may want to insert sed '/^ *$/d'
just before or just after the first sort
in the pipeline.
add a comment |Â
up vote
1
down vote
accepted
grep
will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.
Instead of using grep
(which is an inspired but slow solution to not being able to cat
all files on the command line in one go) you may actually cat
all the text files together and process it as one big document like this:
find /data -type f -name '*.txt' -exec cat + |
tr -cs '[:alnum:]' 'n' | sort | uniq -c | sort -nr | head
I've added -s
to tr
so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([n*]
made little sense to me). The head
command produces ten lines of output by default, so -10
(or -n 10
) is not needed.
The find
command finds all regular files (-type f
) anywhere under /data
whose filenames matches the pattern *.txt
. For as many as possible of those files at a time, cat
is invoked to concatenate them (this is what -exec cat +
does). cat
is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find
+cat
.
To avoid counting empty lines, you may want to insert sed '/^ *$/d'
just before or just after the first sort
in the pipeline.
add a comment |Â
up vote
1
down vote
accepted
up vote
1
down vote
accepted
grep
will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.
Instead of using grep
(which is an inspired but slow solution to not being able to cat
all files on the command line in one go) you may actually cat
all the text files together and process it as one big document like this:
find /data -type f -name '*.txt' -exec cat + |
tr -cs '[:alnum:]' 'n' | sort | uniq -c | sort -nr | head
I've added -s
to tr
so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([n*]
made little sense to me). The head
command produces ten lines of output by default, so -10
(or -n 10
) is not needed.
The find
command finds all regular files (-type f
) anywhere under /data
whose filenames matches the pattern *.txt
. For as many as possible of those files at a time, cat
is invoked to concatenate them (this is what -exec cat +
does). cat
is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find
+cat
.
To avoid counting empty lines, you may want to insert sed '/^ *$/d'
just before or just after the first sort
in the pipeline.
grep
will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.
Instead of using grep
(which is an inspired but slow solution to not being able to cat
all files on the command line in one go) you may actually cat
all the text files together and process it as one big document like this:
find /data -type f -name '*.txt' -exec cat + |
tr -cs '[:alnum:]' 'n' | sort | uniq -c | sort -nr | head
I've added -s
to tr
so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([n*]
made little sense to me). The head
command produces ten lines of output by default, so -10
(or -n 10
) is not needed.
The find
command finds all regular files (-type f
) anywhere under /data
whose filenames matches the pattern *.txt
. For as many as possible of those files at a time, cat
is invoked to concatenate them (this is what -exec cat +
does). cat
is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find
+cat
.
To avoid counting empty lines, you may want to insert sed '/^ *$/d'
just before or just after the first sort
in the pipeline.
edited Feb 15 at 21:34
answered Feb 14 at 21:30
Kusalananda
103k13202319
103k13202319
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f418722%2fusing-a-single-command-line-command-how-would-i-search-every-text-file-in-a-dat%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
When you say "command line" that suggests bash which is probably not best suited to your task.
â jdwolf
Jan 22 at 1:41
1
The link doesnâÂÂt work. Are you just looking for the top ten file names, or words in the files?
â Guy
Jan 22 at 3:45
idownvotedbecau.se/nocode
â Murphy
Jan 22 at 13:18
I added another link. It worked when I tested it.
â dpoiesz
Jan 22 at 16:54
grep --no-filename
; or use cat instead of grep, likecat /data/*/*/s*/*txt
â drewbenn
Jan 22 at 17:06