Using a single command-line command, how would I search every text file in a database to find the 10 most used words?

up vote
2
down vote

favorite

This answered question explains how to search and sort a specific filename, but how would you accomplish this for an entire directory? I have 1 million text files I need to to search for the ten most frequently used words.

database= /data/000/0000000/s##_date/*.txt - /data/999/0999999/s##_data/*txt

Everything I have attempted results in sorting filenames, paths, or directory errors.

I have made some progress with grep, but parts of filenames seem to appear in my results.

grep -r . * | tr -c '[:alnum:]' '[n*]' | sort | uniq -c | sort -nr | head -10
output:
 1145 
 253 txt
 190 s01
 132 is
 126 of
 116 the
 108 and
 104 test
 92 with
 84 in

The 'txt' and 's01' come from file names and not from the text inside the text file. I know there are ways of excluding common words like "the" but would rather not sort and count file names at all.

edited Apr 12 at 0:27

Jeff Schaller

31.7k847108

asked Jan 22 at 1:36

dpoiesz

133

1

When you say "command line" that suggests bash which is probably not best suited to your task.
â€“Â jdwolf
Jan 22 at 1:41

1

The link doesnÃ¢Â€Â™t work. Are you just looking for the top ten file names, or words in the files?
â€“Â Guy
Jan 22 at 3:45

idownvotedbecau.se/nocode
â€“Â Murphy
Jan 22 at 13:18

I added another link. It worked when I tested it.
â€“Â dpoiesz
Jan 22 at 16:54

grep --no-filename; or use cat instead of grep, like cat /data/*/*/s*/*txt
â€“Â drewbenn
Jan 22 at 17:06

Â |Â
show 1 more comment

up vote
2
down vote

favorite

database= /data/000/0000000/s##_date/*.txt - /data/999/0999999/s##_data/*txt

Everything I have attempted results in sorting filenames, paths, or directory errors.

I have made some progress with grep, but parts of filenames seem to appear in my results.

grep -r . * | tr -c '[:alnum:]' '[n*]' | sort | uniq -c | sort -nr | head -10
output:
 1145 
 253 txt
 190 s01
 132 is
 126 of
 116 the
 108 and
 104 test
 92 with
 84 in

The 'txt' and 's01' come from file names and not from the text inside the text file. I know there are ways of excluding common words like "the" but would rather not sort and count file names at all.

edited Apr 12 at 0:27

Jeff Schaller

31.7k847108

asked Jan 22 at 1:36

dpoiesz

133

1

When you say "command line" that suggests bash which is probably not best suited to your task.
â€“Â jdwolf
Jan 22 at 1:41

1

The link doesnÃ¢Â€Â™t work. Are you just looking for the top ten file names, or words in the files?
â€“Â Guy
Jan 22 at 3:45

idownvotedbecau.se/nocode
â€“Â Murphy
Jan 22 at 13:18

I added another link. It worked when I tested it.
â€“Â dpoiesz
Jan 22 at 16:54

grep --no-filename; or use cat instead of grep, like cat /data/*/*/s*/*txt
â€“Â drewbenn
Jan 22 at 17:06

Â |Â
show 1 more comment

up vote
2
down vote

favorite

database= /data/000/0000000/s##_date/*.txt - /data/999/0999999/s##_data/*txt

Everything I have attempted results in sorting filenames, paths, or directory errors.

I have made some progress with grep, but parts of filenames seem to appear in my results.

grep -r . * | tr -c '[:alnum:]' '[n*]' | sort | uniq -c | sort -nr | head -10
output:
 1145 
 253 txt
 190 s01
 132 is
 126 of
 116 the
 108 and
 104 test
 92 with
 84 in

The 'txt' and 's01' come from file names and not from the text inside the text file. I know there are ways of excluding common words like "the" but would rather not sort and count file names at all.

edited Apr 12 at 0:27

Jeff Schaller

31.7k847108

asked Jan 22 at 1:36

dpoiesz

133

database= /data/000/0000000/s##_date/*.txt - /data/999/0999999/s##_data/*txt

Everything I have attempted results in sorting filenames, paths, or directory errors.

I have made some progress with grep, but parts of filenames seem to appear in my results.

grep -r . * | tr -c '[:alnum:]' '[n*]' | sort | uniq -c | sort -nr | head -10
output:
 1145 
 253 txt
 190 s01
 132 is
 126 of
 116 the
 108 and
 104 test
 92 with
 84 in

The 'txt' and 's01' come from file names and not from the text inside the text file. I know there are ways of excluding common words like "the" but would rather not sort and count file names at all.

edited Apr 12 at 0:27

Jeff Schaller

31.7k847108

asked Jan 22 at 1:36

dpoiesz

133

edited Apr 12 at 0:27

Jeff Schaller

31.7k847108

edited Apr 12 at 0:27

Jeff Schaller

31.7k847108

edited Apr 12 at 0:27

Jeff Schaller

31.7k847108

asked Jan 22 at 1:36

dpoiesz

133

asked Jan 22 at 1:36

dpoiesz

133

asked Jan 22 at 1:36

dpoiesz

133

1

When you say "command line" that suggests bash which is probably not best suited to your task.
â€“Â jdwolf
Jan 22 at 1:41

1

The link doesnÃ¢Â€Â™t work. Are you just looking for the top ten file names, or words in the files?
â€“Â Guy
Jan 22 at 3:45

idownvotedbecau.se/nocode
â€“Â Murphy
Jan 22 at 13:18

I added another link. It worked when I tested it.
â€“Â dpoiesz
Jan 22 at 16:54

grep --no-filename; or use cat instead of grep, like cat /data/*/*/s*/*txt
â€“Â drewbenn
Jan 22 at 17:06

Â |Â
show 1 more comment

1

When you say "command line" that suggests bash which is probably not best suited to your task.
â€“Â jdwolf
Jan 22 at 1:41

1

The link doesnÃ¢Â€Â™t work. Are you just looking for the top ten file names, or words in the files?
â€“Â Guy
Jan 22 at 3:45

idownvotedbecau.se/nocode
â€“Â Murphy
Jan 22 at 13:18

I added another link. It worked when I tested it.
â€“Â dpoiesz
Jan 22 at 16:54

grep --no-filename; or use cat instead of grep, like cat /data/*/*/s*/*txt
â€“Â drewbenn
Jan 22 at 17:06

When you say "command line" that suggests bash which is probably not best suited to your task.
â€“Â jdwolf
Jan 22 at 1:41

The link doesnÃ¢Â€Â™t work. Are you just looking for the top ten file names, or words in the files?
â€“Â Guy
Jan 22 at 3:45

idownvotedbecau.se/nocode
â€“Â Murphy
Jan 22 at 13:18

I added another link. It worked when I tested it.
â€“Â dpoiesz
Jan 22 at 16:54

grep --no-filename; or use cat instead of grep, like cat /data/*/*/s*/*txt
â€“Â drewbenn
Jan 22 at 17:06

Â |Â
show 1 more comment

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

grep will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.

Instead of using grep (which is an inspired but slow solution to not being able to cat all files on the command line in one go) you may actually cat all the text files together and process it as one big document like this:

find /data -type f -name '*.txt' -exec cat + |
tr -cs '[:alnum:]' 'n' | sort | uniq -c | sort -nr | head

I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.

The find command finds all regular files (-type f) anywhere under /data whose filenames matches the pattern *.txt. For as many as possible of those files at a time, cat is invoked to concatenate them (this is what -exec cat + does). cat is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find+cat.

To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.

edited Feb 15 at 21:34

answered Feb 14 at 21:30

Kusalananda

103k13202319

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f418722%2fusing-a-single-command-line-command-how-would-i-search-every-text-file-in-a-dat%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

grep will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.

find /data -type f -name '*.txt' -exec cat + |
tr -cs '[:alnum:]' 'n' | sort | uniq -c | sort -nr | head

To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.

edited Feb 15 at 21:34

answered Feb 14 at 21:30

Kusalananda

103k13202319

add a commentÂ |Â

up vote
1
down vote

accepted

grep will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.

find /data -type f -name '*.txt' -exec cat + |
tr -cs '[:alnum:]' 'n' | sort | uniq -c | sort -nr | head

To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.

edited Feb 15 at 21:34

answered Feb 14 at 21:30

Kusalananda

103k13202319

add a commentÂ |Â

up vote
1
down vote

accepted

grep will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.

find /data -type f -name '*.txt' -exec cat + |
tr -cs '[:alnum:]' 'n' | sort | uniq -c | sort -nr | head

To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.

edited Feb 15 at 21:34

answered Feb 14 at 21:30

Kusalananda

103k13202319

grep will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.

find /data -type f -name '*.txt' -exec cat + |
tr -cs '[:alnum:]' 'n' | sort | uniq -c | sort -nr | head

To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.

edited Feb 15 at 21:34

answered Feb 14 at 21:30

Kusalananda

103k13202319

edited Feb 15 at 21:34

answered Feb 14 at 21:30

Kusalananda

103k13202319

answered Feb 14 at 21:30

Kusalananda

103k13202319

answered Feb 14 at 21:30

Kusalananda

103k13202319

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu