Using a single command-line command, how would I search every text file in a database to find the 10 most used words?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite












This answered question explains how to search and sort a specific filename, but how would you accomplish this for an entire directory? I have 1 million text files I need to to search for the ten most frequently used words.



database= /data/000/0000000/s##_date/*.txt - /data/999/0999999/s##_data/*txt



Everything I have attempted results in sorting filenames, paths, or directory errors.



I have made some progress with grep, but parts of filenames seem to appear in my results.



grep -r . * | tr -c '[:alnum:]' '[n*]' | sort | uniq -c | sort -nr | head -10
output:
1145
253 txt
190 s01
132 is
126 of
116 the
108 and
104 test
92 with
84 in


The 'txt' and 's01' come from file names and not from the text inside the text file. I know there are ways of excluding common words like "the" but would rather not sort and count file names at all.







share|improve this question


















  • 1




    When you say "command line" that suggests bash which is probably not best suited to your task.
    – jdwolf
    Jan 22 at 1:41






  • 1




    The link doesn’t work. Are you just looking for the top ten file names, or words in the files?
    – Guy
    Jan 22 at 3:45










  • idownvotedbecau.se/nocode
    – Murphy
    Jan 22 at 13:18










  • I added another link. It worked when I tested it.
    – dpoiesz
    Jan 22 at 16:54










  • grep --no-filename; or use cat instead of grep, like cat /data/*/*/s*/*txt
    – drewbenn
    Jan 22 at 17:06















up vote
2
down vote

favorite












This answered question explains how to search and sort a specific filename, but how would you accomplish this for an entire directory? I have 1 million text files I need to to search for the ten most frequently used words.



database= /data/000/0000000/s##_date/*.txt - /data/999/0999999/s##_data/*txt



Everything I have attempted results in sorting filenames, paths, or directory errors.



I have made some progress with grep, but parts of filenames seem to appear in my results.



grep -r . * | tr -c '[:alnum:]' '[n*]' | sort | uniq -c | sort -nr | head -10
output:
1145
253 txt
190 s01
132 is
126 of
116 the
108 and
104 test
92 with
84 in


The 'txt' and 's01' come from file names and not from the text inside the text file. I know there are ways of excluding common words like "the" but would rather not sort and count file names at all.







share|improve this question


















  • 1




    When you say "command line" that suggests bash which is probably not best suited to your task.
    – jdwolf
    Jan 22 at 1:41






  • 1




    The link doesn’t work. Are you just looking for the top ten file names, or words in the files?
    – Guy
    Jan 22 at 3:45










  • idownvotedbecau.se/nocode
    – Murphy
    Jan 22 at 13:18










  • I added another link. It worked when I tested it.
    – dpoiesz
    Jan 22 at 16:54










  • grep --no-filename; or use cat instead of grep, like cat /data/*/*/s*/*txt
    – drewbenn
    Jan 22 at 17:06













up vote
2
down vote

favorite









up vote
2
down vote

favorite











This answered question explains how to search and sort a specific filename, but how would you accomplish this for an entire directory? I have 1 million text files I need to to search for the ten most frequently used words.



database= /data/000/0000000/s##_date/*.txt - /data/999/0999999/s##_data/*txt



Everything I have attempted results in sorting filenames, paths, or directory errors.



I have made some progress with grep, but parts of filenames seem to appear in my results.



grep -r . * | tr -c '[:alnum:]' '[n*]' | sort | uniq -c | sort -nr | head -10
output:
1145
253 txt
190 s01
132 is
126 of
116 the
108 and
104 test
92 with
84 in


The 'txt' and 's01' come from file names and not from the text inside the text file. I know there are ways of excluding common words like "the" but would rather not sort and count file names at all.







share|improve this question














This answered question explains how to search and sort a specific filename, but how would you accomplish this for an entire directory? I have 1 million text files I need to to search for the ten most frequently used words.



database= /data/000/0000000/s##_date/*.txt - /data/999/0999999/s##_data/*txt



Everything I have attempted results in sorting filenames, paths, or directory errors.



I have made some progress with grep, but parts of filenames seem to appear in my results.



grep -r . * | tr -c '[:alnum:]' '[n*]' | sort | uniq -c | sort -nr | head -10
output:
1145
253 txt
190 s01
132 is
126 of
116 the
108 and
104 test
92 with
84 in


The 'txt' and 's01' come from file names and not from the text inside the text file. I know there are ways of excluding common words like "the" but would rather not sort and count file names at all.









share|improve this question













share|improve this question




share|improve this question








edited Apr 12 at 0:27









Jeff Schaller

31.7k847108




31.7k847108










asked Jan 22 at 1:36









dpoiesz

133




133







  • 1




    When you say "command line" that suggests bash which is probably not best suited to your task.
    – jdwolf
    Jan 22 at 1:41






  • 1




    The link doesn’t work. Are you just looking for the top ten file names, or words in the files?
    – Guy
    Jan 22 at 3:45










  • idownvotedbecau.se/nocode
    – Murphy
    Jan 22 at 13:18










  • I added another link. It worked when I tested it.
    – dpoiesz
    Jan 22 at 16:54










  • grep --no-filename; or use cat instead of grep, like cat /data/*/*/s*/*txt
    – drewbenn
    Jan 22 at 17:06













  • 1




    When you say "command line" that suggests bash which is probably not best suited to your task.
    – jdwolf
    Jan 22 at 1:41






  • 1




    The link doesn’t work. Are you just looking for the top ten file names, or words in the files?
    – Guy
    Jan 22 at 3:45










  • idownvotedbecau.se/nocode
    – Murphy
    Jan 22 at 13:18










  • I added another link. It worked when I tested it.
    – dpoiesz
    Jan 22 at 16:54










  • grep --no-filename; or use cat instead of grep, like cat /data/*/*/s*/*txt
    – drewbenn
    Jan 22 at 17:06








1




1




When you say "command line" that suggests bash which is probably not best suited to your task.
– jdwolf
Jan 22 at 1:41




When you say "command line" that suggests bash which is probably not best suited to your task.
– jdwolf
Jan 22 at 1:41




1




1




The link doesn’t work. Are you just looking for the top ten file names, or words in the files?
– Guy
Jan 22 at 3:45




The link doesn’t work. Are you just looking for the top ten file names, or words in the files?
– Guy
Jan 22 at 3:45












idownvotedbecau.se/nocode
– Murphy
Jan 22 at 13:18




idownvotedbecau.se/nocode
– Murphy
Jan 22 at 13:18












I added another link. It worked when I tested it.
– dpoiesz
Jan 22 at 16:54




I added another link. It worked when I tested it.
– dpoiesz
Jan 22 at 16:54












grep --no-filename; or use cat instead of grep, like cat /data/*/*/s*/*txt
– drewbenn
Jan 22 at 17:06





grep --no-filename; or use cat instead of grep, like cat /data/*/*/s*/*txt
– drewbenn
Jan 22 at 17:06











1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










grep will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.



Instead of using grep (which is an inspired but slow solution to not being able to cat all files on the command line in one go) you may actually cat all the text files together and process it as one big document like this:



find /data -type f -name '*.txt' -exec cat + |
tr -cs '[:alnum:]' 'n' | sort | uniq -c | sort -nr | head


I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.



The find command finds all regular files (-type f) anywhere under /data whose filenames matches the pattern *.txt. For as many as possible of those files at a time, cat is invoked to concatenate them (this is what -exec cat + does). cat is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find+cat.




To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.






share|improve this answer






















    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f418722%2fusing-a-single-command-line-command-how-would-i-search-every-text-file-in-a-dat%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote



    accepted










    grep will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.



    Instead of using grep (which is an inspired but slow solution to not being able to cat all files on the command line in one go) you may actually cat all the text files together and process it as one big document like this:



    find /data -type f -name '*.txt' -exec cat + |
    tr -cs '[:alnum:]' 'n' | sort | uniq -c | sort -nr | head


    I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.



    The find command finds all regular files (-type f) anywhere under /data whose filenames matches the pattern *.txt. For as many as possible of those files at a time, cat is invoked to concatenate them (this is what -exec cat + does). cat is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find+cat.




    To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.






    share|improve this answer


























      up vote
      1
      down vote



      accepted










      grep will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.



      Instead of using grep (which is an inspired but slow solution to not being able to cat all files on the command line in one go) you may actually cat all the text files together and process it as one big document like this:



      find /data -type f -name '*.txt' -exec cat + |
      tr -cs '[:alnum:]' 'n' | sort | uniq -c | sort -nr | head


      I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.



      The find command finds all regular files (-type f) anywhere under /data whose filenames matches the pattern *.txt. For as many as possible of those files at a time, cat is invoked to concatenate them (this is what -exec cat + does). cat is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find+cat.




      To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.






      share|improve this answer
























        up vote
        1
        down vote



        accepted







        up vote
        1
        down vote



        accepted






        grep will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.



        Instead of using grep (which is an inspired but slow solution to not being able to cat all files on the command line in one go) you may actually cat all the text files together and process it as one big document like this:



        find /data -type f -name '*.txt' -exec cat + |
        tr -cs '[:alnum:]' 'n' | sort | uniq -c | sort -nr | head


        I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.



        The find command finds all regular files (-type f) anywhere under /data whose filenames matches the pattern *.txt. For as many as possible of those files at a time, cat is invoked to concatenate them (this is what -exec cat + does). cat is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find+cat.




        To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.






        share|improve this answer














        grep will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.



        Instead of using grep (which is an inspired but slow solution to not being able to cat all files on the command line in one go) you may actually cat all the text files together and process it as one big document like this:



        find /data -type f -name '*.txt' -exec cat + |
        tr -cs '[:alnum:]' 'n' | sort | uniq -c | sort -nr | head


        I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.



        The find command finds all regular files (-type f) anywhere under /data whose filenames matches the pattern *.txt. For as many as possible of those files at a time, cat is invoked to concatenate them (this is what -exec cat + does). cat is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find+cat.




        To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Feb 15 at 21:34

























        answered Feb 14 at 21:30









        Kusalananda

        103k13202319




        103k13202319






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f418722%2fusing-a-single-command-line-command-how-would-i-search-every-text-file-in-a-dat%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Displaying single band from multi-band raster using QGIS

            How many registers does an x86_64 CPU actually have?