I can't figure out how to cut this file and find unique words of a particular section

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












So there's an access log entry file named access_log and I'm supposed to find all of the unique files that were accessed on the web server. access_log is formatted like this, this is just an excerpt:



66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /robots.txt HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /~robert/class2.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:30:19 -0600] "GET /~robert/class3.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
202.46.61.93 - - [14/Dec/2015:09:07:34 -0600] "GET / HTTP/1.1" 200 5208 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"


The files, for example on the first one "robots.txt", are either after the word GET, HEAD, or POST. I've tried using the cut command using " as the delimeter which hasn't worked. I literally have no idea how to separate the fields on a file like this, so I can compare them. If anyone could point me in the right direction, I'd really appreciate it.



Edit: Figured it out, you were right @MichaelHomer. My syntax was off so that's why cut wasn't working for me. I used space as the delimeter and it worked.







share|improve this question






















  • Space seems like the obvious delimiter here, is there a reason you can't use it?
    – Michael Homer
    Apr 10 at 2:29










  • What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using either awk (which I'd recommend) for printing out the name of the file or whatever else you need
    – Nasir Riley
    Apr 10 at 2:33










  • @NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
    – Michael Kiroff
    Apr 10 at 2:46














up vote
0
down vote

favorite












So there's an access log entry file named access_log and I'm supposed to find all of the unique files that were accessed on the web server. access_log is formatted like this, this is just an excerpt:



66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /robots.txt HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /~robert/class2.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:30:19 -0600] "GET /~robert/class3.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
202.46.61.93 - - [14/Dec/2015:09:07:34 -0600] "GET / HTTP/1.1" 200 5208 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"


The files, for example on the first one "robots.txt", are either after the word GET, HEAD, or POST. I've tried using the cut command using " as the delimeter which hasn't worked. I literally have no idea how to separate the fields on a file like this, so I can compare them. If anyone could point me in the right direction, I'd really appreciate it.



Edit: Figured it out, you were right @MichaelHomer. My syntax was off so that's why cut wasn't working for me. I used space as the delimeter and it worked.







share|improve this question






















  • Space seems like the obvious delimiter here, is there a reason you can't use it?
    – Michael Homer
    Apr 10 at 2:29










  • What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using either awk (which I'd recommend) for printing out the name of the file or whatever else you need
    – Nasir Riley
    Apr 10 at 2:33










  • @NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
    – Michael Kiroff
    Apr 10 at 2:46












up vote
0
down vote

favorite









up vote
0
down vote

favorite











So there's an access log entry file named access_log and I'm supposed to find all of the unique files that were accessed on the web server. access_log is formatted like this, this is just an excerpt:



66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /robots.txt HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /~robert/class2.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:30:19 -0600] "GET /~robert/class3.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
202.46.61.93 - - [14/Dec/2015:09:07:34 -0600] "GET / HTTP/1.1" 200 5208 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"


The files, for example on the first one "robots.txt", are either after the word GET, HEAD, or POST. I've tried using the cut command using " as the delimeter which hasn't worked. I literally have no idea how to separate the fields on a file like this, so I can compare them. If anyone could point me in the right direction, I'd really appreciate it.



Edit: Figured it out, you were right @MichaelHomer. My syntax was off so that's why cut wasn't working for me. I used space as the delimeter and it worked.







share|improve this question














So there's an access log entry file named access_log and I'm supposed to find all of the unique files that were accessed on the web server. access_log is formatted like this, this is just an excerpt:



66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /robots.txt HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /~robert/class2.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:30:19 -0600] "GET /~robert/class3.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
202.46.61.93 - - [14/Dec/2015:09:07:34 -0600] "GET / HTTP/1.1" 200 5208 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"


The files, for example on the first one "robots.txt", are either after the word GET, HEAD, or POST. I've tried using the cut command using " as the delimeter which hasn't worked. I literally have no idea how to separate the fields on a file like this, so I can compare them. If anyone could point me in the right direction, I'd really appreciate it.



Edit: Figured it out, you were right @MichaelHomer. My syntax was off so that's why cut wasn't working for me. I used space as the delimeter and it worked.









share|improve this question













share|improve this question




share|improve this question








edited Apr 10 at 2:56

























asked Apr 10 at 2:19









Michael Kiroff

11




11











  • Space seems like the obvious delimiter here, is there a reason you can't use it?
    – Michael Homer
    Apr 10 at 2:29










  • What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using either awk (which I'd recommend) for printing out the name of the file or whatever else you need
    – Nasir Riley
    Apr 10 at 2:33










  • @NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
    – Michael Kiroff
    Apr 10 at 2:46
















  • Space seems like the obvious delimiter here, is there a reason you can't use it?
    – Michael Homer
    Apr 10 at 2:29










  • What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using either awk (which I'd recommend) for printing out the name of the file or whatever else you need
    – Nasir Riley
    Apr 10 at 2:33










  • @NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
    – Michael Kiroff
    Apr 10 at 2:46















Space seems like the obvious delimiter here, is there a reason you can't use it?
– Michael Homer
Apr 10 at 2:29




Space seems like the obvious delimiter here, is there a reason you can't use it?
– Michael Homer
Apr 10 at 2:29












What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using either awk (which I'd recommend) for printing out the name of the file or whatever else you need
– Nasir Riley
Apr 10 at 2:33




What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using either awk (which I'd recommend) for printing out the name of the file or whatever else you need
– Nasir Riley
Apr 10 at 2:33












@NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
– Michael Kiroff
Apr 10 at 2:46




@NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
– Michael Kiroff
Apr 10 at 2:46










2 Answers
2






active

oldest

votes

















up vote
0
down vote













Here's a walk-through on the sample that you've provided.



awk prints out columns and lines which you can specify. I suggest reviewing the man page and Google for more reference. In your case the delimiter is space which will separate each column. It's going to vary because in what you've provided so far, each line has different text which will make the positioning of the columns different but for your first three lines, you can begin with the following:



cat access_log | awk 'NR==1,NR==3 print $7' | sort -u


NR==1,NR==3 Prints out lines 1 through 3



print $7 Prints out the seventh column which is the file name that you need. Keep in mind that it won't always be the seventh column because the text in each line may be different.



sort -u Prints out unique values



The output is:



/robots.txt
/~robert/class2.cgi
/~robert/class3.cgi


The last part with sort won't have any effect on your sample because there are no duplicates but if the rest of your file does then it will only print out unique values in the particular column.



If you just want to print the filename then you can use the substr argument with awk command:



cat access_log | awk 'NR==1 print substr($7,2,10) NR==2,NR==3 print substr($7,10,10)'


The output will be:



robots.txt
class2.cgi
class3.cgi


To explain:



NR==1 print substr($7,2,10) For the first line in field 7, starting at the 2nd position, it prints out 10 characters.



NR==2,NR==3 print substr($7,10,10) For the second through third lines in field 7, starting at the tenth position, it prints out 10 characters.



You'll probably have to modify the columns and values as the rest of your file is probably different and won't always line up in the same position but that should get you started. It seems like quite a bit to take in but a little research will get you going into the right direction






share|improve this answer



























    up vote
    0
    down vote













    an alternative, that will give you a count of each unique file hit:



    awk 'print $7' access_log | sort | uniq -c | sort -rn



    or if you wanted hits on a specific day, you could grep the date first:



    fgrep "14/Dec/2015" access_log | awk 'print $7' | sort | uniq -c | sort -rn



    somewhat relevant, you can use the above to also find unique visitors (at least unique IPs anyway) to your site by changing the print from $7 to $1. I personally use the same commands when my sites are being DoS'd to find which IPs to block out the network.






    share|improve this answer




















      Your Answer







      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "106"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );








       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f436667%2fi-cant-figure-out-how-to-cut-this-file-and-find-unique-words-of-a-particular-se%23new-answer', 'question_page');

      );

      Post as a guest






























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      0
      down vote













      Here's a walk-through on the sample that you've provided.



      awk prints out columns and lines which you can specify. I suggest reviewing the man page and Google for more reference. In your case the delimiter is space which will separate each column. It's going to vary because in what you've provided so far, each line has different text which will make the positioning of the columns different but for your first three lines, you can begin with the following:



      cat access_log | awk 'NR==1,NR==3 print $7' | sort -u


      NR==1,NR==3 Prints out lines 1 through 3



      print $7 Prints out the seventh column which is the file name that you need. Keep in mind that it won't always be the seventh column because the text in each line may be different.



      sort -u Prints out unique values



      The output is:



      /robots.txt
      /~robert/class2.cgi
      /~robert/class3.cgi


      The last part with sort won't have any effect on your sample because there are no duplicates but if the rest of your file does then it will only print out unique values in the particular column.



      If you just want to print the filename then you can use the substr argument with awk command:



      cat access_log | awk 'NR==1 print substr($7,2,10) NR==2,NR==3 print substr($7,10,10)'


      The output will be:



      robots.txt
      class2.cgi
      class3.cgi


      To explain:



      NR==1 print substr($7,2,10) For the first line in field 7, starting at the 2nd position, it prints out 10 characters.



      NR==2,NR==3 print substr($7,10,10) For the second through third lines in field 7, starting at the tenth position, it prints out 10 characters.



      You'll probably have to modify the columns and values as the rest of your file is probably different and won't always line up in the same position but that should get you started. It seems like quite a bit to take in but a little research will get you going into the right direction






      share|improve this answer
























        up vote
        0
        down vote













        Here's a walk-through on the sample that you've provided.



        awk prints out columns and lines which you can specify. I suggest reviewing the man page and Google for more reference. In your case the delimiter is space which will separate each column. It's going to vary because in what you've provided so far, each line has different text which will make the positioning of the columns different but for your first three lines, you can begin with the following:



        cat access_log | awk 'NR==1,NR==3 print $7' | sort -u


        NR==1,NR==3 Prints out lines 1 through 3



        print $7 Prints out the seventh column which is the file name that you need. Keep in mind that it won't always be the seventh column because the text in each line may be different.



        sort -u Prints out unique values



        The output is:



        /robots.txt
        /~robert/class2.cgi
        /~robert/class3.cgi


        The last part with sort won't have any effect on your sample because there are no duplicates but if the rest of your file does then it will only print out unique values in the particular column.



        If you just want to print the filename then you can use the substr argument with awk command:



        cat access_log | awk 'NR==1 print substr($7,2,10) NR==2,NR==3 print substr($7,10,10)'


        The output will be:



        robots.txt
        class2.cgi
        class3.cgi


        To explain:



        NR==1 print substr($7,2,10) For the first line in field 7, starting at the 2nd position, it prints out 10 characters.



        NR==2,NR==3 print substr($7,10,10) For the second through third lines in field 7, starting at the tenth position, it prints out 10 characters.



        You'll probably have to modify the columns and values as the rest of your file is probably different and won't always line up in the same position but that should get you started. It seems like quite a bit to take in but a little research will get you going into the right direction






        share|improve this answer






















          up vote
          0
          down vote










          up vote
          0
          down vote









          Here's a walk-through on the sample that you've provided.



          awk prints out columns and lines which you can specify. I suggest reviewing the man page and Google for more reference. In your case the delimiter is space which will separate each column. It's going to vary because in what you've provided so far, each line has different text which will make the positioning of the columns different but for your first three lines, you can begin with the following:



          cat access_log | awk 'NR==1,NR==3 print $7' | sort -u


          NR==1,NR==3 Prints out lines 1 through 3



          print $7 Prints out the seventh column which is the file name that you need. Keep in mind that it won't always be the seventh column because the text in each line may be different.



          sort -u Prints out unique values



          The output is:



          /robots.txt
          /~robert/class2.cgi
          /~robert/class3.cgi


          The last part with sort won't have any effect on your sample because there are no duplicates but if the rest of your file does then it will only print out unique values in the particular column.



          If you just want to print the filename then you can use the substr argument with awk command:



          cat access_log | awk 'NR==1 print substr($7,2,10) NR==2,NR==3 print substr($7,10,10)'


          The output will be:



          robots.txt
          class2.cgi
          class3.cgi


          To explain:



          NR==1 print substr($7,2,10) For the first line in field 7, starting at the 2nd position, it prints out 10 characters.



          NR==2,NR==3 print substr($7,10,10) For the second through third lines in field 7, starting at the tenth position, it prints out 10 characters.



          You'll probably have to modify the columns and values as the rest of your file is probably different and won't always line up in the same position but that should get you started. It seems like quite a bit to take in but a little research will get you going into the right direction






          share|improve this answer












          Here's a walk-through on the sample that you've provided.



          awk prints out columns and lines which you can specify. I suggest reviewing the man page and Google for more reference. In your case the delimiter is space which will separate each column. It's going to vary because in what you've provided so far, each line has different text which will make the positioning of the columns different but for your first three lines, you can begin with the following:



          cat access_log | awk 'NR==1,NR==3 print $7' | sort -u


          NR==1,NR==3 Prints out lines 1 through 3



          print $7 Prints out the seventh column which is the file name that you need. Keep in mind that it won't always be the seventh column because the text in each line may be different.



          sort -u Prints out unique values



          The output is:



          /robots.txt
          /~robert/class2.cgi
          /~robert/class3.cgi


          The last part with sort won't have any effect on your sample because there are no duplicates but if the rest of your file does then it will only print out unique values in the particular column.



          If you just want to print the filename then you can use the substr argument with awk command:



          cat access_log | awk 'NR==1 print substr($7,2,10) NR==2,NR==3 print substr($7,10,10)'


          The output will be:



          robots.txt
          class2.cgi
          class3.cgi


          To explain:



          NR==1 print substr($7,2,10) For the first line in field 7, starting at the 2nd position, it prints out 10 characters.



          NR==2,NR==3 print substr($7,10,10) For the second through third lines in field 7, starting at the tenth position, it prints out 10 characters.



          You'll probably have to modify the columns and values as the rest of your file is probably different and won't always line up in the same position but that should get you started. It seems like quite a bit to take in but a little research will get you going into the right direction







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Apr 10 at 3:52









          Nasir Riley

          1,514138




          1,514138






















              up vote
              0
              down vote













              an alternative, that will give you a count of each unique file hit:



              awk 'print $7' access_log | sort | uniq -c | sort -rn



              or if you wanted hits on a specific day, you could grep the date first:



              fgrep "14/Dec/2015" access_log | awk 'print $7' | sort | uniq -c | sort -rn



              somewhat relevant, you can use the above to also find unique visitors (at least unique IPs anyway) to your site by changing the print from $7 to $1. I personally use the same commands when my sites are being DoS'd to find which IPs to block out the network.






              share|improve this answer
























                up vote
                0
                down vote













                an alternative, that will give you a count of each unique file hit:



                awk 'print $7' access_log | sort | uniq -c | sort -rn



                or if you wanted hits on a specific day, you could grep the date first:



                fgrep "14/Dec/2015" access_log | awk 'print $7' | sort | uniq -c | sort -rn



                somewhat relevant, you can use the above to also find unique visitors (at least unique IPs anyway) to your site by changing the print from $7 to $1. I personally use the same commands when my sites are being DoS'd to find which IPs to block out the network.






                share|improve this answer






















                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  an alternative, that will give you a count of each unique file hit:



                  awk 'print $7' access_log | sort | uniq -c | sort -rn



                  or if you wanted hits on a specific day, you could grep the date first:



                  fgrep "14/Dec/2015" access_log | awk 'print $7' | sort | uniq -c | sort -rn



                  somewhat relevant, you can use the above to also find unique visitors (at least unique IPs anyway) to your site by changing the print from $7 to $1. I personally use the same commands when my sites are being DoS'd to find which IPs to block out the network.






                  share|improve this answer












                  an alternative, that will give you a count of each unique file hit:



                  awk 'print $7' access_log | sort | uniq -c | sort -rn



                  or if you wanted hits on a specific day, you could grep the date first:



                  fgrep "14/Dec/2015" access_log | awk 'print $7' | sort | uniq -c | sort -rn



                  somewhat relevant, you can use the above to also find unique visitors (at least unique IPs anyway) to your site by changing the print from $7 to $1. I personally use the same commands when my sites are being DoS'd to find which IPs to block out the network.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Apr 10 at 13:22









                  RobotJohnny

                  417212




                  417212






















                       

                      draft saved


                      draft discarded


























                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f436667%2fi-cant-figure-out-how-to-cut-this-file-and-find-unique-words-of-a-particular-se%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Popular posts from this blog

                      How to check contact read email or not when send email to Individual?

                      Displaying single band from multi-band raster using QGIS

                      How many registers does an x86_64 CPU actually have?