I can't figure out how to cut this file and find unique words of a particular section
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
So there's an access log entry file named access_log and I'm supposed to find all of the unique files that were accessed on the web server. access_log is formatted like this, this is just an excerpt:
66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /robots.txt HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /~robert/class2.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:30:19 -0600] "GET /~robert/class3.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
202.46.61.93 - - [14/Dec/2015:09:07:34 -0600] "GET / HTTP/1.1" 200 5208 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
The files, for example on the first one "robots.txt", are either after the word GET, HEAD, or POST. I've tried using the cut command using " as the delimeter which hasn't worked. I literally have no idea how to separate the fields on a file like this, so I can compare them. If anyone could point me in the right direction, I'd really appreciate it.
Edit: Figured it out, you were right @MichaelHomer. My syntax was off so that's why cut wasn't working for me. I used space as the delimeter and it worked.
linux cut uniq
add a comment |Â
up vote
0
down vote
favorite
So there's an access log entry file named access_log and I'm supposed to find all of the unique files that were accessed on the web server. access_log is formatted like this, this is just an excerpt:
66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /robots.txt HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /~robert/class2.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:30:19 -0600] "GET /~robert/class3.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
202.46.61.93 - - [14/Dec/2015:09:07:34 -0600] "GET / HTTP/1.1" 200 5208 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
The files, for example on the first one "robots.txt", are either after the word GET, HEAD, or POST. I've tried using the cut command using " as the delimeter which hasn't worked. I literally have no idea how to separate the fields on a file like this, so I can compare them. If anyone could point me in the right direction, I'd really appreciate it.
Edit: Figured it out, you were right @MichaelHomer. My syntax was off so that's why cut wasn't working for me. I used space as the delimeter and it worked.
linux cut uniq
Space seems like the obvious delimiter here, is there a reason you can't use it?
â Michael Homer
Apr 10 at 2:29
What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using eitherawk
(which I'd recommend) for printing out the name of the file or whatever else you need
â Nasir Riley
Apr 10 at 2:33
@NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
â Michael Kiroff
Apr 10 at 2:46
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
So there's an access log entry file named access_log and I'm supposed to find all of the unique files that were accessed on the web server. access_log is formatted like this, this is just an excerpt:
66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /robots.txt HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /~robert/class2.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:30:19 -0600] "GET /~robert/class3.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
202.46.61.93 - - [14/Dec/2015:09:07:34 -0600] "GET / HTTP/1.1" 200 5208 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
The files, for example on the first one "robots.txt", are either after the word GET, HEAD, or POST. I've tried using the cut command using " as the delimeter which hasn't worked. I literally have no idea how to separate the fields on a file like this, so I can compare them. If anyone could point me in the right direction, I'd really appreciate it.
Edit: Figured it out, you were right @MichaelHomer. My syntax was off so that's why cut wasn't working for me. I used space as the delimeter and it worked.
linux cut uniq
So there's an access log entry file named access_log and I'm supposed to find all of the unique files that were accessed on the web server. access_log is formatted like this, this is just an excerpt:
66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /robots.txt HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /~robert/class2.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:30:19 -0600] "GET /~robert/class3.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
202.46.61.93 - - [14/Dec/2015:09:07:34 -0600] "GET / HTTP/1.1" 200 5208 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"
The files, for example on the first one "robots.txt", are either after the word GET, HEAD, or POST. I've tried using the cut command using " as the delimeter which hasn't worked. I literally have no idea how to separate the fields on a file like this, so I can compare them. If anyone could point me in the right direction, I'd really appreciate it.
Edit: Figured it out, you were right @MichaelHomer. My syntax was off so that's why cut wasn't working for me. I used space as the delimeter and it worked.
linux cut uniq
edited Apr 10 at 2:56
asked Apr 10 at 2:19
Michael Kiroff
11
11
Space seems like the obvious delimiter here, is there a reason you can't use it?
â Michael Homer
Apr 10 at 2:29
What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using eitherawk
(which I'd recommend) for printing out the name of the file or whatever else you need
â Nasir Riley
Apr 10 at 2:33
@NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
â Michael Kiroff
Apr 10 at 2:46
add a comment |Â
Space seems like the obvious delimiter here, is there a reason you can't use it?
â Michael Homer
Apr 10 at 2:29
What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using eitherawk
(which I'd recommend) for printing out the name of the file or whatever else you need
â Nasir Riley
Apr 10 at 2:33
@NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
â Michael Kiroff
Apr 10 at 2:46
Space seems like the obvious delimiter here, is there a reason you can't use it?
â Michael Homer
Apr 10 at 2:29
Space seems like the obvious delimiter here, is there a reason you can't use it?
â Michael Homer
Apr 10 at 2:29
What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using either
awk
(which I'd recommend) for printing out the name of the file or whatever else you needâ Nasir Riley
Apr 10 at 2:33
What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using either
awk
(which I'd recommend) for printing out the name of the file or whatever else you needâ Nasir Riley
Apr 10 at 2:33
@NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
â Michael Kiroff
Apr 10 at 2:46
@NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
â Michael Kiroff
Apr 10 at 2:46
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
0
down vote
Here's a walk-through on the sample that you've provided.
awk
prints out columns and lines which you can specify. I suggest reviewing the man
page and Google for more reference. In your case the delimiter is space which will separate each column. It's going to vary because in what you've provided so far, each line has different text which will make the positioning of the columns different but for your first three lines, you can begin with the following:
cat access_log | awk 'NR==1,NR==3 print $7' | sort -u
NR==1,NR==3
Prints out lines 1 through 3
print $7
Prints out the seventh column which is the file name that you need. Keep in mind that it won't always be the seventh column because the text in each line may be different.
sort -u
Prints out unique values
The output is:
/robots.txt
/~robert/class2.cgi
/~robert/class3.cgi
The last part with sort
won't have any effect on your sample because there are no duplicates but if the rest of your file does then it will only print out unique values in the particular column.
If you just want to print the filename then you can use the substr
argument with awk
command:
cat access_log | awk 'NR==1 print substr($7,2,10) NR==2,NR==3 print substr($7,10,10)'
The output will be:
robots.txt
class2.cgi
class3.cgi
To explain:
NR==1 print substr($7,2,10)
For the first line in field 7, starting at the 2nd position, it prints out 10 characters.
NR==2,NR==3 print substr($7,10,10)
For the second through third lines in field 7, starting at the tenth position, it prints out 10 characters.
You'll probably have to modify the columns and values as the rest of your file is probably different and won't always line up in the same position but that should get you started. It seems like quite a bit to take in but a little research will get you going into the right direction
add a comment |Â
up vote
0
down vote
an alternative, that will give you a count of each unique file hit:
awk 'print $7' access_log | sort | uniq -c | sort -rn
or if you wanted hits on a specific day, you could grep the date first:
fgrep "14/Dec/2015" access_log | awk 'print $7' | sort | uniq -c | sort -rn
somewhat relevant, you can use the above to also find unique visitors (at least unique IPs anyway) to your site by changing the print from $7 to $1. I personally use the same commands when my sites are being DoS'd to find which IPs to block out the network.
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Here's a walk-through on the sample that you've provided.
awk
prints out columns and lines which you can specify. I suggest reviewing the man
page and Google for more reference. In your case the delimiter is space which will separate each column. It's going to vary because in what you've provided so far, each line has different text which will make the positioning of the columns different but for your first three lines, you can begin with the following:
cat access_log | awk 'NR==1,NR==3 print $7' | sort -u
NR==1,NR==3
Prints out lines 1 through 3
print $7
Prints out the seventh column which is the file name that you need. Keep in mind that it won't always be the seventh column because the text in each line may be different.
sort -u
Prints out unique values
The output is:
/robots.txt
/~robert/class2.cgi
/~robert/class3.cgi
The last part with sort
won't have any effect on your sample because there are no duplicates but if the rest of your file does then it will only print out unique values in the particular column.
If you just want to print the filename then you can use the substr
argument with awk
command:
cat access_log | awk 'NR==1 print substr($7,2,10) NR==2,NR==3 print substr($7,10,10)'
The output will be:
robots.txt
class2.cgi
class3.cgi
To explain:
NR==1 print substr($7,2,10)
For the first line in field 7, starting at the 2nd position, it prints out 10 characters.
NR==2,NR==3 print substr($7,10,10)
For the second through third lines in field 7, starting at the tenth position, it prints out 10 characters.
You'll probably have to modify the columns and values as the rest of your file is probably different and won't always line up in the same position but that should get you started. It seems like quite a bit to take in but a little research will get you going into the right direction
add a comment |Â
up vote
0
down vote
Here's a walk-through on the sample that you've provided.
awk
prints out columns and lines which you can specify. I suggest reviewing the man
page and Google for more reference. In your case the delimiter is space which will separate each column. It's going to vary because in what you've provided so far, each line has different text which will make the positioning of the columns different but for your first three lines, you can begin with the following:
cat access_log | awk 'NR==1,NR==3 print $7' | sort -u
NR==1,NR==3
Prints out lines 1 through 3
print $7
Prints out the seventh column which is the file name that you need. Keep in mind that it won't always be the seventh column because the text in each line may be different.
sort -u
Prints out unique values
The output is:
/robots.txt
/~robert/class2.cgi
/~robert/class3.cgi
The last part with sort
won't have any effect on your sample because there are no duplicates but if the rest of your file does then it will only print out unique values in the particular column.
If you just want to print the filename then you can use the substr
argument with awk
command:
cat access_log | awk 'NR==1 print substr($7,2,10) NR==2,NR==3 print substr($7,10,10)'
The output will be:
robots.txt
class2.cgi
class3.cgi
To explain:
NR==1 print substr($7,2,10)
For the first line in field 7, starting at the 2nd position, it prints out 10 characters.
NR==2,NR==3 print substr($7,10,10)
For the second through third lines in field 7, starting at the tenth position, it prints out 10 characters.
You'll probably have to modify the columns and values as the rest of your file is probably different and won't always line up in the same position but that should get you started. It seems like quite a bit to take in but a little research will get you going into the right direction
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Here's a walk-through on the sample that you've provided.
awk
prints out columns and lines which you can specify. I suggest reviewing the man
page and Google for more reference. In your case the delimiter is space which will separate each column. It's going to vary because in what you've provided so far, each line has different text which will make the positioning of the columns different but for your first three lines, you can begin with the following:
cat access_log | awk 'NR==1,NR==3 print $7' | sort -u
NR==1,NR==3
Prints out lines 1 through 3
print $7
Prints out the seventh column which is the file name that you need. Keep in mind that it won't always be the seventh column because the text in each line may be different.
sort -u
Prints out unique values
The output is:
/robots.txt
/~robert/class2.cgi
/~robert/class3.cgi
The last part with sort
won't have any effect on your sample because there are no duplicates but if the rest of your file does then it will only print out unique values in the particular column.
If you just want to print the filename then you can use the substr
argument with awk
command:
cat access_log | awk 'NR==1 print substr($7,2,10) NR==2,NR==3 print substr($7,10,10)'
The output will be:
robots.txt
class2.cgi
class3.cgi
To explain:
NR==1 print substr($7,2,10)
For the first line in field 7, starting at the 2nd position, it prints out 10 characters.
NR==2,NR==3 print substr($7,10,10)
For the second through third lines in field 7, starting at the tenth position, it prints out 10 characters.
You'll probably have to modify the columns and values as the rest of your file is probably different and won't always line up in the same position but that should get you started. It seems like quite a bit to take in but a little research will get you going into the right direction
Here's a walk-through on the sample that you've provided.
awk
prints out columns and lines which you can specify. I suggest reviewing the man
page and Google for more reference. In your case the delimiter is space which will separate each column. It's going to vary because in what you've provided so far, each line has different text which will make the positioning of the columns different but for your first three lines, you can begin with the following:
cat access_log | awk 'NR==1,NR==3 print $7' | sort -u
NR==1,NR==3
Prints out lines 1 through 3
print $7
Prints out the seventh column which is the file name that you need. Keep in mind that it won't always be the seventh column because the text in each line may be different.
sort -u
Prints out unique values
The output is:
/robots.txt
/~robert/class2.cgi
/~robert/class3.cgi
The last part with sort
won't have any effect on your sample because there are no duplicates but if the rest of your file does then it will only print out unique values in the particular column.
If you just want to print the filename then you can use the substr
argument with awk
command:
cat access_log | awk 'NR==1 print substr($7,2,10) NR==2,NR==3 print substr($7,10,10)'
The output will be:
robots.txt
class2.cgi
class3.cgi
To explain:
NR==1 print substr($7,2,10)
For the first line in field 7, starting at the 2nd position, it prints out 10 characters.
NR==2,NR==3 print substr($7,10,10)
For the second through third lines in field 7, starting at the tenth position, it prints out 10 characters.
You'll probably have to modify the columns and values as the rest of your file is probably different and won't always line up in the same position but that should get you started. It seems like quite a bit to take in but a little research will get you going into the right direction
answered Apr 10 at 3:52
Nasir Riley
1,514138
1,514138
add a comment |Â
add a comment |Â
up vote
0
down vote
an alternative, that will give you a count of each unique file hit:
awk 'print $7' access_log | sort | uniq -c | sort -rn
or if you wanted hits on a specific day, you could grep the date first:
fgrep "14/Dec/2015" access_log | awk 'print $7' | sort | uniq -c | sort -rn
somewhat relevant, you can use the above to also find unique visitors (at least unique IPs anyway) to your site by changing the print from $7 to $1. I personally use the same commands when my sites are being DoS'd to find which IPs to block out the network.
add a comment |Â
up vote
0
down vote
an alternative, that will give you a count of each unique file hit:
awk 'print $7' access_log | sort | uniq -c | sort -rn
or if you wanted hits on a specific day, you could grep the date first:
fgrep "14/Dec/2015" access_log | awk 'print $7' | sort | uniq -c | sort -rn
somewhat relevant, you can use the above to also find unique visitors (at least unique IPs anyway) to your site by changing the print from $7 to $1. I personally use the same commands when my sites are being DoS'd to find which IPs to block out the network.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
an alternative, that will give you a count of each unique file hit:
awk 'print $7' access_log | sort | uniq -c | sort -rn
or if you wanted hits on a specific day, you could grep the date first:
fgrep "14/Dec/2015" access_log | awk 'print $7' | sort | uniq -c | sort -rn
somewhat relevant, you can use the above to also find unique visitors (at least unique IPs anyway) to your site by changing the print from $7 to $1. I personally use the same commands when my sites are being DoS'd to find which IPs to block out the network.
an alternative, that will give you a count of each unique file hit:
awk 'print $7' access_log | sort | uniq -c | sort -rn
or if you wanted hits on a specific day, you could grep the date first:
fgrep "14/Dec/2015" access_log | awk 'print $7' | sort | uniq -c | sort -rn
somewhat relevant, you can use the above to also find unique visitors (at least unique IPs anyway) to your site by changing the print from $7 to $1. I personally use the same commands when my sites are being DoS'd to find which IPs to block out the network.
answered Apr 10 at 13:22
RobotJohnny
417212
417212
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f436667%2fi-cant-figure-out-how-to-cut-this-file-and-find-unique-words-of-a-particular-se%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Space seems like the obvious delimiter here, is there a reason you can't use it?
â Michael Homer
Apr 10 at 2:29
What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using either
awk
(which I'd recommend) for printing out the name of the file or whatever else you needâ Nasir Riley
Apr 10 at 2:33
@NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
â Michael Kiroff
Apr 10 at 2:46