I can't figure out how to cut this file and find unique words of a particular section

up vote
0
down vote

favorite

So there's an access log entry file named access_log and I'm supposed to find all of the unique files that were accessed on the web server. access_log is formatted like this, this is just an excerpt:

66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /robots.txt HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /~robert/class2.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:30:19 -0600] "GET /~robert/class3.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
202.46.61.93 - - [14/Dec/2015:09:07:34 -0600] "GET / HTTP/1.1" 200 5208 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"

The files, for example on the first one "robots.txt", are either after the word GET, HEAD, or POST. I've tried using the cut command using " as the delimeter which hasn't worked. I literally have no idea how to separate the fields on a file like this, so I can compare them. If anyone could point me in the right direction, I'd really appreciate it.

Edit: Figured it out, you were right @MichaelHomer. My syntax was off so that's why cut wasn't working for me. I used space as the delimeter and it worked.

edited Apr 10 at 2:56

asked Apr 10 at 2:19

Michael Kiroff

Space seems like the obvious delimiter here, is there a reason you can't use it?
â€“Â Michael Homer
Apr 10 at 2:29

What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using either awk (which I'd recommend) for printing out the name of the file or whatever else you need
â€“Â Nasir Riley
Apr 10 at 2:33

@NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
â€“Â Michael Kiroff
Apr 10 at 2:46

add a commentÂ |Â

up vote
0
down vote

favorite

So there's an access log entry file named access_log and I'm supposed to find all of the unique files that were accessed on the web server. access_log is formatted like this, this is just an excerpt:

66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /robots.txt HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /~robert/class2.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:30:19 -0600] "GET /~robert/class3.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
202.46.61.93 - - [14/Dec/2015:09:07:34 -0600] "GET / HTTP/1.1" 200 5208 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"

Edit: Figured it out, you were right @MichaelHomer. My syntax was off so that's why cut wasn't working for me. I used space as the delimeter and it worked.

edited Apr 10 at 2:56

asked Apr 10 at 2:19

Michael Kiroff

Space seems like the obvious delimiter here, is there a reason you can't use it?
â€“Â Michael Homer
Apr 10 at 2:29

What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using either awk (which I'd recommend) for printing out the name of the file or whatever else you need
â€“Â Nasir Riley
Apr 10 at 2:33

@NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
â€“Â Michael Kiroff
Apr 10 at 2:46

add a commentÂ |Â

up vote
0
down vote

favorite

So there's an access log entry file named access_log and I'm supposed to find all of the unique files that were accessed on the web server. access_log is formatted like this, this is just an excerpt:

66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /robots.txt HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /~robert/class2.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:30:19 -0600] "GET /~robert/class3.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
202.46.61.93 - - [14/Dec/2015:09:07:34 -0600] "GET / HTTP/1.1" 200 5208 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"

Edit: Figured it out, you were right @MichaelHomer. My syntax was off so that's why cut wasn't working for me. I used space as the delimeter and it worked.

edited Apr 10 at 2:56

asked Apr 10 at 2:19

Michael Kiroff

So there's an access log entry file named access_log and I'm supposed to find all of the unique files that were accessed on the web server. access_log is formatted like this, this is just an excerpt:

66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /robots.txt HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:25:18 -0600] "GET /~robert/class2.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.75.4 - - [14/Dec/2015:08:30:19 -0600] "GET /~robert/class3.cgi HTTP/1.1" 404 1012 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
202.46.61.93 - - [14/Dec/2015:09:07:34 -0600] "GET / HTTP/1.1" 200 5208 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"

Edit: Figured it out, you were right @MichaelHomer. My syntax was off so that's why cut wasn't working for me. I used space as the delimeter and it worked.

edited Apr 10 at 2:56

asked Apr 10 at 2:19

Michael Kiroff

edited Apr 10 at 2:56

asked Apr 10 at 2:19

Michael Kiroff

asked Apr 10 at 2:19

Michael Kiroff

asked Apr 10 at 2:19

Michael Kiroff

Space seems like the obvious delimiter here, is there a reason you can't use it?
â€“Â Michael Homer
Apr 10 at 2:29

What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using either awk (which I'd recommend) for printing out the name of the file or whatever else you need
â€“Â Nasir Riley
Apr 10 at 2:33

@NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
â€“Â Michael Kiroff
Apr 10 at 2:46

add a commentÂ |Â

Space seems like the obvious delimiter here, is there a reason you can't use it?
â€“Â Michael Homer
Apr 10 at 2:29

What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using either awk (which I'd recommend) for printing out the name of the file or whatever else you need
â€“Â Nasir Riley
Apr 10 at 2:33

@NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
â€“Â Michael Kiroff
Apr 10 at 2:46

Space seems like the obvious delimiter here, is there a reason you can't use it?
â€“Â Michael Homer
Apr 10 at 2:29

What exactly are you trying to cut from each line? As has been stated, the delimiter looks to me like space. From there, it's just a matter of using either awk (which I'd recommend) for printing out the name of the file or whatever else you need
â€“Â Nasir Riley
Apr 10 at 2:33

@NasirRiley I'm trying to print out the file or directory it's accessing like /robots.txt, /~robert/class2.cgi, and /~robert/class3.cgi. Then I need to find how many unique files there are. I don't know a lot about awk I'm new to this, could you point me in the right direction?
â€“Â Michael Kiroff
Apr 10 at 2:46

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
0
down vote

Here's a walk-through on the sample that you've provided.

awk prints out columns and lines which you can specify. I suggest reviewing the man page and Google for more reference. In your case the delimiter is space which will separate each column. It's going to vary because in what you've provided so far, each line has different text which will make the positioning of the columns different but for your first three lines, you can begin with the following:

cat access_log | awk 'NR==1,NR==3 print $7' | sort -u

NR==1,NR==3 Prints out lines 1 through 3

print $7 Prints out the seventh column which is the file name that you need. Keep in mind that it won't always be the seventh column because the text in each line may be different.

sort -u Prints out unique values

The output is:

/robots.txt
/~robert/class2.cgi
/~robert/class3.cgi

The last part with sort won't have any effect on your sample because there are no duplicates but if the rest of your file does then it will only print out unique values in the particular column.

If you just want to print the filename then you can use the substr argument with awk command:

cat access_log | awk 'NR==1 print substr($7,2,10) NR==2,NR==3 print substr($7,10,10)'

The output will be:

robots.txt
class2.cgi
class3.cgi

To explain:

NR==1 print substr($7,2,10) For the first line in field 7, starting at the 2nd position, it prints out 10 characters.

NR==2,NR==3 print substr($7,10,10) For the second through third lines in field 7, starting at the tenth position, it prints out 10 characters.

You'll probably have to modify the columns and values as the rest of your file is probably different and won't always line up in the same position but that should get you started. It seems like quite a bit to take in but a little research will get you going into the right direction

answered Apr 10 at 3:52

Nasir Riley

1,514138

add a commentÂ |Â

up vote
0
down vote

an alternative, that will give you a count of each unique file hit:

awk 'print $7' access_log | sort | uniq -c | sort -rn

or if you wanted hits on a specific day, you could grep the date first:

fgrep "14/Dec/2015" access_log | awk 'print $7' | sort | uniq -c | sort -rn

somewhat relevant, you can use the above to also find unique visitors (at least unique IPs anyway) to your site by changing the print from $7 to $1. I personally use the same commands when my sites are being DoS'd to find which IPs to block out the network.

answered Apr 10 at 13:22

RobotJohnny

417212

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f436667%2fi-cant-figure-out-how-to-cut-this-file-and-find-unique-words-of-a-particular-se%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
0
down vote

Here's a walk-through on the sample that you've provided.

cat access_log | awk 'NR==1,NR==3 print $7' | sort -u

NR==1,NR==3 Prints out lines 1 through 3

print $7 Prints out the seventh column which is the file name that you need. Keep in mind that it won't always be the seventh column because the text in each line may be different.

sort -u Prints out unique values

The output is:

/robots.txt
/~robert/class2.cgi
/~robert/class3.cgi

The last part with sort won't have any effect on your sample because there are no duplicates but if the rest of your file does then it will only print out unique values in the particular column.

If you just want to print the filename then you can use the substr argument with awk command:

cat access_log | awk 'NR==1 print substr($7,2,10) NR==2,NR==3 print substr($7,10,10)'

The output will be:

robots.txt
class2.cgi
class3.cgi

To explain:

NR==1 print substr($7,2,10) For the first line in field 7, starting at the 2nd position, it prints out 10 characters.

NR==2,NR==3 print substr($7,10,10) For the second through third lines in field 7, starting at the tenth position, it prints out 10 characters.

answered Apr 10 at 3:52

Nasir Riley

1,514138

add a commentÂ |Â

up vote
0
down vote

Here's a walk-through on the sample that you've provided.

cat access_log | awk 'NR==1,NR==3 print $7' | sort -u

NR==1,NR==3 Prints out lines 1 through 3

print $7 Prints out the seventh column which is the file name that you need. Keep in mind that it won't always be the seventh column because the text in each line may be different.

sort -u Prints out unique values

The output is:

/robots.txt
/~robert/class2.cgi
/~robert/class3.cgi

The last part with sort won't have any effect on your sample because there are no duplicates but if the rest of your file does then it will only print out unique values in the particular column.

If you just want to print the filename then you can use the substr argument with awk command:

cat access_log | awk 'NR==1 print substr($7,2,10) NR==2,NR==3 print substr($7,10,10)'

The output will be:

robots.txt
class2.cgi
class3.cgi

To explain:

NR==1 print substr($7,2,10) For the first line in field 7, starting at the 2nd position, it prints out 10 characters.

NR==2,NR==3 print substr($7,10,10) For the second through third lines in field 7, starting at the tenth position, it prints out 10 characters.

answered Apr 10 at 3:52

Nasir Riley

1,514138

add a commentÂ |Â

up vote
0
down vote

Here's a walk-through on the sample that you've provided.

cat access_log | awk 'NR==1,NR==3 print $7' | sort -u

NR==1,NR==3 Prints out lines 1 through 3

print $7 Prints out the seventh column which is the file name that you need. Keep in mind that it won't always be the seventh column because the text in each line may be different.

sort -u Prints out unique values

The output is:

/robots.txt
/~robert/class2.cgi
/~robert/class3.cgi

The last part with sort won't have any effect on your sample because there are no duplicates but if the rest of your file does then it will only print out unique values in the particular column.

If you just want to print the filename then you can use the substr argument with awk command:

cat access_log | awk 'NR==1 print substr($7,2,10) NR==2,NR==3 print substr($7,10,10)'

The output will be:

robots.txt
class2.cgi
class3.cgi

To explain:

NR==1 print substr($7,2,10) For the first line in field 7, starting at the 2nd position, it prints out 10 characters.

NR==2,NR==3 print substr($7,10,10) For the second through third lines in field 7, starting at the tenth position, it prints out 10 characters.

answered Apr 10 at 3:52

Nasir Riley

1,514138

Here's a walk-through on the sample that you've provided.

cat access_log | awk 'NR==1,NR==3 print $7' | sort -u

NR==1,NR==3 Prints out lines 1 through 3

print $7 Prints out the seventh column which is the file name that you need. Keep in mind that it won't always be the seventh column because the text in each line may be different.

sort -u Prints out unique values

The output is:

/robots.txt
/~robert/class2.cgi
/~robert/class3.cgi

The last part with sort won't have any effect on your sample because there are no duplicates but if the rest of your file does then it will only print out unique values in the particular column.

If you just want to print the filename then you can use the substr argument with awk command:

cat access_log | awk 'NR==1 print substr($7,2,10) NR==2,NR==3 print substr($7,10,10)'

The output will be:

robots.txt
class2.cgi
class3.cgi

To explain:

NR==1 print substr($7,2,10) For the first line in field 7, starting at the 2nd position, it prints out 10 characters.

NR==2,NR==3 print substr($7,10,10) For the second through third lines in field 7, starting at the tenth position, it prints out 10 characters.

answered Apr 10 at 3:52

Nasir Riley

1,514138

answered Apr 10 at 3:52

Nasir Riley

1,514138

answered Apr 10 at 3:52

Nasir Riley

1,514138

answered Apr 10 at 3:52

Nasir Riley

1,514138

add a commentÂ |Â

up vote
0
down vote

an alternative, that will give you a count of each unique file hit:

awk 'print $7' access_log | sort | uniq -c | sort -rn

or if you wanted hits on a specific day, you could grep the date first:

fgrep "14/Dec/2015" access_log | awk 'print $7' | sort | uniq -c | sort -rn

answered Apr 10 at 13:22

RobotJohnny

417212

add a commentÂ |Â

up vote
0
down vote

an alternative, that will give you a count of each unique file hit:

awk 'print $7' access_log | sort | uniq -c | sort -rn

or if you wanted hits on a specific day, you could grep the date first:

fgrep "14/Dec/2015" access_log | awk 'print $7' | sort | uniq -c | sort -rn

answered Apr 10 at 13:22

RobotJohnny

417212

add a commentÂ |Â

up vote
0
down vote

an alternative, that will give you a count of each unique file hit:

awk 'print $7' access_log | sort | uniq -c | sort -rn

or if you wanted hits on a specific day, you could grep the date first:

fgrep "14/Dec/2015" access_log | awk 'print $7' | sort | uniq -c | sort -rn

answered Apr 10 at 13:22

RobotJohnny

417212

an alternative, that will give you a count of each unique file hit:

awk 'print $7' access_log | sort | uniq -c | sort -rn

or if you wanted hits on a specific day, you could grep the date first:

fgrep "14/Dec/2015" access_log | awk 'print $7' | sort | uniq -c | sort -rn

answered Apr 10 at 13:22

RobotJohnny

417212

answered Apr 10 at 13:22

RobotJohnny

417212

answered Apr 10 at 13:22

RobotJohnny

417212

answered Apr 10 at 13:22

RobotJohnny

417212

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu