Count distinct values of a field in a file
Clash Royale CLAN TAG#URR8PPP
up vote
11
down vote
favorite
I have a file contains around million number of lines. In the lines I have a field called transactionid
, which has repetitive values. What I need to do is to count them distinctly.
No matter of how many times a value is repeated, it should be counted only once.
text-processing awk
add a comment |
up vote
11
down vote
favorite
I have a file contains around million number of lines. In the lines I have a field called transactionid
, which has repetitive values. What I need to do is to count them distinctly.
No matter of how many times a value is repeated, it should be counted only once.
text-processing awk
it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
– Nikhil Mulley
Jan 11 '12 at 14:20
btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
– Nikhil Mulley
Jan 11 '12 at 14:27
@Nikhil This is clear from the question:... No matter of how many times a value is repeated, it should be counted as 1. ...
– user13742
Jan 11 '12 at 14:28
ok, then answer from @hesse would do your need.
– Nikhil Mulley
Jan 11 '12 at 14:30
sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used;cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l
the if clause was for another check of date as it seems obvious :)
– Olgun Kaya
Jan 12 '12 at 6:29
add a comment |
up vote
11
down vote
favorite
up vote
11
down vote
favorite
I have a file contains around million number of lines. In the lines I have a field called transactionid
, which has repetitive values. What I need to do is to count them distinctly.
No matter of how many times a value is repeated, it should be counted only once.
text-processing awk
I have a file contains around million number of lines. In the lines I have a field called transactionid
, which has repetitive values. What I need to do is to count them distinctly.
No matter of how many times a value is repeated, it should be counted only once.
text-processing awk
text-processing awk
edited Nov 20 at 22:36
Rui F Ribeiro
38.2k1475125
38.2k1475125
asked Jan 11 '12 at 14:08
Olgun Kaya
222147
222147
it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
– Nikhil Mulley
Jan 11 '12 at 14:20
btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
– Nikhil Mulley
Jan 11 '12 at 14:27
@Nikhil This is clear from the question:... No matter of how many times a value is repeated, it should be counted as 1. ...
– user13742
Jan 11 '12 at 14:28
ok, then answer from @hesse would do your need.
– Nikhil Mulley
Jan 11 '12 at 14:30
sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used;cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l
the if clause was for another check of date as it seems obvious :)
– Olgun Kaya
Jan 12 '12 at 6:29
add a comment |
it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
– Nikhil Mulley
Jan 11 '12 at 14:20
btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
– Nikhil Mulley
Jan 11 '12 at 14:27
@Nikhil This is clear from the question:... No matter of how many times a value is repeated, it should be counted as 1. ...
– user13742
Jan 11 '12 at 14:28
ok, then answer from @hesse would do your need.
– Nikhil Mulley
Jan 11 '12 at 14:30
sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used;cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l
the if clause was for another check of date as it seems obvious :)
– Olgun Kaya
Jan 12 '12 at 6:29
it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
– Nikhil Mulley
Jan 11 '12 at 14:20
it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
– Nikhil Mulley
Jan 11 '12 at 14:20
btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
– Nikhil Mulley
Jan 11 '12 at 14:27
btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
– Nikhil Mulley
Jan 11 '12 at 14:27
@Nikhil This is clear from the question:
... No matter of how many times a value is repeated, it should be counted as 1. ...
– user13742
Jan 11 '12 at 14:28
@Nikhil This is clear from the question:
... No matter of how many times a value is repeated, it should be counted as 1. ...
– user13742
Jan 11 '12 at 14:28
ok, then answer from @hesse would do your need.
– Nikhil Mulley
Jan 11 '12 at 14:30
ok, then answer from @hesse would do your need.
– Nikhil Mulley
Jan 11 '12 at 14:30
sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used;
cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l
the if clause was for another check of date as it seems obvious :)– Olgun Kaya
Jan 12 '12 at 6:29
sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used;
cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l
the if clause was for another check of date as it seems obvious :)– Olgun Kaya
Jan 12 '12 at 6:29
add a comment |
3 Answers
3
active
oldest
votes
up vote
17
down vote
accepted
OK, Assuming that your file is a text file, having the fields separated by comma separator ','. You would also know which field 'transactionid'
is in terms of its position. Assuming that your 'transactionid'
field is 7th field.
awk -F ',' 'print $7' text_file | sort | uniq -c
This would count the distinct/unique occurrences in the 7th field and prints the result.
add a comment |
up vote
3
down vote
There is no need to sort the file .. (uniq
requires the file to be sorted)
This awk script assumes the field is the first whitespace delimiited field.
awk 'a[$1] == "" a[$1]="X" END print length(a) ' file
For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Mostsort
implementations are designed to cope well with huge files.
– Gilles
Jan 12 '12 at 1:59
add a comment |
up vote
2
down vote
Maybe not the sleekest method, but this should work:
awk 'print $1' your_file | sort | uniq | wc -l
where $1
is the number corresponding to the field to be parsed.
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
17
down vote
accepted
OK, Assuming that your file is a text file, having the fields separated by comma separator ','. You would also know which field 'transactionid'
is in terms of its position. Assuming that your 'transactionid'
field is 7th field.
awk -F ',' 'print $7' text_file | sort | uniq -c
This would count the distinct/unique occurrences in the 7th field and prints the result.
add a comment |
up vote
17
down vote
accepted
OK, Assuming that your file is a text file, having the fields separated by comma separator ','. You would also know which field 'transactionid'
is in terms of its position. Assuming that your 'transactionid'
field is 7th field.
awk -F ',' 'print $7' text_file | sort | uniq -c
This would count the distinct/unique occurrences in the 7th field and prints the result.
add a comment |
up vote
17
down vote
accepted
up vote
17
down vote
accepted
OK, Assuming that your file is a text file, having the fields separated by comma separator ','. You would also know which field 'transactionid'
is in terms of its position. Assuming that your 'transactionid'
field is 7th field.
awk -F ',' 'print $7' text_file | sort | uniq -c
This would count the distinct/unique occurrences in the 7th field and prints the result.
OK, Assuming that your file is a text file, having the fields separated by comma separator ','. You would also know which field 'transactionid'
is in terms of its position. Assuming that your 'transactionid'
field is 7th field.
awk -F ',' 'print $7' text_file | sort | uniq -c
This would count the distinct/unique occurrences in the 7th field and prints the result.
edited May 1 '17 at 23:06
phk
3,92652151
3,92652151
answered Jan 11 '12 at 14:21
Nikhil Mulley
6,3112144
6,3112144
add a comment |
add a comment |
up vote
3
down vote
There is no need to sort the file .. (uniq
requires the file to be sorted)
This awk script assumes the field is the first whitespace delimiited field.
awk 'a[$1] == "" a[$1]="X" END print length(a) ' file
For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Mostsort
implementations are designed to cope well with huge files.
– Gilles
Jan 12 '12 at 1:59
add a comment |
up vote
3
down vote
There is no need to sort the file .. (uniq
requires the file to be sorted)
This awk script assumes the field is the first whitespace delimiited field.
awk 'a[$1] == "" a[$1]="X" END print length(a) ' file
For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Mostsort
implementations are designed to cope well with huge files.
– Gilles
Jan 12 '12 at 1:59
add a comment |
up vote
3
down vote
up vote
3
down vote
There is no need to sort the file .. (uniq
requires the file to be sorted)
This awk script assumes the field is the first whitespace delimiited field.
awk 'a[$1] == "" a[$1]="X" END print length(a) ' file
There is no need to sort the file .. (uniq
requires the file to be sorted)
This awk script assumes the field is the first whitespace delimiited field.
awk 'a[$1] == "" a[$1]="X" END print length(a) ' file
edited Jan 11 '12 at 14:57
answered Jan 11 '12 at 14:30
Peter.O
18.7k1791143
18.7k1791143
For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Mostsort
implementations are designed to cope well with huge files.
– Gilles
Jan 12 '12 at 1:59
add a comment |
For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Mostsort
implementations are designed to cope well with huge files.
– Gilles
Jan 12 '12 at 1:59
For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most
sort
implementations are designed to cope well with huge files.– Gilles
Jan 12 '12 at 1:59
For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most
sort
implementations are designed to cope well with huge files.– Gilles
Jan 12 '12 at 1:59
add a comment |
up vote
2
down vote
Maybe not the sleekest method, but this should work:
awk 'print $1' your_file | sort | uniq | wc -l
where $1
is the number corresponding to the field to be parsed.
add a comment |
up vote
2
down vote
Maybe not the sleekest method, but this should work:
awk 'print $1' your_file | sort | uniq | wc -l
where $1
is the number corresponding to the field to be parsed.
add a comment |
up vote
2
down vote
up vote
2
down vote
Maybe not the sleekest method, but this should work:
awk 'print $1' your_file | sort | uniq | wc -l
where $1
is the number corresponding to the field to be parsed.
Maybe not the sleekest method, but this should work:
awk 'print $1' your_file | sort | uniq | wc -l
where $1
is the number corresponding to the field to be parsed.
edited Jan 11 '12 at 14:26
answered Jan 11 '12 at 14:18
user13742
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f28845%2fcount-distinct-values-of-a-field-in-a-file%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
– Nikhil Mulley
Jan 11 '12 at 14:20
btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
– Nikhil Mulley
Jan 11 '12 at 14:27
@Nikhil This is clear from the question:
... No matter of how many times a value is repeated, it should be counted as 1. ...
– user13742
Jan 11 '12 at 14:28
ok, then answer from @hesse would do your need.
– Nikhil Mulley
Jan 11 '12 at 14:30
sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used;
cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l
the if clause was for another check of date as it seems obvious :)– Olgun Kaya
Jan 12 '12 at 6:29