Count distinct values of a field in a file

up vote
11
down vote

favorite

I have a file contains around million number of lines. In the lines I have a field called transactionid, which has repetitive values. What I need to do is to count them distinctly.

No matter of how many times a value is repeated, it should be counted only once.

edited Nov 20 at 22:36

Rui F Ribeiro

38.2k1475125

asked Jan 11 '12 at 14:08

Olgun Kaya

222147

it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
– Nikhil Mulley
Jan 11 '12 at 14:20

btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
– Nikhil Mulley
Jan 11 '12 at 14:27

@Nikhil This is clear from the question: ... No matter of how many times a value is repeated, it should be counted as 1. ...
– user13742
Jan 11 '12 at 14:28

ok, then answer from @hesse would do your need.
– Nikhil Mulley
Jan 11 '12 at 14:30

sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used; cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l the if clause was for another check of date as it seems obvious :)
– Olgun Kaya
Jan 12 '12 at 6:29

add a comment |

up vote
11
down vote

favorite

I have a file contains around million number of lines. In the lines I have a field called transactionid, which has repetitive values. What I need to do is to count them distinctly.

No matter of how many times a value is repeated, it should be counted only once.

edited Nov 20 at 22:36

Rui F Ribeiro

38.2k1475125

asked Jan 11 '12 at 14:08

Olgun Kaya

222147

it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
– Nikhil Mulley
Jan 11 '12 at 14:20

btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
– Nikhil Mulley
Jan 11 '12 at 14:27

@Nikhil This is clear from the question: ... No matter of how many times a value is repeated, it should be counted as 1. ...
– user13742
Jan 11 '12 at 14:28

ok, then answer from @hesse would do your need.
– Nikhil Mulley
Jan 11 '12 at 14:30

sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used; cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l the if clause was for another check of date as it seems obvious :)
– Olgun Kaya
Jan 12 '12 at 6:29

add a comment |

up vote
11
down vote

favorite

I have a file contains around million number of lines. In the lines I have a field called transactionid, which has repetitive values. What I need to do is to count them distinctly.

No matter of how many times a value is repeated, it should be counted only once.

edited Nov 20 at 22:36

Rui F Ribeiro

38.2k1475125

asked Jan 11 '12 at 14:08

Olgun Kaya

222147

I have a file contains around million number of lines. In the lines I have a field called transactionid, which has repetitive values. What I need to do is to count them distinctly.

No matter of how many times a value is repeated, it should be counted only once.

text-processing awk

edited Nov 20 at 22:36

Rui F Ribeiro

38.2k1475125

asked Jan 11 '12 at 14:08

Olgun Kaya

222147

edited Nov 20 at 22:36

Rui F Ribeiro

38.2k1475125

asked Jan 11 '12 at 14:08

Olgun Kaya

222147

edited Nov 20 at 22:36

Rui F Ribeiro

38.2k1475125

edited Nov 20 at 22:36

Rui F Ribeiro

38.2k1475125

edited Nov 20 at 22:36

Rui F Ribeiro

38.2k1475125

asked Jan 11 '12 at 14:08

Olgun Kaya

222147

asked Jan 11 '12 at 14:08

Olgun Kaya

222147

asked Jan 11 '12 at 14:08

Olgun Kaya

222147

it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
– Nikhil Mulley
Jan 11 '12 at 14:20

btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
– Nikhil Mulley
Jan 11 '12 at 14:27

@Nikhil This is clear from the question: ... No matter of how many times a value is repeated, it should be counted as 1. ...
– user13742
Jan 11 '12 at 14:28

ok, then answer from @hesse would do your need.
– Nikhil Mulley
Jan 11 '12 at 14:30

sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used; cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l the if clause was for another check of date as it seems obvious :)
– Olgun Kaya
Jan 12 '12 at 6:29

add a comment |

it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
– Nikhil Mulley
Jan 11 '12 at 14:20

btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
– Nikhil Mulley
Jan 11 '12 at 14:27

@Nikhil This is clear from the question: ... No matter of how many times a value is repeated, it should be counted as 1. ...
– user13742
Jan 11 '12 at 14:28

ok, then answer from @hesse would do your need.
– Nikhil Mulley
Jan 11 '12 at 14:30

sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used; cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l the if clause was for another check of date as it seems obvious :)
– Olgun Kaya
Jan 12 '12 at 6:29

it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
– Nikhil Mulley
Jan 11 '12 at 14:20

btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
– Nikhil Mulley
Jan 11 '12 at 14:27

@Nikhil This is clear from the question: ... No matter of how many times a value is repeated, it should be counted as 1. ...
– user13742
Jan 11 '12 at 14:28

ok, then answer from @hesse would do your need.
– Nikhil Mulley
Jan 11 '12 at 14:30

sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used; cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l the if clause was for another check of date as it seems obvious :)
– Olgun Kaya
Jan 12 '12 at 6:29

add a comment |

3 Answers
3

active

oldest

votes

up vote
17
down vote

accepted

OK, Assuming that your file is a text file, having the fields separated by comma separator ','. You would also know which field 'transactionid' is in terms of its position. Assuming that your 'transactionid' field is 7th field.

awk -F ',' 'print $7' text_file | sort | uniq -c

This would count the distinct/unique occurrences in the 7th field and prints the result.

edited May 1 '17 at 23:06

phk

3,92652151

answered Jan 11 '12 at 14:21

Nikhil Mulley

6,3112144

add a comment |

up vote
3
down vote

There is no need to sort the file .. (uniq requires the file to be sorted)

This awk script assumes the field is the first whitespace delimiited field.

awk 'a[$1] == "" a[$1]="X" END print length(a) ' file

edited Jan 11 '12 at 14:57

answered Jan 11 '12 at 14:30

Peter.O

18.7k1791143

For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most sort implementations are designed to cope well with huge files.
– Gilles
Jan 12 '12 at 1:59

add a comment |

up vote
2
down vote

Maybe not the sleekest method, but this should work:

awk 'print $1' your_file | sort | uniq | wc -l

where $1 is the number corresponding to the field to be parsed.

edited Jan 11 '12 at 14:26

answered Jan 11 '12 at 14:18

user13742

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f28845%2fcount-distinct-values-of-a-field-in-a-file%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
17
down vote

accepted

awk -F ',' 'print $7' text_file | sort | uniq -c

This would count the distinct/unique occurrences in the 7th field and prints the result.

edited May 1 '17 at 23:06

phk

3,92652151

answered Jan 11 '12 at 14:21

Nikhil Mulley

6,3112144

add a comment |

up vote
17
down vote

accepted

awk -F ',' 'print $7' text_file | sort | uniq -c

This would count the distinct/unique occurrences in the 7th field and prints the result.

edited May 1 '17 at 23:06

phk

3,92652151

answered Jan 11 '12 at 14:21

Nikhil Mulley

6,3112144

add a comment |

up vote
17
down vote

accepted

awk -F ',' 'print $7' text_file | sort | uniq -c

This would count the distinct/unique occurrences in the 7th field and prints the result.

edited May 1 '17 at 23:06

phk

3,92652151

answered Jan 11 '12 at 14:21

Nikhil Mulley

6,3112144

awk -F ',' 'print $7' text_file | sort | uniq -c

This would count the distinct/unique occurrences in the 7th field and prints the result.

edited May 1 '17 at 23:06

phk

3,92652151

answered Jan 11 '12 at 14:21

Nikhil Mulley

6,3112144

edited May 1 '17 at 23:06

phk

3,92652151

edited May 1 '17 at 23:06

phk

3,92652151

edited May 1 '17 at 23:06

phk

3,92652151

answered Jan 11 '12 at 14:21

Nikhil Mulley

6,3112144

answered Jan 11 '12 at 14:21

Nikhil Mulley

6,3112144

answered Jan 11 '12 at 14:21

Nikhil Mulley

6,3112144

add a comment |

up vote
3
down vote

There is no need to sort the file .. (uniq requires the file to be sorted)

This awk script assumes the field is the first whitespace delimiited field.

awk 'a[$1] == "" a[$1]="X" END print length(a) ' file

edited Jan 11 '12 at 14:57

answered Jan 11 '12 at 14:30

Peter.O

18.7k1791143

For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most sort implementations are designed to cope well with huge files.
– Gilles
Jan 12 '12 at 1:59

add a comment |

up vote
3
down vote

There is no need to sort the file .. (uniq requires the file to be sorted)

This awk script assumes the field is the first whitespace delimiited field.

awk 'a[$1] == "" a[$1]="X" END print length(a) ' file

edited Jan 11 '12 at 14:57

answered Jan 11 '12 at 14:30

Peter.O

18.7k1791143

For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most sort implementations are designed to cope well with huge files.
– Gilles
Jan 12 '12 at 1:59

add a comment |

up vote
3
down vote

There is no need to sort the file .. (uniq requires the file to be sorted)

This awk script assumes the field is the first whitespace delimiited field.

awk 'a[$1] == "" a[$1]="X" END print length(a) ' file

edited Jan 11 '12 at 14:57

answered Jan 11 '12 at 14:30

Peter.O

18.7k1791143

There is no need to sort the file .. (uniq requires the file to be sorted)

This awk script assumes the field is the first whitespace delimiited field.

awk 'a[$1] == "" a[$1]="X" END print length(a) ' file

edited Jan 11 '12 at 14:57

answered Jan 11 '12 at 14:30

Peter.O

18.7k1791143

edited Jan 11 '12 at 14:57

answered Jan 11 '12 at 14:30

Peter.O

18.7k1791143

answered Jan 11 '12 at 14:30

Peter.O

18.7k1791143

answered Jan 11 '12 at 14:30

Peter.O

18.7k1791143

For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most sort implementations are designed to cope well with huge files.
– Gilles
Jan 12 '12 at 1:59

add a comment |

For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most sort implementations are designed to cope well with huge files.
– Gilles
Jan 12 '12 at 1:59

For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most sort implementations are designed to cope well with huge files.
– Gilles
Jan 12 '12 at 1:59

add a comment |

up vote
2
down vote

Maybe not the sleekest method, but this should work:

awk 'print $1' your_file | sort | uniq | wc -l

where $1 is the number corresponding to the field to be parsed.

edited Jan 11 '12 at 14:26

answered Jan 11 '12 at 14:18

user13742

add a comment |

up vote
2
down vote

Maybe not the sleekest method, but this should work:

awk 'print $1' your_file | sort | uniq | wc -l

where $1 is the number corresponding to the field to be parsed.

edited Jan 11 '12 at 14:26

answered Jan 11 '12 at 14:18

user13742

add a comment |

up vote
2
down vote

Maybe not the sleekest method, but this should work:

awk 'print $1' your_file | sort | uniq | wc -l

where $1 is the number corresponding to the field to be parsed.

edited Jan 11 '12 at 14:26

answered Jan 11 '12 at 14:18

user13742

Maybe not the sleekest method, but this should work:

awk 'print $1' your_file | sort | uniq | wc -l

where $1 is the number corresponding to the field to be parsed.

edited Jan 11 '12 at 14:26

answered Jan 11 '12 at 14:18

user13742

edited Jan 11 '12 at 14:26

answered Jan 11 '12 at 14:18

user13742

answered Jan 11 '12 at 14:18

user13742

answered Jan 11 '12 at 14:18

user13742

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

搜尋此網誌

mjhjmtu