Count by first column, count distinct by second column and group output by first column?
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
5
down vote
favorite
Please, I am new Unix command awk
. I need a Unix command that will read a CSV file (over 700M rows) with the sample below:
A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12
The command will count the number of occurrence in the first column then count the number of distinct occurrence in column 2 and group the output by entries in column one. Such that the output will look be like below:
A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1
text-processing awk csv
add a comment |Â
up vote
5
down vote
favorite
Please, I am new Unix command awk
. I need a Unix command that will read a CSV file (over 700M rows) with the sample below:
A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12
The command will count the number of occurrence in the first column then count the number of distinct occurrence in column 2 and group the output by entries in column one. Such that the output will look be like below:
A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1
text-processing awk csv
looks like atsort
feed
â mikeserv
Jul 28 at 8:54
add a comment |Â
up vote
5
down vote
favorite
up vote
5
down vote
favorite
Please, I am new Unix command awk
. I need a Unix command that will read a CSV file (over 700M rows) with the sample below:
A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12
The command will count the number of occurrence in the first column then count the number of distinct occurrence in column 2 and group the output by entries in column one. Such that the output will look be like below:
A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1
text-processing awk csv
Please, I am new Unix command awk
. I need a Unix command that will read a CSV file (over 700M rows) with the sample below:
A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12
The command will count the number of occurrence in the first column then count the number of distinct occurrence in column 2 and group the output by entries in column one. Such that the output will look be like below:
A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1
text-processing awk csv
edited Jul 28 at 1:08
slmâ¦
232k65479649
232k65479649
asked Jul 27 at 15:48
Dami Femi
261
261
looks like atsort
feed
â mikeserv
Jul 28 at 8:54
add a comment |Â
looks like atsort
feed
â mikeserv
Jul 28 at 8:54
looks like a
tsort
feedâ mikeserv
Jul 28 at 8:54
looks like a
tsort
feedâ mikeserv
Jul 28 at 8:54
add a comment |Â
4 Answers
4
active
oldest
votes
up vote
2
down vote
To get the first two columns of the output:
$ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 '
A,3
B,2
C,2
D,1
This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk
at the end just swaps the columns and inserts a comma in-between them.
The final column may be had with
$ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 '
2
2
1
1
This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk
at the end extracts the counts only.
Combining these using bash
and paste
:
$ paste -d, <( cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 ' )
<( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1
If you pre-sort the data, this may be shortened slightly (and sped up considerably):
$ sort -o file file
$ paste -d, <( cut -d, -f1 <file | uniq -c | awk -vOFS=, ' print $2, $1 ' )
<( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1
add a comment |Â
up vote
2
down vote
With a small sqlite3
script running from command line, where input.csv
is your input data:
sqlite3 -batch <<EOF
.mode csv
CREATE TABLE data (letter TEXT, number INTEGER);
.import input.csv data
SELECT letter, COUNT(*) AS lcount, COUNT(DISTINCT number) AS dcount
FROM data
GROUP BY letter ;
EOF
This works as so
$ bash query.sqlite
A,3,2
B,2,2
C,2,1
D,1,1
add a comment |Â
up vote
1
down vote
I wanted to see if this could be solved using a Perl one-liner, which I was able to figure out:
$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
END print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt '
file
A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1
Breakdown
looping over a file
This one-liner might seem uber complicated but it's actually pretty straight forward once you break it down. At the heart of it is this mechanism in Perl:
$ perl -F, -ane '...; END ... ' file
This tells Perl to take the file file
in and loop over it and auto-split it using -F,
as the separator character, when complete, run the END ..
block once and exit.
For example:
$ perl -F, -ane 'print "arg1: $F[0] arg2: $F[1]"; END print "DONEn" ' file
arg1: A arg2: 10
arg1: B arg2: 11
arg1: C arg2: 12
arg1: A arg2: 10
arg1: B arg2: 12
arg1: D arg2: 10
arg1: A arg2: 12
arg1: C arg2: 12
DONE
NOTE: Perl's auto-split feature automatically puts the columns into an array @F
, here I'm using elements 1 & 2, $F[0]
& $F[1]
.
Counting things
The next thing we need to do is count various bits of the input. For this we'll enlist the power of hashes in Perl. We're going to use 2, %lcnt
and %ccnt
.
NOTE: One of the more annoying things with Perl is the switching of notation when defining a hash vs. accessing it. When we access it we switch from %lcnt
to $lcnt["A"]
, but I digress.
$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
%lcnt
- count of characters from the 1st column%ccnt
- 2 dimensional hash containing 2 coordinates to access a count of 2nd column
NOTE: Counting things in this manner allows for the unique function to be performed simply by how we're counting the bits.
For example, let's examine the contents of the %lcnt
hash:
$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
END print "key: $_n" for sort keys %lcnt ' file
key: A
key: B
key: C
key: D
If we want to see the value for each hash:
$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
END print "key: $_ val: $lcnt$_n" for sort keys %lcnt ' file
key: A val: 3
key: B val: 2
key: C val: 2
key: D val: 1
NOTE: Here we can see that the $lcnt$F[0]++
has done all the hard work of counting each character as we've looped through the file and added them to the hash %lcnt
.
This is the END
The last piece of the puzzle is to display all this collected information in a useful way. For that we'll use this in the END ...
:
print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt
This loops through the list of keys from %lcnt
and prints the following line:
$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n"
If it's hard to see the above's structure, here it is more generally:
A, 3, 2
^--- (keys % $ccnt$_ ) ## count of unique columns for each character ($_)
^------ $lcnt$_ ## count of each character
^--------- $_ ## character
This will produce a line that contains the character ($_
), the count value for this character ($lcnt$_
) and then the count of unique values from the 2nd column for each character.
References
- How can I find the number of keys in a hash in Perl?
- How do I concatenate a string?
- The top 10 tricks of Perl one-liners
- Perl One-Liners Explained, Part I: File Spacing
- 9.4. Hashes of Hashes
- Multi dimensional hashes in Perl
- How can I get the unique keys from two hashes?
- slmingol/README.md
add a comment |Â
up vote
0
down vote
datamash -t, -s -g 1 count 1 countunique 2 < input.txt
Input
A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12
Output
A,3,2
B,2,2
C,2,1
D,1,1
add a comment |Â
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
To get the first two columns of the output:
$ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 '
A,3
B,2
C,2
D,1
This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk
at the end just swaps the columns and inserts a comma in-between them.
The final column may be had with
$ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 '
2
2
1
1
This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk
at the end extracts the counts only.
Combining these using bash
and paste
:
$ paste -d, <( cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 ' )
<( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1
If you pre-sort the data, this may be shortened slightly (and sped up considerably):
$ sort -o file file
$ paste -d, <( cut -d, -f1 <file | uniq -c | awk -vOFS=, ' print $2, $1 ' )
<( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1
add a comment |Â
up vote
2
down vote
To get the first two columns of the output:
$ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 '
A,3
B,2
C,2
D,1
This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk
at the end just swaps the columns and inserts a comma in-between them.
The final column may be had with
$ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 '
2
2
1
1
This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk
at the end extracts the counts only.
Combining these using bash
and paste
:
$ paste -d, <( cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 ' )
<( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1
If you pre-sort the data, this may be shortened slightly (and sped up considerably):
$ sort -o file file
$ paste -d, <( cut -d, -f1 <file | uniq -c | awk -vOFS=, ' print $2, $1 ' )
<( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1
add a comment |Â
up vote
2
down vote
up vote
2
down vote
To get the first two columns of the output:
$ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 '
A,3
B,2
C,2
D,1
This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk
at the end just swaps the columns and inserts a comma in-between them.
The final column may be had with
$ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 '
2
2
1
1
This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk
at the end extracts the counts only.
Combining these using bash
and paste
:
$ paste -d, <( cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 ' )
<( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1
If you pre-sort the data, this may be shortened slightly (and sped up considerably):
$ sort -o file file
$ paste -d, <( cut -d, -f1 <file | uniq -c | awk -vOFS=, ' print $2, $1 ' )
<( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1
To get the first two columns of the output:
$ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 '
A,3
B,2
C,2
D,1
This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk
at the end just swaps the columns and inserts a comma in-between them.
The final column may be had with
$ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 '
2
2
1
1
This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk
at the end extracts the counts only.
Combining these using bash
and paste
:
$ paste -d, <( cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 ' )
<( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1
If you pre-sort the data, this may be shortened slightly (and sped up considerably):
$ sort -o file file
$ paste -d, <( cut -d, -f1 <file | uniq -c | awk -vOFS=, ' print $2, $1 ' )
<( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1
edited Jul 27 at 18:22
answered Jul 27 at 17:54
Kusalananda
101k13199311
101k13199311
add a comment |Â
add a comment |Â
up vote
2
down vote
With a small sqlite3
script running from command line, where input.csv
is your input data:
sqlite3 -batch <<EOF
.mode csv
CREATE TABLE data (letter TEXT, number INTEGER);
.import input.csv data
SELECT letter, COUNT(*) AS lcount, COUNT(DISTINCT number) AS dcount
FROM data
GROUP BY letter ;
EOF
This works as so
$ bash query.sqlite
A,3,2
B,2,2
C,2,1
D,1,1
add a comment |Â
up vote
2
down vote
With a small sqlite3
script running from command line, where input.csv
is your input data:
sqlite3 -batch <<EOF
.mode csv
CREATE TABLE data (letter TEXT, number INTEGER);
.import input.csv data
SELECT letter, COUNT(*) AS lcount, COUNT(DISTINCT number) AS dcount
FROM data
GROUP BY letter ;
EOF
This works as so
$ bash query.sqlite
A,3,2
B,2,2
C,2,1
D,1,1
add a comment |Â
up vote
2
down vote
up vote
2
down vote
With a small sqlite3
script running from command line, where input.csv
is your input data:
sqlite3 -batch <<EOF
.mode csv
CREATE TABLE data (letter TEXT, number INTEGER);
.import input.csv data
SELECT letter, COUNT(*) AS lcount, COUNT(DISTINCT number) AS dcount
FROM data
GROUP BY letter ;
EOF
This works as so
$ bash query.sqlite
A,3,2
B,2,2
C,2,1
D,1,1
With a small sqlite3
script running from command line, where input.csv
is your input data:
sqlite3 -batch <<EOF
.mode csv
CREATE TABLE data (letter TEXT, number INTEGER);
.import input.csv data
SELECT letter, COUNT(*) AS lcount, COUNT(DISTINCT number) AS dcount
FROM data
GROUP BY letter ;
EOF
This works as so
$ bash query.sqlite
A,3,2
B,2,2
C,2,1
D,1,1
edited Jul 30 at 21:35
answered Jul 30 at 21:09
Sergiy Kolodyazhnyy
7,51311545
7,51311545
add a comment |Â
add a comment |Â
up vote
1
down vote
I wanted to see if this could be solved using a Perl one-liner, which I was able to figure out:
$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
END print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt '
file
A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1
Breakdown
looping over a file
This one-liner might seem uber complicated but it's actually pretty straight forward once you break it down. At the heart of it is this mechanism in Perl:
$ perl -F, -ane '...; END ... ' file
This tells Perl to take the file file
in and loop over it and auto-split it using -F,
as the separator character, when complete, run the END ..
block once and exit.
For example:
$ perl -F, -ane 'print "arg1: $F[0] arg2: $F[1]"; END print "DONEn" ' file
arg1: A arg2: 10
arg1: B arg2: 11
arg1: C arg2: 12
arg1: A arg2: 10
arg1: B arg2: 12
arg1: D arg2: 10
arg1: A arg2: 12
arg1: C arg2: 12
DONE
NOTE: Perl's auto-split feature automatically puts the columns into an array @F
, here I'm using elements 1 & 2, $F[0]
& $F[1]
.
Counting things
The next thing we need to do is count various bits of the input. For this we'll enlist the power of hashes in Perl. We're going to use 2, %lcnt
and %ccnt
.
NOTE: One of the more annoying things with Perl is the switching of notation when defining a hash vs. accessing it. When we access it we switch from %lcnt
to $lcnt["A"]
, but I digress.
$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
%lcnt
- count of characters from the 1st column%ccnt
- 2 dimensional hash containing 2 coordinates to access a count of 2nd column
NOTE: Counting things in this manner allows for the unique function to be performed simply by how we're counting the bits.
For example, let's examine the contents of the %lcnt
hash:
$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
END print "key: $_n" for sort keys %lcnt ' file
key: A
key: B
key: C
key: D
If we want to see the value for each hash:
$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
END print "key: $_ val: $lcnt$_n" for sort keys %lcnt ' file
key: A val: 3
key: B val: 2
key: C val: 2
key: D val: 1
NOTE: Here we can see that the $lcnt$F[0]++
has done all the hard work of counting each character as we've looped through the file and added them to the hash %lcnt
.
This is the END
The last piece of the puzzle is to display all this collected information in a useful way. For that we'll use this in the END ...
:
print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt
This loops through the list of keys from %lcnt
and prints the following line:
$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n"
If it's hard to see the above's structure, here it is more generally:
A, 3, 2
^--- (keys % $ccnt$_ ) ## count of unique columns for each character ($_)
^------ $lcnt$_ ## count of each character
^--------- $_ ## character
This will produce a line that contains the character ($_
), the count value for this character ($lcnt$_
) and then the count of unique values from the 2nd column for each character.
References
- How can I find the number of keys in a hash in Perl?
- How do I concatenate a string?
- The top 10 tricks of Perl one-liners
- Perl One-Liners Explained, Part I: File Spacing
- 9.4. Hashes of Hashes
- Multi dimensional hashes in Perl
- How can I get the unique keys from two hashes?
- slmingol/README.md
add a comment |Â
up vote
1
down vote
I wanted to see if this could be solved using a Perl one-liner, which I was able to figure out:
$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
END print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt '
file
A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1
Breakdown
looping over a file
This one-liner might seem uber complicated but it's actually pretty straight forward once you break it down. At the heart of it is this mechanism in Perl:
$ perl -F, -ane '...; END ... ' file
This tells Perl to take the file file
in and loop over it and auto-split it using -F,
as the separator character, when complete, run the END ..
block once and exit.
For example:
$ perl -F, -ane 'print "arg1: $F[0] arg2: $F[1]"; END print "DONEn" ' file
arg1: A arg2: 10
arg1: B arg2: 11
arg1: C arg2: 12
arg1: A arg2: 10
arg1: B arg2: 12
arg1: D arg2: 10
arg1: A arg2: 12
arg1: C arg2: 12
DONE
NOTE: Perl's auto-split feature automatically puts the columns into an array @F
, here I'm using elements 1 & 2, $F[0]
& $F[1]
.
Counting things
The next thing we need to do is count various bits of the input. For this we'll enlist the power of hashes in Perl. We're going to use 2, %lcnt
and %ccnt
.
NOTE: One of the more annoying things with Perl is the switching of notation when defining a hash vs. accessing it. When we access it we switch from %lcnt
to $lcnt["A"]
, but I digress.
$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
%lcnt
- count of characters from the 1st column%ccnt
- 2 dimensional hash containing 2 coordinates to access a count of 2nd column
NOTE: Counting things in this manner allows for the unique function to be performed simply by how we're counting the bits.
For example, let's examine the contents of the %lcnt
hash:
$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
END print "key: $_n" for sort keys %lcnt ' file
key: A
key: B
key: C
key: D
If we want to see the value for each hash:
$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
END print "key: $_ val: $lcnt$_n" for sort keys %lcnt ' file
key: A val: 3
key: B val: 2
key: C val: 2
key: D val: 1
NOTE: Here we can see that the $lcnt$F[0]++
has done all the hard work of counting each character as we've looped through the file and added them to the hash %lcnt
.
This is the END
The last piece of the puzzle is to display all this collected information in a useful way. For that we'll use this in the END ...
:
print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt
This loops through the list of keys from %lcnt
and prints the following line:
$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n"
If it's hard to see the above's structure, here it is more generally:
A, 3, 2
^--- (keys % $ccnt$_ ) ## count of unique columns for each character ($_)
^------ $lcnt$_ ## count of each character
^--------- $_ ## character
This will produce a line that contains the character ($_
), the count value for this character ($lcnt$_
) and then the count of unique values from the 2nd column for each character.
References
- How can I find the number of keys in a hash in Perl?
- How do I concatenate a string?
- The top 10 tricks of Perl one-liners
- Perl One-Liners Explained, Part I: File Spacing
- 9.4. Hashes of Hashes
- Multi dimensional hashes in Perl
- How can I get the unique keys from two hashes?
- slmingol/README.md
add a comment |Â
up vote
1
down vote
up vote
1
down vote
I wanted to see if this could be solved using a Perl one-liner, which I was able to figure out:
$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
END print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt '
file
A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1
Breakdown
looping over a file
This one-liner might seem uber complicated but it's actually pretty straight forward once you break it down. At the heart of it is this mechanism in Perl:
$ perl -F, -ane '...; END ... ' file
This tells Perl to take the file file
in and loop over it and auto-split it using -F,
as the separator character, when complete, run the END ..
block once and exit.
For example:
$ perl -F, -ane 'print "arg1: $F[0] arg2: $F[1]"; END print "DONEn" ' file
arg1: A arg2: 10
arg1: B arg2: 11
arg1: C arg2: 12
arg1: A arg2: 10
arg1: B arg2: 12
arg1: D arg2: 10
arg1: A arg2: 12
arg1: C arg2: 12
DONE
NOTE: Perl's auto-split feature automatically puts the columns into an array @F
, here I'm using elements 1 & 2, $F[0]
& $F[1]
.
Counting things
The next thing we need to do is count various bits of the input. For this we'll enlist the power of hashes in Perl. We're going to use 2, %lcnt
and %ccnt
.
NOTE: One of the more annoying things with Perl is the switching of notation when defining a hash vs. accessing it. When we access it we switch from %lcnt
to $lcnt["A"]
, but I digress.
$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
%lcnt
- count of characters from the 1st column%ccnt
- 2 dimensional hash containing 2 coordinates to access a count of 2nd column
NOTE: Counting things in this manner allows for the unique function to be performed simply by how we're counting the bits.
For example, let's examine the contents of the %lcnt
hash:
$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
END print "key: $_n" for sort keys %lcnt ' file
key: A
key: B
key: C
key: D
If we want to see the value for each hash:
$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
END print "key: $_ val: $lcnt$_n" for sort keys %lcnt ' file
key: A val: 3
key: B val: 2
key: C val: 2
key: D val: 1
NOTE: Here we can see that the $lcnt$F[0]++
has done all the hard work of counting each character as we've looped through the file and added them to the hash %lcnt
.
This is the END
The last piece of the puzzle is to display all this collected information in a useful way. For that we'll use this in the END ...
:
print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt
This loops through the list of keys from %lcnt
and prints the following line:
$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n"
If it's hard to see the above's structure, here it is more generally:
A, 3, 2
^--- (keys % $ccnt$_ ) ## count of unique columns for each character ($_)
^------ $lcnt$_ ## count of each character
^--------- $_ ## character
This will produce a line that contains the character ($_
), the count value for this character ($lcnt$_
) and then the count of unique values from the 2nd column for each character.
References
- How can I find the number of keys in a hash in Perl?
- How do I concatenate a string?
- The top 10 tricks of Perl one-liners
- Perl One-Liners Explained, Part I: File Spacing
- 9.4. Hashes of Hashes
- Multi dimensional hashes in Perl
- How can I get the unique keys from two hashes?
- slmingol/README.md
I wanted to see if this could be solved using a Perl one-liner, which I was able to figure out:
$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
END print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt '
file
A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1
Breakdown
looping over a file
This one-liner might seem uber complicated but it's actually pretty straight forward once you break it down. At the heart of it is this mechanism in Perl:
$ perl -F, -ane '...; END ... ' file
This tells Perl to take the file file
in and loop over it and auto-split it using -F,
as the separator character, when complete, run the END ..
block once and exit.
For example:
$ perl -F, -ane 'print "arg1: $F[0] arg2: $F[1]"; END print "DONEn" ' file
arg1: A arg2: 10
arg1: B arg2: 11
arg1: C arg2: 12
arg1: A arg2: 10
arg1: B arg2: 12
arg1: D arg2: 10
arg1: A arg2: 12
arg1: C arg2: 12
DONE
NOTE: Perl's auto-split feature automatically puts the columns into an array @F
, here I'm using elements 1 & 2, $F[0]
& $F[1]
.
Counting things
The next thing we need to do is count various bits of the input. For this we'll enlist the power of hashes in Perl. We're going to use 2, %lcnt
and %ccnt
.
NOTE: One of the more annoying things with Perl is the switching of notation when defining a hash vs. accessing it. When we access it we switch from %lcnt
to $lcnt["A"]
, but I digress.
$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
%lcnt
- count of characters from the 1st column%ccnt
- 2 dimensional hash containing 2 coordinates to access a count of 2nd column
NOTE: Counting things in this manner allows for the unique function to be performed simply by how we're counting the bits.
For example, let's examine the contents of the %lcnt
hash:
$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
END print "key: $_n" for sort keys %lcnt ' file
key: A
key: B
key: C
key: D
If we want to see the value for each hash:
$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++;
END print "key: $_ val: $lcnt$_n" for sort keys %lcnt ' file
key: A val: 3
key: B val: 2
key: C val: 2
key: D val: 1
NOTE: Here we can see that the $lcnt$F[0]++
has done all the hard work of counting each character as we've looped through the file and added them to the hash %lcnt
.
This is the END
The last piece of the puzzle is to display all this collected information in a useful way. For that we'll use this in the END ...
:
print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt
This loops through the list of keys from %lcnt
and prints the following line:
$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n"
If it's hard to see the above's structure, here it is more generally:
A, 3, 2
^--- (keys % $ccnt$_ ) ## count of unique columns for each character ($_)
^------ $lcnt$_ ## count of each character
^--------- $_ ## character
This will produce a line that contains the character ($_
), the count value for this character ($lcnt$_
) and then the count of unique values from the 2nd column for each character.
References
- How can I find the number of keys in a hash in Perl?
- How do I concatenate a string?
- The top 10 tricks of Perl one-liners
- Perl One-Liners Explained, Part I: File Spacing
- 9.4. Hashes of Hashes
- Multi dimensional hashes in Perl
- How can I get the unique keys from two hashes?
- slmingol/README.md
answered Jul 28 at 18:19
slmâ¦
232k65479649
232k65479649
add a comment |Â
add a comment |Â
up vote
0
down vote
datamash -t, -s -g 1 count 1 countunique 2 < input.txt
Input
A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12
Output
A,3,2
B,2,2
C,2,1
D,1,1
add a comment |Â
up vote
0
down vote
datamash -t, -s -g 1 count 1 countunique 2 < input.txt
Input
A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12
Output
A,3,2
B,2,2
C,2,1
D,1,1
add a comment |Â
up vote
0
down vote
up vote
0
down vote
datamash -t, -s -g 1 count 1 countunique 2 < input.txt
Input
A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12
Output
A,3,2
B,2,2
C,2,1
D,1,1
datamash -t, -s -g 1 count 1 countunique 2 < input.txt
Input
A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12
Output
A,3,2
B,2,2
C,2,1
D,1,1
answered Jul 30 at 20:16
MiniMax
2,661718
2,661718
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f458883%2fcount-by-first-column-count-distinct-by-second-column-and-group-output-by-first%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
looks like a
tsort
feedâ mikeserv
Jul 28 at 8:54