Count by first column, count distinct by second column and group output by first column?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
5
down vote

favorite

Please, I am new Unix command awk. I need a Unix command that will read a CSV file (over 700M rows) with the sample below:

A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12

The command will count the number of occurrence in the first column then count the number of distinct occurrence in column 2 and group the output by entries in column one. Such that the output will look be like below:

A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1

edited Jul 28 at 1:08

slmâ™¦

232k65479649

asked Jul 27 at 15:48

Dami Femi

261

looks like a tsort feed
â€“Â mikeserv
Jul 28 at 8:54

add a commentÂ |Â

up vote
5
down vote

favorite

Please, I am new Unix command awk. I need a Unix command that will read a CSV file (over 700M rows) with the sample below:

A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12

A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1

edited Jul 28 at 1:08

slmâ™¦

232k65479649

asked Jul 27 at 15:48

Dami Femi

261

looks like a tsort feed
â€“Â mikeserv
Jul 28 at 8:54

add a commentÂ |Â

up vote
5
down vote

favorite

Please, I am new Unix command awk. I need a Unix command that will read a CSV file (over 700M rows) with the sample below:

A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12

A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1

edited Jul 28 at 1:08

slmâ™¦

232k65479649

asked Jul 27 at 15:48

Dami Femi

261

Please, I am new Unix command awk. I need a Unix command that will read a CSV file (over 700M rows) with the sample below:

A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12

A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1

edited Jul 28 at 1:08

slmâ™¦

232k65479649

asked Jul 27 at 15:48

Dami Femi

261

edited Jul 28 at 1:08

slmâ™¦

232k65479649

edited Jul 28 at 1:08

slmâ™¦

232k65479649

edited Jul 28 at 1:08

slmâ™¦

232k65479649

asked Jul 27 at 15:48

Dami Femi

261

asked Jul 27 at 15:48

Dami Femi

261

asked Jul 27 at 15:48

Dami Femi

261

looks like a tsort feed
â€“Â mikeserv
Jul 28 at 8:54

add a commentÂ |Â

looks like a tsort feed
â€“Â mikeserv
Jul 28 at 8:54

looks like a tsort feed
â€“Â mikeserv
Jul 28 at 8:54

add a commentÂ |Â

4 Answers
4

active

oldest

votes

up vote
2
down vote

To get the first two columns of the output:

$ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 '
A,3
B,2
C,2
D,1

This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk at the end just swaps the columns and inserts a comma in-between them.

The final column may be had with

$ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 '
2
2
1
1

This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk at the end extracts the counts only.

Combining these using bash and paste:

$ paste -d, <( cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 ' ) 
 <( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1

If you pre-sort the data, this may be shortened slightly (and sped up considerably):

$ sort -o file file

$ paste -d, <( cut -d, -f1 <file | uniq -c | awk -vOFS=, ' print $2, $1 ' ) 
 <( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1

edited Jul 27 at 18:22

answered Jul 27 at 17:54

Kusalananda

101k13199311

add a commentÂ |Â

up vote
2
down vote

With a small sqlite3 script running from command line, where input.csv is your input data:

sqlite3 -batch <<EOF
.mode csv

CREATE TABLE data (letter TEXT, number INTEGER);

.import input.csv data

SELECT letter, COUNT(*) AS lcount, COUNT(DISTINCT number) AS dcount
FROM data
GROUP BY letter ;
EOF

This works as so

$ bash query.sqlite 
A,3,2
B,2,2
C,2,1
D,1,1

edited Jul 30 at 21:35

answered Jul 30 at 21:09

Sergiy Kolodyazhnyy

7,51311545

add a commentÂ |Â

up vote
1
down vote

I wanted to see if this could be solved using a Perl one-liner, which I was able to figure out:

$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
 END print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt ' 
 file
A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1

Breakdown

looping over a file

This one-liner might seem uber complicated but it's actually pretty straight forward once you break it down. At the heart of it is this mechanism in Perl:

$ perl -F, -ane '...; END ... ' file

This tells Perl to take the file file in and loop over it and auto-split it using -F, as the separator character, when complete, run the END .. block once and exit.

For example:

$ perl -F, -ane 'print "arg1: $F[0] arg2: $F[1]"; END print "DONEn" ' file
arg1: A arg2: 10
arg1: B arg2: 11
arg1: C arg2: 12
arg1: A arg2: 10
arg1: B arg2: 12
arg1: D arg2: 10
arg1: A arg2: 12
arg1: C arg2: 12
DONE

NOTE: Perl's auto-split feature automatically puts the columns into an array @F, here I'm using elements 1 & 2, $F[0] & $F[1].

Counting things

The next thing we need to do is count various bits of the input. For this we'll enlist the power of hashes in Perl. We're going to use 2, %lcnt and %ccnt.

NOTE: One of the more annoying things with Perl is the switching of notation when defining a hash vs. accessing it. When we access it we switch from %lcnt to $lcnt["A"], but I digress.

$lcnt$F[0]++; $ccnt$F[0]$F[1]++;

%lcnt - count of characters from the 1st column

%ccnt - 2 dimensional hash containing 2 coordinates to access a count of 2nd column

NOTE: Counting things in this manner allows for the unique function to be performed simply by how we're counting the bits.

For example, let's examine the contents of the %lcnt hash:

$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
 END print "key: $_n" for sort keys %lcnt ' file
key: A
key: B
key: C
key: D

If we want to see the value for each hash:

$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
 END print "key: $_ val: $lcnt$_n" for sort keys %lcnt ' file
key: A val: 3
key: B val: 2
key: C val: 2
key: D val: 1

NOTE: Here we can see that the $lcnt$F[0]++ has done all the hard work of counting each character as we've looped through the file and added them to the hash %lcnt.

This is the END

The last piece of the puzzle is to display all this collected information in a useful way. For that we'll use this in the END ...:

print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt

This loops through the list of keys from %lcnt and prints the following line:

$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n"

If it's hard to see the above's structure, here it is more generally:

A, 3, 2
 ^--- (keys % $ccnt$_ ) ## count of unique columns for each character ($_)
 ^------ $lcnt$_ ## count of each character
^--------- $_ ## character

This will produce a line that contains the character ($_), the count value for this character ($lcnt$_) and then the count of unique values from the 2nd column for each character.

References

How can I find the number of keys in a hash in Perl?

How do I concatenate a string?

The top 10 tricks of Perl one-liners

Perl One-Liners Explained, Part I: File Spacing

9.4. Hashes of Hashes

Multi dimensional hashes in Perl

How can I get the unique keys from two hashes?

slmingol/README.md

answered Jul 28 at 18:19

slmâ™¦

232k65479649

add a commentÂ |Â

up vote
0
down vote

datamash -t, -s -g 1 count 1 countunique 2 < input.txt

Input

A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12

Output

A,3,2
B,2,2
C,2,1
D,1,1

answered Jul 30 at 20:16

MiniMax

2,661718

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f458883%2fcount-by-first-column-count-distinct-by-second-column-and-group-output-by-first%23new-answer', 'question_page');

);

Post as a guest

Name

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

up vote
2
down vote

To get the first two columns of the output:

$ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 '
A,3
B,2
C,2
D,1

This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk at the end just swaps the columns and inserts a comma in-between them.

The final column may be had with

$ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 '
2
2
1
1

This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk at the end extracts the counts only.

Combining these using bash and paste:

$ paste -d, <( cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 ' ) 
 <( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1

If you pre-sort the data, this may be shortened slightly (and sped up considerably):

$ sort -o file file

$ paste -d, <( cut -d, -f1 <file | uniq -c | awk -vOFS=, ' print $2, $1 ' ) 
 <( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1

edited Jul 27 at 18:22

answered Jul 27 at 17:54

Kusalananda

101k13199311

add a commentÂ |Â

up vote
2
down vote

To get the first two columns of the output:

$ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 '
A,3
B,2
C,2
D,1

This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk at the end just swaps the columns and inserts a comma in-between them.

The final column may be had with

$ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 '
2
2
1
1

This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk at the end extracts the counts only.

Combining these using bash and paste:

$ paste -d, <( cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 ' ) 
 <( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1

If you pre-sort the data, this may be shortened slightly (and sped up considerably):

$ sort -o file file

$ paste -d, <( cut -d, -f1 <file | uniq -c | awk -vOFS=, ' print $2, $1 ' ) 
 <( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1

edited Jul 27 at 18:22

answered Jul 27 at 17:54

Kusalananda

101k13199311

add a commentÂ |Â

up vote
2
down vote

To get the first two columns of the output:

$ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 '
A,3
B,2
C,2
D,1

This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk at the end just swaps the columns and inserts a comma in-between them.

The final column may be had with

$ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 '
2
2
1
1

This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk at the end extracts the counts only.

Combining these using bash and paste:

$ paste -d, <( cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 ' ) 
 <( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1

If you pre-sort the data, this may be shortened slightly (and sped up considerably):

$ sort -o file file

$ paste -d, <( cut -d, -f1 <file | uniq -c | awk -vOFS=, ' print $2, $1 ' ) 
 <( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1

edited Jul 27 at 18:22

answered Jul 27 at 17:54

Kusalananda

101k13199311

To get the first two columns of the output:

$ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 '
A,3
B,2
C,2
D,1

This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk at the end just swaps the columns and inserts a comma in-between them.

The final column may be had with

$ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 '
2
2
1
1

This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk at the end extracts the counts only.

Combining these using bash and paste:

$ paste -d, <( cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 ' ) 
 <( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1

If you pre-sort the data, this may be shortened slightly (and sped up considerably):

$ sort -o file file

$ paste -d, <( cut -d, -f1 <file | uniq -c | awk -vOFS=, ' print $2, $1 ' ) 
 <( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1

edited Jul 27 at 18:22

answered Jul 27 at 17:54

Kusalananda

101k13199311

edited Jul 27 at 18:22

answered Jul 27 at 17:54

Kusalananda

101k13199311

answered Jul 27 at 17:54

Kusalananda

101k13199311

answered Jul 27 at 17:54

Kusalananda

101k13199311

add a commentÂ |Â

up vote
2
down vote

With a small sqlite3 script running from command line, where input.csv is your input data:

sqlite3 -batch <<EOF
.mode csv

CREATE TABLE data (letter TEXT, number INTEGER);

.import input.csv data

SELECT letter, COUNT(*) AS lcount, COUNT(DISTINCT number) AS dcount
FROM data
GROUP BY letter ;
EOF

This works as so

$ bash query.sqlite 
A,3,2
B,2,2
C,2,1
D,1,1

edited Jul 30 at 21:35

answered Jul 30 at 21:09

Sergiy Kolodyazhnyy

7,51311545

add a commentÂ |Â

up vote
2
down vote

With a small sqlite3 script running from command line, where input.csv is your input data:

sqlite3 -batch <<EOF
.mode csv

CREATE TABLE data (letter TEXT, number INTEGER);

.import input.csv data

SELECT letter, COUNT(*) AS lcount, COUNT(DISTINCT number) AS dcount
FROM data
GROUP BY letter ;
EOF

This works as so

$ bash query.sqlite 
A,3,2
B,2,2
C,2,1
D,1,1

edited Jul 30 at 21:35

answered Jul 30 at 21:09

Sergiy Kolodyazhnyy

7,51311545

add a commentÂ |Â

up vote
2
down vote

With a small sqlite3 script running from command line, where input.csv is your input data:

sqlite3 -batch <<EOF
.mode csv

CREATE TABLE data (letter TEXT, number INTEGER);

.import input.csv data

SELECT letter, COUNT(*) AS lcount, COUNT(DISTINCT number) AS dcount
FROM data
GROUP BY letter ;
EOF

This works as so

$ bash query.sqlite 
A,3,2
B,2,2
C,2,1
D,1,1

edited Jul 30 at 21:35

answered Jul 30 at 21:09

Sergiy Kolodyazhnyy

7,51311545

With a small sqlite3 script running from command line, where input.csv is your input data:

sqlite3 -batch <<EOF
.mode csv

CREATE TABLE data (letter TEXT, number INTEGER);

.import input.csv data

SELECT letter, COUNT(*) AS lcount, COUNT(DISTINCT number) AS dcount
FROM data
GROUP BY letter ;
EOF

This works as so

$ bash query.sqlite 
A,3,2
B,2,2
C,2,1
D,1,1

edited Jul 30 at 21:35

answered Jul 30 at 21:09

Sergiy Kolodyazhnyy

7,51311545

edited Jul 30 at 21:35

answered Jul 30 at 21:09

Sergiy Kolodyazhnyy

7,51311545

answered Jul 30 at 21:09

Sergiy Kolodyazhnyy

7,51311545

answered Jul 30 at 21:09

Sergiy Kolodyazhnyy

7,51311545

add a commentÂ |Â

up vote
1
down vote

I wanted to see if this could be solved using a Perl one-liner, which I was able to figure out:

$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
 END print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt ' 
 file
A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1

Breakdown

looping over a file

This one-liner might seem uber complicated but it's actually pretty straight forward once you break it down. At the heart of it is this mechanism in Perl:

$ perl -F, -ane '...; END ... ' file

This tells Perl to take the file file in and loop over it and auto-split it using -F, as the separator character, when complete, run the END .. block once and exit.

For example:

$ perl -F, -ane 'print "arg1: $F[0] arg2: $F[1]"; END print "DONEn" ' file
arg1: A arg2: 10
arg1: B arg2: 11
arg1: C arg2: 12
arg1: A arg2: 10
arg1: B arg2: 12
arg1: D arg2: 10
arg1: A arg2: 12
arg1: C arg2: 12
DONE

NOTE: Perl's auto-split feature automatically puts the columns into an array @F, here I'm using elements 1 & 2, $F[0] & $F[1].

Counting things

The next thing we need to do is count various bits of the input. For this we'll enlist the power of hashes in Perl. We're going to use 2, %lcnt and %ccnt.

NOTE: One of the more annoying things with Perl is the switching of notation when defining a hash vs. accessing it. When we access it we switch from %lcnt to $lcnt["A"], but I digress.

$lcnt$F[0]++; $ccnt$F[0]$F[1]++;

%lcnt - count of characters from the 1st column

%ccnt - 2 dimensional hash containing 2 coordinates to access a count of 2nd column

NOTE: Counting things in this manner allows for the unique function to be performed simply by how we're counting the bits.

For example, let's examine the contents of the %lcnt hash:

$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
 END print "key: $_n" for sort keys %lcnt ' file
key: A
key: B
key: C
key: D

If we want to see the value for each hash:

$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
 END print "key: $_ val: $lcnt$_n" for sort keys %lcnt ' file
key: A val: 3
key: B val: 2
key: C val: 2
key: D val: 1

NOTE: Here we can see that the $lcnt$F[0]++ has done all the hard work of counting each character as we've looped through the file and added them to the hash %lcnt.

This is the END

The last piece of the puzzle is to display all this collected information in a useful way. For that we'll use this in the END ...:

print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt

This loops through the list of keys from %lcnt and prints the following line:

$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n"

If it's hard to see the above's structure, here it is more generally:

A, 3, 2
 ^--- (keys % $ccnt$_ ) ## count of unique columns for each character ($_)
 ^------ $lcnt$_ ## count of each character
^--------- $_ ## character

This will produce a line that contains the character ($_), the count value for this character ($lcnt$_) and then the count of unique values from the 2nd column for each character.

References

How can I find the number of keys in a hash in Perl?

How do I concatenate a string?

The top 10 tricks of Perl one-liners

Perl One-Liners Explained, Part I: File Spacing

9.4. Hashes of Hashes

Multi dimensional hashes in Perl

How can I get the unique keys from two hashes?

slmingol/README.md

answered Jul 28 at 18:19

slmâ™¦

232k65479649

add a commentÂ |Â

up vote
1
down vote

I wanted to see if this could be solved using a Perl one-liner, which I was able to figure out:

$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
 END print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt ' 
 file
A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1

Breakdown

looping over a file

This one-liner might seem uber complicated but it's actually pretty straight forward once you break it down. At the heart of it is this mechanism in Perl:

$ perl -F, -ane '...; END ... ' file

This tells Perl to take the file file in and loop over it and auto-split it using -F, as the separator character, when complete, run the END .. block once and exit.

For example:

$ perl -F, -ane 'print "arg1: $F[0] arg2: $F[1]"; END print "DONEn" ' file
arg1: A arg2: 10
arg1: B arg2: 11
arg1: C arg2: 12
arg1: A arg2: 10
arg1: B arg2: 12
arg1: D arg2: 10
arg1: A arg2: 12
arg1: C arg2: 12
DONE

NOTE: Perl's auto-split feature automatically puts the columns into an array @F, here I'm using elements 1 & 2, $F[0] & $F[1].

Counting things

The next thing we need to do is count various bits of the input. For this we'll enlist the power of hashes in Perl. We're going to use 2, %lcnt and %ccnt.

NOTE: One of the more annoying things with Perl is the switching of notation when defining a hash vs. accessing it. When we access it we switch from %lcnt to $lcnt["A"], but I digress.

$lcnt$F[0]++; $ccnt$F[0]$F[1]++;

%lcnt - count of characters from the 1st column

%ccnt - 2 dimensional hash containing 2 coordinates to access a count of 2nd column

NOTE: Counting things in this manner allows for the unique function to be performed simply by how we're counting the bits.

For example, let's examine the contents of the %lcnt hash:

$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
 END print "key: $_n" for sort keys %lcnt ' file
key: A
key: B
key: C
key: D

If we want to see the value for each hash:

$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
 END print "key: $_ val: $lcnt$_n" for sort keys %lcnt ' file
key: A val: 3
key: B val: 2
key: C val: 2
key: D val: 1

NOTE: Here we can see that the $lcnt$F[0]++ has done all the hard work of counting each character as we've looped through the file and added them to the hash %lcnt.

This is the END

The last piece of the puzzle is to display all this collected information in a useful way. For that we'll use this in the END ...:

print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt

This loops through the list of keys from %lcnt and prints the following line:

$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n"

If it's hard to see the above's structure, here it is more generally:

A, 3, 2
 ^--- (keys % $ccnt$_ ) ## count of unique columns for each character ($_)
 ^------ $lcnt$_ ## count of each character
^--------- $_ ## character

This will produce a line that contains the character ($_), the count value for this character ($lcnt$_) and then the count of unique values from the 2nd column for each character.

References

How can I find the number of keys in a hash in Perl?

How do I concatenate a string?

The top 10 tricks of Perl one-liners

Perl One-Liners Explained, Part I: File Spacing

9.4. Hashes of Hashes

Multi dimensional hashes in Perl

How can I get the unique keys from two hashes?

slmingol/README.md

answered Jul 28 at 18:19

slmâ™¦

232k65479649

add a commentÂ |Â

up vote
1
down vote

I wanted to see if this could be solved using a Perl one-liner, which I was able to figure out:

$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
 END print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt ' 
 file
A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1

Breakdown

looping over a file

This one-liner might seem uber complicated but it's actually pretty straight forward once you break it down. At the heart of it is this mechanism in Perl:

$ perl -F, -ane '...; END ... ' file

This tells Perl to take the file file in and loop over it and auto-split it using -F, as the separator character, when complete, run the END .. block once and exit.

For example:

$ perl -F, -ane 'print "arg1: $F[0] arg2: $F[1]"; END print "DONEn" ' file
arg1: A arg2: 10
arg1: B arg2: 11
arg1: C arg2: 12
arg1: A arg2: 10
arg1: B arg2: 12
arg1: D arg2: 10
arg1: A arg2: 12
arg1: C arg2: 12
DONE

NOTE: Perl's auto-split feature automatically puts the columns into an array @F, here I'm using elements 1 & 2, $F[0] & $F[1].

Counting things

The next thing we need to do is count various bits of the input. For this we'll enlist the power of hashes in Perl. We're going to use 2, %lcnt and %ccnt.

NOTE: One of the more annoying things with Perl is the switching of notation when defining a hash vs. accessing it. When we access it we switch from %lcnt to $lcnt["A"], but I digress.

$lcnt$F[0]++; $ccnt$F[0]$F[1]++;

%lcnt - count of characters from the 1st column

%ccnt - 2 dimensional hash containing 2 coordinates to access a count of 2nd column

NOTE: Counting things in this manner allows for the unique function to be performed simply by how we're counting the bits.

For example, let's examine the contents of the %lcnt hash:

$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
 END print "key: $_n" for sort keys %lcnt ' file
key: A
key: B
key: C
key: D

If we want to see the value for each hash:

$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
 END print "key: $_ val: $lcnt$_n" for sort keys %lcnt ' file
key: A val: 3
key: B val: 2
key: C val: 2
key: D val: 1

NOTE: Here we can see that the $lcnt$F[0]++ has done all the hard work of counting each character as we've looped through the file and added them to the hash %lcnt.

This is the END

The last piece of the puzzle is to display all this collected information in a useful way. For that we'll use this in the END ...:

print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt

This loops through the list of keys from %lcnt and prints the following line:

$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n"

If it's hard to see the above's structure, here it is more generally:

A, 3, 2
 ^--- (keys % $ccnt$_ ) ## count of unique columns for each character ($_)
 ^------ $lcnt$_ ## count of each character
^--------- $_ ## character

This will produce a line that contains the character ($_), the count value for this character ($lcnt$_) and then the count of unique values from the 2nd column for each character.

References

How can I find the number of keys in a hash in Perl?

How do I concatenate a string?

The top 10 tricks of Perl one-liners

Perl One-Liners Explained, Part I: File Spacing

9.4. Hashes of Hashes

Multi dimensional hashes in Perl

How can I get the unique keys from two hashes?

slmingol/README.md

answered Jul 28 at 18:19

slmâ™¦

232k65479649

I wanted to see if this could be solved using a Perl one-liner, which I was able to figure out:

$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
 END print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt ' 
 file
A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1

Breakdown

looping over a file

This one-liner might seem uber complicated but it's actually pretty straight forward once you break it down. At the heart of it is this mechanism in Perl:

$ perl -F, -ane '...; END ... ' file

This tells Perl to take the file file in and loop over it and auto-split it using -F, as the separator character, when complete, run the END .. block once and exit.

For example:

$ perl -F, -ane 'print "arg1: $F[0] arg2: $F[1]"; END print "DONEn" ' file
arg1: A arg2: 10
arg1: B arg2: 11
arg1: C arg2: 12
arg1: A arg2: 10
arg1: B arg2: 12
arg1: D arg2: 10
arg1: A arg2: 12
arg1: C arg2: 12
DONE

NOTE: Perl's auto-split feature automatically puts the columns into an array @F, here I'm using elements 1 & 2, $F[0] & $F[1].

Counting things

The next thing we need to do is count various bits of the input. For this we'll enlist the power of hashes in Perl. We're going to use 2, %lcnt and %ccnt.

NOTE: One of the more annoying things with Perl is the switching of notation when defining a hash vs. accessing it. When we access it we switch from %lcnt to $lcnt["A"], but I digress.

$lcnt$F[0]++; $ccnt$F[0]$F[1]++;

%lcnt - count of characters from the 1st column

%ccnt - 2 dimensional hash containing 2 coordinates to access a count of 2nd column

NOTE: Counting things in this manner allows for the unique function to be performed simply by how we're counting the bits.

For example, let's examine the contents of the %lcnt hash:

$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
 END print "key: $_n" for sort keys %lcnt ' file
key: A
key: B
key: C
key: D

If we want to see the value for each hash:

$ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
 END print "key: $_ val: $lcnt$_n" for sort keys %lcnt ' file
key: A val: 3
key: B val: 2
key: C val: 2
key: D val: 1

NOTE: Here we can see that the $lcnt$F[0]++ has done all the hard work of counting each character as we've looped through the file and added them to the hash %lcnt.

This is the END

The last piece of the puzzle is to display all this collected information in a useful way. For that we'll use this in the END ...:

print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt

This loops through the list of keys from %lcnt and prints the following line:

$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n"

If it's hard to see the above's structure, here it is more generally:

A, 3, 2
 ^--- (keys % $ccnt$_ ) ## count of unique columns for each character ($_)
 ^------ $lcnt$_ ## count of each character
^--------- $_ ## character

This will produce a line that contains the character ($_), the count value for this character ($lcnt$_) and then the count of unique values from the 2nd column for each character.

References

How can I find the number of keys in a hash in Perl?

How do I concatenate a string?

The top 10 tricks of Perl one-liners

Perl One-Liners Explained, Part I: File Spacing

9.4. Hashes of Hashes

Multi dimensional hashes in Perl

How can I get the unique keys from two hashes?

slmingol/README.md

answered Jul 28 at 18:19

slmâ™¦

232k65479649

answered Jul 28 at 18:19

slmâ™¦

232k65479649

answered Jul 28 at 18:19

slmâ™¦

232k65479649

answered Jul 28 at 18:19

slmâ™¦

232k65479649

add a commentÂ |Â

up vote
0
down vote

datamash -t, -s -g 1 count 1 countunique 2 < input.txt

Input

A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12

Output

A,3,2
B,2,2
C,2,1
D,1,1

answered Jul 30 at 20:16

MiniMax

2,661718

add a commentÂ |Â

up vote
0
down vote

datamash -t, -s -g 1 count 1 countunique 2 < input.txt

Input

A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12

Output

A,3,2
B,2,2
C,2,1
D,1,1

answered Jul 30 at 20:16

MiniMax

2,661718

add a commentÂ |Â

up vote
0
down vote

datamash -t, -s -g 1 count 1 countunique 2 < input.txt

Input

A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12

Output

A,3,2
B,2,2
C,2,1
D,1,1

answered Jul 30 at 20:16

MiniMax

2,661718

datamash -t, -s -g 1 count 1 countunique 2 < input.txt

Input

A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12

Output

A,3,2
B,2,2
C,2,1
D,1,1

answered Jul 30 at 20:16

MiniMax

2,661718

answered Jul 30 at 20:16

MiniMax

2,661718

answered Jul 30 at 20:16

MiniMax

2,661718

answered Jul 30 at 20:16

MiniMax

2,661718

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu