Count by first column, count distinct by second column and group output by first column?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
5
down vote

favorite












Please, I am new Unix command awk. I need a Unix command that will read a CSV file (over 700M rows) with the sample below:



A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12


The command will count the number of occurrence in the first column then count the number of distinct occurrence in column 2 and group the output by entries in column one. Such that the output will look be like below:



A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1






share|improve this question





















  • looks like a tsort feed
    – mikeserv
    Jul 28 at 8:54
















up vote
5
down vote

favorite












Please, I am new Unix command awk. I need a Unix command that will read a CSV file (over 700M rows) with the sample below:



A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12


The command will count the number of occurrence in the first column then count the number of distinct occurrence in column 2 and group the output by entries in column one. Such that the output will look be like below:



A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1






share|improve this question





















  • looks like a tsort feed
    – mikeserv
    Jul 28 at 8:54












up vote
5
down vote

favorite









up vote
5
down vote

favorite











Please, I am new Unix command awk. I need a Unix command that will read a CSV file (over 700M rows) with the sample below:



A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12


The command will count the number of occurrence in the first column then count the number of distinct occurrence in column 2 and group the output by entries in column one. Such that the output will look be like below:



A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1






share|improve this question













Please, I am new Unix command awk. I need a Unix command that will read a CSV file (over 700M rows) with the sample below:



A, 10
B, 11
C, 12
A, 10
B, 12
D, 10
A, 12
C, 12


The command will count the number of occurrence in the first column then count the number of distinct occurrence in column 2 and group the output by entries in column one. Such that the output will look be like below:



A, 3, 2
B, 2, 2
C, 2, 1
D, 1, 1








share|improve this question












share|improve this question




share|improve this question








edited Jul 28 at 1:08









slm♦

232k65479649




232k65479649









asked Jul 27 at 15:48









Dami Femi

261




261











  • looks like a tsort feed
    – mikeserv
    Jul 28 at 8:54
















  • looks like a tsort feed
    – mikeserv
    Jul 28 at 8:54















looks like a tsort feed
– mikeserv
Jul 28 at 8:54




looks like a tsort feed
– mikeserv
Jul 28 at 8:54










4 Answers
4






active

oldest

votes

















up vote
2
down vote













To get the first two columns of the output:



$ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 '
A,3
B,2
C,2
D,1


This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk at the end just swaps the columns and inserts a comma in-between them.



The final column may be had with



$ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 '
2
2
1
1


This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk at the end extracts the counts only.



Combining these using bash and paste:



$ paste -d, <( cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 ' ) 
<( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1


If you pre-sort the data, this may be shortened slightly (and sped up considerably):



$ sort -o file file

$ paste -d, <( cut -d, -f1 <file | uniq -c | awk -vOFS=, ' print $2, $1 ' )
<( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
A,3,2
B,2,2
C,2,1
D,1,1





share|improve this answer






























    up vote
    2
    down vote













    With a small sqlite3 script running from command line, where input.csv is your input data:



    sqlite3 -batch <<EOF
    .mode csv

    CREATE TABLE data (letter TEXT, number INTEGER);

    .import input.csv data

    SELECT letter, COUNT(*) AS lcount, COUNT(DISTINCT number) AS dcount
    FROM data
    GROUP BY letter ;
    EOF


    This works as so



    $ bash query.sqlite 
    A,3,2
    B,2,2
    C,2,1
    D,1,1





    share|improve this answer






























      up vote
      1
      down vote













      I wanted to see if this could be solved using a Perl one-liner, which I was able to figure out:



      $ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
      END print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt '
      file
      A, 3, 2
      B, 2, 2
      C, 2, 1
      D, 1, 1


      Breakdown



      looping over a file



      This one-liner might seem uber complicated but it's actually pretty straight forward once you break it down. At the heart of it is this mechanism in Perl:



      $ perl -F, -ane '...; END ... ' file


      This tells Perl to take the file file in and loop over it and auto-split it using -F, as the separator character, when complete, run the END .. block once and exit.



      For example:



      $ perl -F, -ane 'print "arg1: $F[0] arg2: $F[1]"; END print "DONEn" ' file
      arg1: A arg2: 10
      arg1: B arg2: 11
      arg1: C arg2: 12
      arg1: A arg2: 10
      arg1: B arg2: 12
      arg1: D arg2: 10
      arg1: A arg2: 12
      arg1: C arg2: 12
      DONE


      NOTE: Perl's auto-split feature automatically puts the columns into an array @F, here I'm using elements 1 & 2, $F[0] & $F[1].



      Counting things



      The next thing we need to do is count various bits of the input. For this we'll enlist the power of hashes in Perl. We're going to use 2, %lcnt and %ccnt.



      NOTE: One of the more annoying things with Perl is the switching of notation when defining a hash vs. accessing it. When we access it we switch from %lcnt to $lcnt["A"], but I digress.



      $lcnt$F[0]++; $ccnt$F[0]$F[1]++;



      • %lcnt - count of characters from the 1st column


      • %ccnt - 2 dimensional hash containing 2 coordinates to access a count of 2nd column

      NOTE: Counting things in this manner allows for the unique function to be performed simply by how we're counting the bits.



      For example, let's examine the contents of the %lcnt hash:



      $ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
      END print "key: $_n" for sort keys %lcnt ' file
      key: A
      key: B
      key: C
      key: D


      If we want to see the value for each hash:



      $ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
      END print "key: $_ val: $lcnt$_n" for sort keys %lcnt ' file
      key: A val: 3
      key: B val: 2
      key: C val: 2
      key: D val: 1


      NOTE: Here we can see that the $lcnt$F[0]++ has done all the hard work of counting each character as we've looped through the file and added them to the hash %lcnt.



      This is the END



      The last piece of the puzzle is to display all this collected information in a useful way. For that we'll use this in the END ...:



      print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt


      This loops through the list of keys from %lcnt and prints the following line:



      $_, $lcnt$_, " . (keys % $ccnt$_ ) . "n"


      If it's hard to see the above's structure, here it is more generally:



      A, 3, 2
      ^--- (keys % $ccnt$_ ) ## count of unique columns for each character ($_)
      ^------ $lcnt$_ ## count of each character
      ^--------- $_ ## character


      This will produce a line that contains the character ($_), the count value for this character ($lcnt$_) and then the count of unique values from the 2nd column for each character.



      References



      • How can I find the number of keys in a hash in Perl?

      • How do I concatenate a string?

      • The top 10 tricks of Perl one-liners

      • Perl One-Liners Explained, Part I: File Spacing

      • 9.4. Hashes of Hashes

      • Multi dimensional hashes in Perl

      • How can I get the unique keys from two hashes?

      • slmingol/README.md





      share|improve this answer




























        up vote
        0
        down vote













        datamash -t, -s -g 1 count 1 countunique 2 < input.txt


        Input



        A, 10
        B, 11
        C, 12
        A, 10
        B, 12
        D, 10
        A, 12
        C, 12


        Output



        A,3,2
        B,2,2
        C,2,1
        D,1,1





        share|improve this answer





















          Your Answer







          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "106"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f458883%2fcount-by-first-column-count-distinct-by-second-column-and-group-output-by-first%23new-answer', 'question_page');

          );

          Post as a guest






























          4 Answers
          4






          active

          oldest

          votes








          4 Answers
          4






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          2
          down vote













          To get the first two columns of the output:



          $ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 '
          A,3
          B,2
          C,2
          D,1


          This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk at the end just swaps the columns and inserts a comma in-between them.



          The final column may be had with



          $ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 '
          2
          2
          1
          1


          This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk at the end extracts the counts only.



          Combining these using bash and paste:



          $ paste -d, <( cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 ' ) 
          <( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
          A,3,2
          B,2,2
          C,2,1
          D,1,1


          If you pre-sort the data, this may be shortened slightly (and sped up considerably):



          $ sort -o file file

          $ paste -d, <( cut -d, -f1 <file | uniq -c | awk -vOFS=, ' print $2, $1 ' )
          <( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
          A,3,2
          B,2,2
          C,2,1
          D,1,1





          share|improve this answer



























            up vote
            2
            down vote













            To get the first two columns of the output:



            $ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 '
            A,3
            B,2
            C,2
            D,1


            This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk at the end just swaps the columns and inserts a comma in-between them.



            The final column may be had with



            $ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 '
            2
            2
            1
            1


            This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk at the end extracts the counts only.



            Combining these using bash and paste:



            $ paste -d, <( cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 ' ) 
            <( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
            A,3,2
            B,2,2
            C,2,1
            D,1,1


            If you pre-sort the data, this may be shortened slightly (and sped up considerably):



            $ sort -o file file

            $ paste -d, <( cut -d, -f1 <file | uniq -c | awk -vOFS=, ' print $2, $1 ' )
            <( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
            A,3,2
            B,2,2
            C,2,1
            D,1,1





            share|improve this answer

























              up vote
              2
              down vote










              up vote
              2
              down vote









              To get the first two columns of the output:



              $ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 '
              A,3
              B,2
              C,2
              D,1


              This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk at the end just swaps the columns and inserts a comma in-between them.



              The final column may be had with



              $ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 '
              2
              2
              1
              1


              This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk at the end extracts the counts only.



              Combining these using bash and paste:



              $ paste -d, <( cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 ' ) 
              <( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
              A,3,2
              B,2,2
              C,2,1
              D,1,1


              If you pre-sort the data, this may be shortened slightly (and sped up considerably):



              $ sort -o file file

              $ paste -d, <( cut -d, -f1 <file | uniq -c | awk -vOFS=, ' print $2, $1 ' )
              <( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
              A,3,2
              B,2,2
              C,2,1
              D,1,1





              share|improve this answer















              To get the first two columns of the output:



              $ cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 '
              A,3
              B,2
              C,2
              D,1


              This extracts the first column of the original file, sorts it and counts the number of duplicated entries. The awk at the end just swaps the columns and inserts a comma in-between them.



              The final column may be had with



              $ sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 '
              2
              2
              1
              1


              This sorts the original data and discards the duplicates. Then the first column is extracted and the number of duplicates of that is counted. The awk at the end extracts the counts only.



              Combining these using bash and paste:



              $ paste -d, <( cut -d, -f1 <file | sort | uniq -c | awk -vOFS=, ' print $2, $1 ' ) 
              <( sort -u <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
              A,3,2
              B,2,2
              C,2,1
              D,1,1


              If you pre-sort the data, this may be shortened slightly (and sped up considerably):



              $ sort -o file file

              $ paste -d, <( cut -d, -f1 <file | uniq -c | awk -vOFS=, ' print $2, $1 ' )
              <( uniq <file | cut -d, -f1 | uniq -c | awk -vOFS=, ' print $1 ' )
              A,3,2
              B,2,2
              C,2,1
              D,1,1






              share|improve this answer















              share|improve this answer



              share|improve this answer








              edited Jul 27 at 18:22


























              answered Jul 27 at 17:54









              Kusalananda

              101k13199311




              101k13199311






















                  up vote
                  2
                  down vote













                  With a small sqlite3 script running from command line, where input.csv is your input data:



                  sqlite3 -batch <<EOF
                  .mode csv

                  CREATE TABLE data (letter TEXT, number INTEGER);

                  .import input.csv data

                  SELECT letter, COUNT(*) AS lcount, COUNT(DISTINCT number) AS dcount
                  FROM data
                  GROUP BY letter ;
                  EOF


                  This works as so



                  $ bash query.sqlite 
                  A,3,2
                  B,2,2
                  C,2,1
                  D,1,1





                  share|improve this answer



























                    up vote
                    2
                    down vote













                    With a small sqlite3 script running from command line, where input.csv is your input data:



                    sqlite3 -batch <<EOF
                    .mode csv

                    CREATE TABLE data (letter TEXT, number INTEGER);

                    .import input.csv data

                    SELECT letter, COUNT(*) AS lcount, COUNT(DISTINCT number) AS dcount
                    FROM data
                    GROUP BY letter ;
                    EOF


                    This works as so



                    $ bash query.sqlite 
                    A,3,2
                    B,2,2
                    C,2,1
                    D,1,1





                    share|improve this answer

























                      up vote
                      2
                      down vote










                      up vote
                      2
                      down vote









                      With a small sqlite3 script running from command line, where input.csv is your input data:



                      sqlite3 -batch <<EOF
                      .mode csv

                      CREATE TABLE data (letter TEXT, number INTEGER);

                      .import input.csv data

                      SELECT letter, COUNT(*) AS lcount, COUNT(DISTINCT number) AS dcount
                      FROM data
                      GROUP BY letter ;
                      EOF


                      This works as so



                      $ bash query.sqlite 
                      A,3,2
                      B,2,2
                      C,2,1
                      D,1,1





                      share|improve this answer















                      With a small sqlite3 script running from command line, where input.csv is your input data:



                      sqlite3 -batch <<EOF
                      .mode csv

                      CREATE TABLE data (letter TEXT, number INTEGER);

                      .import input.csv data

                      SELECT letter, COUNT(*) AS lcount, COUNT(DISTINCT number) AS dcount
                      FROM data
                      GROUP BY letter ;
                      EOF


                      This works as so



                      $ bash query.sqlite 
                      A,3,2
                      B,2,2
                      C,2,1
                      D,1,1






                      share|improve this answer















                      share|improve this answer



                      share|improve this answer








                      edited Jul 30 at 21:35


























                      answered Jul 30 at 21:09









                      Sergiy Kolodyazhnyy

                      7,51311545




                      7,51311545




















                          up vote
                          1
                          down vote













                          I wanted to see if this could be solved using a Perl one-liner, which I was able to figure out:



                          $ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
                          END print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt '
                          file
                          A, 3, 2
                          B, 2, 2
                          C, 2, 1
                          D, 1, 1


                          Breakdown



                          looping over a file



                          This one-liner might seem uber complicated but it's actually pretty straight forward once you break it down. At the heart of it is this mechanism in Perl:



                          $ perl -F, -ane '...; END ... ' file


                          This tells Perl to take the file file in and loop over it and auto-split it using -F, as the separator character, when complete, run the END .. block once and exit.



                          For example:



                          $ perl -F, -ane 'print "arg1: $F[0] arg2: $F[1]"; END print "DONEn" ' file
                          arg1: A arg2: 10
                          arg1: B arg2: 11
                          arg1: C arg2: 12
                          arg1: A arg2: 10
                          arg1: B arg2: 12
                          arg1: D arg2: 10
                          arg1: A arg2: 12
                          arg1: C arg2: 12
                          DONE


                          NOTE: Perl's auto-split feature automatically puts the columns into an array @F, here I'm using elements 1 & 2, $F[0] & $F[1].



                          Counting things



                          The next thing we need to do is count various bits of the input. For this we'll enlist the power of hashes in Perl. We're going to use 2, %lcnt and %ccnt.



                          NOTE: One of the more annoying things with Perl is the switching of notation when defining a hash vs. accessing it. When we access it we switch from %lcnt to $lcnt["A"], but I digress.



                          $lcnt$F[0]++; $ccnt$F[0]$F[1]++;



                          • %lcnt - count of characters from the 1st column


                          • %ccnt - 2 dimensional hash containing 2 coordinates to access a count of 2nd column

                          NOTE: Counting things in this manner allows for the unique function to be performed simply by how we're counting the bits.



                          For example, let's examine the contents of the %lcnt hash:



                          $ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
                          END print "key: $_n" for sort keys %lcnt ' file
                          key: A
                          key: B
                          key: C
                          key: D


                          If we want to see the value for each hash:



                          $ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
                          END print "key: $_ val: $lcnt$_n" for sort keys %lcnt ' file
                          key: A val: 3
                          key: B val: 2
                          key: C val: 2
                          key: D val: 1


                          NOTE: Here we can see that the $lcnt$F[0]++ has done all the hard work of counting each character as we've looped through the file and added them to the hash %lcnt.



                          This is the END



                          The last piece of the puzzle is to display all this collected information in a useful way. For that we'll use this in the END ...:



                          print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt


                          This loops through the list of keys from %lcnt and prints the following line:



                          $_, $lcnt$_, " . (keys % $ccnt$_ ) . "n"


                          If it's hard to see the above's structure, here it is more generally:



                          A, 3, 2
                          ^--- (keys % $ccnt$_ ) ## count of unique columns for each character ($_)
                          ^------ $lcnt$_ ## count of each character
                          ^--------- $_ ## character


                          This will produce a line that contains the character ($_), the count value for this character ($lcnt$_) and then the count of unique values from the 2nd column for each character.



                          References



                          • How can I find the number of keys in a hash in Perl?

                          • How do I concatenate a string?

                          • The top 10 tricks of Perl one-liners

                          • Perl One-Liners Explained, Part I: File Spacing

                          • 9.4. Hashes of Hashes

                          • Multi dimensional hashes in Perl

                          • How can I get the unique keys from two hashes?

                          • slmingol/README.md





                          share|improve this answer

























                            up vote
                            1
                            down vote













                            I wanted to see if this could be solved using a Perl one-liner, which I was able to figure out:



                            $ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
                            END print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt '
                            file
                            A, 3, 2
                            B, 2, 2
                            C, 2, 1
                            D, 1, 1


                            Breakdown



                            looping over a file



                            This one-liner might seem uber complicated but it's actually pretty straight forward once you break it down. At the heart of it is this mechanism in Perl:



                            $ perl -F, -ane '...; END ... ' file


                            This tells Perl to take the file file in and loop over it and auto-split it using -F, as the separator character, when complete, run the END .. block once and exit.



                            For example:



                            $ perl -F, -ane 'print "arg1: $F[0] arg2: $F[1]"; END print "DONEn" ' file
                            arg1: A arg2: 10
                            arg1: B arg2: 11
                            arg1: C arg2: 12
                            arg1: A arg2: 10
                            arg1: B arg2: 12
                            arg1: D arg2: 10
                            arg1: A arg2: 12
                            arg1: C arg2: 12
                            DONE


                            NOTE: Perl's auto-split feature automatically puts the columns into an array @F, here I'm using elements 1 & 2, $F[0] & $F[1].



                            Counting things



                            The next thing we need to do is count various bits of the input. For this we'll enlist the power of hashes in Perl. We're going to use 2, %lcnt and %ccnt.



                            NOTE: One of the more annoying things with Perl is the switching of notation when defining a hash vs. accessing it. When we access it we switch from %lcnt to $lcnt["A"], but I digress.



                            $lcnt$F[0]++; $ccnt$F[0]$F[1]++;



                            • %lcnt - count of characters from the 1st column


                            • %ccnt - 2 dimensional hash containing 2 coordinates to access a count of 2nd column

                            NOTE: Counting things in this manner allows for the unique function to be performed simply by how we're counting the bits.



                            For example, let's examine the contents of the %lcnt hash:



                            $ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
                            END print "key: $_n" for sort keys %lcnt ' file
                            key: A
                            key: B
                            key: C
                            key: D


                            If we want to see the value for each hash:



                            $ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
                            END print "key: $_ val: $lcnt$_n" for sort keys %lcnt ' file
                            key: A val: 3
                            key: B val: 2
                            key: C val: 2
                            key: D val: 1


                            NOTE: Here we can see that the $lcnt$F[0]++ has done all the hard work of counting each character as we've looped through the file and added them to the hash %lcnt.



                            This is the END



                            The last piece of the puzzle is to display all this collected information in a useful way. For that we'll use this in the END ...:



                            print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt


                            This loops through the list of keys from %lcnt and prints the following line:



                            $_, $lcnt$_, " . (keys % $ccnt$_ ) . "n"


                            If it's hard to see the above's structure, here it is more generally:



                            A, 3, 2
                            ^--- (keys % $ccnt$_ ) ## count of unique columns for each character ($_)
                            ^------ $lcnt$_ ## count of each character
                            ^--------- $_ ## character


                            This will produce a line that contains the character ($_), the count value for this character ($lcnt$_) and then the count of unique values from the 2nd column for each character.



                            References



                            • How can I find the number of keys in a hash in Perl?

                            • How do I concatenate a string?

                            • The top 10 tricks of Perl one-liners

                            • Perl One-Liners Explained, Part I: File Spacing

                            • 9.4. Hashes of Hashes

                            • Multi dimensional hashes in Perl

                            • How can I get the unique keys from two hashes?

                            • slmingol/README.md





                            share|improve this answer























                              up vote
                              1
                              down vote










                              up vote
                              1
                              down vote









                              I wanted to see if this could be solved using a Perl one-liner, which I was able to figure out:



                              $ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
                              END print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt '
                              file
                              A, 3, 2
                              B, 2, 2
                              C, 2, 1
                              D, 1, 1


                              Breakdown



                              looping over a file



                              This one-liner might seem uber complicated but it's actually pretty straight forward once you break it down. At the heart of it is this mechanism in Perl:



                              $ perl -F, -ane '...; END ... ' file


                              This tells Perl to take the file file in and loop over it and auto-split it using -F, as the separator character, when complete, run the END .. block once and exit.



                              For example:



                              $ perl -F, -ane 'print "arg1: $F[0] arg2: $F[1]"; END print "DONEn" ' file
                              arg1: A arg2: 10
                              arg1: B arg2: 11
                              arg1: C arg2: 12
                              arg1: A arg2: 10
                              arg1: B arg2: 12
                              arg1: D arg2: 10
                              arg1: A arg2: 12
                              arg1: C arg2: 12
                              DONE


                              NOTE: Perl's auto-split feature automatically puts the columns into an array @F, here I'm using elements 1 & 2, $F[0] & $F[1].



                              Counting things



                              The next thing we need to do is count various bits of the input. For this we'll enlist the power of hashes in Perl. We're going to use 2, %lcnt and %ccnt.



                              NOTE: One of the more annoying things with Perl is the switching of notation when defining a hash vs. accessing it. When we access it we switch from %lcnt to $lcnt["A"], but I digress.



                              $lcnt$F[0]++; $ccnt$F[0]$F[1]++;



                              • %lcnt - count of characters from the 1st column


                              • %ccnt - 2 dimensional hash containing 2 coordinates to access a count of 2nd column

                              NOTE: Counting things in this manner allows for the unique function to be performed simply by how we're counting the bits.



                              For example, let's examine the contents of the %lcnt hash:



                              $ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
                              END print "key: $_n" for sort keys %lcnt ' file
                              key: A
                              key: B
                              key: C
                              key: D


                              If we want to see the value for each hash:



                              $ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
                              END print "key: $_ val: $lcnt$_n" for sort keys %lcnt ' file
                              key: A val: 3
                              key: B val: 2
                              key: C val: 2
                              key: D val: 1


                              NOTE: Here we can see that the $lcnt$F[0]++ has done all the hard work of counting each character as we've looped through the file and added them to the hash %lcnt.



                              This is the END



                              The last piece of the puzzle is to display all this collected information in a useful way. For that we'll use this in the END ...:



                              print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt


                              This loops through the list of keys from %lcnt and prints the following line:



                              $_, $lcnt$_, " . (keys % $ccnt$_ ) . "n"


                              If it's hard to see the above's structure, here it is more generally:



                              A, 3, 2
                              ^--- (keys % $ccnt$_ ) ## count of unique columns for each character ($_)
                              ^------ $lcnt$_ ## count of each character
                              ^--------- $_ ## character


                              This will produce a line that contains the character ($_), the count value for this character ($lcnt$_) and then the count of unique values from the 2nd column for each character.



                              References



                              • How can I find the number of keys in a hash in Perl?

                              • How do I concatenate a string?

                              • The top 10 tricks of Perl one-liners

                              • Perl One-Liners Explained, Part I: File Spacing

                              • 9.4. Hashes of Hashes

                              • Multi dimensional hashes in Perl

                              • How can I get the unique keys from two hashes?

                              • slmingol/README.md





                              share|improve this answer













                              I wanted to see if this could be solved using a Perl one-liner, which I was able to figure out:



                              $ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
                              END print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt '
                              file
                              A, 3, 2
                              B, 2, 2
                              C, 2, 1
                              D, 1, 1


                              Breakdown



                              looping over a file



                              This one-liner might seem uber complicated but it's actually pretty straight forward once you break it down. At the heart of it is this mechanism in Perl:



                              $ perl -F, -ane '...; END ... ' file


                              This tells Perl to take the file file in and loop over it and auto-split it using -F, as the separator character, when complete, run the END .. block once and exit.



                              For example:



                              $ perl -F, -ane 'print "arg1: $F[0] arg2: $F[1]"; END print "DONEn" ' file
                              arg1: A arg2: 10
                              arg1: B arg2: 11
                              arg1: C arg2: 12
                              arg1: A arg2: 10
                              arg1: B arg2: 12
                              arg1: D arg2: 10
                              arg1: A arg2: 12
                              arg1: C arg2: 12
                              DONE


                              NOTE: Perl's auto-split feature automatically puts the columns into an array @F, here I'm using elements 1 & 2, $F[0] & $F[1].



                              Counting things



                              The next thing we need to do is count various bits of the input. For this we'll enlist the power of hashes in Perl. We're going to use 2, %lcnt and %ccnt.



                              NOTE: One of the more annoying things with Perl is the switching of notation when defining a hash vs. accessing it. When we access it we switch from %lcnt to $lcnt["A"], but I digress.



                              $lcnt$F[0]++; $ccnt$F[0]$F[1]++;



                              • %lcnt - count of characters from the 1st column


                              • %ccnt - 2 dimensional hash containing 2 coordinates to access a count of 2nd column

                              NOTE: Counting things in this manner allows for the unique function to be performed simply by how we're counting the bits.



                              For example, let's examine the contents of the %lcnt hash:



                              $ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
                              END print "key: $_n" for sort keys %lcnt ' file
                              key: A
                              key: B
                              key: C
                              key: D


                              If we want to see the value for each hash:



                              $ perl -F, -ane '$lcnt$F[0]++; $ccnt$F[0]$F[1]++; 
                              END print "key: $_ val: $lcnt$_n" for sort keys %lcnt ' file
                              key: A val: 3
                              key: B val: 2
                              key: C val: 2
                              key: D val: 1


                              NOTE: Here we can see that the $lcnt$F[0]++ has done all the hard work of counting each character as we've looped through the file and added them to the hash %lcnt.



                              This is the END



                              The last piece of the puzzle is to display all this collected information in a useful way. For that we'll use this in the END ...:



                              print "$_, $lcnt$_, " . (keys % $ccnt$_ ) . "n" for sort keys %lcnt


                              This loops through the list of keys from %lcnt and prints the following line:



                              $_, $lcnt$_, " . (keys % $ccnt$_ ) . "n"


                              If it's hard to see the above's structure, here it is more generally:



                              A, 3, 2
                              ^--- (keys % $ccnt$_ ) ## count of unique columns for each character ($_)
                              ^------ $lcnt$_ ## count of each character
                              ^--------- $_ ## character


                              This will produce a line that contains the character ($_), the count value for this character ($lcnt$_) and then the count of unique values from the 2nd column for each character.



                              References



                              • How can I find the number of keys in a hash in Perl?

                              • How do I concatenate a string?

                              • The top 10 tricks of Perl one-liners

                              • Perl One-Liners Explained, Part I: File Spacing

                              • 9.4. Hashes of Hashes

                              • Multi dimensional hashes in Perl

                              • How can I get the unique keys from two hashes?

                              • slmingol/README.md






                              share|improve this answer













                              share|improve this answer



                              share|improve this answer











                              answered Jul 28 at 18:19









                              slm♦

                              232k65479649




                              232k65479649




















                                  up vote
                                  0
                                  down vote













                                  datamash -t, -s -g 1 count 1 countunique 2 < input.txt


                                  Input



                                  A, 10
                                  B, 11
                                  C, 12
                                  A, 10
                                  B, 12
                                  D, 10
                                  A, 12
                                  C, 12


                                  Output



                                  A,3,2
                                  B,2,2
                                  C,2,1
                                  D,1,1





                                  share|improve this answer

























                                    up vote
                                    0
                                    down vote













                                    datamash -t, -s -g 1 count 1 countunique 2 < input.txt


                                    Input



                                    A, 10
                                    B, 11
                                    C, 12
                                    A, 10
                                    B, 12
                                    D, 10
                                    A, 12
                                    C, 12


                                    Output



                                    A,3,2
                                    B,2,2
                                    C,2,1
                                    D,1,1





                                    share|improve this answer























                                      up vote
                                      0
                                      down vote










                                      up vote
                                      0
                                      down vote









                                      datamash -t, -s -g 1 count 1 countunique 2 < input.txt


                                      Input



                                      A, 10
                                      B, 11
                                      C, 12
                                      A, 10
                                      B, 12
                                      D, 10
                                      A, 12
                                      C, 12


                                      Output



                                      A,3,2
                                      B,2,2
                                      C,2,1
                                      D,1,1





                                      share|improve this answer













                                      datamash -t, -s -g 1 count 1 countunique 2 < input.txt


                                      Input



                                      A, 10
                                      B, 11
                                      C, 12
                                      A, 10
                                      B, 12
                                      D, 10
                                      A, 12
                                      C, 12


                                      Output



                                      A,3,2
                                      B,2,2
                                      C,2,1
                                      D,1,1






                                      share|improve this answer













                                      share|improve this answer



                                      share|improve this answer











                                      answered Jul 30 at 20:16









                                      MiniMax

                                      2,661718




                                      2,661718






















                                           

                                          draft saved


                                          draft discarded


























                                           


                                          draft saved


                                          draft discarded














                                          StackExchange.ready(
                                          function ()
                                          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f458883%2fcount-by-first-column-count-distinct-by-second-column-and-group-output-by-first%23new-answer', 'question_page');

                                          );

                                          Post as a guest













































































                                          Popular posts from this blog

                                          How to check contact read email or not when send email to Individual?

                                          Displaying single band from multi-band raster using QGIS

                                          How many registers does an x86_64 CPU actually have?