Count distinct values of a field in a file

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
11
down vote

favorite
3












I have a file contains around million number of lines. In the lines I have a field called transactionid, which has repetitive values. What I need to do is to count them distinctly.



No matter of how many times a value is repeated, it should be counted only once.










share|improve this question























  • it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
    – Nikhil Mulley
    Jan 11 '12 at 14:20










  • btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
    – Nikhil Mulley
    Jan 11 '12 at 14:27











  • @Nikhil This is clear from the question: ... No matter of how many times a value is repeated, it should be counted as 1. ...
    – user13742
    Jan 11 '12 at 14:28











  • ok, then answer from @hesse would do your need.
    – Nikhil Mulley
    Jan 11 '12 at 14:30










  • sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used; cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l the if clause was for another check of date as it seems obvious :)
    – Olgun Kaya
    Jan 12 '12 at 6:29















up vote
11
down vote

favorite
3












I have a file contains around million number of lines. In the lines I have a field called transactionid, which has repetitive values. What I need to do is to count them distinctly.



No matter of how many times a value is repeated, it should be counted only once.










share|improve this question























  • it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
    – Nikhil Mulley
    Jan 11 '12 at 14:20










  • btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
    – Nikhil Mulley
    Jan 11 '12 at 14:27











  • @Nikhil This is clear from the question: ... No matter of how many times a value is repeated, it should be counted as 1. ...
    – user13742
    Jan 11 '12 at 14:28











  • ok, then answer from @hesse would do your need.
    – Nikhil Mulley
    Jan 11 '12 at 14:30










  • sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used; cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l the if clause was for another check of date as it seems obvious :)
    – Olgun Kaya
    Jan 12 '12 at 6:29













up vote
11
down vote

favorite
3









up vote
11
down vote

favorite
3






3





I have a file contains around million number of lines. In the lines I have a field called transactionid, which has repetitive values. What I need to do is to count them distinctly.



No matter of how many times a value is repeated, it should be counted only once.










share|improve this question















I have a file contains around million number of lines. In the lines I have a field called transactionid, which has repetitive values. What I need to do is to count them distinctly.



No matter of how many times a value is repeated, it should be counted only once.







text-processing awk






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 20 at 22:36









Rui F Ribeiro

38.2k1475125




38.2k1475125










asked Jan 11 '12 at 14:08









Olgun Kaya

222147




222147











  • it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
    – Nikhil Mulley
    Jan 11 '12 at 14:20










  • btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
    – Nikhil Mulley
    Jan 11 '12 at 14:27











  • @Nikhil This is clear from the question: ... No matter of how many times a value is repeated, it should be counted as 1. ...
    – user13742
    Jan 11 '12 at 14:28











  • ok, then answer from @hesse would do your need.
    – Nikhil Mulley
    Jan 11 '12 at 14:30










  • sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used; cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l the if clause was for another check of date as it seems obvious :)
    – Olgun Kaya
    Jan 12 '12 at 6:29

















  • it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
    – Nikhil Mulley
    Jan 11 '12 at 14:20










  • btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
    – Nikhil Mulley
    Jan 11 '12 at 14:27











  • @Nikhil This is clear from the question: ... No matter of how many times a value is repeated, it should be counted as 1. ...
    – user13742
    Jan 11 '12 at 14:28











  • ok, then answer from @hesse would do your need.
    – Nikhil Mulley
    Jan 11 '12 at 14:30










  • sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used; cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l the if clause was for another check of date as it seems obvious :)
    – Olgun Kaya
    Jan 12 '12 at 6:29
















it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
– Nikhil Mulley
Jan 11 '12 at 14:20




it would be more easier, if you could just give a glimpse of format of the file..not necessarily the data.
– Nikhil Mulley
Jan 11 '12 at 14:20












btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
– Nikhil Mulley
Jan 11 '12 at 14:27





btw, do you want the value to be counted as 1 irrespective of how many times it exists, or you want the count of the number of occurrences/repetitions? if you just want it to be counted once, then how the distinct values are counted? Can you please check my edit on your question and confirm if I am right in interpreting.
– Nikhil Mulley
Jan 11 '12 at 14:27













@Nikhil This is clear from the question: ... No matter of how many times a value is repeated, it should be counted as 1. ...
– user13742
Jan 11 '12 at 14:28





@Nikhil This is clear from the question: ... No matter of how many times a value is repeated, it should be counted as 1. ...
– user13742
Jan 11 '12 at 14:28













ok, then answer from @hesse would do your need.
– Nikhil Mulley
Jan 11 '12 at 14:30




ok, then answer from @hesse would do your need.
– Nikhil Mulley
Jan 11 '12 at 14:30












sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used; cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l the if clause was for another check of date as it seems obvious :)
– Olgun Kaya
Jan 12 '12 at 6:29





sorry for latency. I was out of internet connection. seperator is 2|' and field is field 28. I used; cat <file_name> | awk -F"|" 'if ((substr($2,0,8)=='20120110')) print $28' | sort -u | wc -l the if clause was for another check of date as it seems obvious :)
– Olgun Kaya
Jan 12 '12 at 6:29











3 Answers
3






active

oldest

votes

















up vote
17
down vote



accepted










OK, Assuming that your file is a text file, having the fields separated by comma separator ','. You would also know which field 'transactionid' is in terms of its position. Assuming that your 'transactionid' field is 7th field.



awk -F ',' 'print $7' text_file | sort | uniq -c


This would count the distinct/unique occurrences in the 7th field and prints the result.






share|improve this answer





























    up vote
    3
    down vote













    There is no need to sort the file .. (uniq requires the file to be sorted)

    This awk script assumes the field is the first whitespace delimiited field.



    awk 'a[$1] == "" a[$1]="X" END print length(a) ' file 





    share|improve this answer






















    • For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most sort implementations are designed to cope well with huge files.
      – Gilles
      Jan 12 '12 at 1:59

















    up vote
    2
    down vote













    Maybe not the sleekest method, but this should work:



    awk 'print $1' your_file | sort | uniq | wc -l


    where $1 is the number corresponding to the field to be parsed.






    share|improve this answer






















      Your Answer








      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "106"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f28845%2fcount-distinct-values-of-a-field-in-a-file%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      17
      down vote



      accepted










      OK, Assuming that your file is a text file, having the fields separated by comma separator ','. You would also know which field 'transactionid' is in terms of its position. Assuming that your 'transactionid' field is 7th field.



      awk -F ',' 'print $7' text_file | sort | uniq -c


      This would count the distinct/unique occurrences in the 7th field and prints the result.






      share|improve this answer


























        up vote
        17
        down vote



        accepted










        OK, Assuming that your file is a text file, having the fields separated by comma separator ','. You would also know which field 'transactionid' is in terms of its position. Assuming that your 'transactionid' field is 7th field.



        awk -F ',' 'print $7' text_file | sort | uniq -c


        This would count the distinct/unique occurrences in the 7th field and prints the result.






        share|improve this answer
























          up vote
          17
          down vote



          accepted







          up vote
          17
          down vote



          accepted






          OK, Assuming that your file is a text file, having the fields separated by comma separator ','. You would also know which field 'transactionid' is in terms of its position. Assuming that your 'transactionid' field is 7th field.



          awk -F ',' 'print $7' text_file | sort | uniq -c


          This would count the distinct/unique occurrences in the 7th field and prints the result.






          share|improve this answer














          OK, Assuming that your file is a text file, having the fields separated by comma separator ','. You would also know which field 'transactionid' is in terms of its position. Assuming that your 'transactionid' field is 7th field.



          awk -F ',' 'print $7' text_file | sort | uniq -c


          This would count the distinct/unique occurrences in the 7th field and prints the result.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited May 1 '17 at 23:06









          phk

          3,92652151




          3,92652151










          answered Jan 11 '12 at 14:21









          Nikhil Mulley

          6,3112144




          6,3112144






















              up vote
              3
              down vote













              There is no need to sort the file .. (uniq requires the file to be sorted)

              This awk script assumes the field is the first whitespace delimiited field.



              awk 'a[$1] == "" a[$1]="X" END print length(a) ' file 





              share|improve this answer






















              • For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most sort implementations are designed to cope well with huge files.
                – Gilles
                Jan 12 '12 at 1:59














              up vote
              3
              down vote













              There is no need to sort the file .. (uniq requires the file to be sorted)

              This awk script assumes the field is the first whitespace delimiited field.



              awk 'a[$1] == "" a[$1]="X" END print length(a) ' file 





              share|improve this answer






















              • For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most sort implementations are designed to cope well with huge files.
                – Gilles
                Jan 12 '12 at 1:59












              up vote
              3
              down vote










              up vote
              3
              down vote









              There is no need to sort the file .. (uniq requires the file to be sorted)

              This awk script assumes the field is the first whitespace delimiited field.



              awk 'a[$1] == "" a[$1]="X" END print length(a) ' file 





              share|improve this answer














              There is no need to sort the file .. (uniq requires the file to be sorted)

              This awk script assumes the field is the first whitespace delimiited field.



              awk 'a[$1] == "" a[$1]="X" END print length(a) ' file 






              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Jan 11 '12 at 14:57

























              answered Jan 11 '12 at 14:30









              Peter.O

              18.7k1791143




              18.7k1791143











              • For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most sort implementations are designed to cope well with huge files.
                – Gilles
                Jan 12 '12 at 1:59
















              • For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most sort implementations are designed to cope well with huge files.
                – Gilles
                Jan 12 '12 at 1:59















              For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most sort implementations are designed to cope well with huge files.
              – Gilles
              Jan 12 '12 at 1:59




              For a huge file (as in, getting close to the size of RAM), awk will consume a lot of memory. Most sort implementations are designed to cope well with huge files.
              – Gilles
              Jan 12 '12 at 1:59










              up vote
              2
              down vote













              Maybe not the sleekest method, but this should work:



              awk 'print $1' your_file | sort | uniq | wc -l


              where $1 is the number corresponding to the field to be parsed.






              share|improve this answer


























                up vote
                2
                down vote













                Maybe not the sleekest method, but this should work:



                awk 'print $1' your_file | sort | uniq | wc -l


                where $1 is the number corresponding to the field to be parsed.






                share|improve this answer
























                  up vote
                  2
                  down vote










                  up vote
                  2
                  down vote









                  Maybe not the sleekest method, but this should work:



                  awk 'print $1' your_file | sort | uniq | wc -l


                  where $1 is the number corresponding to the field to be parsed.






                  share|improve this answer














                  Maybe not the sleekest method, but this should work:



                  awk 'print $1' your_file | sort | uniq | wc -l


                  where $1 is the number corresponding to the field to be parsed.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Jan 11 '12 at 14:26

























                  answered Jan 11 '12 at 14:18







                  user13742


































                       

                      draft saved


                      draft discarded















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f28845%2fcount-distinct-values-of-a-field-in-a-file%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown






                      Popular posts from this blog

                      How to check contact read email or not when send email to Individual?

                      Displaying single band from multi-band raster using QGIS

                      How many registers does an x86_64 CPU actually have?