Grep a range of values with specific starting characters

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












I have 10GB files in which i want to count occurrence of some specific text i.e TY[0-9].



File format is like :



ABC,2A,2018-07-06,2018-06-20 00:00:00
BCD,TY1,2018-07-06,2018-06-20 00:00:00
EFG,TY2,2018-07-06,2018-06-20 00:00:00
IGH,2A,2018-07-06,2018-06-20 00:00:00


I want to get the count of all text starting with TY. I tried using egrep but i am not able to get that .



egrep "^TY[0-9]" Filename






share|improve this question























    up vote
    0
    down vote

    favorite












    I have 10GB files in which i want to count occurrence of some specific text i.e TY[0-9].



    File format is like :



    ABC,2A,2018-07-06,2018-06-20 00:00:00
    BCD,TY1,2018-07-06,2018-06-20 00:00:00
    EFG,TY2,2018-07-06,2018-06-20 00:00:00
    IGH,2A,2018-07-06,2018-06-20 00:00:00


    I want to get the count of all text starting with TY. I tried using egrep but i am not able to get that .



    egrep "^TY[0-9]" Filename






    share|improve this question





















      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I have 10GB files in which i want to count occurrence of some specific text i.e TY[0-9].



      File format is like :



      ABC,2A,2018-07-06,2018-06-20 00:00:00
      BCD,TY1,2018-07-06,2018-06-20 00:00:00
      EFG,TY2,2018-07-06,2018-06-20 00:00:00
      IGH,2A,2018-07-06,2018-06-20 00:00:00


      I want to get the count of all text starting with TY. I tried using egrep but i am not able to get that .



      egrep "^TY[0-9]" Filename






      share|improve this question











      I have 10GB files in which i want to count occurrence of some specific text i.e TY[0-9].



      File format is like :



      ABC,2A,2018-07-06,2018-06-20 00:00:00
      BCD,TY1,2018-07-06,2018-06-20 00:00:00
      EFG,TY2,2018-07-06,2018-06-20 00:00:00
      IGH,2A,2018-07-06,2018-06-20 00:00:00


      I want to get the count of all text starting with TY. I tried using egrep but i am not able to get that .



      egrep "^TY[0-9]" Filename








      share|improve this question










      share|improve this question




      share|improve this question









      asked Jun 21 at 18:37









      Developer

      15717




      15717




















          3 Answers
          3






          active

          oldest

          votes

















          up vote
          3
          down vote



          accepted










          Using awk to count the number of times the second comma-delimited field in the file starts with the string TY followed by a digit:



          awk -F, '$2 ~ /^TY[[:digit:]]/ n++ END print n ' filename


          I'm wondering whether using cut in combination with grep would be quick? Cutting out the second column would give grep less data to work with, and so it may be quicker than just grep alone.



          cut -d, -f2 filename | grep -c '^TY[[:digit:]]'


          ... but I'm not sure.




          After some testing on my OpenBSD system, using a 1.1GB file, the cut+grep is actually almost 50% quicker than awk (8 seconds vs. 15 seconds). And a pure grep solution (grep -Ec '<TY[0-9]' filename, taken from glenn's solution) takes 13 seconds.



          So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.






          share|improve this answer























          • In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?
            – DopeGhoti
            Jun 21 at 18:59











          • @DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
            – Kusalananda
            Jun 21 at 19:02


















          up vote
          2
          down vote













          You want to use a word boundary instead of the start-of-line anchor:



          $ grep -Ec '<TY[0-9]' file
          2


          Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then



          $ grep -Eo '<TY[0-9]' file | wc -l





          share|improve this answer




























            up vote
            1
            down vote













            If you want to find the number of occurrence of a , delimited field that starts with TY and is followed by any number of decimal digits, you could do:



            <file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; ENDprint 0+$n'


            Which on an input like:



            TY1,TY2,TY,TYFOO
            TY213,X-TY2,TY4


            Would return 4 (TY1, TY2, TY213, TY4).



            (?<!...) and (?!...) are respectively negative look behing and ahead operators. So here, we're looking for TY followed by one or more (+) digits (d), provided its neither preceded nor followed by a character other than ,.



            Another way to do it would be to convert ,s to newlines and count the number of resulting lines that start with TY followed by one or more digits:



            <file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'


            (on my system, that's about 10 times as fast as the perl solution)






            share|improve this answer























              Your Answer







              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "106"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              convertImagesToLinks: false,
              noModals: false,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );








               

              draft saved


              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f451168%2fgrep-a-range-of-values-with-specific-starting-characters%23new-answer', 'question_page');

              );

              Post as a guest






























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes








              up vote
              3
              down vote



              accepted










              Using awk to count the number of times the second comma-delimited field in the file starts with the string TY followed by a digit:



              awk -F, '$2 ~ /^TY[[:digit:]]/ n++ END print n ' filename


              I'm wondering whether using cut in combination with grep would be quick? Cutting out the second column would give grep less data to work with, and so it may be quicker than just grep alone.



              cut -d, -f2 filename | grep -c '^TY[[:digit:]]'


              ... but I'm not sure.




              After some testing on my OpenBSD system, using a 1.1GB file, the cut+grep is actually almost 50% quicker than awk (8 seconds vs. 15 seconds). And a pure grep solution (grep -Ec '<TY[0-9]' filename, taken from glenn's solution) takes 13 seconds.



              So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.






              share|improve this answer























              • In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?
                – DopeGhoti
                Jun 21 at 18:59











              • @DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
                – Kusalananda
                Jun 21 at 19:02















              up vote
              3
              down vote



              accepted










              Using awk to count the number of times the second comma-delimited field in the file starts with the string TY followed by a digit:



              awk -F, '$2 ~ /^TY[[:digit:]]/ n++ END print n ' filename


              I'm wondering whether using cut in combination with grep would be quick? Cutting out the second column would give grep less data to work with, and so it may be quicker than just grep alone.



              cut -d, -f2 filename | grep -c '^TY[[:digit:]]'


              ... but I'm not sure.




              After some testing on my OpenBSD system, using a 1.1GB file, the cut+grep is actually almost 50% quicker than awk (8 seconds vs. 15 seconds). And a pure grep solution (grep -Ec '<TY[0-9]' filename, taken from glenn's solution) takes 13 seconds.



              So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.






              share|improve this answer























              • In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?
                – DopeGhoti
                Jun 21 at 18:59











              • @DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
                – Kusalananda
                Jun 21 at 19:02













              up vote
              3
              down vote



              accepted







              up vote
              3
              down vote



              accepted






              Using awk to count the number of times the second comma-delimited field in the file starts with the string TY followed by a digit:



              awk -F, '$2 ~ /^TY[[:digit:]]/ n++ END print n ' filename


              I'm wondering whether using cut in combination with grep would be quick? Cutting out the second column would give grep less data to work with, and so it may be quicker than just grep alone.



              cut -d, -f2 filename | grep -c '^TY[[:digit:]]'


              ... but I'm not sure.




              After some testing on my OpenBSD system, using a 1.1GB file, the cut+grep is actually almost 50% quicker than awk (8 seconds vs. 15 seconds). And a pure grep solution (grep -Ec '<TY[0-9]' filename, taken from glenn's solution) takes 13 seconds.



              So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.






              share|improve this answer















              Using awk to count the number of times the second comma-delimited field in the file starts with the string TY followed by a digit:



              awk -F, '$2 ~ /^TY[[:digit:]]/ n++ END print n ' filename


              I'm wondering whether using cut in combination with grep would be quick? Cutting out the second column would give grep less data to work with, and so it may be quicker than just grep alone.



              cut -d, -f2 filename | grep -c '^TY[[:digit:]]'


              ... but I'm not sure.




              After some testing on my OpenBSD system, using a 1.1GB file, the cut+grep is actually almost 50% quicker than awk (8 seconds vs. 15 seconds). And a pure grep solution (grep -Ec '<TY[0-9]' filename, taken from glenn's solution) takes 13 seconds.



              So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.







              share|improve this answer















              share|improve this answer



              share|improve this answer








              edited Jun 21 at 19:02


























              answered Jun 21 at 18:47









              Kusalananda

              101k13199312




              101k13199312











              • In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?
                – DopeGhoti
                Jun 21 at 18:59











              • @DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
                – Kusalananda
                Jun 21 at 19:02

















              • In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?
                – DopeGhoti
                Jun 21 at 18:59











              • @DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
                – Kusalananda
                Jun 21 at 19:02
















              In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?
              – DopeGhoti
              Jun 21 at 18:59





              In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?
              – DopeGhoti
              Jun 21 at 18:59













              @DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
              – Kusalananda
              Jun 21 at 19:02





              @DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
              – Kusalananda
              Jun 21 at 19:02













              up vote
              2
              down vote













              You want to use a word boundary instead of the start-of-line anchor:



              $ grep -Ec '<TY[0-9]' file
              2


              Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then



              $ grep -Eo '<TY[0-9]' file | wc -l





              share|improve this answer

























                up vote
                2
                down vote













                You want to use a word boundary instead of the start-of-line anchor:



                $ grep -Ec '<TY[0-9]' file
                2


                Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then



                $ grep -Eo '<TY[0-9]' file | wc -l





                share|improve this answer























                  up vote
                  2
                  down vote










                  up vote
                  2
                  down vote









                  You want to use a word boundary instead of the start-of-line anchor:



                  $ grep -Ec '<TY[0-9]' file
                  2


                  Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then



                  $ grep -Eo '<TY[0-9]' file | wc -l





                  share|improve this answer













                  You want to use a word boundary instead of the start-of-line anchor:



                  $ grep -Ec '<TY[0-9]' file
                  2


                  Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then



                  $ grep -Eo '<TY[0-9]' file | wc -l






                  share|improve this answer













                  share|improve this answer



                  share|improve this answer











                  answered Jun 21 at 18:45









                  glenn jackman

                  45.6k265100




                  45.6k265100




















                      up vote
                      1
                      down vote













                      If you want to find the number of occurrence of a , delimited field that starts with TY and is followed by any number of decimal digits, you could do:



                      <file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; ENDprint 0+$n'


                      Which on an input like:



                      TY1,TY2,TY,TYFOO
                      TY213,X-TY2,TY4


                      Would return 4 (TY1, TY2, TY213, TY4).



                      (?<!...) and (?!...) are respectively negative look behing and ahead operators. So here, we're looking for TY followed by one or more (+) digits (d), provided its neither preceded nor followed by a character other than ,.



                      Another way to do it would be to convert ,s to newlines and count the number of resulting lines that start with TY followed by one or more digits:



                      <file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'


                      (on my system, that's about 10 times as fast as the perl solution)






                      share|improve this answer



























                        up vote
                        1
                        down vote













                        If you want to find the number of occurrence of a , delimited field that starts with TY and is followed by any number of decimal digits, you could do:



                        <file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; ENDprint 0+$n'


                        Which on an input like:



                        TY1,TY2,TY,TYFOO
                        TY213,X-TY2,TY4


                        Would return 4 (TY1, TY2, TY213, TY4).



                        (?<!...) and (?!...) are respectively negative look behing and ahead operators. So here, we're looking for TY followed by one or more (+) digits (d), provided its neither preceded nor followed by a character other than ,.



                        Another way to do it would be to convert ,s to newlines and count the number of resulting lines that start with TY followed by one or more digits:



                        <file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'


                        (on my system, that's about 10 times as fast as the perl solution)






                        share|improve this answer

























                          up vote
                          1
                          down vote










                          up vote
                          1
                          down vote









                          If you want to find the number of occurrence of a , delimited field that starts with TY and is followed by any number of decimal digits, you could do:



                          <file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; ENDprint 0+$n'


                          Which on an input like:



                          TY1,TY2,TY,TYFOO
                          TY213,X-TY2,TY4


                          Would return 4 (TY1, TY2, TY213, TY4).



                          (?<!...) and (?!...) are respectively negative look behing and ahead operators. So here, we're looking for TY followed by one or more (+) digits (d), provided its neither preceded nor followed by a character other than ,.



                          Another way to do it would be to convert ,s to newlines and count the number of resulting lines that start with TY followed by one or more digits:



                          <file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'


                          (on my system, that's about 10 times as fast as the perl solution)






                          share|improve this answer















                          If you want to find the number of occurrence of a , delimited field that starts with TY and is followed by any number of decimal digits, you could do:



                          <file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; ENDprint 0+$n'


                          Which on an input like:



                          TY1,TY2,TY,TYFOO
                          TY213,X-TY2,TY4


                          Would return 4 (TY1, TY2, TY213, TY4).



                          (?<!...) and (?!...) are respectively negative look behing and ahead operators. So here, we're looking for TY followed by one or more (+) digits (d), provided its neither preceded nor followed by a character other than ,.



                          Another way to do it would be to convert ,s to newlines and count the number of resulting lines that start with TY followed by one or more digits:



                          <file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'


                          (on my system, that's about 10 times as fast as the perl solution)







                          share|improve this answer















                          share|improve this answer



                          share|improve this answer








                          edited Jun 21 at 19:03


























                          answered Jun 21 at 18:51









                          Stéphane Chazelas

                          278k52513844




                          278k52513844






















                               

                              draft saved


                              draft discarded


























                               


                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f451168%2fgrep-a-range-of-values-with-specific-starting-characters%23new-answer', 'question_page');

                              );

                              Post as a guest













































































                              Popular posts from this blog

                              How to check contact read email or not when send email to Individual?

                              Displaying single band from multi-band raster using QGIS

                              How many registers does an x86_64 CPU actually have?