Grep a range of values with specific starting characters

up vote
0
down vote

favorite

I have 10GB files in which i want to count occurrence of some specific text i.e TY[0-9].

File format is like :

ABC,2A,2018-07-06,2018-06-20 00:00:00
BCD,TY1,2018-07-06,2018-06-20 00:00:00
EFG,TY2,2018-07-06,2018-06-20 00:00:00
IGH,2A,2018-07-06,2018-06-20 00:00:00

I want to get the count of all text starting with TY. I tried using egrep but i am not able to get that .

egrep "^TY[0-9]" Filename

asked Jun 21 at 18:37

Developer

15717

add a commentÂ |Â

up vote
0
down vote

favorite

I have 10GB files in which i want to count occurrence of some specific text i.e TY[0-9].

File format is like :

ABC,2A,2018-07-06,2018-06-20 00:00:00
BCD,TY1,2018-07-06,2018-06-20 00:00:00
EFG,TY2,2018-07-06,2018-06-20 00:00:00
IGH,2A,2018-07-06,2018-06-20 00:00:00

I want to get the count of all text starting with TY. I tried using egrep but i am not able to get that .

egrep "^TY[0-9]" Filename

asked Jun 21 at 18:37

Developer

15717

add a commentÂ |Â

up vote
0
down vote

favorite

I have 10GB files in which i want to count occurrence of some specific text i.e TY[0-9].

File format is like :

ABC,2A,2018-07-06,2018-06-20 00:00:00
BCD,TY1,2018-07-06,2018-06-20 00:00:00
EFG,TY2,2018-07-06,2018-06-20 00:00:00
IGH,2A,2018-07-06,2018-06-20 00:00:00

I want to get the count of all text starting with TY. I tried using egrep but i am not able to get that .

egrep "^TY[0-9]" Filename

asked Jun 21 at 18:37

Developer

15717

I have 10GB files in which i want to count occurrence of some specific text i.e TY[0-9].

File format is like :

ABC,2A,2018-07-06,2018-06-20 00:00:00
BCD,TY1,2018-07-06,2018-06-20 00:00:00
EFG,TY2,2018-07-06,2018-06-20 00:00:00
IGH,2A,2018-07-06,2018-06-20 00:00:00

I want to get the count of all text starting with TY. I tried using egrep but i am not able to get that .

egrep "^TY[0-9]" Filename

asked Jun 21 at 18:37

Developer

15717

asked Jun 21 at 18:37

Developer

15717

asked Jun 21 at 18:37

Developer

15717

asked Jun 21 at 18:37

Developer

15717

add a commentÂ |Â

3 Answers
3

active

oldest

votes

up vote
3
down vote

accepted

Using awk to count the number of times the second comma-delimited field in the file starts with the string TY followed by a digit:

awk -F, '$2 ~ /^TY[[:digit:]]/ n++ END print n ' filename

I'm wondering whether using cut in combination with grep would be quick? Cutting out the second column would give grep less data to work with, and so it may be quicker than just grep alone.

cut -d, -f2 filename | grep -c '^TY[[:digit:]]'

... but I'm not sure.

After some testing on my OpenBSD system, using a 1.1GB file, the cut+grep is actually almost 50% quicker than awk (8 seconds vs. 15 seconds). And a pure grep solution (grep -Ec '<TY[0-9]' filename, taken from glenn's solution) takes 13 seconds.

So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.

edited Jun 21 at 19:02

answered Jun 21 at 18:47

Kusalananda

101k13199312

In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?
â€“Â DopeGhoti
Jun 21 at 18:59

@DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
â€“Â Kusalananda
Jun 21 at 19:02

add a commentÂ |Â

up vote
2
down vote

You want to use a word boundary instead of the start-of-line anchor:

$ grep -Ec '<TY[0-9]' file
2

Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then

$ grep -Eo '<TY[0-9]' file | wc -l

answered Jun 21 at 18:45

glenn jackman

45.6k265100

add a commentÂ |Â

up vote
1
down vote

If you want to find the number of occurrence of a , delimited field that starts with TY and is followed by any number of decimal digits, you could do:

<file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; ENDprint 0+$n'

Which on an input like:

TY1,TY2,TY,TYFOO
TY213,X-TY2,TY4

Would return 4 (TY1, TY2, TY213, TY4).

(?<!...) and (?!...) are respectively negative look behing and ahead operators. So here, we're looking for TY followed by one or more (+) digits (d), provided its neither preceded nor followed by a character other than ,.

Another way to do it would be to convert ,s to newlines and count the number of resulting lines that start with TY followed by one or more digits:

<file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'

(on my system, that's about 10 times as fast as the perl solution)

edited Jun 21 at 19:03

answered Jun 21 at 18:51

StÃ©phane Chazelas

278k52513844

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f451168%2fgrep-a-range-of-values-with-specific-starting-characters%23new-answer', 'question_page');

);

Post as a guest

Name

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
3
down vote

accepted

Using awk to count the number of times the second comma-delimited field in the file starts with the string TY followed by a digit:

awk -F, '$2 ~ /^TY[[:digit:]]/ n++ END print n ' filename

I'm wondering whether using cut in combination with grep would be quick? Cutting out the second column would give grep less data to work with, and so it may be quicker than just grep alone.

cut -d, -f2 filename | grep -c '^TY[[:digit:]]'

... but I'm not sure.

So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.

edited Jun 21 at 19:02

answered Jun 21 at 18:47

Kusalananda

101k13199312

In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?
â€“Â DopeGhoti
Jun 21 at 18:59

@DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
â€“Â Kusalananda
Jun 21 at 19:02

add a commentÂ |Â

up vote
3
down vote

accepted

Using awk to count the number of times the second comma-delimited field in the file starts with the string TY followed by a digit:

awk -F, '$2 ~ /^TY[[:digit:]]/ n++ END print n ' filename

I'm wondering whether using cut in combination with grep would be quick? Cutting out the second column would give grep less data to work with, and so it may be quicker than just grep alone.

cut -d, -f2 filename | grep -c '^TY[[:digit:]]'

... but I'm not sure.

So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.

edited Jun 21 at 19:02

answered Jun 21 at 18:47

Kusalananda

101k13199312

In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?
â€“Â DopeGhoti
Jun 21 at 18:59

@DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
â€“Â Kusalananda
Jun 21 at 19:02

add a commentÂ |Â

up vote
3
down vote

accepted

Using awk to count the number of times the second comma-delimited field in the file starts with the string TY followed by a digit:

awk -F, '$2 ~ /^TY[[:digit:]]/ n++ END print n ' filename

I'm wondering whether using cut in combination with grep would be quick? Cutting out the second column would give grep less data to work with, and so it may be quicker than just grep alone.

cut -d, -f2 filename | grep -c '^TY[[:digit:]]'

... but I'm not sure.

So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.

edited Jun 21 at 19:02

answered Jun 21 at 18:47

Kusalananda

101k13199312

Using awk to count the number of times the second comma-delimited field in the file starts with the string TY followed by a digit:

awk -F, '$2 ~ /^TY[[:digit:]]/ n++ END print n ' filename

I'm wondering whether using cut in combination with grep would be quick? Cutting out the second column would give grep less data to work with, and so it may be quicker than just grep alone.

cut -d, -f2 filename | grep -c '^TY[[:digit:]]'

... but I'm not sure.

So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.

edited Jun 21 at 19:02

answered Jun 21 at 18:47

Kusalananda

101k13199312

edited Jun 21 at 19:02

answered Jun 21 at 18:47

Kusalananda

101k13199312

answered Jun 21 at 18:47

Kusalananda

101k13199312

answered Jun 21 at 18:47

Kusalananda

101k13199312

In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?
â€“Â DopeGhoti
Jun 21 at 18:59

@DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
â€“Â Kusalananda
Jun 21 at 19:02

add a commentÂ |Â

In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?
â€“Â DopeGhoti
Jun 21 at 18:59

@DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
â€“Â Kusalananda
Jun 21 at 19:02

In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?
â€“Â DopeGhoti
Jun 21 at 18:59

@DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
â€“Â Kusalananda
Jun 21 at 19:02

add a commentÂ |Â

up vote
2
down vote

You want to use a word boundary instead of the start-of-line anchor:

$ grep -Ec '<TY[0-9]' file
2

Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then

$ grep -Eo '<TY[0-9]' file | wc -l

answered Jun 21 at 18:45

glenn jackman

45.6k265100

add a commentÂ |Â

up vote
2
down vote

You want to use a word boundary instead of the start-of-line anchor:

$ grep -Ec '<TY[0-9]' file
2

Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then

$ grep -Eo '<TY[0-9]' file | wc -l

answered Jun 21 at 18:45

glenn jackman

45.6k265100

add a commentÂ |Â

up vote
2
down vote

You want to use a word boundary instead of the start-of-line anchor:

$ grep -Ec '<TY[0-9]' file
2

Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then

$ grep -Eo '<TY[0-9]' file | wc -l

answered Jun 21 at 18:45

glenn jackman

45.6k265100

You want to use a word boundary instead of the start-of-line anchor:

$ grep -Ec '<TY[0-9]' file
2

Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then

$ grep -Eo '<TY[0-9]' file | wc -l

answered Jun 21 at 18:45

glenn jackman

45.6k265100

answered Jun 21 at 18:45

glenn jackman

45.6k265100

answered Jun 21 at 18:45

glenn jackman

45.6k265100

answered Jun 21 at 18:45

glenn jackman

45.6k265100

add a commentÂ |Â

up vote
1
down vote

If you want to find the number of occurrence of a , delimited field that starts with TY and is followed by any number of decimal digits, you could do:

<file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; ENDprint 0+$n'

Which on an input like:

TY1,TY2,TY,TYFOO
TY213,X-TY2,TY4

Would return 4 (TY1, TY2, TY213, TY4).

Another way to do it would be to convert ,s to newlines and count the number of resulting lines that start with TY followed by one or more digits:

<file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'

(on my system, that's about 10 times as fast as the perl solution)

edited Jun 21 at 19:03

answered Jun 21 at 18:51

278k52513844

add a commentÂ |Â

up vote
1
down vote

If you want to find the number of occurrence of a , delimited field that starts with TY and is followed by any number of decimal digits, you could do:

<file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; ENDprint 0+$n'

Which on an input like:

TY1,TY2,TY,TYFOO
TY213,X-TY2,TY4

Would return 4 (TY1, TY2, TY213, TY4).

Another way to do it would be to convert ,s to newlines and count the number of resulting lines that start with TY followed by one or more digits:

<file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'

(on my system, that's about 10 times as fast as the perl solution)

edited Jun 21 at 19:03

answered Jun 21 at 18:51

278k52513844

add a commentÂ |Â

up vote
1
down vote

If you want to find the number of occurrence of a , delimited field that starts with TY and is followed by any number of decimal digits, you could do:

<file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; ENDprint 0+$n'

Which on an input like:

TY1,TY2,TY,TYFOO
TY213,X-TY2,TY4

Would return 4 (TY1, TY2, TY213, TY4).

Another way to do it would be to convert ,s to newlines and count the number of resulting lines that start with TY followed by one or more digits:

<file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'

(on my system, that's about 10 times as fast as the perl solution)

edited Jun 21 at 19:03

answered Jun 21 at 18:51

278k52513844

If you want to find the number of occurrence of a , delimited field that starts with TY and is followed by any number of decimal digits, you could do:

<file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; ENDprint 0+$n'

Which on an input like:

TY1,TY2,TY,TYFOO
TY213,X-TY2,TY4

Would return 4 (TY1, TY2, TY213, TY4).

Another way to do it would be to convert ,s to newlines and count the number of resulting lines that start with TY followed by one or more digits:

<file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'

(on my system, that's about 10 times as fast as the perl solution)

edited Jun 21 at 19:03

answered Jun 21 at 18:51

278k52513844

edited Jun 21 at 19:03

answered Jun 21 at 18:51

278k52513844

answered Jun 21 at 18:51

278k52513844

answered Jun 21 at 18:51

278k52513844

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu