Is there an easy way to count characters in words in file, from terminal?

up vote
8
down vote

favorite

I have 100 million rows in my file.

Each row has only one column.

e.g.

aaaaa
bb
cc
ddddddd
ee

I would like to list the character count

Like this

2 character words - 3
5 character words - 1
7 character words - 1

etc.

Is there any easy way to do this in terminal?

edited Oct 8 '17 at 17:59

ctrl-alt-delor

9,11431948

asked Oct 8 '17 at 15:38

user1091558

1464

1

see also Count line lengths in file using command line tools
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 8 '17 at 18:38

add a commentÂ |Â

up vote
8
down vote

favorite

I have 100 million rows in my file.

Each row has only one column.

e.g.

aaaaa
bb
cc
ddddddd
ee

I would like to list the character count

Like this

2 character words - 3
5 character words - 1
7 character words - 1

etc.

Is there any easy way to do this in terminal?

edited Oct 8 '17 at 17:59

ctrl-alt-delor

9,11431948

asked Oct 8 '17 at 15:38

user1091558

1464

1

see also Count line lengths in file using command line tools
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 8 '17 at 18:38

add a commentÂ |Â

up vote
8
down vote

favorite

I have 100 million rows in my file.

Each row has only one column.

e.g.

aaaaa
bb
cc
ddddddd
ee

I would like to list the character count

Like this

2 character words - 3
5 character words - 1
7 character words - 1

etc.

Is there any easy way to do this in terminal?

edited Oct 8 '17 at 17:59

ctrl-alt-delor

9,11431948

asked Oct 8 '17 at 15:38

user1091558

1464

I have 100 million rows in my file.

Each row has only one column.

e.g.

aaaaa
bb
cc
ddddddd
ee

I would like to list the character count

Like this

2 character words - 3
5 character words - 1
7 character words - 1

etc.

Is there any easy way to do this in terminal?

text-processing

edited Oct 8 '17 at 17:59

ctrl-alt-delor

9,11431948

asked Oct 8 '17 at 15:38

user1091558

1464

edited Oct 8 '17 at 17:59

ctrl-alt-delor

9,11431948

asked Oct 8 '17 at 15:38

user1091558

1464

edited Oct 8 '17 at 17:59

ctrl-alt-delor

9,11431948

edited Oct 8 '17 at 17:59

ctrl-alt-delor

9,11431948

edited Oct 8 '17 at 17:59

ctrl-alt-delor

9,11431948

asked Oct 8 '17 at 15:38

user1091558

1464

asked Oct 8 '17 at 15:38

user1091558

1464

asked Oct 8 '17 at 15:38

user1091558

1464

1

see also Count line lengths in file using command line tools
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 8 '17 at 18:38

add a commentÂ |Â

1

see also Count line lengths in file using command line tools
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 8 '17 at 18:38

see also Count line lengths in file using command line tools
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 8 '17 at 18:38

add a commentÂ |Â

4 Answers
4

active

oldest

votes

up vote
20
down vote

accepted

$ awk ' print length ' file | sort -n | uniq -c | awk ' printf("%d character words: %dn", $2, $1) '
2 character words: 3
5 character words: 1
7 character words: 1

The first awk filter will just print the length of each line in the file called file. I'm assuming that this file contains one word per line.

The sort -n (sort the lines from the output of awk numerically in ascending order) and uniq -c (count the number of times each line occurs consecutively) will then create the following output from that for the given data:

 3 2
 1 5
 1 7

This is then parsed by the second awk script which interprets each line as "X number of lines having Y characters" and produces the wanted output.

The alternative solution is to do it all in awk and keeping counts of lengths in an array. It's a tradeoff between efficiency, readability/ease of understanding (and therefore maintainability) which solution is the "best".

Alternative solution:

$ awk ' len[length]++ END for (i in len) printf("%d character words: %dn", i, len[i]) ' file
2 character words: 3
5 character words: 1
7 character words: 1

edited Oct 9 '17 at 6:52

answered Oct 8 '17 at 15:43

Kusalananda

105k14209326

No need to sort in awk (numerically indexed arrays are sorted by default) (faster).
â€“Â Arrow
Oct 8 '17 at 18:14

@Arrow I know. I have that solution commented out in my answer because Sundeep beat me to it with a few seconds. I also allude to this with my last paragraph.
â€“Â Kusalananda
Oct 8 '17 at 18:18

I believe the comment should be useful to the users of the solutions (not included in your answer (or Sundeep's) :-) Ã¢Â€Â¦). Otherwise: include a comment to the same effect in your answer and I happily will remove my comments. :-)
â€“Â Arrow
Oct 8 '17 at 18:25

add a commentÂ |Â

up vote
11
down vote

Another way to do it all with awk alone

$ awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' ip.txt 
2 character words - 3
5 character words - 1
7 character words - 1

words[length()]++ use length of input line as key to save count

ENDfor(k in words)print k " character words - " words[k] after all lines are processed, print contents of array in desired format

Performance comparison, numbers selected are best of two runs

$ wc words.txt
 71813 71813 655873 words.txt
$ perl -0777 -ne 'print $_ x 1000' words.txt > long_file.txt
$ du -h --apparent-size long_file.txt
626M long_file.txt

$ time awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' long_file.txt > t1

real 0m20.632s
user 0m20.464s
sys 0m0.108s

$ time perl -lne '$hlength($_)++ }improve this answer

edited Oct 9 '17 at 8:10

answered Oct 8 '17 at 15:59

Sundeep

6,9611826

I just added that to my own solution. Deleted it when I saw your's though. :-)
â€“Â Kusalananda
Oct 8 '17 at 16:01

yeah I was debating to delete mine before saw your edit again :)
â€“Â Sundeep
Oct 8 '17 at 16:02

No need to sort a numerically indexed array. It is allways ordered with an increasing index. ( well, at least in awk :-) )
â€“Â Arrow
Oct 8 '17 at 18:09

length without () works perfectly fine here, so it might be redundant to add braces. I'm using GNU awk,though.
â€“Â Sergiy Kolodyazhnyy
Oct 8 '17 at 20:14

2

@SergiyKolodyazhnyy yup, gnu awk manual says In older versions of awk, the length() function could be called without any parentheses. Doing so is considered poor practice, although the 2008 POSIX standard explicitly allows it, to support historical practice. For programs to be maximally portable, always supply the parentheses
â€“Â Sundeep
Oct 9 '17 at 3:08

Â sort -n awk ' printf("%d character words - %dn", $2, $1) ' > t3

real 1m23.294s
user 1m24.952s
sys 0m1.980s

$ diff -s <(sort t1) <(sort t2)
Files /dev/fd/63 and /dev/fd/62 are identical
$ diff -s <(sort t1) <(sort t3)
Files /dev/fd/63 and /dev/fd/62 are identical

If file has only ASCII characters,

$ time LC_ALL=C awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' long_file.txt > t1

real 0m15.651s
user 0m15.496s
sys 0m0.120s

Not sure why time for perl didn't change much, probably encoding has to be set some other way

up vote
11
down vote

Another way to do it all with awk alone

$ awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' ip.txt 
2 character words - 3
5 character words - 1
7 character words - 1

words[length()]++ use length of input line as key to save count

ENDfor(k in words)print k " character words - " words[k] after all lines are processed, print contents of array in desired format

Performance comparison, numbers selected are best of two runs

$ wc words.txt
 71813 71813 655873 words.txt
$ perl -0777 -ne 'print $_ x 1000' words.txt > long_file.txt
$ du -h --apparent-size long_file.txt
626M long_file.txt

$ time awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' long_file.txt > t1

real 0m20.632s
user 0m20.464s
sys 0m0.108s

$ time perl -lne '$hlength($_)++ improve this answer

edited Oct 9 '17 at 8:10

answered Oct 8 '17 at 15:59

Sundeep

6,9611826

Another way to do it all with awk alone

$ awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' ip.txt 
2 character words - 3
5 character words - 1
7 character words - 1

words[length()]++ use length of input line as key to save count

ENDfor(k in words)print k " character words - " words[k] after all lines are processed, print contents of array in desired format

Performance comparison, numbers selected are best of two runs

$ wc words.txt
 71813 71813 655873 words.txt
$ perl -0777 -ne 'print $_ x 1000' words.txt > long_file.txt
$ du -h --apparent-size long_file.txt
626M long_file.txt

$ time awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' long_file.txt > t1

real 0m20.632s
user 0m20.464s
sys 0m0.108s

$ time perl -lne '$hlength($_)++ Â 
 show 5 more comments

I just added that to my own solution. Deleted it when I saw your's though. :-)
â€“Â Kusalananda
Oct 8 '17 at 16:01

yeah I was debating to delete mine before saw your edit again :)
â€“Â Sundeep
Oct 8 '17 at 16:02

No need to sort a numerically indexed array. It is allways ordered with an increasing index. ( well, at least in awk :-) )
â€“Â Arrow
Oct 8 '17 at 18:09

length without () works perfectly fine here, so it might be redundant to add braces. I'm using GNU awk,though.
â€“Â Sergiy Kolodyazhnyy
Oct 8 '17 at 20:14

2

@SergiyKolodyazhnyy yup, gnu awk manual says In older versions of awk, the length() function could be called without any parentheses. Doing so is considered poor practice, although the 2008 POSIX standard explicitly allows it, to support historical practice. For programs to be maximally portable, always supply the parentheses
â€“Â Sundeep
Oct 9 '17 at 3:08

I just added that to my own solution. Deleted it when I saw your's though. :-)
â€“Â Kusalananda
Oct 8 '17 at 16:01

yeah I was debating to delete mine before saw your edit again :)
â€“Â Sundeep
Oct 8 '17 at 16:02

No need to sort a numerically indexed array. It is allways ordered with an increasing index. ( well, at least in awk :-) )
â€“Â Arrow
Oct 8 '17 at 18:09

length without () works perfectly fine here, so it might be redundant to add braces. I'm using GNU awk,though.
â€“Â Sergiy Kolodyazhnyy
Oct 8 '17 at 20:14

@SergiyKolodyazhnyy yup, gnu awk manual says

In older versions of awk, the length() function could be called without any parentheses. Doing so is considered poor practice, although the 2008 POSIX standard explicitly allows it, to support historical practice. For programs to be maximally portable, always supply the parentheses

â€“Â Sundeep
Oct 9 '17 at 3:08

@SergiyKolodyazhnyy yup, gnu awk manual says

In older versions of awk, the length() function could be called without any parentheses. Doing so is considered poor practice, although the 2008 POSIX standard explicitly allows it, to support historical practice. For programs to be maximally portable, always supply the parentheses

â€“Â Sundeep
Oct 9 '17 at 3:08

Â improve this answer

answered Oct 8 '17 at 16:50

steeldriver

32.1k34979

If keys indexes are numerical: Does keys array need to be sorted in Perl?
â€“Â Arrow
Oct 8 '17 at 19:13

1

@Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding $a<=>$b after the sort would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.
â€“Â Ilmari Karonen
Oct 8 '17 at 23:41

@IlmariKaronen Thanks, better now. What a difference curly braces make !!
â€“Â Arrow
Oct 8 '17 at 23:52

It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
â€“Â Peter Cordes
Oct 9 '17 at 8:47

add a commentÂ improve this answer

answered Oct 8 '17 at 16:50

steeldriver

32.1k34979

If keys indexes are numerical: Does keys array need to be sorted in Perl?
â€“Â Arrow
Oct 8 '17 at 19:13

1

@Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding $a<=>$b after the sort would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.
â€“Â Ilmari Karonen
Oct 8 '17 at 23:41

@IlmariKaronen Thanks, better now. What a difference curly braces make !!
â€“Â Arrow
Oct 8 '17 at 23:52

It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
â€“Â Peter Cordes
Oct 9 '17 at 8:47

add a commentÂ improve this answer

answered Oct 8 '17 at 16:50

steeldriver

32.1k34979

Here's a perl equivalent (with - optional - sort):

$ perl -lne '
 $hlength($_)++ { for $n (sort keys %h) print "$n character words - $h$n"
' file
2 character words - 3
5 character words - 1
7 character words - 1

answered Oct 8 '17 at 16:50

steeldriver

32.1k34979

answered Oct 8 '17 at 16:50

steeldriver

32.1k34979

answered Oct 8 '17 at 16:50

steeldriver

32.1k34979

answered Oct 8 '17 at 16:50

steeldriver

32.1k34979

If keys indexes are numerical: Does keys array need to be sorted in Perl?
â€“Â Arrow
Oct 8 '17 at 19:13

1

@Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding $a<=>$b after the sort would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.
â€“Â Ilmari Karonen
Oct 8 '17 at 23:41

@IlmariKaronen Thanks, better now. What a difference curly braces make !!
â€“Â Arrow
Oct 8 '17 at 23:52

It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
â€“Â Peter Cordes
Oct 9 '17 at 8:47

add a commentÂ |Â

If keys indexes are numerical: Does keys array need to be sorted in Perl?
â€“Â Arrow
Oct 8 '17 at 19:13

1

@Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding $a<=>$b after the sort would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.
â€“Â Ilmari Karonen
Oct 8 '17 at 23:41

@IlmariKaronen Thanks, better now. What a difference curly braces make !!
â€“Â Arrow
Oct 8 '17 at 23:52

It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
â€“Â Peter Cordes
Oct 9 '17 at 8:47

If keys indexes are numerical: Does keys array need to be sorted in Perl?
â€“Â Arrow
Oct 8 '17 at 19:13

@Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding $a<=>$b after the sort would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.
â€“Â Ilmari Karonen
Oct 8 '17 at 23:41

@IlmariKaronen Thanks, better now. What a difference curly braces make !!
â€“Â Arrow
Oct 8 '17 at 23:52

It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
â€“Â Peter Cordes
Oct 9 '17 at 8:47

add a commentÂ |Â

up vote
5
down vote

An alternative one call to GNU awk, using printf:

$ awk 'BEGIN PROCINFO["sorted_in"] = "@ind_str_asc"
 c[length($0)]++
 END
 for(i in c)printf("%s character words - %sn",i,c[i])
 ' infile
2 character words - 3
5 character words - 1
7 character words - 1

The core algorithm just collects character counts in an array.
The end part prints the collected counts formatted with printf.

Fast, simple, one single call to awk.

To be precise: some more memory is used to keep the array.

But no sort is called (numeric arrays indexes are set to be always traversed sorted upward with PROCINFO), and only one external program: awk, instead of several.

edited Jul 17 at 23:53

Jeff Schaller

32.3k849109

answered Oct 8 '17 at 17:55

Arrow

2,400218

1

for in may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.
â€“Â dave_thompson_085
Oct 8 '17 at 23:47

Thanks for your input. Using this: I believe it is sorted now. :-)
â€“Â Arrow
Oct 9 '17 at 0:04

1

@ind_str_asc sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use @ind_num_asc if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.
â€“Â dave_thompson_085
Oct 9 '17 at 4:27

add a commentÂ |Â

up vote
5
down vote

An alternative one call to GNU awk, using printf:

$ awk 'BEGIN PROCINFO["sorted_in"] = "@ind_str_asc"
 c[length($0)]++
 END
 for(i in c)printf("%s character words - %sn",i,c[i])
 ' infile
2 character words - 3
5 character words - 1
7 character words - 1

The core algorithm just collects character counts in an array.
The end part prints the collected counts formatted with printf.

Fast, simple, one single call to awk.

edited Jul 17 at 23:53

Jeff Schaller

32.3k849109

answered Oct 8 '17 at 17:55

Arrow

2,400218

1

for in may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.
â€“Â dave_thompson_085
Oct 8 '17 at 23:47

Thanks for your input. Using this: I believe it is sorted now. :-)
â€“Â Arrow
Oct 9 '17 at 0:04

1

@ind_str_asc sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use @ind_num_asc if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.
â€“Â dave_thompson_085
Oct 9 '17 at 4:27

add a commentÂ |Â

up vote
5
down vote

An alternative one call to GNU awk, using printf:

$ awk 'BEGIN PROCINFO["sorted_in"] = "@ind_str_asc"
 c[length($0)]++
 END
 for(i in c)printf("%s character words - %sn",i,c[i])
 ' infile
2 character words - 3
5 character words - 1
7 character words - 1

The core algorithm just collects character counts in an array.
The end part prints the collected counts formatted with printf.

Fast, simple, one single call to awk.

edited Jul 17 at 23:53

Jeff Schaller

32.3k849109

answered Oct 8 '17 at 17:55

Arrow

2,400218

An alternative one call to GNU awk, using printf:

$ awk 'BEGIN PROCINFO["sorted_in"] = "@ind_str_asc"
 c[length($0)]++
 END
 for(i in c)printf("%s character words - %sn",i,c[i])
 ' infile
2 character words - 3
5 character words - 1
7 character words - 1

The core algorithm just collects character counts in an array.
The end part prints the collected counts formatted with printf.

Fast, simple, one single call to awk.

edited Jul 17 at 23:53

Jeff Schaller

32.3k849109

answered Oct 8 '17 at 17:55

Arrow

2,400218

edited Jul 17 at 23:53

Jeff Schaller

32.3k849109

edited Jul 17 at 23:53

Jeff Schaller

32.3k849109

edited Jul 17 at 23:53

Jeff Schaller

32.3k849109

answered Oct 8 '17 at 17:55

Arrow

2,400218

answered Oct 8 '17 at 17:55

Arrow

2,400218

answered Oct 8 '17 at 17:55

Arrow

2,400218

1

for in may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.
â€“Â dave_thompson_085
Oct 8 '17 at 23:47

Thanks for your input. Using this: I believe it is sorted now. :-)
â€“Â Arrow
Oct 9 '17 at 0:04

1

@ind_str_asc sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use @ind_num_asc if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.
â€“Â dave_thompson_085
Oct 9 '17 at 4:27

add a commentÂ |Â

1

for in may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.
â€“Â dave_thompson_085
Oct 8 '17 at 23:47

Thanks for your input. Using this: I believe it is sorted now. :-)
â€“Â Arrow
Oct 9 '17 at 0:04

1

@ind_str_asc sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use @ind_num_asc if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.
â€“Â dave_thompson_085
Oct 9 '17 at 4:27

for in may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.
â€“Â dave_thompson_085
Oct 8 '17 at 23:47

Thanks for your input. Using this: I believe it is sorted now. :-)
â€“Â Arrow
Oct 9 '17 at 0:04

@ind_str_asc sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use @ind_num_asc if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.
â€“Â dave_thompson_085
Oct 9 '17 at 4:27

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f396855%2fis-there-an-easy-way-to-count-characters-in-words-in-file-from-terminal%23new-answer', 'question_page');

);

Post as a guest

Name

搜尋此網誌

mjhjmtu