Is there an easy way to count characters in words in file, from terminal?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
8
down vote

favorite
2












I have 100 million rows in my file.



Each row has only one column.



e.g.



aaaaa
bb
cc
ddddddd
ee


I would like to list the character count



Like this



2 character words - 3
5 character words - 1
7 character words - 1


etc.



Is there any easy way to do this in terminal?










share|improve this question



















  • 1




    see also Count line lengths in file using command line tools
    – Î±Ò“sнιη
    Oct 8 '17 at 18:38














up vote
8
down vote

favorite
2












I have 100 million rows in my file.



Each row has only one column.



e.g.



aaaaa
bb
cc
ddddddd
ee


I would like to list the character count



Like this



2 character words - 3
5 character words - 1
7 character words - 1


etc.



Is there any easy way to do this in terminal?










share|improve this question



















  • 1




    see also Count line lengths in file using command line tools
    – Î±Ò“sнιη
    Oct 8 '17 at 18:38












up vote
8
down vote

favorite
2









up vote
8
down vote

favorite
2






2





I have 100 million rows in my file.



Each row has only one column.



e.g.



aaaaa
bb
cc
ddddddd
ee


I would like to list the character count



Like this



2 character words - 3
5 character words - 1
7 character words - 1


etc.



Is there any easy way to do this in terminal?










share|improve this question















I have 100 million rows in my file.



Each row has only one column.



e.g.



aaaaa
bb
cc
ddddddd
ee


I would like to list the character count



Like this



2 character words - 3
5 character words - 1
7 character words - 1


etc.



Is there any easy way to do this in terminal?







text-processing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Oct 8 '17 at 17:59









ctrl-alt-delor

9,11431948




9,11431948










asked Oct 8 '17 at 15:38









user1091558

1464




1464







  • 1




    see also Count line lengths in file using command line tools
    – Î±Ò“sнιη
    Oct 8 '17 at 18:38












  • 1




    see also Count line lengths in file using command line tools
    – Î±Ò“sнιη
    Oct 8 '17 at 18:38







1




1




see also Count line lengths in file using command line tools
– Î±Ò“sнιη
Oct 8 '17 at 18:38




see also Count line lengths in file using command line tools
– Î±Ò“sнιη
Oct 8 '17 at 18:38










4 Answers
4






active

oldest

votes

















up vote
20
down vote



accepted










$ awk ' print length ' file | sort -n | uniq -c | awk ' printf("%d character words: %dn", $2, $1) '
2 character words: 3
5 character words: 1
7 character words: 1


The first awk filter will just print the length of each line in the file called file. I'm assuming that this file contains one word per line.



The sort -n (sort the lines from the output of awk numerically in ascending order) and uniq -c (count the number of times each line occurs consecutively) will then create the following output from that for the given data:



 3 2
1 5
1 7


This is then parsed by the second awk script which interprets each line as "X number of lines having Y characters" and produces the wanted output.




The alternative solution is to do it all in awk and keeping counts of lengths in an array. It's a tradeoff between efficiency, readability/ease of understanding (and therefore maintainability) which solution is the "best".



Alternative solution:



$ awk ' len[length]++ END for (i in len) printf("%d character words: %dn", i, len[i]) ' file
2 character words: 3
5 character words: 1
7 character words: 1





share|improve this answer






















  • No need to sort in awk (numerically indexed arrays are sorted by default) (faster).
    – Arrow
    Oct 8 '17 at 18:14











  • @Arrow I know. I have that solution commented out in my answer because Sundeep beat me to it with a few seconds. I also allude to this with my last paragraph.
    – Kusalananda
    Oct 8 '17 at 18:18










  • I believe the comment should be useful to the users of the solutions (not included in your answer (or Sundeep's) :-) …). Otherwise: include a comment to the same effect in your answer and I happily will remove my comments. :-)
    – Arrow
    Oct 8 '17 at 18:25


















up vote
11
down vote













Another way to do it all with awk alone



$ awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' ip.txt 
2 character words - 3
5 character words - 1
7 character words - 1



  • words[length()]++ use length of input line as key to save count


  • ENDfor(k in words)print k " character words - " words[k] after all lines are processed, print contents of array in desired format



Performance comparison, numbers selected are best of two runs



$ wc words.txt
71813 71813 655873 words.txt
$ perl -0777 -ne 'print $_ x 1000' words.txt > long_file.txt
$ du -h --apparent-size long_file.txt
626M long_file.txt

$ time awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' long_file.txt > t1

real 0m20.632s
user 0m20.464s
sys 0m0.108s

$ time perl -lne '$hlength($_)++ }improve this answer






















  • I just added that to my own solution. Deleted it when I saw your's though. :-)
    – Kusalananda
    Oct 8 '17 at 16:01










  • yeah I was debating to delete mine before saw your edit again :)
    – Sundeep
    Oct 8 '17 at 16:02










  • No need to sort a numerically indexed array. It is allways ordered with an increasing index. ( well, at least in awk :-) )
    – Arrow
    Oct 8 '17 at 18:09










  • length without () works perfectly fine here, so it might be redundant to add braces. I'm using GNU awk,though.
    – Sergiy Kolodyazhnyy
    Oct 8 '17 at 20:14






  • 2




    @SergiyKolodyazhnyy yup, gnu awk manual says In older versions of awk, the length() function could be called without any parentheses. Doing so is considered poor practice, although the 2008 POSIX standard explicitly allows it, to support historical practice. For programs to be maximally portable, always supply the parentheses
    – Sundeep
    Oct 9 '17 at 3:08









share 
show 5 more comments







up vote
11
down vote










up vote
11
down vote









Another way to do it all with awk alone



$ awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' ip.txt 
2 character words - 3
5 character words - 1
7 character words - 1



  • words[length()]++ use length of input line as key to save count


  • ENDfor(k in words)print k " character words - " words[k] after all lines are processed, print contents of array in desired format



Performance comparison, numbers selected are best of two runs



$ wc words.txt
71813 71813 655873 words.txt
$ perl -0777 -ne 'print $_ x 1000' words.txt > long_file.txt
$ du -h --apparent-size long_file.txt
626M long_file.txt

$ time awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' long_file.txt > t1

real 0m20.632s
user 0m20.464s
sys 0m0.108s

$ time perl -lne '$hlength($_)++ improve this answer














Another way to do it all with awk alone



$ awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' ip.txt 
2 character words - 3
5 character words - 1
7 character words - 1



  • words[length()]++ use length of input line as key to save count


  • ENDfor(k in words)print k " character words - " words[k] after all lines are processed, print contents of array in desired format



Performance comparison, numbers selected are best of two runs



$ wc words.txt
71813 71813 655873 words.txt
$ perl -0777 -ne 'print $_ x 1000' words.txt > long_file.txt
$ du -h --apparent-size long_file.txt
626M long_file.txt

$ time awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' long_file.txt > t1

real 0m20.632s
user 0m20.464s
sys 0m0.108s

$ time perl -lne '$hlength($_)++  
show 5 more comments











  • I just added that to my own solution. Deleted it when I saw your's though. :-)
    – Kusalananda
    Oct 8 '17 at 16:01










  • yeah I was debating to delete mine before saw your edit again :)
    – Sundeep
    Oct 8 '17 at 16:02










  • No need to sort a numerically indexed array. It is allways ordered with an increasing index. ( well, at least in awk :-) )
    – Arrow
    Oct 8 '17 at 18:09










  • length without () works perfectly fine here, so it might be redundant to add braces. I'm using GNU awk,though.
    – Sergiy Kolodyazhnyy
    Oct 8 '17 at 20:14






  • 2




    @SergiyKolodyazhnyy yup, gnu awk manual says In older versions of awk, the length() function could be called without any parentheses. Doing so is considered poor practice, although the 2008 POSIX standard explicitly allows it, to support historical practice. For programs to be maximally portable, always supply the parentheses
    – Sundeep
    Oct 9 '17 at 3:08















I just added that to my own solution. Deleted it when I saw your's though. :-)
– Kusalananda
Oct 8 '17 at 16:01




I just added that to my own solution. Deleted it when I saw your's though. :-)
– Kusalananda
Oct 8 '17 at 16:01












yeah I was debating to delete mine before saw your edit again :)
– Sundeep
Oct 8 '17 at 16:02




yeah I was debating to delete mine before saw your edit again :)
– Sundeep
Oct 8 '17 at 16:02












No need to sort a numerically indexed array. It is allways ordered with an increasing index. ( well, at least in awk :-) )
– Arrow
Oct 8 '17 at 18:09




No need to sort a numerically indexed array. It is allways ordered with an increasing index. ( well, at least in awk :-) )
– Arrow
Oct 8 '17 at 18:09












length without () works perfectly fine here, so it might be redundant to add braces. I'm using GNU awk,though.
– Sergiy Kolodyazhnyy
Oct 8 '17 at 20:14




length without () works perfectly fine here, so it might be redundant to add braces. I'm using GNU awk,though.
– Sergiy Kolodyazhnyy
Oct 8 '17 at 20:14




2




2




@SergiyKolodyazhnyy yup, gnu awk manual says In older versions of awk, the length() function could be called without any parentheses. Doing so is considered poor practice, although the 2008 POSIX standard explicitly allows it, to support historical practice. For programs to be maximally portable, always supply the parentheses
– Sundeep
Oct 9 '17 at 3:08




@SergiyKolodyazhnyy yup, gnu awk manual says In older versions of awk, the length() function could be called without any parentheses. Doing so is considered poor practice, although the 2008 POSIX standard explicitly allows it, to support historical practice. For programs to be maximally portable, always supply the parentheses
– Sundeep
Oct 9 '17 at 3:08






















  • If keys indexes are numerical: Does keys array need to be sorted in Perl?
    – Arrow
    Oct 8 '17 at 19:13






  • 1




    @Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding $a<=>$b after the sort would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.
    – Ilmari Karonen
    Oct 8 '17 at 23:41











  • @IlmariKaronen Thanks, better now. What a difference curly braces make !!
    – Arrow
    Oct 8 '17 at 23:52











  • It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
    – Peter Cordes
    Oct 9 '17 at 8:47

























  • If keys indexes are numerical: Does keys array need to be sorted in Perl?
    – Arrow
    Oct 8 '17 at 19:13






  • 1




    @Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding $a<=>$b after the sort would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.
    – Ilmari Karonen
    Oct 8 '17 at 23:41











  • @IlmariKaronen Thanks, better now. What a difference curly braces make !!
    – Arrow
    Oct 8 '17 at 23:52











  • It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
    – Peter Cordes
    Oct 9 '17 at 8:47

















Here's a perl equivalent (with - optional - sort):



$ perl -lne '
$hlength($_)++ { for $n (sort keys %h) print "$n character words - $h$n"
' file
2 character words - 3
5 character words - 1
7 character words - 1






share|improve this answer












share|improve this answer



share|improve this answer










answered Oct 8 '17 at 16:50









steeldriver

32.1k34979




32.1k34979











  • If keys indexes are numerical: Does keys array need to be sorted in Perl?
    – Arrow
    Oct 8 '17 at 19:13






  • 1




    @Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding $a<=>$b after the sort would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.
    – Ilmari Karonen
    Oct 8 '17 at 23:41











  • @IlmariKaronen Thanks, better now. What a difference curly braces make !!
    – Arrow
    Oct 8 '17 at 23:52











  • It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
    – Peter Cordes
    Oct 9 '17 at 8:47
















  • If keys indexes are numerical: Does keys array need to be sorted in Perl?
    – Arrow
    Oct 8 '17 at 19:13






  • 1




    @Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding $a<=>$b after the sort would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.
    – Ilmari Karonen
    Oct 8 '17 at 23:41











  • @IlmariKaronen Thanks, better now. What a difference curly braces make !!
    – Arrow
    Oct 8 '17 at 23:52











  • It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
    – Peter Cordes
    Oct 9 '17 at 8:47















If keys indexes are numerical: Does keys array need to be sorted in Perl?
– Arrow
Oct 8 '17 at 19:13




If keys indexes are numerical: Does keys array need to be sorted in Perl?
– Arrow
Oct 8 '17 at 19:13




1




1




@Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding $a<=>$b after the sort would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.
– Ilmari Karonen
Oct 8 '17 at 23:41





@Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding $a<=>$b after the sort would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.
– Ilmari Karonen
Oct 8 '17 at 23:41













@IlmariKaronen Thanks, better now. What a difference curly braces make !!
– Arrow
Oct 8 '17 at 23:52





@IlmariKaronen Thanks, better now. What a difference curly braces make !!
– Arrow
Oct 8 '17 at 23:52













It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
– Peter Cordes
Oct 9 '17 at 8:47




It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
– Peter Cordes
Oct 9 '17 at 8:47










up vote
5
down vote













An alternative one call to GNU awk, using printf:



$ awk 'BEGIN PROCINFO["sorted_in"] = "@ind_str_asc"
c[length($0)]++
END
for(i in c)printf("%s character words - %sn",i,c[i])
' infile
2 character words - 3
5 character words - 1
7 character words - 1


The core algorithm just collects character counts in an array.
The end part prints the collected counts formatted with printf.



Fast, simple, one single call to awk.



To be precise: some more memory is used to keep the array.

But no sort is called (numeric arrays indexes are set to be always traversed sorted upward with PROCINFO), and only one external program: awk, instead of several.






share|improve this answer


















  • 1




    for in may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.
    – dave_thompson_085
    Oct 8 '17 at 23:47










  • Thanks for your input. Using this: I believe it is sorted now. :-)
    – Arrow
    Oct 9 '17 at 0:04






  • 1




    @ind_str_asc sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use @ind_num_asc if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.
    – dave_thompson_085
    Oct 9 '17 at 4:27














up vote
5
down vote













An alternative one call to GNU awk, using printf:



$ awk 'BEGIN PROCINFO["sorted_in"] = "@ind_str_asc"
c[length($0)]++
END
for(i in c)printf("%s character words - %sn",i,c[i])
' infile
2 character words - 3
5 character words - 1
7 character words - 1


The core algorithm just collects character counts in an array.
The end part prints the collected counts formatted with printf.



Fast, simple, one single call to awk.



To be precise: some more memory is used to keep the array.

But no sort is called (numeric arrays indexes are set to be always traversed sorted upward with PROCINFO), and only one external program: awk, instead of several.






share|improve this answer


















  • 1




    for in may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.
    – dave_thompson_085
    Oct 8 '17 at 23:47










  • Thanks for your input. Using this: I believe it is sorted now. :-)
    – Arrow
    Oct 9 '17 at 0:04






  • 1




    @ind_str_asc sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use @ind_num_asc if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.
    – dave_thompson_085
    Oct 9 '17 at 4:27












up vote
5
down vote










up vote
5
down vote









An alternative one call to GNU awk, using printf:



$ awk 'BEGIN PROCINFO["sorted_in"] = "@ind_str_asc"
c[length($0)]++
END
for(i in c)printf("%s character words - %sn",i,c[i])
' infile
2 character words - 3
5 character words - 1
7 character words - 1


The core algorithm just collects character counts in an array.
The end part prints the collected counts formatted with printf.



Fast, simple, one single call to awk.



To be precise: some more memory is used to keep the array.

But no sort is called (numeric arrays indexes are set to be always traversed sorted upward with PROCINFO), and only one external program: awk, instead of several.






share|improve this answer














An alternative one call to GNU awk, using printf:



$ awk 'BEGIN PROCINFO["sorted_in"] = "@ind_str_asc"
c[length($0)]++
END
for(i in c)printf("%s character words - %sn",i,c[i])
' infile
2 character words - 3
5 character words - 1
7 character words - 1


The core algorithm just collects character counts in an array.
The end part prints the collected counts formatted with printf.



Fast, simple, one single call to awk.



To be precise: some more memory is used to keep the array.

But no sort is called (numeric arrays indexes are set to be always traversed sorted upward with PROCINFO), and only one external program: awk, instead of several.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jul 17 at 23:53









Jeff Schaller

32.3k849109




32.3k849109










answered Oct 8 '17 at 17:55









Arrow

2,400218




2,400218







  • 1




    for in may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.
    – dave_thompson_085
    Oct 8 '17 at 23:47










  • Thanks for your input. Using this: I believe it is sorted now. :-)
    – Arrow
    Oct 9 '17 at 0:04






  • 1




    @ind_str_asc sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use @ind_num_asc if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.
    – dave_thompson_085
    Oct 9 '17 at 4:27












  • 1




    for in may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.
    – dave_thompson_085
    Oct 8 '17 at 23:47










  • Thanks for your input. Using this: I believe it is sorted now. :-)
    – Arrow
    Oct 9 '17 at 0:04






  • 1




    @ind_str_asc sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use @ind_num_asc if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.
    – dave_thompson_085
    Oct 9 '17 at 4:27







1




1




for in may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.
– dave_thompson_085
Oct 8 '17 at 23:47




for in may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.
– dave_thompson_085
Oct 8 '17 at 23:47












Thanks for your input. Using this: I believe it is sorted now. :-)
– Arrow
Oct 9 '17 at 0:04




Thanks for your input. Using this: I believe it is sorted now. :-)
– Arrow
Oct 9 '17 at 0:04




1




1




@ind_str_asc sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use @ind_num_asc if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.
– dave_thompson_085
Oct 9 '17 at 4:27




@ind_str_asc sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use @ind_num_asc if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.
– dave_thompson_085
Oct 9 '17 at 4:27

















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f396855%2fis-there-an-easy-way-to-count-characters-in-words-in-file-from-terminal%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

How to check contact read email or not when send email to Individual?

How many registers does an x86_64 CPU actually have?

Nur Jahan