Is there an easy way to count characters in words in file, from terminal?
Clash Royale CLAN TAG#URR8PPP
up vote
8
down vote
favorite
I have 100 million rows in my file.
Each row has only one column.
e.g.
aaaaa
bb
cc
ddddddd
ee
I would like to list the character count
Like this
2 character words - 3
5 character words - 1
7 character words - 1
etc.
Is there any easy way to do this in terminal?
text-processing
add a comment |Â
up vote
8
down vote
favorite
I have 100 million rows in my file.
Each row has only one column.
e.g.
aaaaa
bb
cc
ddddddd
ee
I would like to list the character count
Like this
2 character words - 3
5 character words - 1
7 character words - 1
etc.
Is there any easy way to do this in terminal?
text-processing
1
see also Count line lengths in file using command line tools
â Ã±ÃÂsýù÷
Oct 8 '17 at 18:38
add a comment |Â
up vote
8
down vote
favorite
up vote
8
down vote
favorite
I have 100 million rows in my file.
Each row has only one column.
e.g.
aaaaa
bb
cc
ddddddd
ee
I would like to list the character count
Like this
2 character words - 3
5 character words - 1
7 character words - 1
etc.
Is there any easy way to do this in terminal?
text-processing
I have 100 million rows in my file.
Each row has only one column.
e.g.
aaaaa
bb
cc
ddddddd
ee
I would like to list the character count
Like this
2 character words - 3
5 character words - 1
7 character words - 1
etc.
Is there any easy way to do this in terminal?
text-processing
text-processing
edited Oct 8 '17 at 17:59
ctrl-alt-delor
9,11431948
9,11431948
asked Oct 8 '17 at 15:38
user1091558
1464
1464
1
see also Count line lengths in file using command line tools
â Ã±ÃÂsýù÷
Oct 8 '17 at 18:38
add a comment |Â
1
see also Count line lengths in file using command line tools
â Ã±ÃÂsýù÷
Oct 8 '17 at 18:38
1
1
see also Count line lengths in file using command line tools
â Ã±ÃÂsýù÷
Oct 8 '17 at 18:38
see also Count line lengths in file using command line tools
â Ã±ÃÂsýù÷
Oct 8 '17 at 18:38
add a comment |Â
4 Answers
4
active
oldest
votes
up vote
20
down vote
accepted
$ awk ' print length ' file | sort -n | uniq -c | awk ' printf("%d character words: %dn", $2, $1) '
2 character words: 3
5 character words: 1
7 character words: 1
The first awk
filter will just print the length of each line in the file called file
. I'm assuming that this file contains one word per line.
The sort -n
(sort the lines from the output of awk
numerically in ascending order) and uniq -c
(count the number of times each line occurs consecutively) will then create the following output from that for the given data:
3 2
1 5
1 7
This is then parsed by the second awk
script which interprets each line as "X number of lines having Y characters" and produces the wanted output.
The alternative solution is to do it all in awk
and keeping counts of lengths in an array. It's a tradeoff between efficiency, readability/ease of understanding (and therefore maintainability) which solution is the "best".
Alternative solution:
$ awk ' len[length]++ END for (i in len) printf("%d character words: %dn", i, len[i]) ' file
2 character words: 3
5 character words: 1
7 character words: 1
No need to sort in awk (numerically indexed arrays are sorted by default) (faster).
â Arrow
Oct 8 '17 at 18:14
@Arrow I know. I have that solution commented out in my answer because Sundeep beat me to it with a few seconds. I also allude to this with my last paragraph.
â Kusalananda
Oct 8 '17 at 18:18
I believe the comment should be useful to the users of the solutions (not included in your answer (or Sundeep's) :-) â¦). Otherwise: include a comment to the same effect in your answer and I happily will remove my comments. :-)
â Arrow
Oct 8 '17 at 18:25
add a comment |Â
up vote
11
down vote
Another way to do it all with awk
alone
$ awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' ip.txt
2 character words - 3
5 character words - 1
7 character words - 1
words[length()]++
use length of input line as key to save countENDfor(k in words)print k " character words - " words[k]
after all lines are processed, print contents of array in desired format
Performance comparison, numbers selected are best of two runs
$ wc words.txt
71813 71813 655873 words.txt
$ perl -0777 -ne 'print $_ x 1000' words.txt > long_file.txt
$ du -h --apparent-size long_file.txt
626M long_file.txt
$ time awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' long_file.txt > t1
real 0m20.632s
user 0m20.464s
sys 0m0.108s
$ time perl -lne '$hlength($_)++ }improve this answer
I just added that to my own solution. Deleted it when I saw your's though. :-)
â Kusalananda
Oct 8 '17 at 16:01
yeah I was debating to delete mine before saw your edit again :)
â Sundeep
Oct 8 '17 at 16:02
No need to sort a numerically indexed array. It is allways ordered with an increasing index. ( well, at least in awk :-) )
â Arrow
Oct 8 '17 at 18:09
length
without()
works perfectly fine here, so it might be redundant to add braces. I'm using GNU awk,though.
â Sergiy Kolodyazhnyy
Oct 8 '17 at 20:14
2
@SergiyKolodyazhnyy yup, gnu awk manual saysIn older versions of awk, the length() function could be called without any parentheses. Doing so is considered poor practice, although the 2008 POSIX standard explicitly allows it, to support historical practice. For programs to be maximally portable, always supply the parentheses
â Sundeep
Oct 9 '17 at 3:08
 sort -n awk ' printf("%d character words - %dn", $2, $1) ' > t3
real 1m23.294s
user 1m24.952s
sys 0m1.980s
$ diff -s <(sort t1) <(sort t2)
Files /dev/fd/63 and /dev/fd/62 are identical
$ diff -s <(sort t1) <(sort t3)
Files /dev/fd/63 and /dev/fd/62 are identical
If file has only ASCII characters,
$ time LC_ALL=C awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' long_file.txt > t1
real 0m15.651s
user 0m15.496s
sys 0m0.120s
Not sure why time for perl
didn't change much, probably encoding has to be set some other way
up vote
11
down vote
up vote
11
down vote
Another way to do it all with awk
alone
$ awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' ip.txt
2 character words - 3
5 character words - 1
7 character words - 1
words[length()]++
use length of input line as key to save countENDfor(k in words)print k " character words - " words[k]
after all lines are processed, print contents of array in desired format
Performance comparison, numbers selected are best of two runs
$ wc words.txt
71813 71813 655873 words.txt
$ perl -0777 -ne 'print $_ x 1000' words.txt > long_file.txt
$ du -h --apparent-size long_file.txt
626M long_file.txt
$ time awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' long_file.txt > t1
real 0m20.632s
user 0m20.464s
sys 0m0.108s
$ time perl -lne '$hlength($_)++ improve this answer
Another way to do it all with awk
alone
$ awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' ip.txt
2 character words - 3
5 character words - 1
7 character words - 1
words[length()]++
use length of input line as key to save countENDfor(k in words)print k " character words - " words[k]
after all lines are processed, print contents of array in desired format
Performance comparison, numbers selected are best of two runs
$ wc words.txt
71813 71813 655873 words.txt
$ perl -0777 -ne 'print $_ x 1000' words.txt > long_file.txt
$ du -h --apparent-size long_file.txt
626M long_file.txt
$ time awk 'words[length()]++ ENDfor(k in words)print k " character words - " words[k]' long_file.txt > t1
real 0m20.632s
user 0m20.464s
sys 0m0.108s
$ time perl -lne '$hlength($_)++ Â
show 5 more comments
I just added that to my own solution. Deleted it when I saw your's though. :-)
â Kusalananda
Oct 8 '17 at 16:01
yeah I was debating to delete mine before saw your edit again :)
â Sundeep
Oct 8 '17 at 16:02
No need to sort a numerically indexed array. It is allways ordered with an increasing index. ( well, at least in awk :-) )
â Arrow
Oct 8 '17 at 18:09
length
without()
works perfectly fine here, so it might be redundant to add braces. I'm using GNU awk,though.
â Sergiy Kolodyazhnyy
Oct 8 '17 at 20:14
2
@SergiyKolodyazhnyy yup, gnu awk manual saysIn older versions of awk, the length() function could be called without any parentheses. Doing so is considered poor practice, although the 2008 POSIX standard explicitly allows it, to support historical practice. For programs to be maximally portable, always supply the parentheses
â Sundeep
Oct 9 '17 at 3:08
I just added that to my own solution. Deleted it when I saw your's though. :-)
â Kusalananda
Oct 8 '17 at 16:01
I just added that to my own solution. Deleted it when I saw your's though. :-)
â Kusalananda
Oct 8 '17 at 16:01
yeah I was debating to delete mine before saw your edit again :)
â Sundeep
Oct 8 '17 at 16:02
yeah I was debating to delete mine before saw your edit again :)
â Sundeep
Oct 8 '17 at 16:02
No need to sort a numerically indexed array. It is allways ordered with an increasing index. ( well, at least in awk :-) )
â Arrow
Oct 8 '17 at 18:09
No need to sort a numerically indexed array. It is allways ordered with an increasing index. ( well, at least in awk :-) )
â Arrow
Oct 8 '17 at 18:09
length
without ()
works perfectly fine here, so it might be redundant to add braces. I'm using GNU awk,though.â Sergiy Kolodyazhnyy
Oct 8 '17 at 20:14
length
without ()
works perfectly fine here, so it might be redundant to add braces. I'm using GNU awk,though.â Sergiy Kolodyazhnyy
Oct 8 '17 at 20:14
2
2
@SergiyKolodyazhnyy yup, gnu awk manual says
In older versions of awk, the length() function could be called without any parentheses. Doing so is considered poor practice, although the 2008 POSIX standard explicitly allows it, to support historical practice. For programs to be maximally portable, always supply the parentheses
â Sundeep
Oct 9 '17 at 3:08
@SergiyKolodyazhnyy yup, gnu awk manual says
In older versions of awk, the length() function could be called without any parentheses. Doing so is considered poor practice, although the 2008 POSIX standard explicitly allows it, to support historical practice. For programs to be maximally portable, always supply the parentheses
â Sundeep
Oct 9 '17 at 3:08
 improve this answer
If keys indexes are numerical: Does keys array need to be sorted in Perl?
â Arrow
Oct 8 '17 at 19:13
1
@Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding$a<=>$b
after thesort
would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.
â Ilmari Karonen
Oct 8 '17 at 23:41
@IlmariKaronen Thanks, better now. What a difference curly braces make !!
â Arrow
Oct 8 '17 at 23:52
It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
â Peter Cordes
Oct 9 '17 at 8:47
add a comment improve this answer
If keys indexes are numerical: Does keys array need to be sorted in Perl?
â Arrow
Oct 8 '17 at 19:13
1
@Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding$a<=>$b
after thesort
would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.
â Ilmari Karonen
Oct 8 '17 at 23:41
@IlmariKaronen Thanks, better now. What a difference curly braces make !!
â Arrow
Oct 8 '17 at 23:52
It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
â Peter Cordes
Oct 9 '17 at 8:47
add a comment improve this answer
Here's a perl
equivalent (with - optional - sort):
$ perl -lne '
$hlength($_)++ { for $n (sort keys %h) print "$n character words - $h$n"
' file
2 character words - 3
5 character words - 1
7 character words - 1
answered Oct 8 '17 at 16:50
steeldriver
32.1k34979
32.1k34979
If keys indexes are numerical: Does keys array need to be sorted in Perl?
â Arrow
Oct 8 '17 at 19:13
1
@Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding$a<=>$b
after thesort
would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.
â Ilmari Karonen
Oct 8 '17 at 23:41
@IlmariKaronen Thanks, better now. What a difference curly braces make !!
â Arrow
Oct 8 '17 at 23:52
It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
â Peter Cordes
Oct 9 '17 at 8:47
add a comment |Â
If keys indexes are numerical: Does keys array need to be sorted in Perl?
â Arrow
Oct 8 '17 at 19:13
1
@Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding$a<=>$b
after thesort
would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.
â Ilmari Karonen
Oct 8 '17 at 23:41
@IlmariKaronen Thanks, better now. What a difference curly braces make !!
â Arrow
Oct 8 '17 at 23:52
It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
â Peter Cordes
Oct 9 '17 at 8:47
If keys indexes are numerical: Does keys array need to be sorted in Perl?
â Arrow
Oct 8 '17 at 19:13
If keys indexes are numerical: Does keys array need to be sorted in Perl?
â Arrow
Oct 8 '17 at 19:13
1
1
@Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding
$a<=>$b
after the sort
would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.â Ilmari Karonen
Oct 8 '17 at 23:41
@Arrow: This answer is using a hash (i.e. associative array with string keys), and those have undefined key order, so yes. In fact, the answer is slightly buggy because it's sorting the keys as strings, not as numbers. Adding
$a<=>$b
after the sort
would fix that. Alternatively, one could use a normal array with numerical keys and just skip any keys where the value is zero / undefined.â Ilmari Karonen
Oct 8 '17 at 23:41
@IlmariKaronen Thanks, better now. What a difference curly braces make !!
â Arrow
Oct 8 '17 at 23:52
@IlmariKaronen Thanks, better now. What a difference curly braces make !!
â Arrow
Oct 8 '17 at 23:52
It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
â Peter Cordes
Oct 9 '17 at 8:47
It would be more efficient to use an array instead of a hash. The OP wants millions of lines, so any overhead of checking and skipping zeros while printing is easily made up for by cheaper indexing.
â Peter Cordes
Oct 9 '17 at 8:47
add a comment |Â
up vote
5
down vote
An alternative one call to GNU awk, using printf:
$ awk 'BEGIN PROCINFO["sorted_in"] = "@ind_str_asc"
c[length($0)]++
END
for(i in c)printf("%s character words - %sn",i,c[i])
' infile
2 character words - 3
5 character words - 1
7 character words - 1
The core algorithm just collects character counts in an array.
The end part prints the collected counts formatted with printf.
Fast, simple, one single call to awk.
To be precise: some more memory is used to keep the array.
But no sort is called (numeric arrays indexes are set to be always traversed sorted upward with PROCINFO), and only one external program: awk
, instead of several.
1
for in
may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.
â dave_thompson_085
Oct 8 '17 at 23:47
Thanks for your input. Using this: I believe it is sorted now. :-)
â Arrow
Oct 9 '17 at 0:04
1
@ind_str_asc
sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use@ind_num_asc
if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.
â dave_thompson_085
Oct 9 '17 at 4:27
add a comment |Â
up vote
5
down vote
An alternative one call to GNU awk, using printf:
$ awk 'BEGIN PROCINFO["sorted_in"] = "@ind_str_asc"
c[length($0)]++
END
for(i in c)printf("%s character words - %sn",i,c[i])
' infile
2 character words - 3
5 character words - 1
7 character words - 1
The core algorithm just collects character counts in an array.
The end part prints the collected counts formatted with printf.
Fast, simple, one single call to awk.
To be precise: some more memory is used to keep the array.
But no sort is called (numeric arrays indexes are set to be always traversed sorted upward with PROCINFO), and only one external program: awk
, instead of several.
1
for in
may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.
â dave_thompson_085
Oct 8 '17 at 23:47
Thanks for your input. Using this: I believe it is sorted now. :-)
â Arrow
Oct 9 '17 at 0:04
1
@ind_str_asc
sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use@ind_num_asc
if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.
â dave_thompson_085
Oct 9 '17 at 4:27
add a comment |Â
up vote
5
down vote
up vote
5
down vote
An alternative one call to GNU awk, using printf:
$ awk 'BEGIN PROCINFO["sorted_in"] = "@ind_str_asc"
c[length($0)]++
END
for(i in c)printf("%s character words - %sn",i,c[i])
' infile
2 character words - 3
5 character words - 1
7 character words - 1
The core algorithm just collects character counts in an array.
The end part prints the collected counts formatted with printf.
Fast, simple, one single call to awk.
To be precise: some more memory is used to keep the array.
But no sort is called (numeric arrays indexes are set to be always traversed sorted upward with PROCINFO), and only one external program: awk
, instead of several.
An alternative one call to GNU awk, using printf:
$ awk 'BEGIN PROCINFO["sorted_in"] = "@ind_str_asc"
c[length($0)]++
END
for(i in c)printf("%s character words - %sn",i,c[i])
' infile
2 character words - 3
5 character words - 1
7 character words - 1
The core algorithm just collects character counts in an array.
The end part prints the collected counts formatted with printf.
Fast, simple, one single call to awk.
To be precise: some more memory is used to keep the array.
But no sort is called (numeric arrays indexes are set to be always traversed sorted upward with PROCINFO), and only one external program: awk
, instead of several.
edited Jul 17 at 23:53
Jeff Schaller
32.3k849109
32.3k849109
answered Oct 8 '17 at 17:55
Arrow
2,400218
2,400218
1
for in
may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.
â dave_thompson_085
Oct 8 '17 at 23:47
Thanks for your input. Using this: I believe it is sorted now. :-)
â Arrow
Oct 9 '17 at 0:04
1
@ind_str_asc
sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use@ind_num_asc
if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.
â dave_thompson_085
Oct 9 '17 at 4:27
add a comment |Â
1
for in
may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.
â dave_thompson_085
Oct 8 '17 at 23:47
Thanks for your input. Using this: I believe it is sorted now. :-)
â Arrow
Oct 9 '17 at 0:04
1
@ind_str_asc
sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use@ind_num_asc
if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.
â dave_thompson_085
Oct 9 '17 at 4:27
1
1
for in
may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.â dave_thompson_085
Oct 8 '17 at 23:47
for in
may happen to give numeric array indexes in numeric order at least for some values or in some awk implementations, but that is not required, not traditional, and definitely not universal. It does often happen for tiny sets like 2 or 3 or maybe 4; try 10 or 20 on every awk you have access to (without PROCINFO or WHINY_USERS in gawk) and I bet $50 at least one case isn't sorted.â dave_thompson_085
Oct 8 '17 at 23:47
Thanks for your input. Using this: I believe it is sorted now. :-)
â Arrow
Oct 9 '17 at 0:04
Thanks for your input. Using this: I believe it is sorted now. :-)
â Arrow
Oct 9 '17 at 0:04
1
1
@ind_str_asc
sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use @ind_num_asc
if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.â dave_thompson_085
Oct 9 '17 at 4:27
@ind_str_asc
sorts as strings, which will be correct for numbers only if they are all single-digit (as your example is); use @ind_num_asc
if (any) values can be 10 or more. And although it's less of an issue now than it used to be, this feature is only gawk 4.0 up.â dave_thompson_085
Oct 9 '17 at 4:27
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f396855%2fis-there-an-easy-way-to-count-characters-in-words-in-file-from-terminal%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
see also Count line lengths in file using command line tools
â Ã±ÃÂsýù÷
Oct 8 '17 at 18:38