Loop over a list of strings and increment letter count in a corresponding sublist
Clash Royale CLAN TAG#URR8PPP
up vote
10
down vote
favorite
I have a 2D list as follows:
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, ...;
The first sub-list consists of a heading, and the following sub-lists contain counts, initialized at zero.
I need to loop over another list, sequences
, that contains strings plus a heading, and access the corresponding sub-list in counts
to increment the appropriate letter count.
For example, take a string from sequences
:
MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICI
Its corresponding sub-list in counts
would be incremented to 31, 27, 45, 30, 18, 27, 25, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
.
I obtained this via StringCount[sequences[[1]], #] & /@ counts[[1]]
but am struggling to scale this code, and to make it update the sub-lists in counts
instead of returning a new list.
list-manipulation numerics string-manipulation counting
add a comment |Â
up vote
10
down vote
favorite
I have a 2D list as follows:
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, ...;
The first sub-list consists of a heading, and the following sub-lists contain counts, initialized at zero.
I need to loop over another list, sequences
, that contains strings plus a heading, and access the corresponding sub-list in counts
to increment the appropriate letter count.
For example, take a string from sequences
:
MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICI
Its corresponding sub-list in counts
would be incremented to 31, 27, 45, 30, 18, 27, 25, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
.
I obtained this via StringCount[sequences[[1]], #] & /@ counts[[1]]
but am struggling to scale this code, and to make it update the sub-lists in counts
instead of returning a new list.
list-manipulation numerics string-manipulation counting
add a comment |Â
up vote
10
down vote
favorite
up vote
10
down vote
favorite
I have a 2D list as follows:
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, ...;
The first sub-list consists of a heading, and the following sub-lists contain counts, initialized at zero.
I need to loop over another list, sequences
, that contains strings plus a heading, and access the corresponding sub-list in counts
to increment the appropriate letter count.
For example, take a string from sequences
:
MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICI
Its corresponding sub-list in counts
would be incremented to 31, 27, 45, 30, 18, 27, 25, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
.
I obtained this via StringCount[sequences[[1]], #] & /@ counts[[1]]
but am struggling to scale this code, and to make it update the sub-lists in counts
instead of returning a new list.
list-manipulation numerics string-manipulation counting
I have a 2D list as follows:
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, ...;
The first sub-list consists of a heading, and the following sub-lists contain counts, initialized at zero.
I need to loop over another list, sequences
, that contains strings plus a heading, and access the corresponding sub-list in counts
to increment the appropriate letter count.
For example, take a string from sequences
:
MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICI
Its corresponding sub-list in counts
would be incremented to 31, 27, 45, 30, 18, 27, 25, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
.
I obtained this via StringCount[sequences[[1]], #] & /@ counts[[1]]
but am struggling to scale this code, and to make it update the sub-lists in counts
instead of returning a new list.
list-manipulation numerics string-manipulation counting
list-manipulation numerics string-manipulation counting
edited Sep 20 at 6:16
Henrik Schumacher
40.4k256121
40.4k256121
asked Sep 19 at 20:06
briennakh
2457
2457
add a comment |Â
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
9
down vote
accepted
sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
NIRCNICI";
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
and the code:
new = Values[
(CharacterCounts /@ sequences)[[All, First@counts]]
];
counts[[2 ;;]] += new;
counts
"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M",
"F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
Thank you, this works as well!
â briennakh
Sep 19 at 20:34
This is also much faster than kglr's solution (see my post for timing examples).
â Henrik Schumacher
Sep 19 at 20:55
@Kuba, as kglr pointed out, there might be occurences ofMissing[AbsentKey]
in your result. UsingLookup[CharacterCounts /@ sequences, First@counts, 0];
is not only a bit faster but also replaces theMissing[AbsentKey]
by0
.
â Henrik Schumacher
Sep 19 at 21:15
@Kuba ... andLookup[CharacterCounts[sequences], letters, 0];
is even a further bit faster.
â Henrik Schumacher
Sep 19 at 21:25
2
I ended up usingletterCounts = Lookup[CharacterCounts[sequences], letters, 0];
and borrowing from klgrletterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
â briennakh
Sep 19 at 22:09
 |Â
show 2 more comments
up vote
7
down vote
I can propose two things that speed up the letter counting tremendously:
1.) Use ToCharacterCode
to convert your strings to packed arrays of integers.
2.) Use a compiled funcion for additive matrix assembly.
Additive assembly of each row can be obtained with this little function.
cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
Block[b,
b = Table[0, max];
Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
b
],
CompilationTarget -> "C",
RuntimeAttributes -> Listable,
Parallelization -> True,
RuntimeOptions -> "Speed"
];
Borrowing a bit of code from kglr but cranking up the amount of strings and their length:
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";
Here is how kglr's and Kuba's very elegant solutions perform. lcs2a
is a modification of Kuba's code to cope with Missing[AbsentKey]
which may occur when some of the elements of letters
do not occur in any of the elements in sequences
(as kglr pointed out in a comment). It is also a bit faster.
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
lcs2 = Values[(CharacterCounts /@ sequences)[[All, letters]]]; // RepeatedTiming // First
lcs2a = Lookup[CharacterCounts[sequences], letters, 0]; // RepeatedTiming // First
3.59
0.075
0.059
My version is a bit more clunky, but it does the job several times faster:
i0 = ToCharacterCode["A"][[1]] - 1;
letterpos = ToCharacterCode[StringJoin[letters]] - i0;
lcs3 = cAssembleRow[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First
0.0094
When all letters occur in each element of `sequences, then all results are equal:
lcs == lcs2 == lcs2a == lcs3
True
4
Henrik, if some letters have 0 count in some sequences,Kubalcs
will haveMissing[KeyAbsent]
instead of 0; so some additional processing is needed.
â kglr
Sep 19 at 21:05
add a comment |Â
up vote
6
down vote
You can use LetterCounts
as follows:
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L",
"K",ÃÂ ÃÂ "M", "F", "P", "S", "T", "W", "Y", "V";
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
counts = Join[letters, lcs];
counts // Grid
I like the pretty output!
â briennakh
Sep 19 at 20:31
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
9
down vote
accepted
sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
NIRCNICI";
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
and the code:
new = Values[
(CharacterCounts /@ sequences)[[All, First@counts]]
];
counts[[2 ;;]] += new;
counts
"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M",
"F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
Thank you, this works as well!
â briennakh
Sep 19 at 20:34
This is also much faster than kglr's solution (see my post for timing examples).
â Henrik Schumacher
Sep 19 at 20:55
@Kuba, as kglr pointed out, there might be occurences ofMissing[AbsentKey]
in your result. UsingLookup[CharacterCounts /@ sequences, First@counts, 0];
is not only a bit faster but also replaces theMissing[AbsentKey]
by0
.
â Henrik Schumacher
Sep 19 at 21:15
@Kuba ... andLookup[CharacterCounts[sequences], letters, 0];
is even a further bit faster.
â Henrik Schumacher
Sep 19 at 21:25
2
I ended up usingletterCounts = Lookup[CharacterCounts[sequences], letters, 0];
and borrowing from klgrletterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
â briennakh
Sep 19 at 22:09
 |Â
show 2 more comments
up vote
9
down vote
accepted
sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
NIRCNICI";
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
and the code:
new = Values[
(CharacterCounts /@ sequences)[[All, First@counts]]
];
counts[[2 ;;]] += new;
counts
"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M",
"F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
Thank you, this works as well!
â briennakh
Sep 19 at 20:34
This is also much faster than kglr's solution (see my post for timing examples).
â Henrik Schumacher
Sep 19 at 20:55
@Kuba, as kglr pointed out, there might be occurences ofMissing[AbsentKey]
in your result. UsingLookup[CharacterCounts /@ sequences, First@counts, 0];
is not only a bit faster but also replaces theMissing[AbsentKey]
by0
.
â Henrik Schumacher
Sep 19 at 21:15
@Kuba ... andLookup[CharacterCounts[sequences], letters, 0];
is even a further bit faster.
â Henrik Schumacher
Sep 19 at 21:25
2
I ended up usingletterCounts = Lookup[CharacterCounts[sequences], letters, 0];
and borrowing from klgrletterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
â briennakh
Sep 19 at 22:09
 |Â
show 2 more comments
up vote
9
down vote
accepted
up vote
9
down vote
accepted
sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
NIRCNICI";
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
and the code:
new = Values[
(CharacterCounts /@ sequences)[[All, First@counts]]
];
counts[[2 ;;]] += new;
counts
"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M",
"F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
NIRCNICI";
counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
and the code:
new = Values[
(CharacterCounts /@ sequences)[[All, First@counts]]
];
counts[[2 ;;]] += new;
counts
"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M",
"F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25
answered Sep 19 at 20:23
Kubaâ¦
100k11195495
100k11195495
Thank you, this works as well!
â briennakh
Sep 19 at 20:34
This is also much faster than kglr's solution (see my post for timing examples).
â Henrik Schumacher
Sep 19 at 20:55
@Kuba, as kglr pointed out, there might be occurences ofMissing[AbsentKey]
in your result. UsingLookup[CharacterCounts /@ sequences, First@counts, 0];
is not only a bit faster but also replaces theMissing[AbsentKey]
by0
.
â Henrik Schumacher
Sep 19 at 21:15
@Kuba ... andLookup[CharacterCounts[sequences], letters, 0];
is even a further bit faster.
â Henrik Schumacher
Sep 19 at 21:25
2
I ended up usingletterCounts = Lookup[CharacterCounts[sequences], letters, 0];
and borrowing from klgrletterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
â briennakh
Sep 19 at 22:09
 |Â
show 2 more comments
Thank you, this works as well!
â briennakh
Sep 19 at 20:34
This is also much faster than kglr's solution (see my post for timing examples).
â Henrik Schumacher
Sep 19 at 20:55
@Kuba, as kglr pointed out, there might be occurences ofMissing[AbsentKey]
in your result. UsingLookup[CharacterCounts /@ sequences, First@counts, 0];
is not only a bit faster but also replaces theMissing[AbsentKey]
by0
.
â Henrik Schumacher
Sep 19 at 21:15
@Kuba ... andLookup[CharacterCounts[sequences], letters, 0];
is even a further bit faster.
â Henrik Schumacher
Sep 19 at 21:25
2
I ended up usingletterCounts = Lookup[CharacterCounts[sequences], letters, 0];
and borrowing from klgrletterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
â briennakh
Sep 19 at 22:09
Thank you, this works as well!
â briennakh
Sep 19 at 20:34
Thank you, this works as well!
â briennakh
Sep 19 at 20:34
This is also much faster than kglr's solution (see my post for timing examples).
â Henrik Schumacher
Sep 19 at 20:55
This is also much faster than kglr's solution (see my post for timing examples).
â Henrik Schumacher
Sep 19 at 20:55
@Kuba, as kglr pointed out, there might be occurences of
Missing[AbsentKey]
in your result. Using Lookup[CharacterCounts /@ sequences, First@counts, 0];
is not only a bit faster but also replaces the Missing[AbsentKey]
by 0
.â Henrik Schumacher
Sep 19 at 21:15
@Kuba, as kglr pointed out, there might be occurences of
Missing[AbsentKey]
in your result. Using Lookup[CharacterCounts /@ sequences, First@counts, 0];
is not only a bit faster but also replaces the Missing[AbsentKey]
by 0
.â Henrik Schumacher
Sep 19 at 21:15
@Kuba ... and
Lookup[CharacterCounts[sequences], letters, 0];
is even a further bit faster.â Henrik Schumacher
Sep 19 at 21:25
@Kuba ... and
Lookup[CharacterCounts[sequences], letters, 0];
is even a further bit faster.â Henrik Schumacher
Sep 19 at 21:25
2
2
I ended up using
letterCounts = Lookup[CharacterCounts[sequences], letters, 0];
and borrowing from klgr letterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
â briennakh
Sep 19 at 22:09
I ended up using
letterCounts = Lookup[CharacterCounts[sequences], letters, 0];
and borrowing from klgr letterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
â briennakh
Sep 19 at 22:09
 |Â
show 2 more comments
up vote
7
down vote
I can propose two things that speed up the letter counting tremendously:
1.) Use ToCharacterCode
to convert your strings to packed arrays of integers.
2.) Use a compiled funcion for additive matrix assembly.
Additive assembly of each row can be obtained with this little function.
cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
Block[b,
b = Table[0, max];
Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
b
],
CompilationTarget -> "C",
RuntimeAttributes -> Listable,
Parallelization -> True,
RuntimeOptions -> "Speed"
];
Borrowing a bit of code from kglr but cranking up the amount of strings and their length:
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";
Here is how kglr's and Kuba's very elegant solutions perform. lcs2a
is a modification of Kuba's code to cope with Missing[AbsentKey]
which may occur when some of the elements of letters
do not occur in any of the elements in sequences
(as kglr pointed out in a comment). It is also a bit faster.
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
lcs2 = Values[(CharacterCounts /@ sequences)[[All, letters]]]; // RepeatedTiming // First
lcs2a = Lookup[CharacterCounts[sequences], letters, 0]; // RepeatedTiming // First
3.59
0.075
0.059
My version is a bit more clunky, but it does the job several times faster:
i0 = ToCharacterCode["A"][[1]] - 1;
letterpos = ToCharacterCode[StringJoin[letters]] - i0;
lcs3 = cAssembleRow[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First
0.0094
When all letters occur in each element of `sequences, then all results are equal:
lcs == lcs2 == lcs2a == lcs3
True
4
Henrik, if some letters have 0 count in some sequences,Kubalcs
will haveMissing[KeyAbsent]
instead of 0; so some additional processing is needed.
â kglr
Sep 19 at 21:05
add a comment |Â
up vote
7
down vote
I can propose two things that speed up the letter counting tremendously:
1.) Use ToCharacterCode
to convert your strings to packed arrays of integers.
2.) Use a compiled funcion for additive matrix assembly.
Additive assembly of each row can be obtained with this little function.
cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
Block[b,
b = Table[0, max];
Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
b
],
CompilationTarget -> "C",
RuntimeAttributes -> Listable,
Parallelization -> True,
RuntimeOptions -> "Speed"
];
Borrowing a bit of code from kglr but cranking up the amount of strings and their length:
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";
Here is how kglr's and Kuba's very elegant solutions perform. lcs2a
is a modification of Kuba's code to cope with Missing[AbsentKey]
which may occur when some of the elements of letters
do not occur in any of the elements in sequences
(as kglr pointed out in a comment). It is also a bit faster.
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
lcs2 = Values[(CharacterCounts /@ sequences)[[All, letters]]]; // RepeatedTiming // First
lcs2a = Lookup[CharacterCounts[sequences], letters, 0]; // RepeatedTiming // First
3.59
0.075
0.059
My version is a bit more clunky, but it does the job several times faster:
i0 = ToCharacterCode["A"][[1]] - 1;
letterpos = ToCharacterCode[StringJoin[letters]] - i0;
lcs3 = cAssembleRow[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First
0.0094
When all letters occur in each element of `sequences, then all results are equal:
lcs == lcs2 == lcs2a == lcs3
True
4
Henrik, if some letters have 0 count in some sequences,Kubalcs
will haveMissing[KeyAbsent]
instead of 0; so some additional processing is needed.
â kglr
Sep 19 at 21:05
add a comment |Â
up vote
7
down vote
up vote
7
down vote
I can propose two things that speed up the letter counting tremendously:
1.) Use ToCharacterCode
to convert your strings to packed arrays of integers.
2.) Use a compiled funcion for additive matrix assembly.
Additive assembly of each row can be obtained with this little function.
cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
Block[b,
b = Table[0, max];
Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
b
],
CompilationTarget -> "C",
RuntimeAttributes -> Listable,
Parallelization -> True,
RuntimeOptions -> "Speed"
];
Borrowing a bit of code from kglr but cranking up the amount of strings and their length:
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";
Here is how kglr's and Kuba's very elegant solutions perform. lcs2a
is a modification of Kuba's code to cope with Missing[AbsentKey]
which may occur when some of the elements of letters
do not occur in any of the elements in sequences
(as kglr pointed out in a comment). It is also a bit faster.
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
lcs2 = Values[(CharacterCounts /@ sequences)[[All, letters]]]; // RepeatedTiming // First
lcs2a = Lookup[CharacterCounts[sequences], letters, 0]; // RepeatedTiming // First
3.59
0.075
0.059
My version is a bit more clunky, but it does the job several times faster:
i0 = ToCharacterCode["A"][[1]] - 1;
letterpos = ToCharacterCode[StringJoin[letters]] - i0;
lcs3 = cAssembleRow[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First
0.0094
When all letters occur in each element of `sequences, then all results are equal:
lcs == lcs2 == lcs2a == lcs3
True
I can propose two things that speed up the letter counting tremendously:
1.) Use ToCharacterCode
to convert your strings to packed arrays of integers.
2.) Use a compiled funcion for additive matrix assembly.
Additive assembly of each row can be obtained with this little function.
cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
Block[b,
b = Table[0, max];
Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
b
],
CompilationTarget -> "C",
RuntimeAttributes -> Listable,
Parallelization -> True,
RuntimeOptions -> "Speed"
];
Borrowing a bit of code from kglr but cranking up the amount of strings and their length:
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";
Here is how kglr's and Kuba's very elegant solutions perform. lcs2a
is a modification of Kuba's code to cope with Missing[AbsentKey]
which may occur when some of the elements of letters
do not occur in any of the elements in sequences
(as kglr pointed out in a comment). It is also a bit faster.
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
lcs2 = Values[(CharacterCounts /@ sequences)[[All, letters]]]; // RepeatedTiming // First
lcs2a = Lookup[CharacterCounts[sequences], letters, 0]; // RepeatedTiming // First
3.59
0.075
0.059
My version is a bit more clunky, but it does the job several times faster:
i0 = ToCharacterCode["A"][[1]] - 1;
letterpos = ToCharacterCode[StringJoin[letters]] - i0;
lcs3 = cAssembleRow[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First
0.0094
When all letters occur in each element of `sequences, then all results are equal:
lcs == lcs2 == lcs2a == lcs3
True
edited Sep 19 at 21:24
answered Sep 19 at 20:50
Henrik Schumacher
40.4k256121
40.4k256121
4
Henrik, if some letters have 0 count in some sequences,Kubalcs
will haveMissing[KeyAbsent]
instead of 0; so some additional processing is needed.
â kglr
Sep 19 at 21:05
add a comment |Â
4
Henrik, if some letters have 0 count in some sequences,Kubalcs
will haveMissing[KeyAbsent]
instead of 0; so some additional processing is needed.
â kglr
Sep 19 at 21:05
4
4
Henrik, if some letters have 0 count in some sequences,
Kubalcs
will have Missing[KeyAbsent]
instead of 0; so some additional processing is needed.â kglr
Sep 19 at 21:05
Henrik, if some letters have 0 count in some sequences,
Kubalcs
will have Missing[KeyAbsent]
instead of 0; so some additional processing is needed.â kglr
Sep 19 at 21:05
add a comment |Â
up vote
6
down vote
You can use LetterCounts
as follows:
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L",
"K",ÃÂ ÃÂ "M", "F", "P", "S", "T", "W", "Y", "V";
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
counts = Join[letters, lcs];
counts // Grid
I like the pretty output!
â briennakh
Sep 19 at 20:31
add a comment |Â
up vote
6
down vote
You can use LetterCounts
as follows:
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L",
"K",ÃÂ ÃÂ "M", "F", "P", "S", "T", "W", "Y", "V";
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
counts = Join[letters, lcs];
counts // Grid
I like the pretty output!
â briennakh
Sep 19 at 20:31
add a comment |Â
up vote
6
down vote
up vote
6
down vote
You can use LetterCounts
as follows:
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L",
"K",ÃÂ ÃÂ "M", "F", "P", "S", "T", "W", "Y", "V";
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
counts = Join[letters, lcs];
counts // Grid
You can use LetterCounts
as follows:
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L",
"K",ÃÂ ÃÂ "M", "F", "P", "S", "T", "W", "Y", "V";
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
counts = Join[letters, lcs];
counts // Grid
answered Sep 19 at 20:22
kglr
163k8188387
163k8188387
I like the pretty output!
â briennakh
Sep 19 at 20:31
add a comment |Â
I like the pretty output!
â briennakh
Sep 19 at 20:31
I like the pretty output!
â briennakh
Sep 19 at 20:31
I like the pretty output!
â briennakh
Sep 19 at 20:31
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f182201%2floop-over-a-list-of-strings-and-increment-letter-count-in-a-corresponding-sublis%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password