Loop over a list of strings and increment letter count in a corresponding sublist

up vote
10
down vote

favorite

I have a 2D list as follows:

counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
 "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, ...;

The first sub-list consists of a heading, and the following sub-lists contain counts, initialized at zero.

I need to loop over another list, sequences, that contains strings plus a heading, and access the corresponding sub-list in counts to increment the appropriate letter count.

For example, take a string from sequences:

MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICI

Its corresponding sub-list in counts would be incremented to 31, 27, 45, 30, 18, 27, 25, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25.

I obtained this via StringCount[sequences[[1]], #] & /@ counts[[1]] but am struggling to scale this code, and to make it update the sub-lists in counts instead of returning a new list.

edited Sep 20 at 6:16

Henrik Schumacher

40.4k256121

asked Sep 19 at 20:06

briennakh

2457

add a commentÂ |Â

up vote
10
down vote

favorite

I have a 2D list as follows:

counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
 "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, ...;

The first sub-list consists of a heading, and the following sub-lists contain counts, initialized at zero.

I need to loop over another list, sequences, that contains strings plus a heading, and access the corresponding sub-list in counts to increment the appropriate letter count.

For example, take a string from sequences:

MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICI

Its corresponding sub-list in counts would be incremented to 31, 27, 45, 30, 18, 27, 25, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25.

I obtained this via StringCount[sequences[[1]], #] & /@ counts[[1]] but am struggling to scale this code, and to make it update the sub-lists in counts instead of returning a new list.

edited Sep 20 at 6:16

Henrik Schumacher

40.4k256121

asked Sep 19 at 20:06

briennakh

2457

add a commentÂ |Â

up vote
10
down vote

favorite

I have a 2D list as follows:

counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
 "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, ...;

The first sub-list consists of a heading, and the following sub-lists contain counts, initialized at zero.

I need to loop over another list, sequences, that contains strings plus a heading, and access the corresponding sub-list in counts to increment the appropriate letter count.

For example, take a string from sequences:

MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICI

Its corresponding sub-list in counts would be incremented to 31, 27, 45, 30, 18, 27, 25, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25.

I obtained this via StringCount[sequences[[1]], #] & /@ counts[[1]] but am struggling to scale this code, and to make it update the sub-lists in counts instead of returning a new list.

edited Sep 20 at 6:16

Henrik Schumacher

40.4k256121

asked Sep 19 at 20:06

briennakh

2457

I have a 2D list as follows:

counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
 "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
 0, 0, 0, 0, 0, 0, 0, 0, 0, ...;

The first sub-list consists of a heading, and the following sub-lists contain counts, initialized at zero.

I need to loop over another list, sequences, that contains strings plus a heading, and access the corresponding sub-list in counts to increment the appropriate letter count.

For example, take a string from sequences:

MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICI

Its corresponding sub-list in counts would be incremented to 31, 27, 45, 30, 18, 27, 25, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25.

I obtained this via StringCount[sequences[[1]], #] & /@ counts[[1]] but am struggling to scale this code, and to make it update the sub-lists in counts instead of returning a new list.

list-manipulation numerics string-manipulation counting

edited Sep 20 at 6:16

Henrik Schumacher

40.4k256121

asked Sep 19 at 20:06

briennakh

2457

edited Sep 20 at 6:16

Henrik Schumacher

40.4k256121

asked Sep 19 at 20:06

briennakh

2457

edited Sep 20 at 6:16

Henrik Schumacher

40.4k256121

edited Sep 20 at 6:16

Henrik Schumacher

40.4k256121

edited Sep 20 at 6:16

Henrik Schumacher

40.4k256121

asked Sep 19 at 20:06

briennakh

2457

asked Sep 19 at 20:06

briennakh

2457

asked Sep 19 at 20:06

briennakh

2457

add a commentÂ |Â

3 Answers
3

active

oldest

votes

up vote
9
down vote

accepted

sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
NIRCNICI";

counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
 "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;

and the code:

new = Values[
 (CharacterCounts /@ sequences)[[All, First@counts]]
];

counts[[2 ;;]] += new;
counts

"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", 
 "F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25

answered Sep 19 at 20:23

Kubaâ™¦

100k11195495

Thank you, this works as well!
â€“Â briennakh
Sep 19 at 20:34

This is also much faster than kglr's solution (see my post for timing examples).
â€“Â Henrik Schumacher
Sep 19 at 20:55

@Kuba, as kglr pointed out, there might be occurences of Missing[AbsentKey] in your result. Using Lookup[CharacterCounts /@ sequences, First@counts, 0]; is not only a bit faster but also replaces the Missing[AbsentKey] by 0.
â€“Â Henrik Schumacher
Sep 19 at 21:15

@Kuba ... and Lookup[CharacterCounts[sequences], letters, 0]; is even a further bit faster.
â€“Â Henrik Schumacher
Sep 19 at 21:25

2

I ended up using letterCounts = Lookup[CharacterCounts[sequences], letters, 0]; and borrowing from klgr letterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
â€“Â briennakh
Sep 19 at 22:09

Â |Â
show 2 more comments

up vote
7
down vote

I can propose two things that speed up the letter counting tremendously:

1.) Use ToCharacterCode to convert your strings to packed arrays of integers.

2.) Use a compiled funcion for additive matrix assembly.

Additive assembly of each row can be obtained with this little function.

cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
 Block[b,
 b = Table[0, max];
 Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
 b
 ],
 CompilationTarget -> "C",
 RuntimeAttributes -> Listable,
 Parallelization -> True,
 RuntimeOptions -> "Speed"
 ];

Borrowing a bit of code from kglr but cranking up the amount of strings and their length:

sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";

Here is how kglr's and Kuba's very elegant solutions perform. lcs2a is a modification of Kuba's code to cope with Missing[AbsentKey] which may occur when some of the elements of letters do not occur in any of the elements in sequences (as kglr pointed out in a comment). It is also a bit faster.

lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
lcs2 = Values[(CharacterCounts /@ sequences)[[All, letters]]]; // RepeatedTiming // First
lcs2a = Lookup[CharacterCounts[sequences], letters, 0]; // RepeatedTiming // First

3.59

0.075

0.059

My version is a bit more clunky, but it does the job several times faster:

i0 = ToCharacterCode["A"][[1]] - 1;
letterpos = ToCharacterCode[StringJoin[letters]] - i0;

lcs3 = cAssembleRow[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First

0.0094

When all letters occur in each element of `sequences, then all results are equal:

lcs == lcs2 == lcs2a == lcs3

True

edited Sep 19 at 21:24

answered Sep 19 at 20:50

Henrik Schumacher

40.4k256121

4

Henrik, if some letters have 0 count in some sequences, Kubalcs will have Missing[KeyAbsent] instead of 0; so some additional processing is needed.
â€“Â kglr
Sep 19 at 21:05

add a commentÂ |Â

up vote
6
down vote

You can use LetterCounts as follows:

letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", 
 "K",Ã‚Â Ã‚Â "M", "F", "P", "S", "T", "W", "Y", "V"; 
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
counts = Join[letters, lcs];
counts // Grid

enter image description here

answered Sep 19 at 20:22

kglr

163k8188387

I like the pretty output!
â€“Â briennakh
Sep 19 at 20:31

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "387"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f182201%2floop-over-a-list-of-strings-and-increment-letter-count-in-a-corresponding-sublis%23new-answer', 'question_page');

);

Post as a guest

Name

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
9
down vote

accepted

sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
NIRCNICI";

counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
 "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;

and the code:

new = Values[
 (CharacterCounts /@ sequences)[[All, First@counts]]
];

counts[[2 ;;]] += new;
counts

"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", 
 "F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25

answered Sep 19 at 20:23

Kubaâ™¦

100k11195495

Thank you, this works as well!
â€“Â briennakh
Sep 19 at 20:34

This is also much faster than kglr's solution (see my post for timing examples).
â€“Â Henrik Schumacher
Sep 19 at 20:55

@Kuba, as kglr pointed out, there might be occurences of Missing[AbsentKey] in your result. Using Lookup[CharacterCounts /@ sequences, First@counts, 0]; is not only a bit faster but also replaces the Missing[AbsentKey] by 0.
â€“Â Henrik Schumacher
Sep 19 at 21:15

@Kuba ... and Lookup[CharacterCounts[sequences], letters, 0]; is even a further bit faster.
â€“Â Henrik Schumacher
Sep 19 at 21:25

2

I ended up using letterCounts = Lookup[CharacterCounts[sequences], letters, 0]; and borrowing from klgr letterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
â€“Â briennakh
Sep 19 at 22:09

Â |Â
show 2 more comments

up vote
9
down vote

accepted

sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
NIRCNICI";

counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
 "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;

and the code:

new = Values[
 (CharacterCounts /@ sequences)[[All, First@counts]]
];

counts[[2 ;;]] += new;
counts

"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", 
 "F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25

answered Sep 19 at 20:23

Kubaâ™¦

100k11195495

Thank you, this works as well!
â€“Â briennakh
Sep 19 at 20:34

This is also much faster than kglr's solution (see my post for timing examples).
â€“Â Henrik Schumacher
Sep 19 at 20:55

@Kuba, as kglr pointed out, there might be occurences of Missing[AbsentKey] in your result. Using Lookup[CharacterCounts /@ sequences, First@counts, 0]; is not only a bit faster but also replaces the Missing[AbsentKey] by 0.
â€“Â Henrik Schumacher
Sep 19 at 21:15

@Kuba ... and Lookup[CharacterCounts[sequences], letters, 0]; is even a further bit faster.
â€“Â Henrik Schumacher
Sep 19 at 21:25

2

I ended up using letterCounts = Lookup[CharacterCounts[sequences], letters, 0]; and borrowing from klgr letterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
â€“Â briennakh
Sep 19 at 22:09

Â |Â
show 2 more comments

up vote
9
down vote

accepted

sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
NIRCNICI";

counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
 "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;

and the code:

new = Values[
 (CharacterCounts /@ sequences)[[All, First@counts]]
];

counts[[2 ;;]] += new;
counts

"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", 
 "F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25

answered Sep 19 at 20:23

Kubaâ™¦

100k11195495

sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
NIRCNICI";

counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
 "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;

and the code:

new = Values[
 (CharacterCounts /@ sequences)[[All, First@counts]]
];

counts[[2 ;;]] += new;
counts

"A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", 
 "F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25

answered Sep 19 at 20:23

Kubaâ™¦

100k11195495

answered Sep 19 at 20:23

Kubaâ™¦

100k11195495

answered Sep 19 at 20:23

Kubaâ™¦

100k11195495

answered Sep 19 at 20:23

Kubaâ™¦

100k11195495

Thank you, this works as well!
â€“Â briennakh
Sep 19 at 20:34

This is also much faster than kglr's solution (see my post for timing examples).
â€“Â Henrik Schumacher
Sep 19 at 20:55

@Kuba, as kglr pointed out, there might be occurences of Missing[AbsentKey] in your result. Using Lookup[CharacterCounts /@ sequences, First@counts, 0]; is not only a bit faster but also replaces the Missing[AbsentKey] by 0.
â€“Â Henrik Schumacher
Sep 19 at 21:15

@Kuba ... and Lookup[CharacterCounts[sequences], letters, 0]; is even a further bit faster.
â€“Â Henrik Schumacher
Sep 19 at 21:25

2

I ended up using letterCounts = Lookup[CharacterCounts[sequences], letters, 0]; and borrowing from klgr letterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
â€“Â briennakh
Sep 19 at 22:09

Â |Â
show 2 more comments

Thank you, this works as well!
â€“Â briennakh
Sep 19 at 20:34

This is also much faster than kglr's solution (see my post for timing examples).
â€“Â Henrik Schumacher
Sep 19 at 20:55

@Kuba, as kglr pointed out, there might be occurences of Missing[AbsentKey] in your result. Using Lookup[CharacterCounts /@ sequences, First@counts, 0]; is not only a bit faster but also replaces the Missing[AbsentKey] by 0.
â€“Â Henrik Schumacher
Sep 19 at 21:15

@Kuba ... and Lookup[CharacterCounts[sequences], letters, 0]; is even a further bit faster.
â€“Â Henrik Schumacher
Sep 19 at 21:25

2

I ended up using letterCounts = Lookup[CharacterCounts[sequences], letters, 0]; and borrowing from klgr letterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
â€“Â briennakh
Sep 19 at 22:09

Thank you, this works as well!
â€“Â briennakh
Sep 19 at 20:34

This is also much faster than kglr's solution (see my post for timing examples).
â€“Â Henrik Schumacher
Sep 19 at 20:55

@Kuba, as kglr pointed out, there might be occurences of Missing[AbsentKey] in your result. Using Lookup[CharacterCounts /@ sequences, First@counts, 0]; is not only a bit faster but also replaces the Missing[AbsentKey] by 0.
â€“Â Henrik Schumacher
Sep 19 at 21:15

@Kuba ... and Lookup[CharacterCounts[sequences], letters, 0]; is even a further bit faster.
â€“Â Henrik Schumacher
Sep 19 at 21:25

I ended up using letterCounts = Lookup[CharacterCounts[sequences], letters, 0]; and borrowing from klgr letterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
â€“Â briennakh
Sep 19 at 22:09

Â |Â
show 2 more comments

up vote
7
down vote

I can propose two things that speed up the letter counting tremendously:

1.) Use ToCharacterCode to convert your strings to packed arrays of integers.

2.) Use a compiled funcion for additive matrix assembly.

Additive assembly of each row can be obtained with this little function.

cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
 Block[b,
 b = Table[0, max];
 Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
 b
 ],
 CompilationTarget -> "C",
 RuntimeAttributes -> Listable,
 Parallelization -> True,
 RuntimeOptions -> "Speed"
 ];

Borrowing a bit of code from kglr but cranking up the amount of strings and their length:

sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";

lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
lcs2 = Values[(CharacterCounts /@ sequences)[[All, letters]]]; // RepeatedTiming // First
lcs2a = Lookup[CharacterCounts[sequences], letters, 0]; // RepeatedTiming // First

3.59

0.075

0.059

My version is a bit more clunky, but it does the job several times faster:

i0 = ToCharacterCode["A"][[1]] - 1;
letterpos = ToCharacterCode[StringJoin[letters]] - i0;

lcs3 = cAssembleRow[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First

0.0094

When all letters occur in each element of `sequences, then all results are equal:

lcs == lcs2 == lcs2a == lcs3

True

edited Sep 19 at 21:24

answered Sep 19 at 20:50

Henrik Schumacher

40.4k256121

4

Henrik, if some letters have 0 count in some sequences, Kubalcs will have Missing[KeyAbsent] instead of 0; so some additional processing is needed.
â€“Â kglr
Sep 19 at 21:05

add a commentÂ |Â

up vote
7
down vote

I can propose two things that speed up the letter counting tremendously:

1.) Use ToCharacterCode to convert your strings to packed arrays of integers.

2.) Use a compiled funcion for additive matrix assembly.

Additive assembly of each row can be obtained with this little function.

cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
 Block[b,
 b = Table[0, max];
 Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
 b
 ],
 CompilationTarget -> "C",
 RuntimeAttributes -> Listable,
 Parallelization -> True,
 RuntimeOptions -> "Speed"
 ];

Borrowing a bit of code from kglr but cranking up the amount of strings and their length:

sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";

lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
lcs2 = Values[(CharacterCounts /@ sequences)[[All, letters]]]; // RepeatedTiming // First
lcs2a = Lookup[CharacterCounts[sequences], letters, 0]; // RepeatedTiming // First

3.59

0.075

0.059

My version is a bit more clunky, but it does the job several times faster:

i0 = ToCharacterCode["A"][[1]] - 1;
letterpos = ToCharacterCode[StringJoin[letters]] - i0;

lcs3 = cAssembleRow[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First

0.0094

When all letters occur in each element of `sequences, then all results are equal:

lcs == lcs2 == lcs2a == lcs3

True

edited Sep 19 at 21:24

answered Sep 19 at 20:50

Henrik Schumacher

40.4k256121

4

Henrik, if some letters have 0 count in some sequences, Kubalcs will have Missing[KeyAbsent] instead of 0; so some additional processing is needed.
â€“Â kglr
Sep 19 at 21:05

add a commentÂ |Â

up vote
7
down vote

I can propose two things that speed up the letter counting tremendously:

1.) Use ToCharacterCode to convert your strings to packed arrays of integers.

2.) Use a compiled funcion for additive matrix assembly.

Additive assembly of each row can be obtained with this little function.

cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
 Block[b,
 b = Table[0, max];
 Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
 b
 ],
 CompilationTarget -> "C",
 RuntimeAttributes -> Listable,
 Parallelization -> True,
 RuntimeOptions -> "Speed"
 ];

Borrowing a bit of code from kglr but cranking up the amount of strings and their length:

sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";

lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
lcs2 = Values[(CharacterCounts /@ sequences)[[All, letters]]]; // RepeatedTiming // First
lcs2a = Lookup[CharacterCounts[sequences], letters, 0]; // RepeatedTiming // First

3.59

0.075

0.059

My version is a bit more clunky, but it does the job several times faster:

i0 = ToCharacterCode["A"][[1]] - 1;
letterpos = ToCharacterCode[StringJoin[letters]] - i0;

lcs3 = cAssembleRow[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First

0.0094

When all letters occur in each element of `sequences, then all results are equal:

lcs == lcs2 == lcs2a == lcs3

True

edited Sep 19 at 21:24

answered Sep 19 at 20:50

Henrik Schumacher

40.4k256121

I can propose two things that speed up the letter counting tremendously:

1.) Use ToCharacterCode to convert your strings to packed arrays of integers.

2.) Use a compiled funcion for additive matrix assembly.

Additive assembly of each row can be obtained with this little function.

cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
 Block[b,
 b = Table[0, max];
 Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
 b
 ],
 CompilationTarget -> "C",
 RuntimeAttributes -> Listable,
 Parallelization -> True,
 RuntimeOptions -> "Speed"
 ];

Borrowing a bit of code from kglr but cranking up the amount of strings and their length:

sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";

lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
lcs2 = Values[(CharacterCounts /@ sequences)[[All, letters]]]; // RepeatedTiming // First
lcs2a = Lookup[CharacterCounts[sequences], letters, 0]; // RepeatedTiming // First

3.59

0.075

0.059

My version is a bit more clunky, but it does the job several times faster:

i0 = ToCharacterCode["A"][[1]] - 1;
letterpos = ToCharacterCode[StringJoin[letters]] - i0;

lcs3 = cAssembleRow[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First

0.0094

When all letters occur in each element of `sequences, then all results are equal:

lcs == lcs2 == lcs2a == lcs3

True

edited Sep 19 at 21:24

answered Sep 19 at 20:50

Henrik Schumacher

40.4k256121

edited Sep 19 at 21:24

answered Sep 19 at 20:50

Henrik Schumacher

40.4k256121

answered Sep 19 at 20:50

Henrik Schumacher

40.4k256121

answered Sep 19 at 20:50

Henrik Schumacher

40.4k256121

4

Henrik, if some letters have 0 count in some sequences, Kubalcs will have Missing[KeyAbsent] instead of 0; so some additional processing is needed.
â€“Â kglr
Sep 19 at 21:05

add a commentÂ |Â

4

Henrik, if some letters have 0 count in some sequences, Kubalcs will have Missing[KeyAbsent] instead of 0; so some additional processing is needed.
â€“Â kglr
Sep 19 at 21:05

Henrik, if some letters have 0 count in some sequences, Kubalcs will have Missing[KeyAbsent] instead of 0; so some additional processing is needed.
â€“Â kglr
Sep 19 at 21:05

add a commentÂ |Â

up vote
6
down vote

You can use LetterCounts as follows:

letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", 
 "K",Ã‚Â Ã‚Â "M", "F", "P", "S", "T", "W", "Y", "V"; 
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
counts = Join[letters, lcs];
counts // Grid

enter image description here

answered Sep 19 at 20:22

kglr

163k8188387

I like the pretty output!
â€“Â briennakh
Sep 19 at 20:31

add a commentÂ |Â

up vote
6
down vote

You can use LetterCounts as follows:

letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", 
 "K",Ã‚Â Ã‚Â "M", "F", "P", "S", "T", "W", "Y", "V"; 
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
counts = Join[letters, lcs];
counts // Grid

enter image description here

answered Sep 19 at 20:22

kglr

163k8188387

I like the pretty output!
â€“Â briennakh
Sep 19 at 20:31

add a commentÂ |Â

up vote
6
down vote

You can use LetterCounts as follows:

letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", 
 "K",Ã‚Â Ã‚Â "M", "F", "P", "S", "T", "W", "Y", "V"; 
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
counts = Join[letters, lcs];
counts // Grid

enter image description here

answered Sep 19 at 20:22

kglr

163k8188387

You can use LetterCounts as follows:

letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", 
 "K",Ã‚Â Ã‚Â "M", "F", "P", "S", "T", "W", "Y", "V"; 
sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
counts = Join[letters, lcs];
counts // Grid

enter image description here

answered Sep 19 at 20:22

kglr

163k8188387

answered Sep 19 at 20:22

kglr

163k8188387

answered Sep 19 at 20:22

kglr

163k8188387

answered Sep 19 at 20:22

kglr

163k8188387

I like the pretty output!
â€“Â briennakh
Sep 19 at 20:31

add a commentÂ |Â

I like the pretty output!
â€“Â briennakh
Sep 19 at 20:31

I like the pretty output!
â€“Â briennakh
Sep 19 at 20:31

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu