Loop over a list of strings and increment letter count in a corresponding sublist

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
10
down vote

favorite












I have a 2D list as follows:



counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
"M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, ...;


The first sub-list consists of a heading, and the following sub-lists contain counts, initialized at zero.



I need to loop over another list, sequences, that contains strings plus a heading, and access the corresponding sub-list in counts to increment the appropriate letter count.



For example, take a string from sequences:




MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICI




Its corresponding sub-list in counts would be incremented to 31, 27, 45, 30, 18, 27, 25, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25.



I obtained this via StringCount[sequences[[1]], #] & /@ counts[[1]] but am struggling to scale this code, and to make it update the sub-lists in counts instead of returning a new list.










share|improve this question



























    up vote
    10
    down vote

    favorite












    I have a 2D list as follows:



    counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
    "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, ...;


    The first sub-list consists of a heading, and the following sub-lists contain counts, initialized at zero.



    I need to loop over another list, sequences, that contains strings plus a heading, and access the corresponding sub-list in counts to increment the appropriate letter count.



    For example, take a string from sequences:




    MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICI




    Its corresponding sub-list in counts would be incremented to 31, 27, 45, 30, 18, 27, 25, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25.



    I obtained this via StringCount[sequences[[1]], #] & /@ counts[[1]] but am struggling to scale this code, and to make it update the sub-lists in counts instead of returning a new list.










    share|improve this question

























      up vote
      10
      down vote

      favorite









      up vote
      10
      down vote

      favorite











      I have a 2D list as follows:



      counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
      "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, ...;


      The first sub-list consists of a heading, and the following sub-lists contain counts, initialized at zero.



      I need to loop over another list, sequences, that contains strings plus a heading, and access the corresponding sub-list in counts to increment the appropriate letter count.



      For example, take a string from sequences:




      MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICI




      Its corresponding sub-list in counts would be incremented to 31, 27, 45, 30, 18, 27, 25, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25.



      I obtained this via StringCount[sequences[[1]], #] & /@ counts[[1]] but am struggling to scale this code, and to make it update the sub-lists in counts instead of returning a new list.










      share|improve this question















      I have a 2D list as follows:



      counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
      "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, ...;


      The first sub-list consists of a heading, and the following sub-lists contain counts, initialized at zero.



      I need to loop over another list, sequences, that contains strings plus a heading, and access the corresponding sub-list in counts to increment the appropriate letter count.



      For example, take a string from sequences:




      MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEICDSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNNESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQIFLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRSGKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIFGAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHKCDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKGNIRCNICI




      Its corresponding sub-list in counts would be incremented to 31, 27, 45, 30, 18, 27, 25, 25, 42, 11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25.



      I obtained this via StringCount[sequences[[1]], #] & /@ counts[[1]] but am struggling to scale this code, and to make it update the sub-lists in counts instead of returning a new list.







      list-manipulation numerics string-manipulation counting






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Sep 20 at 6:16









      Henrik Schumacher

      40.4k256121




      40.4k256121










      asked Sep 19 at 20:06









      briennakh

      2457




      2457




















          3 Answers
          3






          active

          oldest

          votes

















          up vote
          9
          down vote



          accepted










          sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
          DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
          ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
          FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
          GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
          GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
          EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
          CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
          NIRCNICI";

          counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
          "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;


          and the code:



          new = Values[
          (CharacterCounts /@ sequences)[[All, First@counts]]
          ];

          counts[[2 ;;]] += new;
          counts



          "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", 
          "F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
          11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25






          share|improve this answer




















          • Thank you, this works as well!
            – briennakh
            Sep 19 at 20:34










          • This is also much faster than kglr's solution (see my post for timing examples).
            – Henrik Schumacher
            Sep 19 at 20:55










          • @Kuba, as kglr pointed out, there might be occurences of Missing[AbsentKey] in your result. Using Lookup[CharacterCounts /@ sequences, First@counts, 0]; is not only a bit faster but also replaces the Missing[AbsentKey] by 0.
            – Henrik Schumacher
            Sep 19 at 21:15










          • @Kuba ... and Lookup[CharacterCounts[sequences], letters, 0]; is even a further bit faster.
            – Henrik Schumacher
            Sep 19 at 21:25






          • 2




            I ended up using letterCounts = Lookup[CharacterCounts[sequences], letters, 0]; and borrowing from klgr letterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
            – briennakh
            Sep 19 at 22:09


















          up vote
          7
          down vote













          I can propose two things that speed up the letter counting tremendously:



          1.) Use ToCharacterCode to convert your strings to packed arrays of integers.



          2.) Use a compiled funcion for additive matrix assembly.



          Additive assembly of each row can be obtained with this little function.



          cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
          Block[b,
          b = Table[0, max];
          Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
          b
          ],
          CompilationTarget -> "C",
          RuntimeAttributes -> Listable,
          Parallelization -> True,
          RuntimeOptions -> "Speed"
          ];


          Borrowing a bit of code from kglr but cranking up the amount of strings and their length:



          sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
          letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";


          Here is how kglr's and Kuba's very elegant solutions perform. lcs2a is a modification of Kuba's code to cope with Missing[AbsentKey] which may occur when some of the elements of letters do not occur in any of the elements in sequences (as kglr pointed out in a comment). It is also a bit faster.



          lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
          lcs2 = Values[(CharacterCounts /@ sequences)[[All, letters]]]; // RepeatedTiming // First
          lcs2a = Lookup[CharacterCounts[sequences], letters, 0]; // RepeatedTiming // First



          3.59



          0.075



          0.059




          My version is a bit more clunky, but it does the job several times faster:



          i0 = ToCharacterCode["A"][[1]] - 1;
          letterpos = ToCharacterCode[StringJoin[letters]] - i0;

          lcs3 = cAssembleRow[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First



          0.0094




          When all letters occur in each element of `sequences, then all results are equal:



          lcs == lcs2 == lcs2a == lcs3



          True







          share|improve this answer


















          • 4




            Henrik, if some letters have 0 count in some sequences, Kubalcs will have Missing[KeyAbsent] instead of 0; so some additional processing is needed.
            – kglr
            Sep 19 at 21:05

















          up vote
          6
          down vote













          You can use LetterCounts as follows:



          letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", 
          "K",  "M", "F", "P", "S", "T", "W", "Y", "V";
          sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
          lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
          counts = Join[letters, lcs];
          counts // Grid


          enter image description here






          share|improve this answer




















          • I like the pretty output!
            – briennakh
            Sep 19 at 20:31










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "387"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f182201%2floop-over-a-list-of-strings-and-increment-letter-count-in-a-corresponding-sublis%23new-answer', 'question_page');

          );

          Post as a guest






























          3 Answers
          3






          active

          oldest

          votes








          3 Answers
          3






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          9
          down vote



          accepted










          sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
          DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
          ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
          FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
          GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
          GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
          EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
          CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
          NIRCNICI";

          counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
          "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;


          and the code:



          new = Values[
          (CharacterCounts /@ sequences)[[All, First@counts]]
          ];

          counts[[2 ;;]] += new;
          counts



          "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", 
          "F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
          11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25






          share|improve this answer




















          • Thank you, this works as well!
            – briennakh
            Sep 19 at 20:34










          • This is also much faster than kglr's solution (see my post for timing examples).
            – Henrik Schumacher
            Sep 19 at 20:55










          • @Kuba, as kglr pointed out, there might be occurences of Missing[AbsentKey] in your result. Using Lookup[CharacterCounts /@ sequences, First@counts, 0]; is not only a bit faster but also replaces the Missing[AbsentKey] by 0.
            – Henrik Schumacher
            Sep 19 at 21:15










          • @Kuba ... and Lookup[CharacterCounts[sequences], letters, 0]; is even a further bit faster.
            – Henrik Schumacher
            Sep 19 at 21:25






          • 2




            I ended up using letterCounts = Lookup[CharacterCounts[sequences], letters, 0]; and borrowing from klgr letterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
            – briennakh
            Sep 19 at 22:09















          up vote
          9
          down vote



          accepted










          sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
          DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
          ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
          FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
          GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
          GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
          EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
          CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
          NIRCNICI";

          counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
          "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;


          and the code:



          new = Values[
          (CharacterCounts /@ sequences)[[All, First@counts]]
          ];

          counts[[2 ;;]] += new;
          counts



          "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", 
          "F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
          11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25






          share|improve this answer




















          • Thank you, this works as well!
            – briennakh
            Sep 19 at 20:34










          • This is also much faster than kglr's solution (see my post for timing examples).
            – Henrik Schumacher
            Sep 19 at 20:55










          • @Kuba, as kglr pointed out, there might be occurences of Missing[AbsentKey] in your result. Using Lookup[CharacterCounts /@ sequences, First@counts, 0]; is not only a bit faster but also replaces the Missing[AbsentKey] by 0.
            – Henrik Schumacher
            Sep 19 at 21:15










          • @Kuba ... and Lookup[CharacterCounts[sequences], letters, 0]; is even a further bit faster.
            – Henrik Schumacher
            Sep 19 at 21:25






          • 2




            I ended up using letterCounts = Lookup[CharacterCounts[sequences], letters, 0]; and borrowing from klgr letterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
            – briennakh
            Sep 19 at 22:09













          up vote
          9
          down vote



          accepted







          up vote
          9
          down vote



          accepted






          sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
          DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
          ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
          FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
          GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
          GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
          EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
          CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
          NIRCNICI";

          counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
          "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;


          and the code:



          new = Values[
          (CharacterCounts /@ sequences)[[All, First@counts]]
          ];

          counts[[2 ;;]] += new;
          counts



          "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", 
          "F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
          11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25






          share|improve this answer












          sequences = "MKTIIALSYILCLVFAQKLPGNDNSTATLCLGHHAVPNGTIVKTITNDQIEVTNATELVQSSSTGEIC
          DSPHQILDGKNCTLIDALLGDPQCDGFQNKKWDLFVERSKAYSNCYPYDVPDYASLRSLVASSGTLEFNN
          ESFNWTGVTQNGTSSACIRRSKNSFFSRLNWLTHLNFKYPALNVTMPNNEQFDKLYIWGVHHPGTDKDQI
          FLYAQASGRITVSTKRSQQTVSPNIGSRPRVRNIPSRISIYWTIVKPGDILLINSTGNLIAPRGYFKIRS
          GKSSIMRSDAPIGKCNSECITPNGSIPNDKPFQNVNRITYGACPRYVKQNTLKLATGMRNVPEKQTRGIF
          GAIAGFIENGWEGMVDGWYGFRHQNSEGRGQAADLKSTQAAIDQINGKLNRLIGKTNEKFHQIEKEFSEV
          EGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFEKTKKQLRENAEDMGNGCFKIYHK
          CDNACIGSIRNGTYDHDVYRDEALNNRFQIKGVELKSGYKDWILWISFAISCFLLCVALLGFIMWACQKG
          NIRCNICI";

          counts = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K",
          "M", "F", "P", "S", "T", "W", "Y", "V", 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;


          and the code:



          new = Values[
          (CharacterCounts /@ sequences)[[All, First@counts]]
          ];

          counts[[2 ;;]] += new;
          counts



          "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", 
          "F", "P", "S", "T", "W", "Y", "V", 31, 27, 45, 30, 18, 27, 25, 42,
          11, 48, 44, 37, 8, 23, 20, 41, 34, 11, 19, 25







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Sep 19 at 20:23









          Kuba♦

          100k11195495




          100k11195495











          • Thank you, this works as well!
            – briennakh
            Sep 19 at 20:34










          • This is also much faster than kglr's solution (see my post for timing examples).
            – Henrik Schumacher
            Sep 19 at 20:55










          • @Kuba, as kglr pointed out, there might be occurences of Missing[AbsentKey] in your result. Using Lookup[CharacterCounts /@ sequences, First@counts, 0]; is not only a bit faster but also replaces the Missing[AbsentKey] by 0.
            – Henrik Schumacher
            Sep 19 at 21:15










          • @Kuba ... and Lookup[CharacterCounts[sequences], letters, 0]; is even a further bit faster.
            – Henrik Schumacher
            Sep 19 at 21:25






          • 2




            I ended up using letterCounts = Lookup[CharacterCounts[sequences], letters, 0]; and borrowing from klgr letterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
            – briennakh
            Sep 19 at 22:09

















          • Thank you, this works as well!
            – briennakh
            Sep 19 at 20:34










          • This is also much faster than kglr's solution (see my post for timing examples).
            – Henrik Schumacher
            Sep 19 at 20:55










          • @Kuba, as kglr pointed out, there might be occurences of Missing[AbsentKey] in your result. Using Lookup[CharacterCounts /@ sequences, First@counts, 0]; is not only a bit faster but also replaces the Missing[AbsentKey] by 0.
            – Henrik Schumacher
            Sep 19 at 21:15










          • @Kuba ... and Lookup[CharacterCounts[sequences], letters, 0]; is even a further bit faster.
            – Henrik Schumacher
            Sep 19 at 21:25






          • 2




            I ended up using letterCounts = Lookup[CharacterCounts[sequences], letters, 0]; and borrowing from klgr letterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
            – briennakh
            Sep 19 at 22:09
















          Thank you, this works as well!
          – briennakh
          Sep 19 at 20:34




          Thank you, this works as well!
          – briennakh
          Sep 19 at 20:34












          This is also much faster than kglr's solution (see my post for timing examples).
          – Henrik Schumacher
          Sep 19 at 20:55




          This is also much faster than kglr's solution (see my post for timing examples).
          – Henrik Schumacher
          Sep 19 at 20:55












          @Kuba, as kglr pointed out, there might be occurences of Missing[AbsentKey] in your result. Using Lookup[CharacterCounts /@ sequences, First@counts, 0]; is not only a bit faster but also replaces the Missing[AbsentKey] by 0.
          – Henrik Schumacher
          Sep 19 at 21:15




          @Kuba, as kglr pointed out, there might be occurences of Missing[AbsentKey] in your result. Using Lookup[CharacterCounts /@ sequences, First@counts, 0]; is not only a bit faster but also replaces the Missing[AbsentKey] by 0.
          – Henrik Schumacher
          Sep 19 at 21:15












          @Kuba ... and Lookup[CharacterCounts[sequences], letters, 0]; is even a further bit faster.
          – Henrik Schumacher
          Sep 19 at 21:25




          @Kuba ... and Lookup[CharacterCounts[sequences], letters, 0]; is even a further bit faster.
          – Henrik Schumacher
          Sep 19 at 21:25




          2




          2




          I ended up using letterCounts = Lookup[CharacterCounts[sequences], letters, 0]; and borrowing from klgr letterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
          – briennakh
          Sep 19 at 22:09





          I ended up using letterCounts = Lookup[CharacterCounts[sequences], letters, 0]; and borrowing from klgr letterCountsOutput = Join[letters, letterCounts]; letterCountsOutput // Grid
          – briennakh
          Sep 19 at 22:09











          up vote
          7
          down vote













          I can propose two things that speed up the letter counting tremendously:



          1.) Use ToCharacterCode to convert your strings to packed arrays of integers.



          2.) Use a compiled funcion for additive matrix assembly.



          Additive assembly of each row can be obtained with this little function.



          cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
          Block[b,
          b = Table[0, max];
          Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
          b
          ],
          CompilationTarget -> "C",
          RuntimeAttributes -> Listable,
          Parallelization -> True,
          RuntimeOptions -> "Speed"
          ];


          Borrowing a bit of code from kglr but cranking up the amount of strings and their length:



          sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
          letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";


          Here is how kglr's and Kuba's very elegant solutions perform. lcs2a is a modification of Kuba's code to cope with Missing[AbsentKey] which may occur when some of the elements of letters do not occur in any of the elements in sequences (as kglr pointed out in a comment). It is also a bit faster.



          lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
          lcs2 = Values[(CharacterCounts /@ sequences)[[All, letters]]]; // RepeatedTiming // First
          lcs2a = Lookup[CharacterCounts[sequences], letters, 0]; // RepeatedTiming // First



          3.59



          0.075



          0.059




          My version is a bit more clunky, but it does the job several times faster:



          i0 = ToCharacterCode["A"][[1]] - 1;
          letterpos = ToCharacterCode[StringJoin[letters]] - i0;

          lcs3 = cAssembleRow[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First



          0.0094




          When all letters occur in each element of `sequences, then all results are equal:



          lcs == lcs2 == lcs2a == lcs3



          True







          share|improve this answer


















          • 4




            Henrik, if some letters have 0 count in some sequences, Kubalcs will have Missing[KeyAbsent] instead of 0; so some additional processing is needed.
            – kglr
            Sep 19 at 21:05














          up vote
          7
          down vote













          I can propose two things that speed up the letter counting tremendously:



          1.) Use ToCharacterCode to convert your strings to packed arrays of integers.



          2.) Use a compiled funcion for additive matrix assembly.



          Additive assembly of each row can be obtained with this little function.



          cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
          Block[b,
          b = Table[0, max];
          Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
          b
          ],
          CompilationTarget -> "C",
          RuntimeAttributes -> Listable,
          Parallelization -> True,
          RuntimeOptions -> "Speed"
          ];


          Borrowing a bit of code from kglr but cranking up the amount of strings and their length:



          sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
          letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";


          Here is how kglr's and Kuba's very elegant solutions perform. lcs2a is a modification of Kuba's code to cope with Missing[AbsentKey] which may occur when some of the elements of letters do not occur in any of the elements in sequences (as kglr pointed out in a comment). It is also a bit faster.



          lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
          lcs2 = Values[(CharacterCounts /@ sequences)[[All, letters]]]; // RepeatedTiming // First
          lcs2a = Lookup[CharacterCounts[sequences], letters, 0]; // RepeatedTiming // First



          3.59



          0.075



          0.059




          My version is a bit more clunky, but it does the job several times faster:



          i0 = ToCharacterCode["A"][[1]] - 1;
          letterpos = ToCharacterCode[StringJoin[letters]] - i0;

          lcs3 = cAssembleRow[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First



          0.0094




          When all letters occur in each element of `sequences, then all results are equal:



          lcs == lcs2 == lcs2a == lcs3



          True







          share|improve this answer


















          • 4




            Henrik, if some letters have 0 count in some sequences, Kubalcs will have Missing[KeyAbsent] instead of 0; so some additional processing is needed.
            – kglr
            Sep 19 at 21:05












          up vote
          7
          down vote










          up vote
          7
          down vote









          I can propose two things that speed up the letter counting tremendously:



          1.) Use ToCharacterCode to convert your strings to packed arrays of integers.



          2.) Use a compiled funcion for additive matrix assembly.



          Additive assembly of each row can be obtained with this little function.



          cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
          Block[b,
          b = Table[0, max];
          Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
          b
          ],
          CompilationTarget -> "C",
          RuntimeAttributes -> Listable,
          Parallelization -> True,
          RuntimeOptions -> "Speed"
          ];


          Borrowing a bit of code from kglr but cranking up the amount of strings and their length:



          sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
          letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";


          Here is how kglr's and Kuba's very elegant solutions perform. lcs2a is a modification of Kuba's code to cope with Missing[AbsentKey] which may occur when some of the elements of letters do not occur in any of the elements in sequences (as kglr pointed out in a comment). It is also a bit faster.



          lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
          lcs2 = Values[(CharacterCounts /@ sequences)[[All, letters]]]; // RepeatedTiming // First
          lcs2a = Lookup[CharacterCounts[sequences], letters, 0]; // RepeatedTiming // First



          3.59



          0.075



          0.059




          My version is a bit more clunky, but it does the job several times faster:



          i0 = ToCharacterCode["A"][[1]] - 1;
          letterpos = ToCharacterCode[StringJoin[letters]] - i0;

          lcs3 = cAssembleRow[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First



          0.0094




          When all letters occur in each element of `sequences, then all results are equal:



          lcs == lcs2 == lcs2a == lcs3



          True







          share|improve this answer














          I can propose two things that speed up the letter counting tremendously:



          1.) Use ToCharacterCode to convert your strings to packed arrays of integers.



          2.) Use a compiled funcion for additive matrix assembly.



          Additive assembly of each row can be obtained with this little function.



          cAssembleRow = Compile[a, _Integer, 1, max, _Integer,
          Block[b,
          b = Table[0, max];
          Do[b[[Compile`GetElement[a, i]]]++, i, 1, Length[a]];
          b
          ],
          CompilationTarget -> "C",
          RuntimeAttributes -> Listable,
          Parallelization -> True,
          RuntimeOptions -> "Speed"
          ];


          Borrowing a bit of code from kglr but cranking up the amount of strings and their length:



          sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 1000, 1000];
          letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V";


          Here is how kglr's and Kuba's very elegant solutions perform. lcs2a is a modification of Kuba's code to cope with Missing[AbsentKey] which may occur when some of the elements of letters do not occur in any of the elements in sequences (as kglr pointed out in a comment). It is also a bit faster.



          lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0]; // RepeatedTiming // First
          lcs2 = Values[(CharacterCounts /@ sequences)[[All, letters]]]; // RepeatedTiming // First
          lcs2a = Lookup[CharacterCounts[sequences], letters, 0]; // RepeatedTiming // First



          3.59



          0.075



          0.059




          My version is a bit more clunky, but it does the job several times faster:



          i0 = ToCharacterCode["A"][[1]] - 1;
          letterpos = ToCharacterCode[StringJoin[letters]] - i0;

          lcs3 = cAssembleRow[ToCharacterCode[sequences] - i0, 26][[All,letterpos]]; // RepeatedTiming // First



          0.0094




          When all letters occur in each element of `sequences, then all results are equal:



          lcs == lcs2 == lcs2a == lcs3



          True








          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Sep 19 at 21:24

























          answered Sep 19 at 20:50









          Henrik Schumacher

          40.4k256121




          40.4k256121







          • 4




            Henrik, if some letters have 0 count in some sequences, Kubalcs will have Missing[KeyAbsent] instead of 0; so some additional processing is needed.
            – kglr
            Sep 19 at 21:05












          • 4




            Henrik, if some letters have 0 count in some sequences, Kubalcs will have Missing[KeyAbsent] instead of 0; so some additional processing is needed.
            – kglr
            Sep 19 at 21:05







          4




          4




          Henrik, if some letters have 0 count in some sequences, Kubalcs will have Missing[KeyAbsent] instead of 0; so some additional processing is needed.
          – kglr
          Sep 19 at 21:05




          Henrik, if some letters have 0 count in some sequences, Kubalcs will have Missing[KeyAbsent] instead of 0; so some additional processing is needed.
          – kglr
          Sep 19 at 21:05










          up vote
          6
          down vote













          You can use LetterCounts as follows:



          letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", 
          "K",  "M", "F", "P", "S", "T", "W", "Y", "V";
          sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
          lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
          counts = Join[letters, lcs];
          counts // Grid


          enter image description here






          share|improve this answer




















          • I like the pretty output!
            – briennakh
            Sep 19 at 20:31














          up vote
          6
          down vote













          You can use LetterCounts as follows:



          letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", 
          "K",  "M", "F", "P", "S", "T", "W", "Y", "V";
          sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
          lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
          counts = Join[letters, lcs];
          counts // Grid


          enter image description here






          share|improve this answer




















          • I like the pretty output!
            – briennakh
            Sep 19 at 20:31












          up vote
          6
          down vote










          up vote
          6
          down vote









          You can use LetterCounts as follows:



          letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", 
          "K",  "M", "F", "P", "S", "T", "W", "Y", "V";
          sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
          lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
          counts = Join[letters, lcs];
          counts // Grid


          enter image description here






          share|improve this answer












          You can use LetterCounts as follows:



          letters = "A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", 
          "K",  "M", "F", "P", "S", "T", "W", "Y", "V";
          sequences = StringJoin /@ RandomChoice[Capitalize@Alphabet, 10, 100];
          lcs = letters /. LetterCounts /@ sequences /. Thread[letters -> 0] ;
          counts = Join[letters, lcs];
          counts // Grid


          enter image description here







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Sep 19 at 20:22









          kglr

          163k8188387




          163k8188387











          • I like the pretty output!
            – briennakh
            Sep 19 at 20:31
















          • I like the pretty output!
            – briennakh
            Sep 19 at 20:31















          I like the pretty output!
          – briennakh
          Sep 19 at 20:31




          I like the pretty output!
          – briennakh
          Sep 19 at 20:31

















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f182201%2floop-over-a-list-of-strings-and-increment-letter-count-in-a-corresponding-sublis%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          How to check contact read email or not when send email to Individual?

          Displaying single band from multi-band raster using QGIS

          How many registers does an x86_64 CPU actually have?