bash string manipulation speed vs. pipeline

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












I have a file filled with md5 checksums and filenames. I need to perform some processing on each line, so I need to know:



  • Which is the checksum

  • Which is the filename

and act accordingly. That is, I need to slurp the checksum into a variable, then the filename. The filename may have non-ascii characters in it but I don't expect to see newlines. It looks like this:



05c00367e8914ca1be0964821d127977 ./.fseventsd/0000000000097aa1
cd9d4291f59a43c0e3d73ff60a337bb5 ./.fseventsd/00000000000fdfec
5d1280769e741e04622cfd852f33a138 ./.fseventsd/0000000000103197
8dda3534e5bbc0be1d15db2809123c50 ./.fseventsd/000000000017c9ca
(...etc., about 100,000 lines)


Traditionally, I might perform something like this:



md5sum=$(echo $line | awk 'print $1')
filename=$(echo $line | sed 's/[^ ]* //')


But how much faster would it be if I did this:



md5sum=$line%%" "*
filename=$line#*" "






share|improve this question



















  • The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
    – Kusalananda
    Jul 13 at 7:14










  • Related: unix.stackexchange.com/q/169716/117549
    – Jeff Schaller
    Jul 14 at 0:33














up vote
1
down vote

favorite












I have a file filled with md5 checksums and filenames. I need to perform some processing on each line, so I need to know:



  • Which is the checksum

  • Which is the filename

and act accordingly. That is, I need to slurp the checksum into a variable, then the filename. The filename may have non-ascii characters in it but I don't expect to see newlines. It looks like this:



05c00367e8914ca1be0964821d127977 ./.fseventsd/0000000000097aa1
cd9d4291f59a43c0e3d73ff60a337bb5 ./.fseventsd/00000000000fdfec
5d1280769e741e04622cfd852f33a138 ./.fseventsd/0000000000103197
8dda3534e5bbc0be1d15db2809123c50 ./.fseventsd/000000000017c9ca
(...etc., about 100,000 lines)


Traditionally, I might perform something like this:



md5sum=$(echo $line | awk 'print $1')
filename=$(echo $line | sed 's/[^ ]* //')


But how much faster would it be if I did this:



md5sum=$line%%" "*
filename=$line#*" "






share|improve this question



















  • The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
    – Kusalananda
    Jul 13 at 7:14










  • Related: unix.stackexchange.com/q/169716/117549
    – Jeff Schaller
    Jul 14 at 0:33












up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have a file filled with md5 checksums and filenames. I need to perform some processing on each line, so I need to know:



  • Which is the checksum

  • Which is the filename

and act accordingly. That is, I need to slurp the checksum into a variable, then the filename. The filename may have non-ascii characters in it but I don't expect to see newlines. It looks like this:



05c00367e8914ca1be0964821d127977 ./.fseventsd/0000000000097aa1
cd9d4291f59a43c0e3d73ff60a337bb5 ./.fseventsd/00000000000fdfec
5d1280769e741e04622cfd852f33a138 ./.fseventsd/0000000000103197
8dda3534e5bbc0be1d15db2809123c50 ./.fseventsd/000000000017c9ca
(...etc., about 100,000 lines)


Traditionally, I might perform something like this:



md5sum=$(echo $line | awk 'print $1')
filename=$(echo $line | sed 's/[^ ]* //')


But how much faster would it be if I did this:



md5sum=$line%%" "*
filename=$line#*" "






share|improve this question











I have a file filled with md5 checksums and filenames. I need to perform some processing on each line, so I need to know:



  • Which is the checksum

  • Which is the filename

and act accordingly. That is, I need to slurp the checksum into a variable, then the filename. The filename may have non-ascii characters in it but I don't expect to see newlines. It looks like this:



05c00367e8914ca1be0964821d127977 ./.fseventsd/0000000000097aa1
cd9d4291f59a43c0e3d73ff60a337bb5 ./.fseventsd/00000000000fdfec
5d1280769e741e04622cfd852f33a138 ./.fseventsd/0000000000103197
8dda3534e5bbc0be1d15db2809123c50 ./.fseventsd/000000000017c9ca
(...etc., about 100,000 lines)


Traditionally, I might perform something like this:



md5sum=$(echo $line | awk 'print $1')
filename=$(echo $line | sed 's/[^ ]* //')


But how much faster would it be if I did this:



md5sum=$line%%" "*
filename=$line#*" "








share|improve this question










share|improve this question




share|improve this question









asked Jul 12 at 22:00









Mike S

1,201722




1,201722











  • The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
    – Kusalananda
    Jul 13 at 7:14










  • Related: unix.stackexchange.com/q/169716/117549
    – Jeff Schaller
    Jul 14 at 0:33
















  • The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
    – Kusalananda
    Jul 13 at 7:14










  • Related: unix.stackexchange.com/q/169716/117549
    – Jeff Schaller
    Jul 14 at 0:33















The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
– Kusalananda
Jul 13 at 7:14




The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
– Kusalananda
Jul 13 at 7:14












Related: unix.stackexchange.com/q/169716/117549
– Jeff Schaller
Jul 14 at 0:33




Related: unix.stackexchange.com/q/169716/117549
– Jeff Schaller
Jul 14 at 0:33










2 Answers
2






active

oldest

votes

















up vote
0
down vote













Yes, using the bash internal command, avoids many system calls. Especially when there is recursion.



Another example : we must use * against $(ls).



Bash offers some ways to make simple manipulations on strings (cuts and substitution). But not more. Because it is not made for that. Exemple : difficult to validate the presence of a pattern in a string without an external command.



External programs are more optimized for their tasks (cat, sed, grep, awk, cut, sort, ...)






share|improve this answer




























    up vote
    0
    down vote













    I tested by setting one of the variables. Performing this script twice:



    while read line; do
    md5sum=$line%%" "*
    #md5sum=$(echo $line | awk 'print $1')
    echo "SUM: $md5sum FILE:_$file"
    done < manifest.Stuph.180620


    first with



    md5sum=$line%%" "*


    and next with



    md5sum=$(echo $line | awk 'print $1')


    where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:



    First run (using bash's builtin string manipulation)



    real 0m4.750s
    user 0m4.174s
    sys 0m0.550s


    Second run (using the pipeline)



    real 10m54.255s
    user 4m42.257s
    sys 7m32.880s


    Some (such as myself) will say, that if speed matters then you shouldn't be messing around in the shell anyway, but sometimes you might want to be more efficient- no matter what environment you're using for the job.



    Note that doing this:



    while read md5sum filename; do
    (...etc...)


    is even more efficient than doing the variable assignment, but not to the degree of eliminating the command substitution/pipe/awk construct. The thing I find most interesting is the difference between bash built-in performance and using external commands. I'll be more diligent about learning and using the fancy builtin stuff!






    share|improve this answer























    • Why the cat in a process substitution? And why not read checksum and pathname with read directly?
      – Kusalananda
      Jul 13 at 7:12










    • 1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
      – Mike S
      Jul 13 at 23:31











    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f454990%2fbash-string-manipulation-speed-vs-pipeline%23new-answer', 'question_page');

    );

    Post as a guest






























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    Yes, using the bash internal command, avoids many system calls. Especially when there is recursion.



    Another example : we must use * against $(ls).



    Bash offers some ways to make simple manipulations on strings (cuts and substitution). But not more. Because it is not made for that. Exemple : difficult to validate the presence of a pattern in a string without an external command.



    External programs are more optimized for their tasks (cat, sed, grep, awk, cut, sort, ...)






    share|improve this answer

























      up vote
      0
      down vote













      Yes, using the bash internal command, avoids many system calls. Especially when there is recursion.



      Another example : we must use * against $(ls).



      Bash offers some ways to make simple manipulations on strings (cuts and substitution). But not more. Because it is not made for that. Exemple : difficult to validate the presence of a pattern in a string without an external command.



      External programs are more optimized for their tasks (cat, sed, grep, awk, cut, sort, ...)






      share|improve this answer























        up vote
        0
        down vote










        up vote
        0
        down vote









        Yes, using the bash internal command, avoids many system calls. Especially when there is recursion.



        Another example : we must use * against $(ls).



        Bash offers some ways to make simple manipulations on strings (cuts and substitution). But not more. Because it is not made for that. Exemple : difficult to validate the presence of a pattern in a string without an external command.



        External programs are more optimized for their tasks (cat, sed, grep, awk, cut, sort, ...)






        share|improve this answer













        Yes, using the bash internal command, avoids many system calls. Especially when there is recursion.



        Another example : we must use * against $(ls).



        Bash offers some ways to make simple manipulations on strings (cuts and substitution). But not more. Because it is not made for that. Exemple : difficult to validate the presence of a pattern in a string without an external command.



        External programs are more optimized for their tasks (cat, sed, grep, awk, cut, sort, ...)







        share|improve this answer













        share|improve this answer



        share|improve this answer











        answered Jul 13 at 2:46









        alux

        364




        364






















            up vote
            0
            down vote













            I tested by setting one of the variables. Performing this script twice:



            while read line; do
            md5sum=$line%%" "*
            #md5sum=$(echo $line | awk 'print $1')
            echo "SUM: $md5sum FILE:_$file"
            done < manifest.Stuph.180620


            first with



            md5sum=$line%%" "*


            and next with



            md5sum=$(echo $line | awk 'print $1')


            where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:



            First run (using bash's builtin string manipulation)



            real 0m4.750s
            user 0m4.174s
            sys 0m0.550s


            Second run (using the pipeline)



            real 10m54.255s
            user 4m42.257s
            sys 7m32.880s


            Some (such as myself) will say, that if speed matters then you shouldn't be messing around in the shell anyway, but sometimes you might want to be more efficient- no matter what environment you're using for the job.



            Note that doing this:



            while read md5sum filename; do
            (...etc...)


            is even more efficient than doing the variable assignment, but not to the degree of eliminating the command substitution/pipe/awk construct. The thing I find most interesting is the difference between bash built-in performance and using external commands. I'll be more diligent about learning and using the fancy builtin stuff!






            share|improve this answer























            • Why the cat in a process substitution? And why not read checksum and pathname with read directly?
              – Kusalananda
              Jul 13 at 7:12










            • 1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
              – Mike S
              Jul 13 at 23:31















            up vote
            0
            down vote













            I tested by setting one of the variables. Performing this script twice:



            while read line; do
            md5sum=$line%%" "*
            #md5sum=$(echo $line | awk 'print $1')
            echo "SUM: $md5sum FILE:_$file"
            done < manifest.Stuph.180620


            first with



            md5sum=$line%%" "*


            and next with



            md5sum=$(echo $line | awk 'print $1')


            where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:



            First run (using bash's builtin string manipulation)



            real 0m4.750s
            user 0m4.174s
            sys 0m0.550s


            Second run (using the pipeline)



            real 10m54.255s
            user 4m42.257s
            sys 7m32.880s


            Some (such as myself) will say, that if speed matters then you shouldn't be messing around in the shell anyway, but sometimes you might want to be more efficient- no matter what environment you're using for the job.



            Note that doing this:



            while read md5sum filename; do
            (...etc...)


            is even more efficient than doing the variable assignment, but not to the degree of eliminating the command substitution/pipe/awk construct. The thing I find most interesting is the difference between bash built-in performance and using external commands. I'll be more diligent about learning and using the fancy builtin stuff!






            share|improve this answer























            • Why the cat in a process substitution? And why not read checksum and pathname with read directly?
              – Kusalananda
              Jul 13 at 7:12










            • 1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
              – Mike S
              Jul 13 at 23:31













            up vote
            0
            down vote










            up vote
            0
            down vote









            I tested by setting one of the variables. Performing this script twice:



            while read line; do
            md5sum=$line%%" "*
            #md5sum=$(echo $line | awk 'print $1')
            echo "SUM: $md5sum FILE:_$file"
            done < manifest.Stuph.180620


            first with



            md5sum=$line%%" "*


            and next with



            md5sum=$(echo $line | awk 'print $1')


            where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:



            First run (using bash's builtin string manipulation)



            real 0m4.750s
            user 0m4.174s
            sys 0m0.550s


            Second run (using the pipeline)



            real 10m54.255s
            user 4m42.257s
            sys 7m32.880s


            Some (such as myself) will say, that if speed matters then you shouldn't be messing around in the shell anyway, but sometimes you might want to be more efficient- no matter what environment you're using for the job.



            Note that doing this:



            while read md5sum filename; do
            (...etc...)


            is even more efficient than doing the variable assignment, but not to the degree of eliminating the command substitution/pipe/awk construct. The thing I find most interesting is the difference between bash built-in performance and using external commands. I'll be more diligent about learning and using the fancy builtin stuff!






            share|improve this answer















            I tested by setting one of the variables. Performing this script twice:



            while read line; do
            md5sum=$line%%" "*
            #md5sum=$(echo $line | awk 'print $1')
            echo "SUM: $md5sum FILE:_$file"
            done < manifest.Stuph.180620


            first with



            md5sum=$line%%" "*


            and next with



            md5sum=$(echo $line | awk 'print $1')


            where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:



            First run (using bash's builtin string manipulation)



            real 0m4.750s
            user 0m4.174s
            sys 0m0.550s


            Second run (using the pipeline)



            real 10m54.255s
            user 4m42.257s
            sys 7m32.880s


            Some (such as myself) will say, that if speed matters then you shouldn't be messing around in the shell anyway, but sometimes you might want to be more efficient- no matter what environment you're using for the job.



            Note that doing this:



            while read md5sum filename; do
            (...etc...)


            is even more efficient than doing the variable assignment, but not to the degree of eliminating the command substitution/pipe/awk construct. The thing I find most interesting is the difference between bash built-in performance and using external commands. I'll be more diligent about learning and using the fancy builtin stuff!







            share|improve this answer















            share|improve this answer



            share|improve this answer








            edited Jul 13 at 23:22


























            answered Jul 12 at 22:00









            Mike S

            1,201722




            1,201722











            • Why the cat in a process substitution? And why not read checksum and pathname with read directly?
              – Kusalananda
              Jul 13 at 7:12










            • 1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
              – Mike S
              Jul 13 at 23:31

















            • Why the cat in a process substitution? And why not read checksum and pathname with read directly?
              – Kusalananda
              Jul 13 at 7:12










            • 1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
              – Mike S
              Jul 13 at 23:31
















            Why the cat in a process substitution? And why not read checksum and pathname with read directly?
            – Kusalananda
            Jul 13 at 7:12




            Why the cat in a process substitution? And why not read checksum and pathname with read directly?
            – Kusalananda
            Jul 13 at 7:12












            1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
            – Mike S
            Jul 13 at 23:31





            1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
            – Mike S
            Jul 13 at 23:31













             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f454990%2fbash-string-manipulation-speed-vs-pipeline%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            Peggy Mitchell

            The Forum (Inglewood, California)

            Palaiologos