bash string manipulation speed vs. pipeline

Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
I have a file filled with md5 checksums and filenames. I need to perform some processing on each line, so I need to know:
- Which is the checksum
- Which is the filename
and act accordingly. That is, I need to slurp the checksum into a variable, then the filename. The filename may have non-ascii characters in it but I don't expect to see newlines. It looks like this:
05c00367e8914ca1be0964821d127977 ./.fseventsd/0000000000097aa1
cd9d4291f59a43c0e3d73ff60a337bb5 ./.fseventsd/00000000000fdfec
5d1280769e741e04622cfd852f33a138 ./.fseventsd/0000000000103197
8dda3534e5bbc0be1d15db2809123c50 ./.fseventsd/000000000017c9ca
(...etc., about 100,000 lines)
Traditionally, I might perform something like this:
md5sum=$(echo $line | awk 'print $1')
filename=$(echo $line | sed 's/[^ ]* //')
But how much faster would it be if I did this:
md5sum=$line%%" "*
filename=$line#*" "
bash
add a comment |Â
up vote
1
down vote
favorite
I have a file filled with md5 checksums and filenames. I need to perform some processing on each line, so I need to know:
- Which is the checksum
- Which is the filename
and act accordingly. That is, I need to slurp the checksum into a variable, then the filename. The filename may have non-ascii characters in it but I don't expect to see newlines. It looks like this:
05c00367e8914ca1be0964821d127977 ./.fseventsd/0000000000097aa1
cd9d4291f59a43c0e3d73ff60a337bb5 ./.fseventsd/00000000000fdfec
5d1280769e741e04622cfd852f33a138 ./.fseventsd/0000000000103197
8dda3534e5bbc0be1d15db2809123c50 ./.fseventsd/000000000017c9ca
(...etc., about 100,000 lines)
Traditionally, I might perform something like this:
md5sum=$(echo $line | awk 'print $1')
filename=$(echo $line | sed 's/[^ ]* //')
But how much faster would it be if I did this:
md5sum=$line%%" "*
filename=$line#*" "
bash
The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
â Kusalananda
Jul 13 at 7:14
Related: unix.stackexchange.com/q/169716/117549
â Jeff Schaller
Jul 14 at 0:33
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have a file filled with md5 checksums and filenames. I need to perform some processing on each line, so I need to know:
- Which is the checksum
- Which is the filename
and act accordingly. That is, I need to slurp the checksum into a variable, then the filename. The filename may have non-ascii characters in it but I don't expect to see newlines. It looks like this:
05c00367e8914ca1be0964821d127977 ./.fseventsd/0000000000097aa1
cd9d4291f59a43c0e3d73ff60a337bb5 ./.fseventsd/00000000000fdfec
5d1280769e741e04622cfd852f33a138 ./.fseventsd/0000000000103197
8dda3534e5bbc0be1d15db2809123c50 ./.fseventsd/000000000017c9ca
(...etc., about 100,000 lines)
Traditionally, I might perform something like this:
md5sum=$(echo $line | awk 'print $1')
filename=$(echo $line | sed 's/[^ ]* //')
But how much faster would it be if I did this:
md5sum=$line%%" "*
filename=$line#*" "
bash
I have a file filled with md5 checksums and filenames. I need to perform some processing on each line, so I need to know:
- Which is the checksum
- Which is the filename
and act accordingly. That is, I need to slurp the checksum into a variable, then the filename. The filename may have non-ascii characters in it but I don't expect to see newlines. It looks like this:
05c00367e8914ca1be0964821d127977 ./.fseventsd/0000000000097aa1
cd9d4291f59a43c0e3d73ff60a337bb5 ./.fseventsd/00000000000fdfec
5d1280769e741e04622cfd852f33a138 ./.fseventsd/0000000000103197
8dda3534e5bbc0be1d15db2809123c50 ./.fseventsd/000000000017c9ca
(...etc., about 100,000 lines)
Traditionally, I might perform something like this:
md5sum=$(echo $line | awk 'print $1')
filename=$(echo $line | sed 's/[^ ]* //')
But how much faster would it be if I did this:
md5sum=$line%%" "*
filename=$line#*" "
bash
asked Jul 12 at 22:00
Mike S
1,201722
1,201722
The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
â Kusalananda
Jul 13 at 7:14
Related: unix.stackexchange.com/q/169716/117549
â Jeff Schaller
Jul 14 at 0:33
add a comment |Â
The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
â Kusalananda
Jul 13 at 7:14
Related: unix.stackexchange.com/q/169716/117549
â Jeff Schaller
Jul 14 at 0:33
The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
â Kusalananda
Jul 13 at 7:14
The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
â Kusalananda
Jul 13 at 7:14
Related: unix.stackexchange.com/q/169716/117549
â Jeff Schaller
Jul 14 at 0:33
Related: unix.stackexchange.com/q/169716/117549
â Jeff Schaller
Jul 14 at 0:33
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
0
down vote
Yes, using the bash internal command, avoids many system calls. Especially when there is recursion.
Another example : we must use * against $(ls).
Bash offers some ways to make simple manipulations on strings (cuts and substitution). But not more. Because it is not made for that. Exemple : difficult to validate the presence of a pattern in a string without an external command.
External programs are more optimized for their tasks (cat, sed, grep, awk, cut, sort, ...)
add a comment |Â
up vote
0
down vote
I tested by setting one of the variables. Performing this script twice:
while read line; do
md5sum=$line%%" "*
#md5sum=$(echo $line | awk 'print $1')
echo "SUM: $md5sum FILE:_$file"
done < manifest.Stuph.180620
first with
md5sum=$line%%" "*
and next with
md5sum=$(echo $line | awk 'print $1')
where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:
First run (using bash's builtin string manipulation)
real 0m4.750s
user 0m4.174s
sys 0m0.550s
Second run (using the pipeline)
real 10m54.255s
user 4m42.257s
sys 7m32.880s
Some (such as myself) will say, that if speed matters then you shouldn't be messing around in the shell anyway, but sometimes you might want to be more efficient- no matter what environment you're using for the job.
Note that doing this:
while read md5sum filename; do
(...etc...)
is even more efficient than doing the variable assignment, but not to the degree of eliminating the command substitution/pipe/awk construct. The thing I find most interesting is the difference between bash built-in performance and using external commands. I'll be more diligent about learning and using the fancy builtin stuff!
Why thecatin a process substitution? And why not read checksum and pathname withreaddirectly?
â Kusalananda
Jul 13 at 7:12
1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
â Mike S
Jul 13 at 23:31
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Yes, using the bash internal command, avoids many system calls. Especially when there is recursion.
Another example : we must use * against $(ls).
Bash offers some ways to make simple manipulations on strings (cuts and substitution). But not more. Because it is not made for that. Exemple : difficult to validate the presence of a pattern in a string without an external command.
External programs are more optimized for their tasks (cat, sed, grep, awk, cut, sort, ...)
add a comment |Â
up vote
0
down vote
Yes, using the bash internal command, avoids many system calls. Especially when there is recursion.
Another example : we must use * against $(ls).
Bash offers some ways to make simple manipulations on strings (cuts and substitution). But not more. Because it is not made for that. Exemple : difficult to validate the presence of a pattern in a string without an external command.
External programs are more optimized for their tasks (cat, sed, grep, awk, cut, sort, ...)
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Yes, using the bash internal command, avoids many system calls. Especially when there is recursion.
Another example : we must use * against $(ls).
Bash offers some ways to make simple manipulations on strings (cuts and substitution). But not more. Because it is not made for that. Exemple : difficult to validate the presence of a pattern in a string without an external command.
External programs are more optimized for their tasks (cat, sed, grep, awk, cut, sort, ...)
Yes, using the bash internal command, avoids many system calls. Especially when there is recursion.
Another example : we must use * against $(ls).
Bash offers some ways to make simple manipulations on strings (cuts and substitution). But not more. Because it is not made for that. Exemple : difficult to validate the presence of a pattern in a string without an external command.
External programs are more optimized for their tasks (cat, sed, grep, awk, cut, sort, ...)
answered Jul 13 at 2:46
alux
364
364
add a comment |Â
add a comment |Â
up vote
0
down vote
I tested by setting one of the variables. Performing this script twice:
while read line; do
md5sum=$line%%" "*
#md5sum=$(echo $line | awk 'print $1')
echo "SUM: $md5sum FILE:_$file"
done < manifest.Stuph.180620
first with
md5sum=$line%%" "*
and next with
md5sum=$(echo $line | awk 'print $1')
where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:
First run (using bash's builtin string manipulation)
real 0m4.750s
user 0m4.174s
sys 0m0.550s
Second run (using the pipeline)
real 10m54.255s
user 4m42.257s
sys 7m32.880s
Some (such as myself) will say, that if speed matters then you shouldn't be messing around in the shell anyway, but sometimes you might want to be more efficient- no matter what environment you're using for the job.
Note that doing this:
while read md5sum filename; do
(...etc...)
is even more efficient than doing the variable assignment, but not to the degree of eliminating the command substitution/pipe/awk construct. The thing I find most interesting is the difference between bash built-in performance and using external commands. I'll be more diligent about learning and using the fancy builtin stuff!
Why thecatin a process substitution? And why not read checksum and pathname withreaddirectly?
â Kusalananda
Jul 13 at 7:12
1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
â Mike S
Jul 13 at 23:31
add a comment |Â
up vote
0
down vote
I tested by setting one of the variables. Performing this script twice:
while read line; do
md5sum=$line%%" "*
#md5sum=$(echo $line | awk 'print $1')
echo "SUM: $md5sum FILE:_$file"
done < manifest.Stuph.180620
first with
md5sum=$line%%" "*
and next with
md5sum=$(echo $line | awk 'print $1')
where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:
First run (using bash's builtin string manipulation)
real 0m4.750s
user 0m4.174s
sys 0m0.550s
Second run (using the pipeline)
real 10m54.255s
user 4m42.257s
sys 7m32.880s
Some (such as myself) will say, that if speed matters then you shouldn't be messing around in the shell anyway, but sometimes you might want to be more efficient- no matter what environment you're using for the job.
Note that doing this:
while read md5sum filename; do
(...etc...)
is even more efficient than doing the variable assignment, but not to the degree of eliminating the command substitution/pipe/awk construct. The thing I find most interesting is the difference between bash built-in performance and using external commands. I'll be more diligent about learning and using the fancy builtin stuff!
Why thecatin a process substitution? And why not read checksum and pathname withreaddirectly?
â Kusalananda
Jul 13 at 7:12
1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
â Mike S
Jul 13 at 23:31
add a comment |Â
up vote
0
down vote
up vote
0
down vote
I tested by setting one of the variables. Performing this script twice:
while read line; do
md5sum=$line%%" "*
#md5sum=$(echo $line | awk 'print $1')
echo "SUM: $md5sum FILE:_$file"
done < manifest.Stuph.180620
first with
md5sum=$line%%" "*
and next with
md5sum=$(echo $line | awk 'print $1')
where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:
First run (using bash's builtin string manipulation)
real 0m4.750s
user 0m4.174s
sys 0m0.550s
Second run (using the pipeline)
real 10m54.255s
user 4m42.257s
sys 7m32.880s
Some (such as myself) will say, that if speed matters then you shouldn't be messing around in the shell anyway, but sometimes you might want to be more efficient- no matter what environment you're using for the job.
Note that doing this:
while read md5sum filename; do
(...etc...)
is even more efficient than doing the variable assignment, but not to the degree of eliminating the command substitution/pipe/awk construct. The thing I find most interesting is the difference between bash built-in performance and using external commands. I'll be more diligent about learning and using the fancy builtin stuff!
I tested by setting one of the variables. Performing this script twice:
while read line; do
md5sum=$line%%" "*
#md5sum=$(echo $line | awk 'print $1')
echo "SUM: $md5sum FILE:_$file"
done < manifest.Stuph.180620
first with
md5sum=$line%%" "*
and next with
md5sum=$(echo $line | awk 'print $1')
where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:
First run (using bash's builtin string manipulation)
real 0m4.750s
user 0m4.174s
sys 0m0.550s
Second run (using the pipeline)
real 10m54.255s
user 4m42.257s
sys 7m32.880s
Some (such as myself) will say, that if speed matters then you shouldn't be messing around in the shell anyway, but sometimes you might want to be more efficient- no matter what environment you're using for the job.
Note that doing this:
while read md5sum filename; do
(...etc...)
is even more efficient than doing the variable assignment, but not to the degree of eliminating the command substitution/pipe/awk construct. The thing I find most interesting is the difference between bash built-in performance and using external commands. I'll be more diligent about learning and using the fancy builtin stuff!
edited Jul 13 at 23:22
answered Jul 12 at 22:00
Mike S
1,201722
1,201722
Why thecatin a process substitution? And why not read checksum and pathname withreaddirectly?
â Kusalananda
Jul 13 at 7:12
1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
â Mike S
Jul 13 at 23:31
add a comment |Â
Why thecatin a process substitution? And why not read checksum and pathname withreaddirectly?
â Kusalananda
Jul 13 at 7:12
1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
â Mike S
Jul 13 at 23:31
Why the
cat in a process substitution? And why not read checksum and pathname with read directly?â Kusalananda
Jul 13 at 7:12
Why the
cat in a process substitution? And why not read checksum and pathname with read directly?â Kusalananda
Jul 13 at 7:12
1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
â Mike S
Jul 13 at 23:31
1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
â Mike S
Jul 13 at 23:31
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f454990%2fbash-string-manipulation-speed-vs-pipeline%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
â Kusalananda
Jul 13 at 7:14
Related: unix.stackexchange.com/q/169716/117549
â Jeff Schaller
Jul 14 at 0:33