bash string manipulation speed vs. pipeline

up vote
1
down vote

favorite

I have a file filled with md5 checksums and filenames. I need to perform some processing on each line, so I need to know:

Which is the checksum

Which is the filename

and act accordingly. That is, I need to slurp the checksum into a variable, then the filename. The filename may have non-ascii characters in it but I don't expect to see newlines. It looks like this:

05c00367e8914ca1be0964821d127977 ./.fseventsd/0000000000097aa1
cd9d4291f59a43c0e3d73ff60a337bb5 ./.fseventsd/00000000000fdfec
5d1280769e741e04622cfd852f33a138 ./.fseventsd/0000000000103197
8dda3534e5bbc0be1d15db2809123c50 ./.fseventsd/000000000017c9ca
(...etc., about 100,000 lines)

Traditionally, I might perform something like this:

md5sum=$(echo $line | awk 'print $1')
filename=$(echo $line | sed 's/[^ ]* //')

But how much faster would it be if I did this:

md5sum=$line%%" "*
filename=$line#*" "

asked Jul 12 at 22:00

Mike S

1,201722

The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
â€“Â Kusalananda
Jul 13 at 7:14

Related: unix.stackexchange.com/q/169716/117549
â€“Â Jeff Schaller
Jul 14 at 0:33

add a commentÂ |Â

up vote
1
down vote

favorite

I have a file filled with md5 checksums and filenames. I need to perform some processing on each line, so I need to know:

Which is the checksum

Which is the filename

and act accordingly. That is, I need to slurp the checksum into a variable, then the filename. The filename may have non-ascii characters in it but I don't expect to see newlines. It looks like this:

05c00367e8914ca1be0964821d127977 ./.fseventsd/0000000000097aa1
cd9d4291f59a43c0e3d73ff60a337bb5 ./.fseventsd/00000000000fdfec
5d1280769e741e04622cfd852f33a138 ./.fseventsd/0000000000103197
8dda3534e5bbc0be1d15db2809123c50 ./.fseventsd/000000000017c9ca
(...etc., about 100,000 lines)

Traditionally, I might perform something like this:

md5sum=$(echo $line | awk 'print $1')
filename=$(echo $line | sed 's/[^ ]* //')

But how much faster would it be if I did this:

md5sum=$line%%" "*
filename=$line#*" "

asked Jul 12 at 22:00

Mike S

1,201722

The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
â€“Â Kusalananda
Jul 13 at 7:14

Related: unix.stackexchange.com/q/169716/117549
â€“Â Jeff Schaller
Jul 14 at 0:33

add a commentÂ |Â

up vote
1
down vote

favorite

I have a file filled with md5 checksums and filenames. I need to perform some processing on each line, so I need to know:

Which is the checksum

Which is the filename

and act accordingly. That is, I need to slurp the checksum into a variable, then the filename. The filename may have non-ascii characters in it but I don't expect to see newlines. It looks like this:

05c00367e8914ca1be0964821d127977 ./.fseventsd/0000000000097aa1
cd9d4291f59a43c0e3d73ff60a337bb5 ./.fseventsd/00000000000fdfec
5d1280769e741e04622cfd852f33a138 ./.fseventsd/0000000000103197
8dda3534e5bbc0be1d15db2809123c50 ./.fseventsd/000000000017c9ca
(...etc., about 100,000 lines)

Traditionally, I might perform something like this:

md5sum=$(echo $line | awk 'print $1')
filename=$(echo $line | sed 's/[^ ]* //')

But how much faster would it be if I did this:

md5sum=$line%%" "*
filename=$line#*" "

asked Jul 12 at 22:00

Mike S

1,201722

I have a file filled with md5 checksums and filenames. I need to perform some processing on each line, so I need to know:

Which is the checksum

Which is the filename

and act accordingly. That is, I need to slurp the checksum into a variable, then the filename. The filename may have non-ascii characters in it but I don't expect to see newlines. It looks like this:

05c00367e8914ca1be0964821d127977 ./.fseventsd/0000000000097aa1
cd9d4291f59a43c0e3d73ff60a337bb5 ./.fseventsd/00000000000fdfec
5d1280769e741e04622cfd852f33a138 ./.fseventsd/0000000000103197
8dda3534e5bbc0be1d15db2809123c50 ./.fseventsd/000000000017c9ca
(...etc., about 100,000 lines)

Traditionally, I might perform something like this:

md5sum=$(echo $line | awk 'print $1')
filename=$(echo $line | sed 's/[^ ]* //')

But how much faster would it be if I did this:

md5sum=$line%%" "*
filename=$line#*" "

asked Jul 12 at 22:00

Mike S

1,201722

asked Jul 12 at 22:00

Mike S

1,201722

asked Jul 12 at 22:00

Mike S

1,201722

asked Jul 12 at 22:00

Mike S

1,201722

The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
â€“Â Kusalananda
Jul 13 at 7:14

Related: unix.stackexchange.com/q/169716/117549
â€“Â Jeff Schaller
Jul 14 at 0:33

add a commentÂ |Â

The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
â€“Â Kusalananda
Jul 13 at 7:14

Related: unix.stackexchange.com/q/169716/117549
â€“Â Jeff Schaller
Jul 14 at 0:33

The correct solution to this depends on what it is you are expecting to do with the pathnames and checksums.
â€“Â Kusalananda
Jul 13 at 7:14

Related: unix.stackexchange.com/q/169716/117549
â€“Â Jeff Schaller
Jul 14 at 0:33

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
0
down vote

Yes, using the bash internal command, avoids many system calls. Especially when there is recursion.

Another example : we must use * against $(ls).

Bash offers some ways to make simple manipulations on strings (cuts and substitution). But not more. Because it is not made for that. Exemple : difficult to validate the presence of a pattern in a string without an external command.

External programs are more optimized for their tasks (cat, sed, grep, awk, cut, sort, ...)

answered Jul 13 at 2:46

alux

364

add a commentÂ |Â

up vote
0
down vote

I tested by setting one of the variables. Performing this script twice:

while read line; do
 md5sum=$line%%" "*
 #md5sum=$(echo $line | awk 'print $1')
 echo "SUM: $md5sum FILE:_$file"
done < manifest.Stuph.180620

first with

md5sum=$line%%" "*

and next with

md5sum=$(echo $line | awk 'print $1')

where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:

First run (using bash's builtin string manipulation)

real 0m4.750s
user 0m4.174s
sys 0m0.550s

Second run (using the pipeline)

real 10m54.255s
user 4m42.257s
sys 7m32.880s

Some (such as myself) will say, that if speed matters then you shouldn't be messing around in the shell anyway, but sometimes you might want to be more efficient- no matter what environment you're using for the job.

Note that doing this:

while read md5sum filename; do
 (...etc...)

is even more efficient than doing the variable assignment, but not to the degree of eliminating the command substitution/pipe/awk construct. The thing I find most interesting is the difference between bash built-in performance and using external commands. I'll be more diligent about learning and using the fancy builtin stuff!

edited Jul 13 at 23:22

answered Jul 12 at 22:00

Mike S

1,201722

Why the cat in a process substitution? And why not read checksum and pathname with read directly?
â€“Â Kusalananda
Jul 13 at 7:12

1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
â€“Â Mike S
Jul 13 at 23:31

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f454990%2fbash-string-manipulation-speed-vs-pipeline%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
0
down vote

Yes, using the bash internal command, avoids many system calls. Especially when there is recursion.

Another example : we must use * against $(ls).

External programs are more optimized for their tasks (cat, sed, grep, awk, cut, sort, ...)

answered Jul 13 at 2:46

alux

364

add a commentÂ |Â

up vote
0
down vote

Yes, using the bash internal command, avoids many system calls. Especially when there is recursion.

Another example : we must use * against $(ls).

External programs are more optimized for their tasks (cat, sed, grep, awk, cut, sort, ...)

answered Jul 13 at 2:46

alux

364

add a commentÂ |Â

up vote
0
down vote

Yes, using the bash internal command, avoids many system calls. Especially when there is recursion.

Another example : we must use * against $(ls).

External programs are more optimized for their tasks (cat, sed, grep, awk, cut, sort, ...)

answered Jul 13 at 2:46

alux

364

Yes, using the bash internal command, avoids many system calls. Especially when there is recursion.

Another example : we must use * against $(ls).

External programs are more optimized for their tasks (cat, sed, grep, awk, cut, sort, ...)

answered Jul 13 at 2:46

alux

364

answered Jul 13 at 2:46

alux

364

answered Jul 13 at 2:46

alux

364

answered Jul 13 at 2:46

alux

364

add a commentÂ |Â

up vote
0
down vote

I tested by setting one of the variables. Performing this script twice:

while read line; do
 md5sum=$line%%" "*
 #md5sum=$(echo $line | awk 'print $1')
 echo "SUM: $md5sum FILE:_$file"
done < manifest.Stuph.180620

first with

md5sum=$line%%" "*

and next with

md5sum=$(echo $line | awk 'print $1')

where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:

First run (using bash's builtin string manipulation)

real 0m4.750s
user 0m4.174s
sys 0m0.550s

Second run (using the pipeline)

real 10m54.255s
user 4m42.257s
sys 7m32.880s

Note that doing this:

while read md5sum filename; do
 (...etc...)

edited Jul 13 at 23:22

answered Jul 12 at 22:00

Mike S

1,201722

Why the cat in a process substitution? And why not read checksum and pathname with read directly?
â€“Â Kusalananda
Jul 13 at 7:12

1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
â€“Â Mike S
Jul 13 at 23:31

add a commentÂ |Â

up vote
0
down vote

I tested by setting one of the variables. Performing this script twice:

while read line; do
 md5sum=$line%%" "*
 #md5sum=$(echo $line | awk 'print $1')
 echo "SUM: $md5sum FILE:_$file"
done < manifest.Stuph.180620

first with

md5sum=$line%%" "*

and next with

md5sum=$(echo $line | awk 'print $1')

where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:

First run (using bash's builtin string manipulation)

real 0m4.750s
user 0m4.174s
sys 0m0.550s

Second run (using the pipeline)

real 10m54.255s
user 4m42.257s
sys 7m32.880s

Note that doing this:

while read md5sum filename; do
 (...etc...)

edited Jul 13 at 23:22

answered Jul 12 at 22:00

Mike S

1,201722

Why the cat in a process substitution? And why not read checksum and pathname with read directly?
â€“Â Kusalananda
Jul 13 at 7:12

1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
â€“Â Mike S
Jul 13 at 23:31

add a commentÂ |Â

up vote
0
down vote

I tested by setting one of the variables. Performing this script twice:

while read line; do
 md5sum=$line%%" "*
 #md5sum=$(echo $line | awk 'print $1')
 echo "SUM: $md5sum FILE:_$file"
done < manifest.Stuph.180620

first with

md5sum=$line%%" "*

and next with

md5sum=$(echo $line | awk 'print $1')

where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:

First run (using bash's builtin string manipulation)

real 0m4.750s
user 0m4.174s
sys 0m0.550s

Second run (using the pipeline)

real 10m54.255s
user 4m42.257s
sys 7m32.880s

Note that doing this:

while read md5sum filename; do
 (...etc...)

edited Jul 13 at 23:22

answered Jul 12 at 22:00

Mike S

1,201722

I tested by setting one of the variables. Performing this script twice:

while read line; do
 md5sum=$line%%" "*
 #md5sum=$(echo $line | awk 'print $1')
 echo "SUM: $md5sum FILE:_$file"
done < manifest.Stuph.180620

first with

md5sum=$line%%" "*

and next with

md5sum=$(echo $line | awk 'print $1')

where the file "manifest.Stuph.180620" is 100939 lines long (== about 14MiB) gave the following results:

First run (using bash's builtin string manipulation)

real 0m4.750s
user 0m4.174s
sys 0m0.550s

Second run (using the pipeline)

real 10m54.255s
user 4m42.257s
sys 7m32.880s

Note that doing this:

while read md5sum filename; do
 (...etc...)

edited Jul 13 at 23:22

answered Jul 12 at 22:00

Mike S

1,201722

edited Jul 13 at 23:22

answered Jul 12 at 22:00

Mike S

1,201722

answered Jul 12 at 22:00

Mike S

1,201722

answered Jul 12 at 22:00

Mike S

1,201722

Why the cat in a process substitution? And why not read checksum and pathname with read directly?
â€“Â Kusalananda
Jul 13 at 7:12

1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
â€“Â Mike S
Jul 13 at 23:31

add a commentÂ |Â

Why the cat in a process substitution? And why not read checksum and pathname with read directly?
â€“Â Kusalananda
Jul 13 at 7:12

1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
â€“Â Mike S
Jul 13 at 23:31

Why the cat in a process substitution? And why not read checksum and pathname with read directly?
â€“Â Kusalananda
Jul 13 at 7:12

1) Muscle memory. Thanks, I've removed the spurious cat. Bad kitty! 2) True, that works, maybe it's more succinct, but I was interested in the difference between the builtin vs. the pipe where the technique used is similar between the two styles. I often use the pipe construct with more complex problems. I was curious as to how that hurt my efficiency, and whether I should apply myself more to optimizing it in my shell work in general. It seems the answer is yes, if it may help (ie, larger datasets). This is just an example- I was surprised at the speedup!
â€“Â Mike S
Jul 13 at 23:31

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu