Increase speed of Bash script which used grep into a while loop

Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
got this script to work against a file, composed by lots of line (>500Mb) with this scheme:
odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT # on even lines there's a DNA sequence
Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.
So, output looks like:
> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX
Here's the code (only the loop):
K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)
while read odd; do
echo -n "$odd##" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: "
echo "scale=4;$odd##*_ / $Y" | bc
read even
echo "$even##" &&
ACOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "A")
GCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "G")
CCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "C")
TCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "T")
TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
GCCONT=$(($GCOUNT+$CCOUNT))
printf "GC_CONT: "
echo "scale=2;$GCCONT / $TOTALBASES *100" | bc
done < "$1"
It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?
EDIT
As requested, desidered I/O provided via pastebin: https://pastebin.com/FY0Z7kUW
bash shell-script grep performance
add a comment |Â
up vote
1
down vote
favorite
got this script to work against a file, composed by lots of line (>500Mb) with this scheme:
odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT # on even lines there's a DNA sequence
Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.
So, output looks like:
> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX
Here's the code (only the loop):
K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)
while read odd; do
echo -n "$odd##" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: "
echo "scale=4;$odd##*_ / $Y" | bc
read even
echo "$even##" &&
ACOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "A")
GCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "G")
CCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "C")
TCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "T")
TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
GCCONT=$(($GCOUNT+$CCOUNT))
printf "GC_CONT: "
echo "scale=2;$GCCONT / $TOTALBASES *100" | bc
done < "$1"
It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?
EDIT
As requested, desidered I/O provided via pastebin: https://pastebin.com/FY0Z7kUW
bash shell-script grep performance
1
see unix.stackexchange.com/questions/169716/â¦
â Sundeep
Feb 27 at 9:22
adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
â Sundeep
Feb 27 at 9:24
Update quest. I wrote it in a wrong way.
â Shred
Feb 27 at 9:34
given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
â Sundeep
Feb 27 at 9:44
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
got this script to work against a file, composed by lots of line (>500Mb) with this scheme:
odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT # on even lines there's a DNA sequence
Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.
So, output looks like:
> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX
Here's the code (only the loop):
K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)
while read odd; do
echo -n "$odd##" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: "
echo "scale=4;$odd##*_ / $Y" | bc
read even
echo "$even##" &&
ACOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "A")
GCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "G")
CCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "C")
TCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "T")
TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
GCCONT=$(($GCOUNT+$CCOUNT))
printf "GC_CONT: "
echo "scale=2;$GCCONT / $TOTALBASES *100" | bc
done < "$1"
It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?
EDIT
As requested, desidered I/O provided via pastebin: https://pastebin.com/FY0Z7kUW
bash shell-script grep performance
got this script to work against a file, composed by lots of line (>500Mb) with this scheme:
odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT # on even lines there's a DNA sequence
Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.
So, output looks like:
> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX
Here's the code (only the loop):
K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)
while read odd; do
echo -n "$odd##" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: "
echo "scale=4;$odd##*_ / $Y" | bc
read even
echo "$even##" &&
ACOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "A")
GCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "G")
CCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "C")
TCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "T")
TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
GCCONT=$(($GCOUNT+$CCOUNT))
printf "GC_CONT: "
echo "scale=2;$GCCONT / $TOTALBASES *100" | bc
done < "$1"
It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?
EDIT
As requested, desidered I/O provided via pastebin: https://pastebin.com/FY0Z7kUW
bash shell-script grep performance
edited Mar 9 at 9:57
asked Feb 27 at 9:13
Shred
1236
1236
1
see unix.stackexchange.com/questions/169716/â¦
â Sundeep
Feb 27 at 9:22
adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
â Sundeep
Feb 27 at 9:24
Update quest. I wrote it in a wrong way.
â Shred
Feb 27 at 9:34
given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
â Sundeep
Feb 27 at 9:44
add a comment |Â
1
see unix.stackexchange.com/questions/169716/â¦
â Sundeep
Feb 27 at 9:22
adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
â Sundeep
Feb 27 at 9:24
Update quest. I wrote it in a wrong way.
â Shred
Feb 27 at 9:34
given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
â Sundeep
Feb 27 at 9:44
1
1
see unix.stackexchange.com/questions/169716/â¦
â Sundeep
Feb 27 at 9:22
see unix.stackexchange.com/questions/169716/â¦
â Sundeep
Feb 27 at 9:22
adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
â Sundeep
Feb 27 at 9:24
adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
â Sundeep
Feb 27 at 9:24
Update quest. I wrote it in a wrong way.
â Shred
Feb 27 at 9:34
Update quest. I wrote it in a wrong way.
â Shred
Feb 27 at 9:34
given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
â Sundeep
Feb 27 at 9:44
given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
â Sundeep
Feb 27 at 9:44
add a comment |Â
4 Answers
4
active
oldest
votes
up vote
2
down vote
accepted
The percentage calculation can be reduced to a single operation like this
echo "$even##" | awk 'x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%b", (y*100)/(x+y) '
gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.
You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -
awk -F '_' -v Y="$Y" ' if(NR%2==1)
printf "%s %s %s %s %snnucleotidic_cov : %.4fn",$1,$2,$3,$4,$5, ($6 / Y)
else
x=gsub(/[AT]/,"");
y=gsub(/[GC]/,"");
printf "GC_CONT : %.2f%%n", (y*100)/(x+y)
' large_file
EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.
Using a single awk script instead of multiple operations will significantly speed the operation up.
In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
â Shred
Feb 27 at 10:52
I have updated based on your explanation
â amisax
Feb 27 at 11:07
can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
â Shred
Mar 2 at 15:39
@Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
â amisax
Mar 8 at 10:48
Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
â Shred
Mar 8 at 16:12
 |Â
show 5 more comments
up vote
3
down vote
YouâÂÂve reached (to put it mildly) the limits of what can be reasonably done in the shell â you should re-write your script in something like AWK, or Perl, or Python. Using a more advanced language like those will avoid having to run multiple processes for all your text processing; youâÂÂll be able to do it using built-in functions.
I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
â Shred
Feb 27 at 9:37
Text Processing in Python is an old book covering the topic, but youâÂÂd be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
â Stephen Kitt
Feb 27 at 9:55
This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ($(cmd)) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
â Gordon Davisson
Mar 9 at 8:24
add a comment |Â
up vote
1
down vote
The number of cores matters little if your program isn't parallelized (much).
You could use wc and tr rather than sed and grep, which might speed things up a bit:
ACOUNT=$(echo "$even##" | tr -d [^A] | wc -m)
But really, I think the major problem is that shell, while an easy thing to program in for quick-and-dirty jobs, is just not the right tool for the job when it comes to raw processing power. I would suggest a more sophisticated programming language, like Perl or Python, which also have threading abilities (thereby allowing you to use all your cores).
You could do it in perl somewhat like this:
#!/usr/bin/perl -w
use strict;
use warnings;
my $y = ...; # calculate your Y value here
while(my $odd = <ARGV>) # Read a line from the file(s) passed
# on the command line
chomp $odd; # lose the newline
my @split = split /_/, $odd; # split the read line on a "_" boundary
# into an array
print join("_", @split[0..3]) . "n"; # print the first four elements of the
# array, separated by "_"
print $split[$#split] / $y . "n"; # Treat the final element of the
# @split array as a number, divide it
# by $y, and output the result
my %charcount = ( # Initialize a hash table
A => 0,
G => 0,
C => 0,
T => 0
);
my $even = <ARGV>; # read the even line
chomp $even;
foreach my $char(split //,$even) # split the string into separate
# characters, and loop over them
$charcount$char++; # Count the correct character
my $total = $charcountA + $charcountG + $charcountC + $charcountT;
my $gc = $charcountG + $charcountC;
my $perc = $gc / $total;
print "GC_CONT: $percn"; # Do our final calculations and
# output the result
Note: not tested (beyond "does perl accept this code")
If you want to learn more about perl, run perldoc perlintro and get started ;-)
Already switched from using tr, which slows the work more than now.
â Shred
Feb 27 at 9:37
Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
â Shred
Feb 28 at 13:32
What error? Probably should update it then :-)
â Wouter Verhelst
Mar 1 at 12:01
Here's the console log. pastebin.com/uXZW0802
â Shred
Mar 2 at 10:57
So, if I copy/paste my exact code and change themy $y = ...intomy $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
â Wouter Verhelst
Mar 5 at 10:22
add a comment |Â
up vote
0
down vote
You are reading a long file line by line and executing multiple commands on each iteration. The main problem you are facing is latency of running those calculations and reading very small chunks of the file at a time.
Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.
If you want to rule out the performance of the storage and file system, you can load the file from RAM using:
# mkdir /mnt/tmpfs
# mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
# cp <input_file> /tmp/tmpfs
# <script> /tmp/tmpfs/<input_file>
This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.
add a comment |Â
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
The percentage calculation can be reduced to a single operation like this
echo "$even##" | awk 'x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%b", (y*100)/(x+y) '
gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.
You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -
awk -F '_' -v Y="$Y" ' if(NR%2==1)
printf "%s %s %s %s %snnucleotidic_cov : %.4fn",$1,$2,$3,$4,$5, ($6 / Y)
else
x=gsub(/[AT]/,"");
y=gsub(/[GC]/,"");
printf "GC_CONT : %.2f%%n", (y*100)/(x+y)
' large_file
EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.
Using a single awk script instead of multiple operations will significantly speed the operation up.
In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
â Shred
Feb 27 at 10:52
I have updated based on your explanation
â amisax
Feb 27 at 11:07
can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
â Shred
Mar 2 at 15:39
@Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
â amisax
Mar 8 at 10:48
Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
â Shred
Mar 8 at 16:12
 |Â
show 5 more comments
up vote
2
down vote
accepted
The percentage calculation can be reduced to a single operation like this
echo "$even##" | awk 'x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%b", (y*100)/(x+y) '
gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.
You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -
awk -F '_' -v Y="$Y" ' if(NR%2==1)
printf "%s %s %s %s %snnucleotidic_cov : %.4fn",$1,$2,$3,$4,$5, ($6 / Y)
else
x=gsub(/[AT]/,"");
y=gsub(/[GC]/,"");
printf "GC_CONT : %.2f%%n", (y*100)/(x+y)
' large_file
EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.
Using a single awk script instead of multiple operations will significantly speed the operation up.
In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
â Shred
Feb 27 at 10:52
I have updated based on your explanation
â amisax
Feb 27 at 11:07
can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
â Shred
Mar 2 at 15:39
@Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
â amisax
Mar 8 at 10:48
Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
â Shred
Mar 8 at 16:12
 |Â
show 5 more comments
up vote
2
down vote
accepted
up vote
2
down vote
accepted
The percentage calculation can be reduced to a single operation like this
echo "$even##" | awk 'x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%b", (y*100)/(x+y) '
gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.
You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -
awk -F '_' -v Y="$Y" ' if(NR%2==1)
printf "%s %s %s %s %snnucleotidic_cov : %.4fn",$1,$2,$3,$4,$5, ($6 / Y)
else
x=gsub(/[AT]/,"");
y=gsub(/[GC]/,"");
printf "GC_CONT : %.2f%%n", (y*100)/(x+y)
' large_file
EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.
Using a single awk script instead of multiple operations will significantly speed the operation up.
The percentage calculation can be reduced to a single operation like this
echo "$even##" | awk 'x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%b", (y*100)/(x+y) '
gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.
You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -
awk -F '_' -v Y="$Y" ' if(NR%2==1)
printf "%s %s %s %s %snnucleotidic_cov : %.4fn",$1,$2,$3,$4,$5, ($6 / Y)
else
x=gsub(/[AT]/,"");
y=gsub(/[GC]/,"");
printf "GC_CONT : %.2f%%n", (y*100)/(x+y)
' large_file
EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.
Using a single awk script instead of multiple operations will significantly speed the operation up.
edited Mar 9 at 10:25
answered Feb 27 at 10:09
amisax
1,353314
1,353314
In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
â Shred
Feb 27 at 10:52
I have updated based on your explanation
â amisax
Feb 27 at 11:07
can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
â Shred
Mar 2 at 15:39
@Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
â amisax
Mar 8 at 10:48
Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
â Shred
Mar 8 at 16:12
 |Â
show 5 more comments
In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
â Shred
Feb 27 at 10:52
I have updated based on your explanation
â amisax
Feb 27 at 11:07
can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
â Shred
Mar 2 at 15:39
@Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
â amisax
Mar 8 at 10:48
Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
â Shred
Mar 8 at 16:12
In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
â Shred
Feb 27 at 10:52
In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
â Shred
Feb 27 at 10:52
I have updated based on your explanation
â amisax
Feb 27 at 11:07
I have updated based on your explanation
â amisax
Feb 27 at 11:07
can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
â Shred
Mar 2 at 15:39
can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
â Shred
Mar 2 at 15:39
@Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
â amisax
Mar 8 at 10:48
@Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
â amisax
Mar 8 at 10:48
Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
â Shred
Mar 8 at 16:12
Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
â Shred
Mar 8 at 16:12
 |Â
show 5 more comments
up vote
3
down vote
YouâÂÂve reached (to put it mildly) the limits of what can be reasonably done in the shell â you should re-write your script in something like AWK, or Perl, or Python. Using a more advanced language like those will avoid having to run multiple processes for all your text processing; youâÂÂll be able to do it using built-in functions.
I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
â Shred
Feb 27 at 9:37
Text Processing in Python is an old book covering the topic, but youâÂÂd be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
â Stephen Kitt
Feb 27 at 9:55
This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ($(cmd)) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
â Gordon Davisson
Mar 9 at 8:24
add a comment |Â
up vote
3
down vote
YouâÂÂve reached (to put it mildly) the limits of what can be reasonably done in the shell â you should re-write your script in something like AWK, or Perl, or Python. Using a more advanced language like those will avoid having to run multiple processes for all your text processing; youâÂÂll be able to do it using built-in functions.
I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
â Shred
Feb 27 at 9:37
Text Processing in Python is an old book covering the topic, but youâÂÂd be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
â Stephen Kitt
Feb 27 at 9:55
This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ($(cmd)) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
â Gordon Davisson
Mar 9 at 8:24
add a comment |Â
up vote
3
down vote
up vote
3
down vote
YouâÂÂve reached (to put it mildly) the limits of what can be reasonably done in the shell â you should re-write your script in something like AWK, or Perl, or Python. Using a more advanced language like those will avoid having to run multiple processes for all your text processing; youâÂÂll be able to do it using built-in functions.
YouâÂÂve reached (to put it mildly) the limits of what can be reasonably done in the shell â you should re-write your script in something like AWK, or Perl, or Python. Using a more advanced language like those will avoid having to run multiple processes for all your text processing; youâÂÂll be able to do it using built-in functions.
answered Feb 27 at 9:22
Stephen Kitt
141k22307367
141k22307367
I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
â Shred
Feb 27 at 9:37
Text Processing in Python is an old book covering the topic, but youâÂÂd be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
â Stephen Kitt
Feb 27 at 9:55
This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ($(cmd)) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
â Gordon Davisson
Mar 9 at 8:24
add a comment |Â
I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
â Shred
Feb 27 at 9:37
Text Processing in Python is an old book covering the topic, but youâÂÂd be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
â Stephen Kitt
Feb 27 at 9:55
This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ($(cmd)) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
â Gordon Davisson
Mar 9 at 8:24
I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
â Shred
Feb 27 at 9:37
I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
â Shred
Feb 27 at 9:37
Text Processing in Python is an old book covering the topic, but youâÂÂd be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
â Stephen Kitt
Feb 27 at 9:55
Text Processing in Python is an old book covering the topic, but youâÂÂd be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
â Stephen Kitt
Feb 27 at 9:55
This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured (
$(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.â Gordon Davisson
Mar 9 at 8:24
This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured (
$(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.â Gordon Davisson
Mar 9 at 8:24
add a comment |Â
up vote
1
down vote
The number of cores matters little if your program isn't parallelized (much).
You could use wc and tr rather than sed and grep, which might speed things up a bit:
ACOUNT=$(echo "$even##" | tr -d [^A] | wc -m)
But really, I think the major problem is that shell, while an easy thing to program in for quick-and-dirty jobs, is just not the right tool for the job when it comes to raw processing power. I would suggest a more sophisticated programming language, like Perl or Python, which also have threading abilities (thereby allowing you to use all your cores).
You could do it in perl somewhat like this:
#!/usr/bin/perl -w
use strict;
use warnings;
my $y = ...; # calculate your Y value here
while(my $odd = <ARGV>) # Read a line from the file(s) passed
# on the command line
chomp $odd; # lose the newline
my @split = split /_/, $odd; # split the read line on a "_" boundary
# into an array
print join("_", @split[0..3]) . "n"; # print the first four elements of the
# array, separated by "_"
print $split[$#split] / $y . "n"; # Treat the final element of the
# @split array as a number, divide it
# by $y, and output the result
my %charcount = ( # Initialize a hash table
A => 0,
G => 0,
C => 0,
T => 0
);
my $even = <ARGV>; # read the even line
chomp $even;
foreach my $char(split //,$even) # split the string into separate
# characters, and loop over them
$charcount$char++; # Count the correct character
my $total = $charcountA + $charcountG + $charcountC + $charcountT;
my $gc = $charcountG + $charcountC;
my $perc = $gc / $total;
print "GC_CONT: $percn"; # Do our final calculations and
# output the result
Note: not tested (beyond "does perl accept this code")
If you want to learn more about perl, run perldoc perlintro and get started ;-)
Already switched from using tr, which slows the work more than now.
â Shred
Feb 27 at 9:37
Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
â Shred
Feb 28 at 13:32
What error? Probably should update it then :-)
â Wouter Verhelst
Mar 1 at 12:01
Here's the console log. pastebin.com/uXZW0802
â Shred
Mar 2 at 10:57
So, if I copy/paste my exact code and change themy $y = ...intomy $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
â Wouter Verhelst
Mar 5 at 10:22
add a comment |Â
up vote
1
down vote
The number of cores matters little if your program isn't parallelized (much).
You could use wc and tr rather than sed and grep, which might speed things up a bit:
ACOUNT=$(echo "$even##" | tr -d [^A] | wc -m)
But really, I think the major problem is that shell, while an easy thing to program in for quick-and-dirty jobs, is just not the right tool for the job when it comes to raw processing power. I would suggest a more sophisticated programming language, like Perl or Python, which also have threading abilities (thereby allowing you to use all your cores).
You could do it in perl somewhat like this:
#!/usr/bin/perl -w
use strict;
use warnings;
my $y = ...; # calculate your Y value here
while(my $odd = <ARGV>) # Read a line from the file(s) passed
# on the command line
chomp $odd; # lose the newline
my @split = split /_/, $odd; # split the read line on a "_" boundary
# into an array
print join("_", @split[0..3]) . "n"; # print the first four elements of the
# array, separated by "_"
print $split[$#split] / $y . "n"; # Treat the final element of the
# @split array as a number, divide it
# by $y, and output the result
my %charcount = ( # Initialize a hash table
A => 0,
G => 0,
C => 0,
T => 0
);
my $even = <ARGV>; # read the even line
chomp $even;
foreach my $char(split //,$even) # split the string into separate
# characters, and loop over them
$charcount$char++; # Count the correct character
my $total = $charcountA + $charcountG + $charcountC + $charcountT;
my $gc = $charcountG + $charcountC;
my $perc = $gc / $total;
print "GC_CONT: $percn"; # Do our final calculations and
# output the result
Note: not tested (beyond "does perl accept this code")
If you want to learn more about perl, run perldoc perlintro and get started ;-)
Already switched from using tr, which slows the work more than now.
â Shred
Feb 27 at 9:37
Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
â Shred
Feb 28 at 13:32
What error? Probably should update it then :-)
â Wouter Verhelst
Mar 1 at 12:01
Here's the console log. pastebin.com/uXZW0802
â Shred
Mar 2 at 10:57
So, if I copy/paste my exact code and change themy $y = ...intomy $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
â Wouter Verhelst
Mar 5 at 10:22
add a comment |Â
up vote
1
down vote
up vote
1
down vote
The number of cores matters little if your program isn't parallelized (much).
You could use wc and tr rather than sed and grep, which might speed things up a bit:
ACOUNT=$(echo "$even##" | tr -d [^A] | wc -m)
But really, I think the major problem is that shell, while an easy thing to program in for quick-and-dirty jobs, is just not the right tool for the job when it comes to raw processing power. I would suggest a more sophisticated programming language, like Perl or Python, which also have threading abilities (thereby allowing you to use all your cores).
You could do it in perl somewhat like this:
#!/usr/bin/perl -w
use strict;
use warnings;
my $y = ...; # calculate your Y value here
while(my $odd = <ARGV>) # Read a line from the file(s) passed
# on the command line
chomp $odd; # lose the newline
my @split = split /_/, $odd; # split the read line on a "_" boundary
# into an array
print join("_", @split[0..3]) . "n"; # print the first four elements of the
# array, separated by "_"
print $split[$#split] / $y . "n"; # Treat the final element of the
# @split array as a number, divide it
# by $y, and output the result
my %charcount = ( # Initialize a hash table
A => 0,
G => 0,
C => 0,
T => 0
);
my $even = <ARGV>; # read the even line
chomp $even;
foreach my $char(split //,$even) # split the string into separate
# characters, and loop over them
$charcount$char++; # Count the correct character
my $total = $charcountA + $charcountG + $charcountC + $charcountT;
my $gc = $charcountG + $charcountC;
my $perc = $gc / $total;
print "GC_CONT: $percn"; # Do our final calculations and
# output the result
Note: not tested (beyond "does perl accept this code")
If you want to learn more about perl, run perldoc perlintro and get started ;-)
The number of cores matters little if your program isn't parallelized (much).
You could use wc and tr rather than sed and grep, which might speed things up a bit:
ACOUNT=$(echo "$even##" | tr -d [^A] | wc -m)
But really, I think the major problem is that shell, while an easy thing to program in for quick-and-dirty jobs, is just not the right tool for the job when it comes to raw processing power. I would suggest a more sophisticated programming language, like Perl or Python, which also have threading abilities (thereby allowing you to use all your cores).
You could do it in perl somewhat like this:
#!/usr/bin/perl -w
use strict;
use warnings;
my $y = ...; # calculate your Y value here
while(my $odd = <ARGV>) # Read a line from the file(s) passed
# on the command line
chomp $odd; # lose the newline
my @split = split /_/, $odd; # split the read line on a "_" boundary
# into an array
print join("_", @split[0..3]) . "n"; # print the first four elements of the
# array, separated by "_"
print $split[$#split] / $y . "n"; # Treat the final element of the
# @split array as a number, divide it
# by $y, and output the result
my %charcount = ( # Initialize a hash table
A => 0,
G => 0,
C => 0,
T => 0
);
my $even = <ARGV>; # read the even line
chomp $even;
foreach my $char(split //,$even) # split the string into separate
# characters, and loop over them
$charcount$char++; # Count the correct character
my $total = $charcountA + $charcountG + $charcountC + $charcountT;
my $gc = $charcountG + $charcountC;
my $perc = $gc / $total;
print "GC_CONT: $percn"; # Do our final calculations and
# output the result
Note: not tested (beyond "does perl accept this code")
If you want to learn more about perl, run perldoc perlintro and get started ;-)
edited Feb 27 at 10:16
answered Feb 27 at 9:34
Wouter Verhelst
7,146831
7,146831
Already switched from using tr, which slows the work more than now.
â Shred
Feb 27 at 9:37
Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
â Shred
Feb 28 at 13:32
What error? Probably should update it then :-)
â Wouter Verhelst
Mar 1 at 12:01
Here's the console log. pastebin.com/uXZW0802
â Shred
Mar 2 at 10:57
So, if I copy/paste my exact code and change themy $y = ...intomy $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
â Wouter Verhelst
Mar 5 at 10:22
add a comment |Â
Already switched from using tr, which slows the work more than now.
â Shred
Feb 27 at 9:37
Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
â Shred
Feb 28 at 13:32
What error? Probably should update it then :-)
â Wouter Verhelst
Mar 1 at 12:01
Here's the console log. pastebin.com/uXZW0802
â Shred
Mar 2 at 10:57
So, if I copy/paste my exact code and change themy $y = ...intomy $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
â Wouter Verhelst
Mar 5 at 10:22
Already switched from using tr, which slows the work more than now.
â Shred
Feb 27 at 9:37
Already switched from using tr, which slows the work more than now.
â Shred
Feb 27 at 9:37
Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
â Shred
Feb 28 at 13:32
Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
â Shred
Feb 28 at 13:32
What error? Probably should update it then :-)
â Wouter Verhelst
Mar 1 at 12:01
What error? Probably should update it then :-)
â Wouter Verhelst
Mar 1 at 12:01
Here's the console log. pastebin.com/uXZW0802
â Shred
Mar 2 at 10:57
Here's the console log. pastebin.com/uXZW0802
â Shred
Mar 2 at 10:57
So, if I copy/paste my exact code and change the
my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.â Wouter Verhelst
Mar 5 at 10:22
So, if I copy/paste my exact code and change the
my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.â Wouter Verhelst
Mar 5 at 10:22
add a comment |Â
up vote
0
down vote
You are reading a long file line by line and executing multiple commands on each iteration. The main problem you are facing is latency of running those calculations and reading very small chunks of the file at a time.
Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.
If you want to rule out the performance of the storage and file system, you can load the file from RAM using:
# mkdir /mnt/tmpfs
# mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
# cp <input_file> /tmp/tmpfs
# <script> /tmp/tmpfs/<input_file>
This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.
add a comment |Â
up vote
0
down vote
You are reading a long file line by line and executing multiple commands on each iteration. The main problem you are facing is latency of running those calculations and reading very small chunks of the file at a time.
Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.
If you want to rule out the performance of the storage and file system, you can load the file from RAM using:
# mkdir /mnt/tmpfs
# mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
# cp <input_file> /tmp/tmpfs
# <script> /tmp/tmpfs/<input_file>
This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
You are reading a long file line by line and executing multiple commands on each iteration. The main problem you are facing is latency of running those calculations and reading very small chunks of the file at a time.
Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.
If you want to rule out the performance of the storage and file system, you can load the file from RAM using:
# mkdir /mnt/tmpfs
# mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
# cp <input_file> /tmp/tmpfs
# <script> /tmp/tmpfs/<input_file>
This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.
You are reading a long file line by line and executing multiple commands on each iteration. The main problem you are facing is latency of running those calculations and reading very small chunks of the file at a time.
Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.
If you want to rule out the performance of the storage and file system, you can load the file from RAM using:
# mkdir /mnt/tmpfs
# mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
# cp <input_file> /tmp/tmpfs
# <script> /tmp/tmpfs/<input_file>
This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.
answered Feb 27 at 9:31
Pedro
59429
59429
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f426884%2fincrease-speed-of-bash-script-which-used-grep-into-a-while-loop%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
see unix.stackexchange.com/questions/169716/â¦
â Sundeep
Feb 27 at 9:22
adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
â Sundeep
Feb 27 at 9:24
Update quest. I wrote it in a wrong way.
â Shred
Feb 27 at 9:34
given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
â Sundeep
Feb 27 at 9:44