Increase speed of Bash script which used grep into a while loop

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












got this script to work against a file, composed by lots of line (>500Mb) with this scheme:



odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT # on even lines there's a DNA sequence


Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.



So, output looks like:



> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX


Here's the code (only the loop):



K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
echo -n "$odd##" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: "
echo "scale=4;$odd##*_ / $Y" | bc
read even
echo "$even##" &&
ACOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "A")
GCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "G")
CCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "C")
TCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "T")
TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
GCCONT=$(($GCOUNT+$CCOUNT))
printf "GC_CONT: "
echo "scale=2;$GCCONT / $TOTALBASES *100" | bc
done < "$1"


It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?



EDIT



As requested, desidered I/O provided via pastebin: https://pastebin.com/FY0Z7kUW







share|improve this question


















  • 1




    see unix.stackexchange.com/questions/169716/…
    – Sundeep
    Feb 27 at 9:22










  • adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
    – Sundeep
    Feb 27 at 9:24










  • Update quest. I wrote it in a wrong way.
    – Shred
    Feb 27 at 9:34










  • given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
    – Sundeep
    Feb 27 at 9:44















up vote
1
down vote

favorite












got this script to work against a file, composed by lots of line (>500Mb) with this scheme:



odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT # on even lines there's a DNA sequence


Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.



So, output looks like:



> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX


Here's the code (only the loop):



K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
echo -n "$odd##" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: "
echo "scale=4;$odd##*_ / $Y" | bc
read even
echo "$even##" &&
ACOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "A")
GCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "G")
CCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "C")
TCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "T")
TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
GCCONT=$(($GCOUNT+$CCOUNT))
printf "GC_CONT: "
echo "scale=2;$GCCONT / $TOTALBASES *100" | bc
done < "$1"


It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?



EDIT



As requested, desidered I/O provided via pastebin: https://pastebin.com/FY0Z7kUW







share|improve this question


















  • 1




    see unix.stackexchange.com/questions/169716/…
    – Sundeep
    Feb 27 at 9:22










  • adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
    – Sundeep
    Feb 27 at 9:24










  • Update quest. I wrote it in a wrong way.
    – Shred
    Feb 27 at 9:34










  • given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
    – Sundeep
    Feb 27 at 9:44













up vote
1
down vote

favorite









up vote
1
down vote

favorite











got this script to work against a file, composed by lots of line (>500Mb) with this scheme:



odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT # on even lines there's a DNA sequence


Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.



So, output looks like:



> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX


Here's the code (only the loop):



K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
echo -n "$odd##" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: "
echo "scale=4;$odd##*_ / $Y" | bc
read even
echo "$even##" &&
ACOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "A")
GCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "G")
CCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "C")
TCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "T")
TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
GCCONT=$(($GCOUNT+$CCOUNT))
printf "GC_CONT: "
echo "scale=2;$GCCONT / $TOTALBASES *100" | bc
done < "$1"


It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?



EDIT



As requested, desidered I/O provided via pastebin: https://pastebin.com/FY0Z7kUW







share|improve this question














got this script to work against a file, composed by lots of line (>500Mb) with this scheme:



odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT # on even lines there's a DNA sequence


Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.



So, output looks like:



> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX


Here's the code (only the loop):



K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
echo -n "$odd##" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: "
echo "scale=4;$odd##*_ / $Y" | bc
read even
echo "$even##" &&
ACOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "A")
GCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "G")
CCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "C")
TCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "T")
TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
GCCONT=$(($GCOUNT+$CCOUNT))
printf "GC_CONT: "
echo "scale=2;$GCCONT / $TOTALBASES *100" | bc
done < "$1"


It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?



EDIT



As requested, desidered I/O provided via pastebin: https://pastebin.com/FY0Z7kUW









share|improve this question













share|improve this question




share|improve this question








edited Mar 9 at 9:57

























asked Feb 27 at 9:13









Shred

1236




1236







  • 1




    see unix.stackexchange.com/questions/169716/…
    – Sundeep
    Feb 27 at 9:22










  • adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
    – Sundeep
    Feb 27 at 9:24










  • Update quest. I wrote it in a wrong way.
    – Shred
    Feb 27 at 9:34










  • given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
    – Sundeep
    Feb 27 at 9:44













  • 1




    see unix.stackexchange.com/questions/169716/…
    – Sundeep
    Feb 27 at 9:22










  • adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
    – Sundeep
    Feb 27 at 9:24










  • Update quest. I wrote it in a wrong way.
    – Shred
    Feb 27 at 9:34










  • given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
    – Sundeep
    Feb 27 at 9:44








1




1




see unix.stackexchange.com/questions/169716/…
– Sundeep
Feb 27 at 9:22




see unix.stackexchange.com/questions/169716/…
– Sundeep
Feb 27 at 9:22












adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
– Sundeep
Feb 27 at 9:24




adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
– Sundeep
Feb 27 at 9:24












Update quest. I wrote it in a wrong way.
– Shred
Feb 27 at 9:34




Update quest. I wrote it in a wrong way.
– Shred
Feb 27 at 9:34












given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
– Sundeep
Feb 27 at 9:44





given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
– Sundeep
Feb 27 at 9:44











4 Answers
4






active

oldest

votes

















up vote
2
down vote



accepted










The percentage calculation can be reduced to a single operation like this



 echo "$even##" | awk 'x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%b", (y*100)/(x+y) '


gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.



You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -



awk -F '_' -v Y="$Y" ' if(NR%2==1) 
printf "%s %s %s %s %snnucleotidic_cov : %.4fn",$1,$2,$3,$4,$5, ($6 / Y)
else
x=gsub(/[AT]/,"");
y=gsub(/[GC]/,"");
printf "GC_CONT : %.2f%%n", (y*100)/(x+y)

' large_file


EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.



Using a single awk script instead of multiple operations will significantly speed the operation up.






share|improve this answer






















  • In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
    – Shred
    Feb 27 at 10:52










  • I have updated based on your explanation
    – amisax
    Feb 27 at 11:07











  • can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
    – Shred
    Mar 2 at 15:39










  • @Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
    – amisax
    Mar 8 at 10:48










  • Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
    – Shred
    Mar 8 at 16:12

















up vote
3
down vote













You’ve reached (to put it mildly) the limits of what can be reasonably done in the shell — you should re-write your script in something like AWK, or Perl, or Python. Using a more advanced language like those will avoid having to run multiple processes for all your text processing; you’ll be able to do it using built-in functions.






share|improve this answer




















  • I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
    – Shred
    Feb 27 at 9:37










  • Text Processing in Python is an old book covering the topic, but you’d be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
    – Stephen Kitt
    Feb 27 at 9:55











  • This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ( $(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
    – Gordon Davisson
    Mar 9 at 8:24


















up vote
1
down vote













The number of cores matters little if your program isn't parallelized (much).



You could use wc and tr rather than sed and grep, which might speed things up a bit:



ACOUNT=$(echo "$even##" | tr -d [^A] | wc -m)


But really, I think the major problem is that shell, while an easy thing to program in for quick-and-dirty jobs, is just not the right tool for the job when it comes to raw processing power. I would suggest a more sophisticated programming language, like Perl or Python, which also have threading abilities (thereby allowing you to use all your cores).



You could do it in perl somewhat like this:



#!/usr/bin/perl -w
use strict;
use warnings;

my $y = ...; # calculate your Y value here
while(my $odd = <ARGV>) # Read a line from the file(s) passed
# on the command line
chomp $odd; # lose the newline
my @split = split /_/, $odd; # split the read line on a "_" boundary
# into an array
print join("_", @split[0..3]) . "n"; # print the first four elements of the
# array, separated by "_"
print $split[$#split] / $y . "n"; # Treat the final element of the
# @split array as a number, divide it
# by $y, and output the result
my %charcount = ( # Initialize a hash table
A => 0,
G => 0,
C => 0,
T => 0
);
my $even = <ARGV>; # read the even line
chomp $even;
foreach my $char(split //,$even) # split the string into separate
# characters, and loop over them
$charcount$char++; # Count the correct character

my $total = $charcountA + $charcountG + $charcountC + $charcountT;
my $gc = $charcountG + $charcountC;
my $perc = $gc / $total;
print "GC_CONT: $percn"; # Do our final calculations and
# output the result



Note: not tested (beyond "does perl accept this code")



If you want to learn more about perl, run perldoc perlintro and get started ;-)






share|improve this answer






















  • Already switched from using tr, which slows the work more than now.
    – Shred
    Feb 27 at 9:37










  • Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
    – Shred
    Feb 28 at 13:32










  • What error? Probably should update it then :-)
    – Wouter Verhelst
    Mar 1 at 12:01










  • Here's the console log. pastebin.com/uXZW0802
    – Shred
    Mar 2 at 10:57










  • So, if I copy/paste my exact code and change the my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
    – Wouter Verhelst
    Mar 5 at 10:22

















up vote
0
down vote













You are reading a long file line by line and executing multiple commands on each iteration. The main problem you are facing is latency of running those calculations and reading very small chunks of the file at a time.



Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.



If you want to rule out the performance of the storage and file system, you can load the file from RAM using:



# mkdir /mnt/tmpfs
# mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
# cp <input_file> /tmp/tmpfs
# <script> /tmp/tmpfs/<input_file>


This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.






share|improve this answer




















    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f426884%2fincrease-speed-of-bash-script-which-used-grep-into-a-while-loop%23new-answer', 'question_page');

    );

    Post as a guest






























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    2
    down vote



    accepted










    The percentage calculation can be reduced to a single operation like this



     echo "$even##" | awk 'x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%b", (y*100)/(x+y) '


    gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.



    You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -



    awk -F '_' -v Y="$Y" ' if(NR%2==1) 
    printf "%s %s %s %s %snnucleotidic_cov : %.4fn",$1,$2,$3,$4,$5, ($6 / Y)
    else
    x=gsub(/[AT]/,"");
    y=gsub(/[GC]/,"");
    printf "GC_CONT : %.2f%%n", (y*100)/(x+y)

    ' large_file


    EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.



    Using a single awk script instead of multiple operations will significantly speed the operation up.






    share|improve this answer






















    • In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
      – Shred
      Feb 27 at 10:52










    • I have updated based on your explanation
      – amisax
      Feb 27 at 11:07











    • can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
      – Shred
      Mar 2 at 15:39










    • @Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
      – amisax
      Mar 8 at 10:48










    • Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
      – Shred
      Mar 8 at 16:12














    up vote
    2
    down vote



    accepted










    The percentage calculation can be reduced to a single operation like this



     echo "$even##" | awk 'x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%b", (y*100)/(x+y) '


    gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.



    You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -



    awk -F '_' -v Y="$Y" ' if(NR%2==1) 
    printf "%s %s %s %s %snnucleotidic_cov : %.4fn",$1,$2,$3,$4,$5, ($6 / Y)
    else
    x=gsub(/[AT]/,"");
    y=gsub(/[GC]/,"");
    printf "GC_CONT : %.2f%%n", (y*100)/(x+y)

    ' large_file


    EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.



    Using a single awk script instead of multiple operations will significantly speed the operation up.






    share|improve this answer






















    • In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
      – Shred
      Feb 27 at 10:52










    • I have updated based on your explanation
      – amisax
      Feb 27 at 11:07











    • can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
      – Shred
      Mar 2 at 15:39










    • @Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
      – amisax
      Mar 8 at 10:48










    • Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
      – Shred
      Mar 8 at 16:12












    up vote
    2
    down vote



    accepted







    up vote
    2
    down vote



    accepted






    The percentage calculation can be reduced to a single operation like this



     echo "$even##" | awk 'x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%b", (y*100)/(x+y) '


    gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.



    You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -



    awk -F '_' -v Y="$Y" ' if(NR%2==1) 
    printf "%s %s %s %s %snnucleotidic_cov : %.4fn",$1,$2,$3,$4,$5, ($6 / Y)
    else
    x=gsub(/[AT]/,"");
    y=gsub(/[GC]/,"");
    printf "GC_CONT : %.2f%%n", (y*100)/(x+y)

    ' large_file


    EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.



    Using a single awk script instead of multiple operations will significantly speed the operation up.






    share|improve this answer














    The percentage calculation can be reduced to a single operation like this



     echo "$even##" | awk 'x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%b", (y*100)/(x+y) '


    gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.



    You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -



    awk -F '_' -v Y="$Y" ' if(NR%2==1) 
    printf "%s %s %s %s %snnucleotidic_cov : %.4fn",$1,$2,$3,$4,$5, ($6 / Y)
    else
    x=gsub(/[AT]/,"");
    y=gsub(/[GC]/,"");
    printf "GC_CONT : %.2f%%n", (y*100)/(x+y)

    ' large_file


    EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.



    Using a single awk script instead of multiple operations will significantly speed the operation up.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Mar 9 at 10:25

























    answered Feb 27 at 10:09









    amisax

    1,353314




    1,353314











    • In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
      – Shred
      Feb 27 at 10:52










    • I have updated based on your explanation
      – amisax
      Feb 27 at 11:07











    • can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
      – Shred
      Mar 2 at 15:39










    • @Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
      – amisax
      Mar 8 at 10:48










    • Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
      – Shred
      Mar 8 at 16:12
















    • In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
      – Shred
      Feb 27 at 10:52










    • I have updated based on your explanation
      – amisax
      Feb 27 at 11:07











    • can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
      – Shred
      Mar 2 at 15:39










    • @Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
      – amisax
      Mar 8 at 10:48










    • Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
      – Shred
      Mar 8 at 16:12















    In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
    – Shred
    Feb 27 at 10:52




    In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
    – Shred
    Feb 27 at 10:52












    I have updated based on your explanation
    – amisax
    Feb 27 at 11:07





    I have updated based on your explanation
    – amisax
    Feb 27 at 11:07













    can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
    – Shred
    Mar 2 at 15:39




    can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
    – Shred
    Mar 2 at 15:39












    @Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
    – amisax
    Mar 8 at 10:48




    @Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
    – amisax
    Mar 8 at 10:48












    Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
    – Shred
    Mar 8 at 16:12




    Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
    – Shred
    Mar 8 at 16:12












    up vote
    3
    down vote













    You’ve reached (to put it mildly) the limits of what can be reasonably done in the shell — you should re-write your script in something like AWK, or Perl, or Python. Using a more advanced language like those will avoid having to run multiple processes for all your text processing; you’ll be able to do it using built-in functions.






    share|improve this answer




















    • I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
      – Shred
      Feb 27 at 9:37










    • Text Processing in Python is an old book covering the topic, but you’d be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
      – Stephen Kitt
      Feb 27 at 9:55











    • This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ( $(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
      – Gordon Davisson
      Mar 9 at 8:24















    up vote
    3
    down vote













    You’ve reached (to put it mildly) the limits of what can be reasonably done in the shell — you should re-write your script in something like AWK, or Perl, or Python. Using a more advanced language like those will avoid having to run multiple processes for all your text processing; you’ll be able to do it using built-in functions.






    share|improve this answer




















    • I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
      – Shred
      Feb 27 at 9:37










    • Text Processing in Python is an old book covering the topic, but you’d be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
      – Stephen Kitt
      Feb 27 at 9:55











    • This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ( $(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
      – Gordon Davisson
      Mar 9 at 8:24













    up vote
    3
    down vote










    up vote
    3
    down vote









    You’ve reached (to put it mildly) the limits of what can be reasonably done in the shell — you should re-write your script in something like AWK, or Perl, or Python. Using a more advanced language like those will avoid having to run multiple processes for all your text processing; you’ll be able to do it using built-in functions.






    share|improve this answer












    You’ve reached (to put it mildly) the limits of what can be reasonably done in the shell — you should re-write your script in something like AWK, or Perl, or Python. Using a more advanced language like those will avoid having to run multiple processes for all your text processing; you’ll be able to do it using built-in functions.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Feb 27 at 9:22









    Stephen Kitt

    141k22307367




    141k22307367











    • I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
      – Shred
      Feb 27 at 9:37










    • Text Processing in Python is an old book covering the topic, but you’d be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
      – Stephen Kitt
      Feb 27 at 9:55











    • This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ( $(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
      – Gordon Davisson
      Mar 9 at 8:24

















    • I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
      – Shred
      Feb 27 at 9:37










    • Text Processing in Python is an old book covering the topic, but you’d be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
      – Stephen Kitt
      Feb 27 at 9:55











    • This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ( $(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
      – Gordon Davisson
      Mar 9 at 8:24
















    I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
    – Shred
    Feb 27 at 9:37




    I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
    – Shred
    Feb 27 at 9:37












    Text Processing in Python is an old book covering the topic, but you’d be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
    – Stephen Kitt
    Feb 27 at 9:55





    Text Processing in Python is an old book covering the topic, but you’d be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
    – Stephen Kitt
    Feb 27 at 9:55













    This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ( $(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
    – Gordon Davisson
    Mar 9 at 8:24





    This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ( $(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
    – Gordon Davisson
    Mar 9 at 8:24











    up vote
    1
    down vote













    The number of cores matters little if your program isn't parallelized (much).



    You could use wc and tr rather than sed and grep, which might speed things up a bit:



    ACOUNT=$(echo "$even##" | tr -d [^A] | wc -m)


    But really, I think the major problem is that shell, while an easy thing to program in for quick-and-dirty jobs, is just not the right tool for the job when it comes to raw processing power. I would suggest a more sophisticated programming language, like Perl or Python, which also have threading abilities (thereby allowing you to use all your cores).



    You could do it in perl somewhat like this:



    #!/usr/bin/perl -w
    use strict;
    use warnings;

    my $y = ...; # calculate your Y value here
    while(my $odd = <ARGV>) # Read a line from the file(s) passed
    # on the command line
    chomp $odd; # lose the newline
    my @split = split /_/, $odd; # split the read line on a "_" boundary
    # into an array
    print join("_", @split[0..3]) . "n"; # print the first four elements of the
    # array, separated by "_"
    print $split[$#split] / $y . "n"; # Treat the final element of the
    # @split array as a number, divide it
    # by $y, and output the result
    my %charcount = ( # Initialize a hash table
    A => 0,
    G => 0,
    C => 0,
    T => 0
    );
    my $even = <ARGV>; # read the even line
    chomp $even;
    foreach my $char(split //,$even) # split the string into separate
    # characters, and loop over them
    $charcount$char++; # Count the correct character

    my $total = $charcountA + $charcountG + $charcountC + $charcountT;
    my $gc = $charcountG + $charcountC;
    my $perc = $gc / $total;
    print "GC_CONT: $percn"; # Do our final calculations and
    # output the result



    Note: not tested (beyond "does perl accept this code")



    If you want to learn more about perl, run perldoc perlintro and get started ;-)






    share|improve this answer






















    • Already switched from using tr, which slows the work more than now.
      – Shred
      Feb 27 at 9:37










    • Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
      – Shred
      Feb 28 at 13:32










    • What error? Probably should update it then :-)
      – Wouter Verhelst
      Mar 1 at 12:01










    • Here's the console log. pastebin.com/uXZW0802
      – Shred
      Mar 2 at 10:57










    • So, if I copy/paste my exact code and change the my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
      – Wouter Verhelst
      Mar 5 at 10:22














    up vote
    1
    down vote













    The number of cores matters little if your program isn't parallelized (much).



    You could use wc and tr rather than sed and grep, which might speed things up a bit:



    ACOUNT=$(echo "$even##" | tr -d [^A] | wc -m)


    But really, I think the major problem is that shell, while an easy thing to program in for quick-and-dirty jobs, is just not the right tool for the job when it comes to raw processing power. I would suggest a more sophisticated programming language, like Perl or Python, which also have threading abilities (thereby allowing you to use all your cores).



    You could do it in perl somewhat like this:



    #!/usr/bin/perl -w
    use strict;
    use warnings;

    my $y = ...; # calculate your Y value here
    while(my $odd = <ARGV>) # Read a line from the file(s) passed
    # on the command line
    chomp $odd; # lose the newline
    my @split = split /_/, $odd; # split the read line on a "_" boundary
    # into an array
    print join("_", @split[0..3]) . "n"; # print the first four elements of the
    # array, separated by "_"
    print $split[$#split] / $y . "n"; # Treat the final element of the
    # @split array as a number, divide it
    # by $y, and output the result
    my %charcount = ( # Initialize a hash table
    A => 0,
    G => 0,
    C => 0,
    T => 0
    );
    my $even = <ARGV>; # read the even line
    chomp $even;
    foreach my $char(split //,$even) # split the string into separate
    # characters, and loop over them
    $charcount$char++; # Count the correct character

    my $total = $charcountA + $charcountG + $charcountC + $charcountT;
    my $gc = $charcountG + $charcountC;
    my $perc = $gc / $total;
    print "GC_CONT: $percn"; # Do our final calculations and
    # output the result



    Note: not tested (beyond "does perl accept this code")



    If you want to learn more about perl, run perldoc perlintro and get started ;-)






    share|improve this answer






















    • Already switched from using tr, which slows the work more than now.
      – Shred
      Feb 27 at 9:37










    • Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
      – Shred
      Feb 28 at 13:32










    • What error? Probably should update it then :-)
      – Wouter Verhelst
      Mar 1 at 12:01










    • Here's the console log. pastebin.com/uXZW0802
      – Shred
      Mar 2 at 10:57










    • So, if I copy/paste my exact code and change the my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
      – Wouter Verhelst
      Mar 5 at 10:22












    up vote
    1
    down vote










    up vote
    1
    down vote









    The number of cores matters little if your program isn't parallelized (much).



    You could use wc and tr rather than sed and grep, which might speed things up a bit:



    ACOUNT=$(echo "$even##" | tr -d [^A] | wc -m)


    But really, I think the major problem is that shell, while an easy thing to program in for quick-and-dirty jobs, is just not the right tool for the job when it comes to raw processing power. I would suggest a more sophisticated programming language, like Perl or Python, which also have threading abilities (thereby allowing you to use all your cores).



    You could do it in perl somewhat like this:



    #!/usr/bin/perl -w
    use strict;
    use warnings;

    my $y = ...; # calculate your Y value here
    while(my $odd = <ARGV>) # Read a line from the file(s) passed
    # on the command line
    chomp $odd; # lose the newline
    my @split = split /_/, $odd; # split the read line on a "_" boundary
    # into an array
    print join("_", @split[0..3]) . "n"; # print the first four elements of the
    # array, separated by "_"
    print $split[$#split] / $y . "n"; # Treat the final element of the
    # @split array as a number, divide it
    # by $y, and output the result
    my %charcount = ( # Initialize a hash table
    A => 0,
    G => 0,
    C => 0,
    T => 0
    );
    my $even = <ARGV>; # read the even line
    chomp $even;
    foreach my $char(split //,$even) # split the string into separate
    # characters, and loop over them
    $charcount$char++; # Count the correct character

    my $total = $charcountA + $charcountG + $charcountC + $charcountT;
    my $gc = $charcountG + $charcountC;
    my $perc = $gc / $total;
    print "GC_CONT: $percn"; # Do our final calculations and
    # output the result



    Note: not tested (beyond "does perl accept this code")



    If you want to learn more about perl, run perldoc perlintro and get started ;-)






    share|improve this answer














    The number of cores matters little if your program isn't parallelized (much).



    You could use wc and tr rather than sed and grep, which might speed things up a bit:



    ACOUNT=$(echo "$even##" | tr -d [^A] | wc -m)


    But really, I think the major problem is that shell, while an easy thing to program in for quick-and-dirty jobs, is just not the right tool for the job when it comes to raw processing power. I would suggest a more sophisticated programming language, like Perl or Python, which also have threading abilities (thereby allowing you to use all your cores).



    You could do it in perl somewhat like this:



    #!/usr/bin/perl -w
    use strict;
    use warnings;

    my $y = ...; # calculate your Y value here
    while(my $odd = <ARGV>) # Read a line from the file(s) passed
    # on the command line
    chomp $odd; # lose the newline
    my @split = split /_/, $odd; # split the read line on a "_" boundary
    # into an array
    print join("_", @split[0..3]) . "n"; # print the first four elements of the
    # array, separated by "_"
    print $split[$#split] / $y . "n"; # Treat the final element of the
    # @split array as a number, divide it
    # by $y, and output the result
    my %charcount = ( # Initialize a hash table
    A => 0,
    G => 0,
    C => 0,
    T => 0
    );
    my $even = <ARGV>; # read the even line
    chomp $even;
    foreach my $char(split //,$even) # split the string into separate
    # characters, and loop over them
    $charcount$char++; # Count the correct character

    my $total = $charcountA + $charcountG + $charcountC + $charcountT;
    my $gc = $charcountG + $charcountC;
    my $perc = $gc / $total;
    print "GC_CONT: $percn"; # Do our final calculations and
    # output the result



    Note: not tested (beyond "does perl accept this code")



    If you want to learn more about perl, run perldoc perlintro and get started ;-)







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Feb 27 at 10:16

























    answered Feb 27 at 9:34









    Wouter Verhelst

    7,146831




    7,146831











    • Already switched from using tr, which slows the work more than now.
      – Shred
      Feb 27 at 9:37










    • Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
      – Shred
      Feb 28 at 13:32










    • What error? Probably should update it then :-)
      – Wouter Verhelst
      Mar 1 at 12:01










    • Here's the console log. pastebin.com/uXZW0802
      – Shred
      Mar 2 at 10:57










    • So, if I copy/paste my exact code and change the my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
      – Wouter Verhelst
      Mar 5 at 10:22
















    • Already switched from using tr, which slows the work more than now.
      – Shred
      Feb 27 at 9:37










    • Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
      – Shred
      Feb 28 at 13:32










    • What error? Probably should update it then :-)
      – Wouter Verhelst
      Mar 1 at 12:01










    • Here's the console log. pastebin.com/uXZW0802
      – Shred
      Mar 2 at 10:57










    • So, if I copy/paste my exact code and change the my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
      – Wouter Verhelst
      Mar 5 at 10:22















    Already switched from using tr, which slows the work more than now.
    – Shred
    Feb 27 at 9:37




    Already switched from using tr, which slows the work more than now.
    – Shred
    Feb 27 at 9:37












    Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
    – Shred
    Feb 28 at 13:32




    Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
    – Shred
    Feb 28 at 13:32












    What error? Probably should update it then :-)
    – Wouter Verhelst
    Mar 1 at 12:01




    What error? Probably should update it then :-)
    – Wouter Verhelst
    Mar 1 at 12:01












    Here's the console log. pastebin.com/uXZW0802
    – Shred
    Mar 2 at 10:57




    Here's the console log. pastebin.com/uXZW0802
    – Shred
    Mar 2 at 10:57












    So, if I copy/paste my exact code and change the my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
    – Wouter Verhelst
    Mar 5 at 10:22




    So, if I copy/paste my exact code and change the my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
    – Wouter Verhelst
    Mar 5 at 10:22










    up vote
    0
    down vote













    You are reading a long file line by line and executing multiple commands on each iteration. The main problem you are facing is latency of running those calculations and reading very small chunks of the file at a time.



    Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.



    If you want to rule out the performance of the storage and file system, you can load the file from RAM using:



    # mkdir /mnt/tmpfs
    # mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
    # cp <input_file> /tmp/tmpfs
    # <script> /tmp/tmpfs/<input_file>


    This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.






    share|improve this answer
























      up vote
      0
      down vote













      You are reading a long file line by line and executing multiple commands on each iteration. The main problem you are facing is latency of running those calculations and reading very small chunks of the file at a time.



      Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.



      If you want to rule out the performance of the storage and file system, you can load the file from RAM using:



      # mkdir /mnt/tmpfs
      # mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
      # cp <input_file> /tmp/tmpfs
      # <script> /tmp/tmpfs/<input_file>


      This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.






      share|improve this answer






















        up vote
        0
        down vote










        up vote
        0
        down vote









        You are reading a long file line by line and executing multiple commands on each iteration. The main problem you are facing is latency of running those calculations and reading very small chunks of the file at a time.



        Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.



        If you want to rule out the performance of the storage and file system, you can load the file from RAM using:



        # mkdir /mnt/tmpfs
        # mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
        # cp <input_file> /tmp/tmpfs
        # <script> /tmp/tmpfs/<input_file>


        This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.






        share|improve this answer












        You are reading a long file line by line and executing multiple commands on each iteration. The main problem you are facing is latency of running those calculations and reading very small chunks of the file at a time.



        Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.



        If you want to rule out the performance of the storage and file system, you can load the file from RAM using:



        # mkdir /mnt/tmpfs
        # mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
        # cp <input_file> /tmp/tmpfs
        # <script> /tmp/tmpfs/<input_file>


        This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Feb 27 at 9:31









        Pedro

        59429




        59429






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f426884%2fincrease-speed-of-bash-script-which-used-grep-into-a-while-loop%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            Peggy Mitchell

            Palaiologos

            The Forum (Inglewood, California)