Increase speed of Bash script which used grep into a while loop

up vote
1
down vote

favorite

got this script to work against a file, composed by lots of line (>500Mb) with this scheme:

odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT # on even lines there's a DNA sequence

Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.

So, output looks like:

> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX

Here's the code (only the loop):

K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
 echo -n "$odd##" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: " 
 echo "scale=4;$odd##*_ / $Y" | bc 
 read even
 echo "$even##" &&
 ACOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "A") 
 GCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "G")
 CCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "C")
 TCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "T")
 TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
 GCCONT=$(($GCOUNT+$CCOUNT))
 printf "GC_CONT: " 
 echo "scale=2;$GCCONT / $TOTALBASES *100" | bc 
done < "$1"

It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?

EDIT

As requested, desidered I/O provided via pastebin: https://pastebin.com/FY0Z7kUW

edited Mar 9 at 9:57

asked Feb 27 at 9:13

Shred

1236

1

see unix.stackexchange.com/questions/169716/â€¦
â€“Â Sundeep
Feb 27 at 9:22

adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
â€“Â Sundeep
Feb 27 at 9:24

Update quest. I wrote it in a wrong way.
â€“Â Shred
Feb 27 at 9:34

given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
â€“Â Sundeep
Feb 27 at 9:44

add a commentÂ |Â

up vote
1
down vote

favorite

got this script to work against a file, composed by lots of line (>500Mb) with this scheme:

odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT # on even lines there's a DNA sequence

Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.

So, output looks like:

> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX

Here's the code (only the loop):

K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
 echo -n "$odd##" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: " 
 echo "scale=4;$odd##*_ / $Y" | bc 
 read even
 echo "$even##" &&
 ACOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "A") 
 GCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "G")
 CCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "C")
 TCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "T")
 TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
 GCCONT=$(($GCOUNT+$CCOUNT))
 printf "GC_CONT: " 
 echo "scale=2;$GCCONT / $TOTALBASES *100" | bc 
done < "$1"

It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?

EDIT

As requested, desidered I/O provided via pastebin: https://pastebin.com/FY0Z7kUW

edited Mar 9 at 9:57

asked Feb 27 at 9:13

Shred

1236

1

see unix.stackexchange.com/questions/169716/â€¦
â€“Â Sundeep
Feb 27 at 9:22

adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
â€“Â Sundeep
Feb 27 at 9:24

Update quest. I wrote it in a wrong way.
â€“Â Shred
Feb 27 at 9:34

given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
â€“Â Sundeep
Feb 27 at 9:44

add a commentÂ |Â

up vote
1
down vote

favorite

got this script to work against a file, composed by lots of line (>500Mb) with this scheme:

odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT # on even lines there's a DNA sequence

Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.

So, output looks like:

> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX

Here's the code (only the loop):

K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
 echo -n "$odd##" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: " 
 echo "scale=4;$odd##*_ / $Y" | bc 
 read even
 echo "$even##" &&
 ACOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "A") 
 GCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "G")
 CCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "C")
 TCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "T")
 TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
 GCCONT=$(($GCOUNT+$CCOUNT))
 printf "GC_CONT: " 
 echo "scale=2;$GCCONT / $TOTALBASES *100" | bc 
done < "$1"

It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?

EDIT

As requested, desidered I/O provided via pastebin: https://pastebin.com/FY0Z7kUW

edited Mar 9 at 9:57

asked Feb 27 at 9:13

Shred

1236

got this script to work against a file, composed by lots of line (>500Mb) with this scheme:

odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT # on even lines there's a DNA sequence

Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.

So, output looks like:

> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX

Here's the code (only the loop):

K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
 echo -n "$odd##" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: " 
 echo "scale=4;$odd##*_ / $Y" | bc 
 read even
 echo "$even##" &&
 ACOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "A") 
 GCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "G")
 CCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "C")
 TCOUNT=$(echo "$even##" | sed -e "s/./&n /g" | grep -c "T")
 TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
 GCCONT=$(($GCOUNT+$CCOUNT))
 printf "GC_CONT: " 
 echo "scale=2;$GCCONT / $TOTALBASES *100" | bc 
done < "$1"

It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?

EDIT

As requested, desidered I/O provided via pastebin: https://pastebin.com/FY0Z7kUW

edited Mar 9 at 9:57

asked Feb 27 at 9:13

Shred

1236

edited Mar 9 at 9:57

asked Feb 27 at 9:13

Shred

1236

asked Feb 27 at 9:13

Shred

1236

asked Feb 27 at 9:13

Shred

1236

1

see unix.stackexchange.com/questions/169716/â€¦
â€“Â Sundeep
Feb 27 at 9:22

adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
â€“Â Sundeep
Feb 27 at 9:24

Update quest. I wrote it in a wrong way.
â€“Â Shred
Feb 27 at 9:34

given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
â€“Â Sundeep
Feb 27 at 9:44

add a commentÂ |Â

1

see unix.stackexchange.com/questions/169716/â€¦
â€“Â Sundeep
Feb 27 at 9:22

adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
â€“Â Sundeep
Feb 27 at 9:24

Update quest. I wrote it in a wrong way.
â€“Â Shred
Feb 27 at 9:34

given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
â€“Â Sundeep
Feb 27 at 9:44

see unix.stackexchange.com/questions/169716/â€¦
â€“Â Sundeep
Feb 27 at 9:22

adding some sample lines(say 5-10) along with expected output would help in suggesting an alternate solution.. also, if this is bioinformatics, see bioinformatics.stackexchange.com
â€“Â Sundeep
Feb 27 at 9:24

Update quest. I wrote it in a wrong way.
â€“Â Shred
Feb 27 at 9:34

given sample is not clear.. please do not add any character not actually present in your input... post few lines (to represent different cases) exactly as in your input and post exact output required... and then try to explain how the transformation is done..
â€“Â Sundeep
Feb 27 at 9:44

add a commentÂ |Â

4 Answers
4

active

oldest

votes

up vote
2
down vote

accepted

The percentage calculation can be reduced to a single operation like this

 echo "$even##" | awk 'x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%b", (y*100)/(x+y) '

gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.

You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -

awk -F '_' -v Y="$Y" ' if(NR%2==1) 
 printf "%s %s %s %s %snnucleotidic_cov : %.4fn",$1,$2,$3,$4,$5, ($6 / Y)
 else 
 x=gsub(/[AT]/,""); 
 y=gsub(/[GC]/,""); 
 printf "GC_CONT : %.2f%%n", (y*100)/(x+y)
 
 ' large_file

EDIT : Based on OP's requirement changed the if block for odd lines. The gsub would remove the "cov." from the number. After passing the shell variable $Y to awk , we can now divide and print in the required format.

Using a single awk script instead of multiple operations will significantly speed the operation up.

edited Mar 9 at 10:25

answered Feb 27 at 10:09

amisax

1,353314

In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
â€“Â Shred
Feb 27 at 10:52

I have updated based on your explanation
â€“Â amisax
Feb 27 at 11:07

can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
â€“Â Shred
Mar 2 at 15:39

@Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
â€“Â amisax
Mar 8 at 10:48

Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
â€“Â Shred
Mar 8 at 16:12

Â |Â
show 5 more comments

up vote
3
down vote

YouÃ¢Â€Â™ve reached (to put it mildly) the limits of what can be reasonably done in the shell Ã¢Â€Â” you should re-write your script in something like AWK, or Perl, or Python. Using a more advanced language like those will avoid having to run multiple processes for all your text processing; youÃ¢Â€Â™ll be able to do it using built-in functions.

answered Feb 27 at 9:22

Stephen Kitt

141k22307367

I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
â€“Â Shred
Feb 27 at 9:37

Text Processing in Python is an old book covering the topic, but youÃ¢Â€Â™d be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
â€“Â Stephen Kitt
Feb 27 at 9:55

This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ( $(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
â€“Â Gordon Davisson
Mar 9 at 8:24

add a commentÂ |Â

up vote
1
down vote

The number of cores matters little if your program isn't parallelized (much).

You could use wc and tr rather than sed and grep, which might speed things up a bit:

ACOUNT=$(echo "$even##" | tr -d [^A] | wc -m)

But really, I think the major problem is that shell, while an easy thing to program in for quick-and-dirty jobs, is just not the right tool for the job when it comes to raw processing power. I would suggest a more sophisticated programming language, like Perl or Python, which also have threading abilities (thereby allowing you to use all your cores).

You could do it in perl somewhat like this:

#!/usr/bin/perl -w
use strict;
use warnings;

my $y = ...; # calculate your Y value here
while(my $odd = <ARGV>) # Read a line from the file(s) passed
 # on the command line
 chomp $odd; # lose the newline
 my @split = split /_/, $odd; # split the read line on a "_" boundary
 # into an array
 print join("_", @split[0..3]) . "n"; # print the first four elements of the
 # array, separated by "_"
 print $split[$#split] / $y . "n"; # Treat the final element of the
 # @split array as a number, divide it
 # by $y, and output the result
 my %charcount = ( # Initialize a hash table
 A => 0,
 G => 0,
 C => 0,
 T => 0
 );
 my $even = <ARGV>; # read the even line
 chomp $even;
 foreach my $char(split //,$even) # split the string into separate
 # characters, and loop over them
 $charcount$char++; # Count the correct character
 
 my $total = $charcountA + $charcountG + $charcountC + $charcountT;
 my $gc = $charcountG + $charcountC;
 my $perc = $gc / $total;
 print "GC_CONT: $percn"; # Do our final calculations and
 # output the result

Note: not tested (beyond "does perl accept this code")

If you want to learn more about perl, run perldoc perlintro and get started ;-)

edited Feb 27 at 10:16

answered Feb 27 at 9:34

Wouter Verhelst

7,146831

Already switched from using tr, which slows the work more than now.
â€“Â Shred
Feb 27 at 9:37

Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
â€“Â Shred
Feb 28 at 13:32

What error? Probably should update it then :-)
â€“Â Wouter Verhelst
Mar 1 at 12:01

Here's the console log. pastebin.com/uXZW0802
â€“Â Shred
Mar 2 at 10:57

So, if I copy/paste my exact code and change the my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
â€“Â Wouter Verhelst
Mar 5 at 10:22

add a commentÂ |Â

up vote
0
down vote

You are reading a long file line by line and executing multiple commands on each iteration. The main problem you are facing is latency of running those calculations and reading very small chunks of the file at a time.

Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.

If you want to rule out the performance of the storage and file system, you can load the file from RAM using:

# mkdir /mnt/tmpfs
# mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
# cp <input_file> /tmp/tmpfs
# <script> /tmp/tmpfs/<input_file>

This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.

answered Feb 27 at 9:31

Pedro

59429

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f426884%2fincrease-speed-of-bash-script-which-used-grep-into-a-while-loop%23new-answer', 'question_page');

);

Post as a guest

Name

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

up vote
2
down vote

accepted

The percentage calculation can be reduced to a single operation like this

 echo "$even##" | awk 'x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%b", (y*100)/(x+y) '

gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.

You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -

awk -F '_' -v Y="$Y" ' if(NR%2==1) 
 printf "%s %s %s %s %snnucleotidic_cov : %.4fn",$1,$2,$3,$4,$5, ($6 / Y)
 else 
 x=gsub(/[AT]/,""); 
 y=gsub(/[GC]/,""); 
 printf "GC_CONT : %.2f%%n", (y*100)/(x+y)
 
 ' large_file

Using a single awk script instead of multiple operations will significantly speed the operation up.

edited Mar 9 at 10:25

answered Feb 27 at 10:09

amisax

1,353314

In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
â€“Â Shred
Feb 27 at 10:52

I have updated based on your explanation
â€“Â amisax
Feb 27 at 11:07

can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
â€“Â Shred
Mar 2 at 15:39

@Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
â€“Â amisax
Mar 8 at 10:48

Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
â€“Â Shred
Mar 8 at 16:12

Â |Â
show 5 more comments

up vote
2
down vote

accepted

The percentage calculation can be reduced to a single operation like this

 echo "$even##" | awk 'x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%b", (y*100)/(x+y) '

gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.

You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -

awk -F '_' -v Y="$Y" ' if(NR%2==1) 
 printf "%s %s %s %s %snnucleotidic_cov : %.4fn",$1,$2,$3,$4,$5, ($6 / Y)
 else 
 x=gsub(/[AT]/,""); 
 y=gsub(/[GC]/,""); 
 printf "GC_CONT : %.2f%%n", (y*100)/(x+y)
 
 ' large_file

Using a single awk script instead of multiple operations will significantly speed the operation up.

edited Mar 9 at 10:25

answered Feb 27 at 10:09

amisax

1,353314

In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
â€“Â Shred
Feb 27 at 10:52

I have updated based on your explanation
â€“Â amisax
Feb 27 at 11:07

can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
â€“Â Shred
Mar 2 at 15:39

@Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
â€“Â amisax
Mar 8 at 10:48

Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
â€“Â Shred
Mar 8 at 16:12

Â |Â
show 5 more comments

up vote
2
down vote

accepted

The percentage calculation can be reduced to a single operation like this

 echo "$even##" | awk 'x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%b", (y*100)/(x+y) '

gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.

You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -

awk -F '_' -v Y="$Y" ' if(NR%2==1) 
 printf "%s %s %s %s %snnucleotidic_cov : %.4fn",$1,$2,$3,$4,$5, ($6 / Y)
 else 
 x=gsub(/[AT]/,""); 
 y=gsub(/[GC]/,""); 
 printf "GC_CONT : %.2f%%n", (y*100)/(x+y)
 
 ' large_file

Using a single awk script instead of multiple operations will significantly speed the operation up.

edited Mar 9 at 10:25

answered Feb 27 at 10:09

amisax

1,353314

The percentage calculation can be reduced to a single operation like this

 echo "$even##" | awk 'x=gsub(/[ACT]/,""); y=gsub(/G/,""); printf "GC_CONT : %.2f%%b", (y*100)/(x+y) '

gsub substitutes a pattern and return the count of substitutions it has made. So that can be used to quickly calculate the percentage.

You could also process the odd and even lines in awk. It is not clear what you are doing with odd lines but your complete function can be put in a single awk -

awk -F '_' -v Y="$Y" ' if(NR%2==1) 
 printf "%s %s %s %s %snnucleotidic_cov : %.4fn",$1,$2,$3,$4,$5, ($6 / Y)
 else 
 x=gsub(/[AT]/,""); 
 y=gsub(/[GC]/,""); 
 printf "GC_CONT : %.2f%%n", (y*100)/(x+y)
 
 ' large_file

Using a single awk script instead of multiple operations will significantly speed the operation up.

edited Mar 9 at 10:25

answered Feb 27 at 10:09

amisax

1,353314

edited Mar 9 at 10:25

answered Feb 27 at 10:09

amisax

1,353314

answered Feb 27 at 10:09

amisax

1,353314

answered Feb 27 at 10:09

amisax

1,353314

In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
â€“Â Shred
Feb 27 at 10:52

I have updated based on your explanation
â€“Â amisax
Feb 27 at 11:07

can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
â€“Â Shred
Mar 2 at 15:39

@Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
â€“Â amisax
Mar 8 at 10:48

Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
â€“Â Shred
Mar 8 at 16:12

Â |Â
show 5 more comments

In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
â€“Â Shred
Feb 27 at 10:52

I have updated based on your explanation
â€“Â amisax
Feb 27 at 11:07

can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
â€“Â Shred
Mar 2 at 15:39

@Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
â€“Â amisax
Mar 8 at 10:48

Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
â€“Â Shred
Mar 8 at 16:12

In odd lines the value after "cov." is used to calc a new value, called "nucleotidic_cov" which is printed into a new line.
â€“Â Shred
Feb 27 at 10:52

I have updated based on your explanation
â€“Â amisax
Feb 27 at 11:07

can't figure out where's the split, between the "cov." value and the variable called Y. Please improve, using only awk the script goes so fast.
â€“Â Shred
Mar 2 at 15:39

@Shred in if block we did modified $3 cov. with nucleotidic_cov. $3 in the line is cov.XX
â€“Â amisax
Mar 8 at 10:48

Perhaps no division is made. cov. value needs to be divided by $y , as in script provided, in echo "scale=4;$odd##*_ / $Y" | bc Please improve, and this will be the best answer, cuz is what i need to have.
â€“Â Shred
Mar 8 at 16:12

Â |Â
show 5 more comments

up vote
3
down vote

answered Feb 27 at 9:22

Stephen Kitt

141k22307367

I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
â€“Â Shred
Feb 27 at 9:37

Text Processing in Python is an old book covering the topic, but youÃ¢Â€Â™d be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
â€“Â Stephen Kitt
Feb 27 at 9:55

This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ( $(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
â€“Â Gordon Davisson
Mar 9 at 8:24

add a commentÂ |Â

up vote
3
down vote

answered Feb 27 at 9:22

Stephen Kitt

141k22307367

I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
â€“Â Shred
Feb 27 at 9:37

Text Processing in Python is an old book covering the topic, but youÃ¢Â€Â™d be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
â€“Â Stephen Kitt
Feb 27 at 9:55

This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ( $(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
â€“Â Gordon Davisson
Mar 9 at 8:24

add a commentÂ |Â

up vote
3
down vote

answered Feb 27 at 9:22

Stephen Kitt

141k22307367

answered Feb 27 at 9:22

Stephen Kitt

141k22307367

answered Feb 27 at 9:22

Stephen Kitt

141k22307367

answered Feb 27 at 9:22

Stephen Kitt

141k22307367

answered Feb 27 at 9:22

Stephen Kitt

141k22307367

I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
â€“Â Shred
Feb 27 at 9:37

Text Processing in Python is an old book covering the topic, but youÃ¢Â€Â™d be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
â€“Â Stephen Kitt
Feb 27 at 9:55

This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ( $(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
â€“Â Gordon Davisson
Mar 9 at 8:24

add a commentÂ |Â

I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
â€“Â Shred
Feb 27 at 9:37

Text Processing in Python is an old book covering the topic, but youÃ¢Â€Â™d be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
â€“Â Stephen Kitt
Feb 27 at 9:55

This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ( $(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
â€“Â Gordon Davisson
Mar 9 at 8:24

I'm a beginner, as can be seen by the quest, in bash as long as with python. Could you please give me some references on how to do a work like this with python?
â€“Â Shred
Feb 27 at 9:37

Text Processing in Python is an old book covering the topic, but youÃ¢Â€Â™d be better off learning text processing in Python 3 at this stage. Personally I really like Modern Perl.
â€“Â Stephen Kitt
Feb 27 at 9:55

This is the right answer. The slowest thing in the original script is creating subprocesses for each external and/or piped and/or captured ( $(cmd) ) command in the loop. I count 18 subprocesses per loop! You can decrease that count significantly, but you're much much better off switching to a language that can do complex operations without needing to start other programs to do the work. Nearly any other language could churn through the entire file in a single process.
â€“Â Gordon Davisson
Mar 9 at 8:24

add a commentÂ |Â

up vote
1
down vote

The number of cores matters little if your program isn't parallelized (much).

You could use wc and tr rather than sed and grep, which might speed things up a bit:

ACOUNT=$(echo "$even##" | tr -d [^A] | wc -m)

You could do it in perl somewhat like this:

#!/usr/bin/perl -w
use strict;
use warnings;

my $y = ...; # calculate your Y value here
while(my $odd = <ARGV>) # Read a line from the file(s) passed
 # on the command line
 chomp $odd; # lose the newline
 my @split = split /_/, $odd; # split the read line on a "_" boundary
 # into an array
 print join("_", @split[0..3]) . "n"; # print the first four elements of the
 # array, separated by "_"
 print $split[$#split] / $y . "n"; # Treat the final element of the
 # @split array as a number, divide it
 # by $y, and output the result
 my %charcount = ( # Initialize a hash table
 A => 0,
 G => 0,
 C => 0,
 T => 0
 );
 my $even = <ARGV>; # read the even line
 chomp $even;
 foreach my $char(split //,$even) # split the string into separate
 # characters, and loop over them
 $charcount$char++; # Count the correct character
 
 my $total = $charcountA + $charcountG + $charcountC + $charcountT;
 my $gc = $charcountG + $charcountC;
 my $perc = $gc / $total;
 print "GC_CONT: $percn"; # Do our final calculations and
 # output the result

Note: not tested (beyond "does perl accept this code")

If you want to learn more about perl, run perldoc perlintro and get started ;-)

edited Feb 27 at 10:16

answered Feb 27 at 9:34

Wouter Verhelst

7,146831

Already switched from using tr, which slows the work more than now.
â€“Â Shred
Feb 27 at 9:37

Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
â€“Â Shred
Feb 28 at 13:32

What error? Probably should update it then :-)
â€“Â Wouter Verhelst
Mar 1 at 12:01

Here's the console log. pastebin.com/uXZW0802
â€“Â Shred
Mar 2 at 10:57

So, if I copy/paste my exact code and change the my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
â€“Â Wouter Verhelst
Mar 5 at 10:22

add a commentÂ |Â

up vote
1
down vote

The number of cores matters little if your program isn't parallelized (much).

You could use wc and tr rather than sed and grep, which might speed things up a bit:

ACOUNT=$(echo "$even##" | tr -d [^A] | wc -m)

You could do it in perl somewhat like this:

#!/usr/bin/perl -w
use strict;
use warnings;

my $y = ...; # calculate your Y value here
while(my $odd = <ARGV>) # Read a line from the file(s) passed
 # on the command line
 chomp $odd; # lose the newline
 my @split = split /_/, $odd; # split the read line on a "_" boundary
 # into an array
 print join("_", @split[0..3]) . "n"; # print the first four elements of the
 # array, separated by "_"
 print $split[$#split] / $y . "n"; # Treat the final element of the
 # @split array as a number, divide it
 # by $y, and output the result
 my %charcount = ( # Initialize a hash table
 A => 0,
 G => 0,
 C => 0,
 T => 0
 );
 my $even = <ARGV>; # read the even line
 chomp $even;
 foreach my $char(split //,$even) # split the string into separate
 # characters, and loop over them
 $charcount$char++; # Count the correct character
 
 my $total = $charcountA + $charcountG + $charcountC + $charcountT;
 my $gc = $charcountG + $charcountC;
 my $perc = $gc / $total;
 print "GC_CONT: $percn"; # Do our final calculations and
 # output the result

Note: not tested (beyond "does perl accept this code")

If you want to learn more about perl, run perldoc perlintro and get started ;-)

edited Feb 27 at 10:16

answered Feb 27 at 9:34

Wouter Verhelst

7,146831

Already switched from using tr, which slows the work more than now.
â€“Â Shred
Feb 27 at 9:37

Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
â€“Â Shred
Feb 28 at 13:32

What error? Probably should update it then :-)
â€“Â Wouter Verhelst
Mar 1 at 12:01

Here's the console log. pastebin.com/uXZW0802
â€“Â Shred
Mar 2 at 10:57

So, if I copy/paste my exact code and change the my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
â€“Â Wouter Verhelst
Mar 5 at 10:22

add a commentÂ |Â

up vote
1
down vote

The number of cores matters little if your program isn't parallelized (much).

You could use wc and tr rather than sed and grep, which might speed things up a bit:

ACOUNT=$(echo "$even##" | tr -d [^A] | wc -m)

You could do it in perl somewhat like this:

#!/usr/bin/perl -w
use strict;
use warnings;

my $y = ...; # calculate your Y value here
while(my $odd = <ARGV>) # Read a line from the file(s) passed
 # on the command line
 chomp $odd; # lose the newline
 my @split = split /_/, $odd; # split the read line on a "_" boundary
 # into an array
 print join("_", @split[0..3]) . "n"; # print the first four elements of the
 # array, separated by "_"
 print $split[$#split] / $y . "n"; # Treat the final element of the
 # @split array as a number, divide it
 # by $y, and output the result
 my %charcount = ( # Initialize a hash table
 A => 0,
 G => 0,
 C => 0,
 T => 0
 );
 my $even = <ARGV>; # read the even line
 chomp $even;
 foreach my $char(split //,$even) # split the string into separate
 # characters, and loop over them
 $charcount$char++; # Count the correct character
 
 my $total = $charcountA + $charcountG + $charcountC + $charcountT;
 my $gc = $charcountG + $charcountC;
 my $perc = $gc / $total;
 print "GC_CONT: $percn"; # Do our final calculations and
 # output the result

Note: not tested (beyond "does perl accept this code")

If you want to learn more about perl, run perldoc perlintro and get started ;-)

edited Feb 27 at 10:16

answered Feb 27 at 9:34

Wouter Verhelst

7,146831

The number of cores matters little if your program isn't parallelized (much).

You could use wc and tr rather than sed and grep, which might speed things up a bit:

ACOUNT=$(echo "$even##" | tr -d [^A] | wc -m)

You could do it in perl somewhat like this:

#!/usr/bin/perl -w
use strict;
use warnings;

my $y = ...; # calculate your Y value here
while(my $odd = <ARGV>) # Read a line from the file(s) passed
 # on the command line
 chomp $odd; # lose the newline
 my @split = split /_/, $odd; # split the read line on a "_" boundary
 # into an array
 print join("_", @split[0..3]) . "n"; # print the first four elements of the
 # array, separated by "_"
 print $split[$#split] / $y . "n"; # Treat the final element of the
 # @split array as a number, divide it
 # by $y, and output the result
 my %charcount = ( # Initialize a hash table
 A => 0,
 G => 0,
 C => 0,
 T => 0
 );
 my $even = <ARGV>; # read the even line
 chomp $even;
 foreach my $char(split //,$even) # split the string into separate
 # characters, and loop over them
 $charcount$char++; # Count the correct character
 
 my $total = $charcountA + $charcountG + $charcountC + $charcountT;
 my $gc = $charcountG + $charcountC;
 my $perc = $gc / $total;
 print "GC_CONT: $percn"; # Do our final calculations and
 # output the result

Note: not tested (beyond "does perl accept this code")

If you want to learn more about perl, run perldoc perlintro and get started ;-)

edited Feb 27 at 10:16

answered Feb 27 at 9:34

Wouter Verhelst

7,146831

edited Feb 27 at 10:16

answered Feb 27 at 9:34

Wouter Verhelst

7,146831

answered Feb 27 at 9:34

Wouter Verhelst

7,146831

answered Feb 27 at 9:34

Wouter Verhelst

7,146831

Already switched from using tr, which slows the work more than now.
â€“Â Shred
Feb 27 at 9:37

Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
â€“Â Shred
Feb 28 at 13:32

What error? Probably should update it then :-)
â€“Â Wouter Verhelst
Mar 1 at 12:01

Here's the console log. pastebin.com/uXZW0802
â€“Â Shred
Mar 2 at 10:57

So, if I copy/paste my exact code and change the my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
â€“Â Wouter Verhelst
Mar 5 at 10:22

add a commentÂ |Â

Already switched from using tr, which slows the work more than now.
â€“Â Shred
Feb 27 at 9:37

Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
â€“Â Shred
Feb 28 at 13:32

What error? Probably should update it then :-)
â€“Â Wouter Verhelst
Mar 1 at 12:01

Here's the console log. pastebin.com/uXZW0802
â€“Â Shred
Mar 2 at 10:57

So, if I copy/paste my exact code and change the my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
â€“Â Wouter Verhelst
Mar 5 at 10:22

Already switched from using tr, which slows the work more than now.
â€“Â Shred
Feb 27 at 9:37

Found some error in code while running. Btw, thanks a lot, still a good way to do some research on.
â€“Â Shred
Feb 28 at 13:32

What error? Probably should update it then :-)
â€“Â Wouter Verhelst
Mar 1 at 12:01

Here's the console log. pastebin.com/uXZW0802
â€“Â Shred
Mar 2 at 10:57

So, if I copy/paste my exact code and change the my $y = ... into my $y = 0, that error does not appear. Therefore, I guess that you made an error in the changes that you made. If you haven't found the reason for that yet, I would suggest you open a different question on that subject, since that's really different from what you originally asked.
â€“Â Wouter Verhelst
Mar 5 at 10:22

add a commentÂ |Â

up vote
0
down vote

Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.

If you want to rule out the performance of the storage and file system, you can load the file from RAM using:

# mkdir /mnt/tmpfs
# mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
# cp <input_file> /tmp/tmpfs
# <script> /tmp/tmpfs/<input_file>

This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.

answered Feb 27 at 9:31

Pedro

59429

add a commentÂ |Â

up vote
0
down vote

Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.

If you want to rule out the performance of the storage and file system, you can load the file from RAM using:

# mkdir /mnt/tmpfs
# mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
# cp <input_file> /tmp/tmpfs
# <script> /tmp/tmpfs/<input_file>

This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.

answered Feb 27 at 9:31

Pedro

59429

add a commentÂ |Â

up vote
0
down vote

Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.

If you want to rule out the performance of the storage and file system, you can load the file from RAM using:

# mkdir /mnt/tmpfs
# mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
# cp <input_file> /tmp/tmpfs
# <script> /tmp/tmpfs/<input_file>

This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.

answered Feb 27 at 9:31

Pedro

59429

Stephen Kitt's answer is good, you want to rewrite this in a higher level language in which you can cache file contents and run your string operations much more efficiently.

If you want to rule out the performance of the storage and file system, you can load the file from RAM using:

# mkdir /mnt/tmpfs
# mount -t tmpfs -o size=1024M tmpfs /mnt/tmpfs
# cp <input_file> /tmp/tmpfs
# <script> /tmp/tmpfs/<input_file>

This should make the process faster as much as you are I/O constrained. But it will never be as good as it could be if rewritten in C or ruby or python.

answered Feb 27 at 9:31

Pedro

59429

answered Feb 27 at 9:31

Pedro

59429

answered Feb 27 at 9:31

Pedro

59429

answered Feb 27 at 9:31

Pedro

59429

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu