extract fasta entries from list using while read
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
I have 28 files that each have ~14,000 "entries". A single entry consists of a header, denoted by >string, a newline, and then a sequence which is a string. Each entry has variable length sequence/string.
Across all 28 files there are identical entry headers but the sequence for each entry is variable.
For example one file CR1_ref.fasta would look like
>FBgn0080937
ATGGATAAAAGGCTCAGCGATAGTCCCGGAGATTGTCGCGTAACCAGATCCAGCATGACGCCCACCCTCCGCTTGGAGCACAGTCCCCGGCGGCAACAACAGCAACAACA
>FBgn0076379
ATGCTGCGCACCCTTTTCGCCGTGCGTGGTCAGTGCCAGCAGCTGCTGAGGAGAACATTCACCCCCCATTGCAGTGGCCAACGA
>FBgn0070974
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAACTCCTGCGGGAGCTGCCGCCGCAGAAATGCTCCAGCGCCACGCTGGCCAAGAAGGTGCTGTCGCAGAGCCCGCCGGCAGCCCCGCCGCCCACACCGGCCACAATTGTGCCGCTCACTGCGGTGCCCGTCATCCAGCTGACGCCTCCGTCGCACTCCGGCGACACGCCGCAAAAGCCAGCACCTCCGGCGCCGCCGCCGCC
The overall goal is to create ~14,000 new files. Where each file is the entry associated with a particular ID/header across all 28 files.
To extract a single entry from a single file I can use the following command
sed -n '/^>FBgn0080937$/p;n;p;' CR1_ref.fasta
To extract this entry across all 28 files, each ends in ref.fasta, I can do
for i in *ref.fasta; do sed -n '/^>FBgn0080937$/p;n;p' $i; done > FBgn0080937.fasta
I have a separate text file that has 14,000 lines each line corresponding to a header for an entry called gene.txt.
The first few lines of this file look like
FBgn0080937
FBgn0076379
FBgn0070974
FBgn0081668
FBgn0076576
FBgn0076572
FBgn0079684
FBgn0070907
FBgn0080226
FBgn0072746
I would like to read through this file creating a new text file per header ID.
Below $F is extracting entries for a particular header (FBgn*) and storing this in a new file. I am using the substitution command to rename sequences based on while ref.fasta file they come from.
while read -r line;
do F=$line
for i in *ref.fasta
do sed -n "/^>$F$/s/FB.*/$i/;p;n;p;" $i > $line.fasta
done
done < "gene.txt"
Currently this script creates 14,000 files but each file only has a single sequence.
>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
I am expecting 28 sequences one sequence per *ref.fasta file. The sed command is outputting the last entry.
The expected output would be
>CR1_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACC
>FH2_ref.fasta
AGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>MSH10_ref.fasta
CGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
text-processing sed
 |Â
show 2 more comments
up vote
1
down vote
favorite
I have 28 files that each have ~14,000 "entries". A single entry consists of a header, denoted by >string, a newline, and then a sequence which is a string. Each entry has variable length sequence/string.
Across all 28 files there are identical entry headers but the sequence for each entry is variable.
For example one file CR1_ref.fasta would look like
>FBgn0080937
ATGGATAAAAGGCTCAGCGATAGTCCCGGAGATTGTCGCGTAACCAGATCCAGCATGACGCCCACCCTCCGCTTGGAGCACAGTCCCCGGCGGCAACAACAGCAACAACA
>FBgn0076379
ATGCTGCGCACCCTTTTCGCCGTGCGTGGTCAGTGCCAGCAGCTGCTGAGGAGAACATTCACCCCCCATTGCAGTGGCCAACGA
>FBgn0070974
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAACTCCTGCGGGAGCTGCCGCCGCAGAAATGCTCCAGCGCCACGCTGGCCAAGAAGGTGCTGTCGCAGAGCCCGCCGGCAGCCCCGCCGCCCACACCGGCCACAATTGTGCCGCTCACTGCGGTGCCCGTCATCCAGCTGACGCCTCCGTCGCACTCCGGCGACACGCCGCAAAAGCCAGCACCTCCGGCGCCGCCGCCGCC
The overall goal is to create ~14,000 new files. Where each file is the entry associated with a particular ID/header across all 28 files.
To extract a single entry from a single file I can use the following command
sed -n '/^>FBgn0080937$/p;n;p;' CR1_ref.fasta
To extract this entry across all 28 files, each ends in ref.fasta, I can do
for i in *ref.fasta; do sed -n '/^>FBgn0080937$/p;n;p' $i; done > FBgn0080937.fasta
I have a separate text file that has 14,000 lines each line corresponding to a header for an entry called gene.txt.
The first few lines of this file look like
FBgn0080937
FBgn0076379
FBgn0070974
FBgn0081668
FBgn0076576
FBgn0076572
FBgn0079684
FBgn0070907
FBgn0080226
FBgn0072746
I would like to read through this file creating a new text file per header ID.
Below $F is extracting entries for a particular header (FBgn*) and storing this in a new file. I am using the substitution command to rename sequences based on while ref.fasta file they come from.
while read -r line;
do F=$line
for i in *ref.fasta
do sed -n "/^>$F$/s/FB.*/$i/;p;n;p;" $i > $line.fasta
done
done < "gene.txt"
Currently this script creates 14,000 files but each file only has a single sequence.
>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
I am expecting 28 sequences one sequence per *ref.fasta file. The sed command is outputting the last entry.
The expected output would be
>CR1_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACC
>FH2_ref.fasta
AGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>MSH10_ref.fasta
CGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
text-processing sed
1
Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
â roaima
Dec 13 '17 at 21:05
1
Possible duplicate of How can I use variables when doing a sed?
â ilkkachu
Dec 13 '17 at 21:05
1
In short: variable expansion works within double quotes""
, not within single quotes''
. Also, you can just doF=$line
, there's no need to do the pass around theecho
(unless you want the side effects of the unquoted expansion inecho $line
)
â ilkkachu
Dec 13 '17 at 21:06
3
This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
â terdonâ¦
Dec 13 '17 at 21:06
1
Also, it sounds like you're trying to reinventfastaexplode
from theexonerate
suite which will do this for you.
â terdonâ¦
Dec 13 '17 at 21:42
 |Â
show 2 more comments
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have 28 files that each have ~14,000 "entries". A single entry consists of a header, denoted by >string, a newline, and then a sequence which is a string. Each entry has variable length sequence/string.
Across all 28 files there are identical entry headers but the sequence for each entry is variable.
For example one file CR1_ref.fasta would look like
>FBgn0080937
ATGGATAAAAGGCTCAGCGATAGTCCCGGAGATTGTCGCGTAACCAGATCCAGCATGACGCCCACCCTCCGCTTGGAGCACAGTCCCCGGCGGCAACAACAGCAACAACA
>FBgn0076379
ATGCTGCGCACCCTTTTCGCCGTGCGTGGTCAGTGCCAGCAGCTGCTGAGGAGAACATTCACCCCCCATTGCAGTGGCCAACGA
>FBgn0070974
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAACTCCTGCGGGAGCTGCCGCCGCAGAAATGCTCCAGCGCCACGCTGGCCAAGAAGGTGCTGTCGCAGAGCCCGCCGGCAGCCCCGCCGCCCACACCGGCCACAATTGTGCCGCTCACTGCGGTGCCCGTCATCCAGCTGACGCCTCCGTCGCACTCCGGCGACACGCCGCAAAAGCCAGCACCTCCGGCGCCGCCGCCGCC
The overall goal is to create ~14,000 new files. Where each file is the entry associated with a particular ID/header across all 28 files.
To extract a single entry from a single file I can use the following command
sed -n '/^>FBgn0080937$/p;n;p;' CR1_ref.fasta
To extract this entry across all 28 files, each ends in ref.fasta, I can do
for i in *ref.fasta; do sed -n '/^>FBgn0080937$/p;n;p' $i; done > FBgn0080937.fasta
I have a separate text file that has 14,000 lines each line corresponding to a header for an entry called gene.txt.
The first few lines of this file look like
FBgn0080937
FBgn0076379
FBgn0070974
FBgn0081668
FBgn0076576
FBgn0076572
FBgn0079684
FBgn0070907
FBgn0080226
FBgn0072746
I would like to read through this file creating a new text file per header ID.
Below $F is extracting entries for a particular header (FBgn*) and storing this in a new file. I am using the substitution command to rename sequences based on while ref.fasta file they come from.
while read -r line;
do F=$line
for i in *ref.fasta
do sed -n "/^>$F$/s/FB.*/$i/;p;n;p;" $i > $line.fasta
done
done < "gene.txt"
Currently this script creates 14,000 files but each file only has a single sequence.
>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
I am expecting 28 sequences one sequence per *ref.fasta file. The sed command is outputting the last entry.
The expected output would be
>CR1_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACC
>FH2_ref.fasta
AGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>MSH10_ref.fasta
CGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
text-processing sed
I have 28 files that each have ~14,000 "entries". A single entry consists of a header, denoted by >string, a newline, and then a sequence which is a string. Each entry has variable length sequence/string.
Across all 28 files there are identical entry headers but the sequence for each entry is variable.
For example one file CR1_ref.fasta would look like
>FBgn0080937
ATGGATAAAAGGCTCAGCGATAGTCCCGGAGATTGTCGCGTAACCAGATCCAGCATGACGCCCACCCTCCGCTTGGAGCACAGTCCCCGGCGGCAACAACAGCAACAACA
>FBgn0076379
ATGCTGCGCACCCTTTTCGCCGTGCGTGGTCAGTGCCAGCAGCTGCTGAGGAGAACATTCACCCCCCATTGCAGTGGCCAACGA
>FBgn0070974
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAACTCCTGCGGGAGCTGCCGCCGCAGAAATGCTCCAGCGCCACGCTGGCCAAGAAGGTGCTGTCGCAGAGCCCGCCGGCAGCCCCGCCGCCCACACCGGCCACAATTGTGCCGCTCACTGCGGTGCCCGTCATCCAGCTGACGCCTCCGTCGCACTCCGGCGACACGCCGCAAAAGCCAGCACCTCCGGCGCCGCCGCCGCC
The overall goal is to create ~14,000 new files. Where each file is the entry associated with a particular ID/header across all 28 files.
To extract a single entry from a single file I can use the following command
sed -n '/^>FBgn0080937$/p;n;p;' CR1_ref.fasta
To extract this entry across all 28 files, each ends in ref.fasta, I can do
for i in *ref.fasta; do sed -n '/^>FBgn0080937$/p;n;p' $i; done > FBgn0080937.fasta
I have a separate text file that has 14,000 lines each line corresponding to a header for an entry called gene.txt.
The first few lines of this file look like
FBgn0080937
FBgn0076379
FBgn0070974
FBgn0081668
FBgn0076576
FBgn0076572
FBgn0079684
FBgn0070907
FBgn0080226
FBgn0072746
I would like to read through this file creating a new text file per header ID.
Below $F is extracting entries for a particular header (FBgn*) and storing this in a new file. I am using the substitution command to rename sequences based on while ref.fasta file they come from.
while read -r line;
do F=$line
for i in *ref.fasta
do sed -n "/^>$F$/s/FB.*/$i/;p;n;p;" $i > $line.fasta
done
done < "gene.txt"
Currently this script creates 14,000 files but each file only has a single sequence.
>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
I am expecting 28 sequences one sequence per *ref.fasta file. The sed command is outputting the last entry.
The expected output would be
>CR1_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACC
>FH2_ref.fasta
AGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>MSH10_ref.fasta
CGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
text-processing sed
edited Dec 14 '17 at 14:25
asked Dec 13 '17 at 20:54
Dcastillo
62
62
1
Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
â roaima
Dec 13 '17 at 21:05
1
Possible duplicate of How can I use variables when doing a sed?
â ilkkachu
Dec 13 '17 at 21:05
1
In short: variable expansion works within double quotes""
, not within single quotes''
. Also, you can just doF=$line
, there's no need to do the pass around theecho
(unless you want the side effects of the unquoted expansion inecho $line
)
â ilkkachu
Dec 13 '17 at 21:06
3
This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
â terdonâ¦
Dec 13 '17 at 21:06
1
Also, it sounds like you're trying to reinventfastaexplode
from theexonerate
suite which will do this for you.
â terdonâ¦
Dec 13 '17 at 21:42
 |Â
show 2 more comments
1
Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
â roaima
Dec 13 '17 at 21:05
1
Possible duplicate of How can I use variables when doing a sed?
â ilkkachu
Dec 13 '17 at 21:05
1
In short: variable expansion works within double quotes""
, not within single quotes''
. Also, you can just doF=$line
, there's no need to do the pass around theecho
(unless you want the side effects of the unquoted expansion inecho $line
)
â ilkkachu
Dec 13 '17 at 21:06
3
This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
â terdonâ¦
Dec 13 '17 at 21:06
1
Also, it sounds like you're trying to reinventfastaexplode
from theexonerate
suite which will do this for you.
â terdonâ¦
Dec 13 '17 at 21:42
1
1
Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
â roaima
Dec 13 '17 at 21:05
Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
â roaima
Dec 13 '17 at 21:05
1
1
Possible duplicate of How can I use variables when doing a sed?
â ilkkachu
Dec 13 '17 at 21:05
Possible duplicate of How can I use variables when doing a sed?
â ilkkachu
Dec 13 '17 at 21:05
1
1
In short: variable expansion works within double quotes
""
, not within single quotes ''
. Also, you can just do F=$line
, there's no need to do the pass around the echo
(unless you want the side effects of the unquoted expansion in echo $line
)â ilkkachu
Dec 13 '17 at 21:06
In short: variable expansion works within double quotes
""
, not within single quotes ''
. Also, you can just do F=$line
, there's no need to do the pass around the echo
(unless you want the side effects of the unquoted expansion in echo $line
)â ilkkachu
Dec 13 '17 at 21:06
3
3
This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
â terdonâ¦
Dec 13 '17 at 21:06
This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
â terdonâ¦
Dec 13 '17 at 21:06
1
1
Also, it sounds like you're trying to reinvent
fastaexplode
from the exonerate
suite which will do this for you.â terdonâ¦
Dec 13 '17 at 21:42
Also, it sounds like you're trying to reinvent
fastaexplode
from the exonerate
suite which will do this for you.â terdonâ¦
Dec 13 '17 at 21:42
 |Â
show 2 more comments
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f410735%2fextract-fasta-entries-from-list-using-while-read%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
â roaima
Dec 13 '17 at 21:05
1
Possible duplicate of How can I use variables when doing a sed?
â ilkkachu
Dec 13 '17 at 21:05
1
In short: variable expansion works within double quotes
""
, not within single quotes''
. Also, you can just doF=$line
, there's no need to do the pass around theecho
(unless you want the side effects of the unquoted expansion inecho $line
)â ilkkachu
Dec 13 '17 at 21:06
3
This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
â terdonâ¦
Dec 13 '17 at 21:06
1
Also, it sounds like you're trying to reinvent
fastaexplode
from theexonerate
suite which will do this for you.â terdonâ¦
Dec 13 '17 at 21:42