extract fasta entries from list using while read

up vote
1
down vote

favorite

I have 28 files that each have ~14,000 "entries". A single entry consists of a header, denoted by >string, a newline, and then a sequence which is a string. Each entry has variable length sequence/string.
Across all 28 files there are identical entry headers but the sequence for each entry is variable.

For example one file CR1_ref.fasta would look like

>FBgn0080937
ATGGATAAAAGGCTCAGCGATAGTCCCGGAGATTGTCGCGTAACCAGATCCAGCATGACGCCCACCCTCCGCTTGGAGCACAGTCCCCGGCGGCAACAACAGCAACAACA
>FBgn0076379
ATGCTGCGCACCCTTTTCGCCGTGCGTGGTCAGTGCCAGCAGCTGCTGAGGAGAACATTCACCCCCCATTGCAGTGGCCAACGA
>FBgn0070974
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAACTCCTGCGGGAGCTGCCGCCGCAGAAATGCTCCAGCGCCACGCTGGCCAAGAAGGTGCTGTCGCAGAGCCCGCCGGCAGCCCCGCCGCCCACACCGGCCACAATTGTGCCGCTCACTGCGGTGCCCGTCATCCAGCTGACGCCTCCGTCGCACTCCGGCGACACGCCGCAAAAGCCAGCACCTCCGGCGCCGCCGCCGCC

The overall goal is to create ~14,000 new files. Where each file is the entry associated with a particular ID/header across all 28 files.

To extract a single entry from a single file I can use the following command

sed -n '/^>FBgn0080937$/p;n;p;' CR1_ref.fasta

To extract this entry across all 28 files, each ends in ref.fasta, I can do

for i in *ref.fasta; do sed -n '/^>FBgn0080937$/p;n;p' $i; done > FBgn0080937.fasta

I have a separate text file that has 14,000 lines each line corresponding to a header for an entry called gene.txt.
The first few lines of this file look like

FBgn0080937
FBgn0076379
FBgn0070974
FBgn0081668
FBgn0076576
FBgn0076572
FBgn0079684
FBgn0070907
FBgn0080226
FBgn0072746

I would like to read through this file creating a new text file per header ID.
Below $F is extracting entries for a particular header (FBgn*) and storing this in a new file. I am using the substitution command to rename sequences based on while ref.fasta file they come from.

while read -r line;
do F=$line
for i in *ref.fasta
do sed -n "/^>$F$/s/FB.*/$i/;p;n;p;" $i > $line.fasta
done
done < "gene.txt"

Currently this script creates 14,000 files but each file only has a single sequence.

>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC

I am expecting 28 sequences one sequence per *ref.fasta file. The sed command is outputting the last entry.
The expected output would be

 >CR1_ref.fasta
 ATGCAGACGCGTCCGAGCAGTGAACC
 >FH2_ref.fasta
 AGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
 >MSH10_ref.fasta
 CGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
 >Z9_ref.fasta
 ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC

edited Dec 14 '17 at 14:25

asked Dec 13 '17 at 20:54

Dcastillo

1

Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
â€“Â roaima
Dec 13 '17 at 21:05

1

Possible duplicate of How can I use variables when doing a sed?
â€“Â ilkkachu
Dec 13 '17 at 21:05

1

In short: variable expansion works within double quotes "", not within single quotes ''. Also, you can just do F=$line, there's no need to do the pass around the echo (unless you want the side effects of the unquoted expansion in echo $line)
â€“Â ilkkachu
Dec 13 '17 at 21:06

3

This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
â€“Â terdonâ™¦
Dec 13 '17 at 21:06

1

Also, it sounds like you're trying to reinvent fastaexplode from the exonerate suite which will do this for you.
â€“Â terdonâ™¦
Dec 13 '17 at 21:42

Â |Â
show 2 more comments

up vote
1
down vote

favorite

For example one file CR1_ref.fasta would look like

>FBgn0080937
ATGGATAAAAGGCTCAGCGATAGTCCCGGAGATTGTCGCGTAACCAGATCCAGCATGACGCCCACCCTCCGCTTGGAGCACAGTCCCCGGCGGCAACAACAGCAACAACA
>FBgn0076379
ATGCTGCGCACCCTTTTCGCCGTGCGTGGTCAGTGCCAGCAGCTGCTGAGGAGAACATTCACCCCCCATTGCAGTGGCCAACGA
>FBgn0070974
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAACTCCTGCGGGAGCTGCCGCCGCAGAAATGCTCCAGCGCCACGCTGGCCAAGAAGGTGCTGTCGCAGAGCCCGCCGGCAGCCCCGCCGCCCACACCGGCCACAATTGTGCCGCTCACTGCGGTGCCCGTCATCCAGCTGACGCCTCCGTCGCACTCCGGCGACACGCCGCAAAAGCCAGCACCTCCGGCGCCGCCGCCGCC

The overall goal is to create ~14,000 new files. Where each file is the entry associated with a particular ID/header across all 28 files.

To extract a single entry from a single file I can use the following command

sed -n '/^>FBgn0080937$/p;n;p;' CR1_ref.fasta

To extract this entry across all 28 files, each ends in ref.fasta, I can do

for i in *ref.fasta; do sed -n '/^>FBgn0080937$/p;n;p' $i; done > FBgn0080937.fasta

I have a separate text file that has 14,000 lines each line corresponding to a header for an entry called gene.txt.
The first few lines of this file look like

FBgn0080937
FBgn0076379
FBgn0070974
FBgn0081668
FBgn0076576
FBgn0076572
FBgn0079684
FBgn0070907
FBgn0080226
FBgn0072746

while read -r line;
do F=$line
for i in *ref.fasta
do sed -n "/^>$F$/s/FB.*/$i/;p;n;p;" $i > $line.fasta
done
done < "gene.txt"

Currently this script creates 14,000 files but each file only has a single sequence.

>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC

I am expecting 28 sequences one sequence per *ref.fasta file. The sed command is outputting the last entry.
The expected output would be

 >CR1_ref.fasta
 ATGCAGACGCGTCCGAGCAGTGAACC
 >FH2_ref.fasta
 AGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
 >MSH10_ref.fasta
 CGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
 >Z9_ref.fasta
 ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC

edited Dec 14 '17 at 14:25

asked Dec 13 '17 at 20:54

Dcastillo

1

Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
â€“Â roaima
Dec 13 '17 at 21:05

1

Possible duplicate of How can I use variables when doing a sed?
â€“Â ilkkachu
Dec 13 '17 at 21:05

1

In short: variable expansion works within double quotes "", not within single quotes ''. Also, you can just do F=$line, there's no need to do the pass around the echo (unless you want the side effects of the unquoted expansion in echo $line)
â€“Â ilkkachu
Dec 13 '17 at 21:06

3

This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
â€“Â terdonâ™¦
Dec 13 '17 at 21:06

1

Also, it sounds like you're trying to reinvent fastaexplode from the exonerate suite which will do this for you.
â€“Â terdonâ™¦
Dec 13 '17 at 21:42

Â |Â
show 2 more comments

up vote
1
down vote

favorite

For example one file CR1_ref.fasta would look like

>FBgn0080937
ATGGATAAAAGGCTCAGCGATAGTCCCGGAGATTGTCGCGTAACCAGATCCAGCATGACGCCCACCCTCCGCTTGGAGCACAGTCCCCGGCGGCAACAACAGCAACAACA
>FBgn0076379
ATGCTGCGCACCCTTTTCGCCGTGCGTGGTCAGTGCCAGCAGCTGCTGAGGAGAACATTCACCCCCCATTGCAGTGGCCAACGA
>FBgn0070974
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAACTCCTGCGGGAGCTGCCGCCGCAGAAATGCTCCAGCGCCACGCTGGCCAAGAAGGTGCTGTCGCAGAGCCCGCCGGCAGCCCCGCCGCCCACACCGGCCACAATTGTGCCGCTCACTGCGGTGCCCGTCATCCAGCTGACGCCTCCGTCGCACTCCGGCGACACGCCGCAAAAGCCAGCACCTCCGGCGCCGCCGCCGCC

The overall goal is to create ~14,000 new files. Where each file is the entry associated with a particular ID/header across all 28 files.

To extract a single entry from a single file I can use the following command

sed -n '/^>FBgn0080937$/p;n;p;' CR1_ref.fasta

To extract this entry across all 28 files, each ends in ref.fasta, I can do

for i in *ref.fasta; do sed -n '/^>FBgn0080937$/p;n;p' $i; done > FBgn0080937.fasta

I have a separate text file that has 14,000 lines each line corresponding to a header for an entry called gene.txt.
The first few lines of this file look like

FBgn0080937
FBgn0076379
FBgn0070974
FBgn0081668
FBgn0076576
FBgn0076572
FBgn0079684
FBgn0070907
FBgn0080226
FBgn0072746

while read -r line;
do F=$line
for i in *ref.fasta
do sed -n "/^>$F$/s/FB.*/$i/;p;n;p;" $i > $line.fasta
done
done < "gene.txt"

Currently this script creates 14,000 files but each file only has a single sequence.

>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC

I am expecting 28 sequences one sequence per *ref.fasta file. The sed command is outputting the last entry.
The expected output would be

 >CR1_ref.fasta
 ATGCAGACGCGTCCGAGCAGTGAACC
 >FH2_ref.fasta
 AGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
 >MSH10_ref.fasta
 CGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
 >Z9_ref.fasta
 ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC

edited Dec 14 '17 at 14:25

asked Dec 13 '17 at 20:54

Dcastillo

For example one file CR1_ref.fasta would look like

>FBgn0080937
ATGGATAAAAGGCTCAGCGATAGTCCCGGAGATTGTCGCGTAACCAGATCCAGCATGACGCCCACCCTCCGCTTGGAGCACAGTCCCCGGCGGCAACAACAGCAACAACA
>FBgn0076379
ATGCTGCGCACCCTTTTCGCCGTGCGTGGTCAGTGCCAGCAGCTGCTGAGGAGAACATTCACCCCCCATTGCAGTGGCCAACGA
>FBgn0070974
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAACTCCTGCGGGAGCTGCCGCCGCAGAAATGCTCCAGCGCCACGCTGGCCAAGAAGGTGCTGTCGCAGAGCCCGCCGGCAGCCCCGCCGCCCACACCGGCCACAATTGTGCCGCTCACTGCGGTGCCCGTCATCCAGCTGACGCCTCCGTCGCACTCCGGCGACACGCCGCAAAAGCCAGCACCTCCGGCGCCGCCGCCGCC

The overall goal is to create ~14,000 new files. Where each file is the entry associated with a particular ID/header across all 28 files.

To extract a single entry from a single file I can use the following command

sed -n '/^>FBgn0080937$/p;n;p;' CR1_ref.fasta

To extract this entry across all 28 files, each ends in ref.fasta, I can do

for i in *ref.fasta; do sed -n '/^>FBgn0080937$/p;n;p' $i; done > FBgn0080937.fasta

I have a separate text file that has 14,000 lines each line corresponding to a header for an entry called gene.txt.
The first few lines of this file look like

FBgn0080937
FBgn0076379
FBgn0070974
FBgn0081668
FBgn0076576
FBgn0076572
FBgn0079684
FBgn0070907
FBgn0080226
FBgn0072746

while read -r line;
do F=$line
for i in *ref.fasta
do sed -n "/^>$F$/s/FB.*/$i/;p;n;p;" $i > $line.fasta
done
done < "gene.txt"

Currently this script creates 14,000 files but each file only has a single sequence.

>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC

I am expecting 28 sequences one sequence per *ref.fasta file. The sed command is outputting the last entry.
The expected output would be

 >CR1_ref.fasta
 ATGCAGACGCGTCCGAGCAGTGAACC
 >FH2_ref.fasta
 AGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
 >MSH10_ref.fasta
 CGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
 >Z9_ref.fasta
 ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC

edited Dec 14 '17 at 14:25

asked Dec 13 '17 at 20:54

Dcastillo

edited Dec 14 '17 at 14:25

asked Dec 13 '17 at 20:54

Dcastillo

asked Dec 13 '17 at 20:54

Dcastillo

asked Dec 13 '17 at 20:54

Dcastillo

1

Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
â€“Â roaima
Dec 13 '17 at 21:05

1

Possible duplicate of How can I use variables when doing a sed?
â€“Â ilkkachu
Dec 13 '17 at 21:05

1

In short: variable expansion works within double quotes "", not within single quotes ''. Also, you can just do F=$line, there's no need to do the pass around the echo (unless you want the side effects of the unquoted expansion in echo $line)
â€“Â ilkkachu
Dec 13 '17 at 21:06

3

This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
â€“Â terdonâ™¦
Dec 13 '17 at 21:06

1

Also, it sounds like you're trying to reinvent fastaexplode from the exonerate suite which will do this for you.
â€“Â terdonâ™¦
Dec 13 '17 at 21:42

Â |Â
show 2 more comments

1

Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
â€“Â roaima
Dec 13 '17 at 21:05

1

Possible duplicate of How can I use variables when doing a sed?
â€“Â ilkkachu
Dec 13 '17 at 21:05

1

In short: variable expansion works within double quotes "", not within single quotes ''. Also, you can just do F=$line, there's no need to do the pass around the echo (unless you want the side effects of the unquoted expansion in echo $line)
â€“Â ilkkachu
Dec 13 '17 at 21:06

3

This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
â€“Â terdonâ™¦
Dec 13 '17 at 21:06

1

Also, it sounds like you're trying to reinvent fastaexplode from the exonerate suite which will do this for you.
â€“Â terdonâ™¦
Dec 13 '17 at 21:42

Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
â€“Â roaima
Dec 13 '17 at 21:05

Possible duplicate of How can I use variables when doing a sed?
â€“Â ilkkachu
Dec 13 '17 at 21:05

In short: variable expansion works within double quotes "", not within single quotes ''. Also, you can just do F=$line, there's no need to do the pass around the echo (unless you want the side effects of the unquoted expansion in echo $line)
â€“Â ilkkachu
Dec 13 '17 at 21:06

This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
â€“Â terdonâ™¦
Dec 13 '17 at 21:06

Also, it sounds like you're trying to reinvent fastaexplode from the exonerate suite which will do this for you.
â€“Â terdonâ™¦
Dec 13 '17 at 21:42

Â |Â
show 2 more comments

active

oldest

votes

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f410735%2fextract-fasta-entries-from-list-using-while-read%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu