extract fasta entries from list using while read

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












I have 28 files that each have ~14,000 "entries". A single entry consists of a header, denoted by >string, a newline, and then a sequence which is a string. Each entry has variable length sequence/string.
Across all 28 files there are identical entry headers but the sequence for each entry is variable.



For example one file CR1_ref.fasta would look like



>FBgn0080937
ATGGATAAAAGGCTCAGCGATAGTCCCGGAGATTGTCGCGTAACCAGATCCAGCATGACGCCCACCCTCCGCTTGGAGCACAGTCCCCGGCGGCAACAACAGCAACAACA
>FBgn0076379
ATGCTGCGCACCCTTTTCGCCGTGCGTGGTCAGTGCCAGCAGCTGCTGAGGAGAACATTCACCCCCCATTGCAGTGGCCAACGA
>FBgn0070974
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAACTCCTGCGGGAGCTGCCGCCGCAGAAATGCTCCAGCGCCACGCTGGCCAAGAAGGTGCTGTCGCAGAGCCCGCCGGCAGCCCCGCCGCCCACACCGGCCACAATTGTGCCGCTCACTGCGGTGCCCGTCATCCAGCTGACGCCTCCGTCGCACTCCGGCGACACGCCGCAAAAGCCAGCACCTCCGGCGCCGCCGCCGCC


The overall goal is to create ~14,000 new files. Where each file is the entry associated with a particular ID/header across all 28 files.



To extract a single entry from a single file I can use the following command



sed -n '/^>FBgn0080937$/p;n;p;' CR1_ref.fasta


To extract this entry across all 28 files, each ends in ref.fasta, I can do



for i in *ref.fasta; do sed -n '/^>FBgn0080937$/p;n;p' $i; done > FBgn0080937.fasta


I have a separate text file that has 14,000 lines each line corresponding to a header for an entry called gene.txt.
The first few lines of this file look like



FBgn0080937
FBgn0076379
FBgn0070974
FBgn0081668
FBgn0076576
FBgn0076572
FBgn0079684
FBgn0070907
FBgn0080226
FBgn0072746


I would like to read through this file creating a new text file per header ID.
Below $F is extracting entries for a particular header (FBgn*) and storing this in a new file. I am using the substitution command to rename sequences based on while ref.fasta file they come from.



while read -r line;
do F=$line
for i in *ref.fasta
do sed -n "/^>$F$/s/FB.*/$i/;p;n;p;" $i > $line.fasta
done
done < "gene.txt"


Currently this script creates 14,000 files but each file only has a single sequence.



>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC


I am expecting 28 sequences one sequence per *ref.fasta file. The sed command is outputting the last entry.
The expected output would be



 >CR1_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACC
>FH2_ref.fasta
AGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>MSH10_ref.fasta
CGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC






share|improve this question


















  • 1




    Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
    – roaima
    Dec 13 '17 at 21:05






  • 1




    Possible duplicate of How can I use variables when doing a sed?
    – ilkkachu
    Dec 13 '17 at 21:05






  • 1




    In short: variable expansion works within double quotes "", not within single quotes ''. Also, you can just do F=$line, there's no need to do the pass around the echo (unless you want the side effects of the unquoted expansion in echo $line)
    – ilkkachu
    Dec 13 '17 at 21:06






  • 3




    This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
    – terdon♦
    Dec 13 '17 at 21:06







  • 1




    Also, it sounds like you're trying to reinvent fastaexplode from the exonerate suite which will do this for you.
    – terdon♦
    Dec 13 '17 at 21:42














up vote
1
down vote

favorite












I have 28 files that each have ~14,000 "entries". A single entry consists of a header, denoted by >string, a newline, and then a sequence which is a string. Each entry has variable length sequence/string.
Across all 28 files there are identical entry headers but the sequence for each entry is variable.



For example one file CR1_ref.fasta would look like



>FBgn0080937
ATGGATAAAAGGCTCAGCGATAGTCCCGGAGATTGTCGCGTAACCAGATCCAGCATGACGCCCACCCTCCGCTTGGAGCACAGTCCCCGGCGGCAACAACAGCAACAACA
>FBgn0076379
ATGCTGCGCACCCTTTTCGCCGTGCGTGGTCAGTGCCAGCAGCTGCTGAGGAGAACATTCACCCCCCATTGCAGTGGCCAACGA
>FBgn0070974
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAACTCCTGCGGGAGCTGCCGCCGCAGAAATGCTCCAGCGCCACGCTGGCCAAGAAGGTGCTGTCGCAGAGCCCGCCGGCAGCCCCGCCGCCCACACCGGCCACAATTGTGCCGCTCACTGCGGTGCCCGTCATCCAGCTGACGCCTCCGTCGCACTCCGGCGACACGCCGCAAAAGCCAGCACCTCCGGCGCCGCCGCCGCC


The overall goal is to create ~14,000 new files. Where each file is the entry associated with a particular ID/header across all 28 files.



To extract a single entry from a single file I can use the following command



sed -n '/^>FBgn0080937$/p;n;p;' CR1_ref.fasta


To extract this entry across all 28 files, each ends in ref.fasta, I can do



for i in *ref.fasta; do sed -n '/^>FBgn0080937$/p;n;p' $i; done > FBgn0080937.fasta


I have a separate text file that has 14,000 lines each line corresponding to a header for an entry called gene.txt.
The first few lines of this file look like



FBgn0080937
FBgn0076379
FBgn0070974
FBgn0081668
FBgn0076576
FBgn0076572
FBgn0079684
FBgn0070907
FBgn0080226
FBgn0072746


I would like to read through this file creating a new text file per header ID.
Below $F is extracting entries for a particular header (FBgn*) and storing this in a new file. I am using the substitution command to rename sequences based on while ref.fasta file they come from.



while read -r line;
do F=$line
for i in *ref.fasta
do sed -n "/^>$F$/s/FB.*/$i/;p;n;p;" $i > $line.fasta
done
done < "gene.txt"


Currently this script creates 14,000 files but each file only has a single sequence.



>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC


I am expecting 28 sequences one sequence per *ref.fasta file. The sed command is outputting the last entry.
The expected output would be



 >CR1_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACC
>FH2_ref.fasta
AGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>MSH10_ref.fasta
CGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC






share|improve this question


















  • 1




    Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
    – roaima
    Dec 13 '17 at 21:05






  • 1




    Possible duplicate of How can I use variables when doing a sed?
    – ilkkachu
    Dec 13 '17 at 21:05






  • 1




    In short: variable expansion works within double quotes "", not within single quotes ''. Also, you can just do F=$line, there's no need to do the pass around the echo (unless you want the side effects of the unquoted expansion in echo $line)
    – ilkkachu
    Dec 13 '17 at 21:06






  • 3




    This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
    – terdon♦
    Dec 13 '17 at 21:06







  • 1




    Also, it sounds like you're trying to reinvent fastaexplode from the exonerate suite which will do this for you.
    – terdon♦
    Dec 13 '17 at 21:42












up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have 28 files that each have ~14,000 "entries". A single entry consists of a header, denoted by >string, a newline, and then a sequence which is a string. Each entry has variable length sequence/string.
Across all 28 files there are identical entry headers but the sequence for each entry is variable.



For example one file CR1_ref.fasta would look like



>FBgn0080937
ATGGATAAAAGGCTCAGCGATAGTCCCGGAGATTGTCGCGTAACCAGATCCAGCATGACGCCCACCCTCCGCTTGGAGCACAGTCCCCGGCGGCAACAACAGCAACAACA
>FBgn0076379
ATGCTGCGCACCCTTTTCGCCGTGCGTGGTCAGTGCCAGCAGCTGCTGAGGAGAACATTCACCCCCCATTGCAGTGGCCAACGA
>FBgn0070974
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAACTCCTGCGGGAGCTGCCGCCGCAGAAATGCTCCAGCGCCACGCTGGCCAAGAAGGTGCTGTCGCAGAGCCCGCCGGCAGCCCCGCCGCCCACACCGGCCACAATTGTGCCGCTCACTGCGGTGCCCGTCATCCAGCTGACGCCTCCGTCGCACTCCGGCGACACGCCGCAAAAGCCAGCACCTCCGGCGCCGCCGCCGCC


The overall goal is to create ~14,000 new files. Where each file is the entry associated with a particular ID/header across all 28 files.



To extract a single entry from a single file I can use the following command



sed -n '/^>FBgn0080937$/p;n;p;' CR1_ref.fasta


To extract this entry across all 28 files, each ends in ref.fasta, I can do



for i in *ref.fasta; do sed -n '/^>FBgn0080937$/p;n;p' $i; done > FBgn0080937.fasta


I have a separate text file that has 14,000 lines each line corresponding to a header for an entry called gene.txt.
The first few lines of this file look like



FBgn0080937
FBgn0076379
FBgn0070974
FBgn0081668
FBgn0076576
FBgn0076572
FBgn0079684
FBgn0070907
FBgn0080226
FBgn0072746


I would like to read through this file creating a new text file per header ID.
Below $F is extracting entries for a particular header (FBgn*) and storing this in a new file. I am using the substitution command to rename sequences based on while ref.fasta file they come from.



while read -r line;
do F=$line
for i in *ref.fasta
do sed -n "/^>$F$/s/FB.*/$i/;p;n;p;" $i > $line.fasta
done
done < "gene.txt"


Currently this script creates 14,000 files but each file only has a single sequence.



>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC


I am expecting 28 sequences one sequence per *ref.fasta file. The sed command is outputting the last entry.
The expected output would be



 >CR1_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACC
>FH2_ref.fasta
AGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>MSH10_ref.fasta
CGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC






share|improve this question














I have 28 files that each have ~14,000 "entries". A single entry consists of a header, denoted by >string, a newline, and then a sequence which is a string. Each entry has variable length sequence/string.
Across all 28 files there are identical entry headers but the sequence for each entry is variable.



For example one file CR1_ref.fasta would look like



>FBgn0080937
ATGGATAAAAGGCTCAGCGATAGTCCCGGAGATTGTCGCGTAACCAGATCCAGCATGACGCCCACCCTCCGCTTGGAGCACAGTCCCCGGCGGCAACAACAGCAACAACA
>FBgn0076379
ATGCTGCGCACCCTTTTCGCCGTGCGTGGTCAGTGCCAGCAGCTGCTGAGGAGAACATTCACCCCCCATTGCAGTGGCCAACGA
>FBgn0070974
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAACTCCTGCGGGAGCTGCCGCCGCAGAAATGCTCCAGCGCCACGCTGGCCAAGAAGGTGCTGTCGCAGAGCCCGCCGGCAGCCCCGCCGCCCACACCGGCCACAATTGTGCCGCTCACTGCGGTGCCCGTCATCCAGCTGACGCCTCCGTCGCACTCCGGCGACACGCCGCAAAAGCCAGCACCTCCGGCGCCGCCGCCGCC


The overall goal is to create ~14,000 new files. Where each file is the entry associated with a particular ID/header across all 28 files.



To extract a single entry from a single file I can use the following command



sed -n '/^>FBgn0080937$/p;n;p;' CR1_ref.fasta


To extract this entry across all 28 files, each ends in ref.fasta, I can do



for i in *ref.fasta; do sed -n '/^>FBgn0080937$/p;n;p' $i; done > FBgn0080937.fasta


I have a separate text file that has 14,000 lines each line corresponding to a header for an entry called gene.txt.
The first few lines of this file look like



FBgn0080937
FBgn0076379
FBgn0070974
FBgn0081668
FBgn0076576
FBgn0076572
FBgn0079684
FBgn0070907
FBgn0080226
FBgn0072746


I would like to read through this file creating a new text file per header ID.
Below $F is extracting entries for a particular header (FBgn*) and storing this in a new file. I am using the substitution command to rename sequences based on while ref.fasta file they come from.



while read -r line;
do F=$line
for i in *ref.fasta
do sed -n "/^>$F$/s/FB.*/$i/;p;n;p;" $i > $line.fasta
done
done < "gene.txt"


Currently this script creates 14,000 files but each file only has a single sequence.



>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC


I am expecting 28 sequences one sequence per *ref.fasta file. The sed command is outputting the last entry.
The expected output would be



 >CR1_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACC
>FH2_ref.fasta
AGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>MSH10_ref.fasta
CGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC
>Z9_ref.fasta
ATGCAGACGCGTCCGAGCAGTGAACCGCAGCGCGCCAAGGAGCAAC








share|improve this question













share|improve this question




share|improve this question








edited Dec 14 '17 at 14:25

























asked Dec 13 '17 at 20:54









Dcastillo

62




62







  • 1




    Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
    – roaima
    Dec 13 '17 at 21:05






  • 1




    Possible duplicate of How can I use variables when doing a sed?
    – ilkkachu
    Dec 13 '17 at 21:05






  • 1




    In short: variable expansion works within double quotes "", not within single quotes ''. Also, you can just do F=$line, there's no need to do the pass around the echo (unless you want the side effects of the unquoted expansion in echo $line)
    – ilkkachu
    Dec 13 '17 at 21:06






  • 3




    This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
    – terdon♦
    Dec 13 '17 at 21:06







  • 1




    Also, it sounds like you're trying to reinvent fastaexplode from the exonerate suite which will do this for you.
    – terdon♦
    Dec 13 '17 at 21:42












  • 1




    Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
    – roaima
    Dec 13 '17 at 21:05






  • 1




    Possible duplicate of How can I use variables when doing a sed?
    – ilkkachu
    Dec 13 '17 at 21:05






  • 1




    In short: variable expansion works within double quotes "", not within single quotes ''. Also, you can just do F=$line, there's no need to do the pass around the echo (unless you want the side effects of the unquoted expansion in echo $line)
    – ilkkachu
    Dec 13 '17 at 21:06






  • 3




    This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
    – terdon♦
    Dec 13 '17 at 21:06







  • 1




    Also, it sounds like you're trying to reinvent fastaexplode from the exonerate suite which will do this for you.
    – terdon♦
    Dec 13 '17 at 21:42







1




1




Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
– roaima
Dec 13 '17 at 21:05




Can you rewrite this question in computing terms? I only have the vaguest idea what a fasta thingie is, and I certainly couldn't identify one in a data file.
– roaima
Dec 13 '17 at 21:05




1




1




Possible duplicate of How can I use variables when doing a sed?
– ilkkachu
Dec 13 '17 at 21:05




Possible duplicate of How can I use variables when doing a sed?
– ilkkachu
Dec 13 '17 at 21:05




1




1




In short: variable expansion works within double quotes "", not within single quotes ''. Also, you can just do F=$line, there's no need to do the pass around the echo (unless you want the side effects of the unquoted expansion in echo $line)
– ilkkachu
Dec 13 '17 at 21:06




In short: variable expansion works within double quotes "", not within single quotes ''. Also, you can just do F=$line, there's no need to do the pass around the echo (unless you want the side effects of the unquoted expansion in echo $line)
– ilkkachu
Dec 13 '17 at 21:06




3




3




This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
– terdon♦
Dec 13 '17 at 21:06





This is on topic here, but I would strongly recommend you post this on Bioinformatics instead. The people here are very knowledgeable about text parsing but, as you can see from the comments, most of us won't have any idea what fasta is. If you like, I can migrate it over for you. If not, please edit and explain (also mention if your sequences will always be one line or not). You should also explain what you are trying to do because it really isn't obvious and I'm a bioinformatician.
– terdon♦
Dec 13 '17 at 21:06





1




1




Also, it sounds like you're trying to reinvent fastaexplode from the exonerate suite which will do this for you.
– terdon♦
Dec 13 '17 at 21:42




Also, it sounds like you're trying to reinvent fastaexplode from the exonerate suite which will do this for you.
– terdon♦
Dec 13 '17 at 21:42















active

oldest

votes











Your Answer







StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f410735%2fextract-fasta-entries-from-list-using-while-read%23new-answer', 'question_page');

);

Post as a guest



































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes










 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f410735%2fextract-fasta-entries-from-list-using-while-read%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

How to check contact read email or not when send email to Individual?

Displaying single band from multi-band raster using QGIS

How many registers does an x86_64 CPU actually have?