Collecting specific genome data from a file and collect it in the same title
Clash Royale CLAN TAG#URR8PPP
up vote
6
down vote
favorite
I have genomes data in a file, genomes-seq.txt
. The titles of the sequences begin with >
, and then the genome name:
>genome.1
atcg
atcg
atcggtc
>genome.2
atct
tgcgtgctt
attttt
>genome.
sdkf
sdf;ksdf
sdlfkjdslc
edsfsfv
>genome.3
as;ldkhaskjd
asdkljdsl
asdkljasdk;l
>genome.4
ekjfhdhsa
dsfkjskajd
asdknasd
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.234
efijhusidh
siduhygfhuji
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
I want to collect the similar data for genome.1 in one file so it looks like this:
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
But every time I do it using sed I get:
>genome.1
atcg
atcg
atcggtc
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
That is, multiple genome.1
s. How can I do it correctly so on large data set I don't need to remove all the repetitions?
bash
New contributor
paul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |Â
up vote
6
down vote
favorite
I have genomes data in a file, genomes-seq.txt
. The titles of the sequences begin with >
, and then the genome name:
>genome.1
atcg
atcg
atcggtc
>genome.2
atct
tgcgtgctt
attttt
>genome.
sdkf
sdf;ksdf
sdlfkjdslc
edsfsfv
>genome.3
as;ldkhaskjd
asdkljdsl
asdkljasdk;l
>genome.4
ekjfhdhsa
dsfkjskajd
asdknasd
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.234
efijhusidh
siduhygfhuji
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
I want to collect the similar data for genome.1 in one file so it looks like this:
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
But every time I do it using sed I get:
>genome.1
atcg
atcg
atcggtc
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
That is, multiple genome.1
s. How can I do it correctly so on large data set I don't need to remove all the repetitions?
bash
New contributor
paul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
2
Hi @paul, what is yoursed
command that you used?
â Goro
10 hours ago
I tried but it didn't work
â paul
9 hours ago
2
Show what you tried and we can help fix your errors.
â glenn jackman
9 hours ago
1
Re "...every time I do it usingsed
...": You ought to include thesed
line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).
â Peter Mortensen
7 hours ago
The file format is FASTA.
â Peter Mortensen
7 hours ago
add a comment |Â
up vote
6
down vote
favorite
up vote
6
down vote
favorite
I have genomes data in a file, genomes-seq.txt
. The titles of the sequences begin with >
, and then the genome name:
>genome.1
atcg
atcg
atcggtc
>genome.2
atct
tgcgtgctt
attttt
>genome.
sdkf
sdf;ksdf
sdlfkjdslc
edsfsfv
>genome.3
as;ldkhaskjd
asdkljdsl
asdkljasdk;l
>genome.4
ekjfhdhsa
dsfkjskajd
asdknasd
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.234
efijhusidh
siduhygfhuji
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
I want to collect the similar data for genome.1 in one file so it looks like this:
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
But every time I do it using sed I get:
>genome.1
atcg
atcg
atcggtc
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
That is, multiple genome.1
s. How can I do it correctly so on large data set I don't need to remove all the repetitions?
bash
New contributor
paul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
I have genomes data in a file, genomes-seq.txt
. The titles of the sequences begin with >
, and then the genome name:
>genome.1
atcg
atcg
atcggtc
>genome.2
atct
tgcgtgctt
attttt
>genome.
sdkf
sdf;ksdf
sdlfkjdslc
edsfsfv
>genome.3
as;ldkhaskjd
asdkljdsl
asdkljasdk;l
>genome.4
ekjfhdhsa
dsfkjskajd
asdknasd
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.234
efijhusidh
siduhygfhuji
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
I want to collect the similar data for genome.1 in one file so it looks like this:
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
But every time I do it using sed I get:
>genome.1
atcg
atcg
atcggtc
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
That is, multiple genome.1
s. How can I do it correctly so on large data set I don't need to remove all the repetitions?
bash
bash
New contributor
paul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
paul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
edited 5 hours ago
Peter Mortensen
82758
82758
New contributor
paul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
asked 10 hours ago
paul
333
333
New contributor
paul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
paul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
paul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
2
Hi @paul, what is yoursed
command that you used?
â Goro
10 hours ago
I tried but it didn't work
â paul
9 hours ago
2
Show what you tried and we can help fix your errors.
â glenn jackman
9 hours ago
1
Re "...every time I do it usingsed
...": You ought to include thesed
line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).
â Peter Mortensen
7 hours ago
The file format is FASTA.
â Peter Mortensen
7 hours ago
add a comment |Â
2
Hi @paul, what is yoursed
command that you used?
â Goro
10 hours ago
I tried but it didn't work
â paul
9 hours ago
2
Show what you tried and we can help fix your errors.
â glenn jackman
9 hours ago
1
Re "...every time I do it usingsed
...": You ought to include thesed
line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).
â Peter Mortensen
7 hours ago
The file format is FASTA.
â Peter Mortensen
7 hours ago
2
2
Hi @paul, what is your
sed
command that you used?â Goro
10 hours ago
Hi @paul, what is your
sed
command that you used?â Goro
10 hours ago
I tried but it didn't work
â paul
9 hours ago
I tried but it didn't work
â paul
9 hours ago
2
2
Show what you tried and we can help fix your errors.
â glenn jackman
9 hours ago
Show what you tried and we can help fix your errors.
â glenn jackman
9 hours ago
1
1
Re "...every time I do it using
sed
...": You ought to include the sed
line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).â Peter Mortensen
7 hours ago
Re "...every time I do it using
sed
...": You ought to include the sed
line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).â Peter Mortensen
7 hours ago
The file format is FASTA.
â Peter Mortensen
7 hours ago
The file format is FASTA.
â Peter Mortensen
7 hours ago
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
10
down vote
accepted
$sed -nr />genome.1/,/^$/p file | sed '2,$/^>genome.1$/d'
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
genome.1 is the key word, change depending on the list you would like to generate.
Hi Goro, can I be yoursed
friend? Where can I find suchsed
knowledge?
â schweik
8 hours ago
1)>
matches end-of-word, not the literal character>
. 2) Please use.
to escape the literal dot in the regular expression that is supposed to match>genome.1
. 3) You should also anchor it at the line start and end to avoid false matches:/^>genome.1$/
. 4) The-r
flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
â David Foerster
6 hours ago
add a comment |Â
up vote
6
down vote
With perl
perl -00 -ne 'if (/^>genome.1n/) s/// if $. > 1; print' file
add a comment |Â
up vote
0
down vote
With Awk:
if (/^>/)
in_section = 0;
if ($0 == ">genome.1")
in_section = 1;
if (!section_count++)
print;
else if (in_section)
print;
Usage:
awk ' if (/^>/) in_section = 0; if ($0 == ">genome.1") in_section = 1; if (!section_count++) print; else if (in_section) print; ' genome.txt
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
10
down vote
accepted
$sed -nr />genome.1/,/^$/p file | sed '2,$/^>genome.1$/d'
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
genome.1 is the key word, change depending on the list you would like to generate.
Hi Goro, can I be yoursed
friend? Where can I find suchsed
knowledge?
â schweik
8 hours ago
1)>
matches end-of-word, not the literal character>
. 2) Please use.
to escape the literal dot in the regular expression that is supposed to match>genome.1
. 3) You should also anchor it at the line start and end to avoid false matches:/^>genome.1$/
. 4) The-r
flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
â David Foerster
6 hours ago
add a comment |Â
up vote
10
down vote
accepted
$sed -nr />genome.1/,/^$/p file | sed '2,$/^>genome.1$/d'
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
genome.1 is the key word, change depending on the list you would like to generate.
Hi Goro, can I be yoursed
friend? Where can I find suchsed
knowledge?
â schweik
8 hours ago
1)>
matches end-of-word, not the literal character>
. 2) Please use.
to escape the literal dot in the regular expression that is supposed to match>genome.1
. 3) You should also anchor it at the line start and end to avoid false matches:/^>genome.1$/
. 4) The-r
flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
â David Foerster
6 hours ago
add a comment |Â
up vote
10
down vote
accepted
up vote
10
down vote
accepted
$sed -nr />genome.1/,/^$/p file | sed '2,$/^>genome.1$/d'
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
genome.1 is the key word, change depending on the list you would like to generate.
$sed -nr />genome.1/,/^$/p file | sed '2,$/^>genome.1$/d'
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
genome.1 is the key word, change depending on the list you would like to generate.
edited 9 hours ago
answered 9 hours ago
Goro
8,14153878
8,14153878
Hi Goro, can I be yoursed
friend? Where can I find suchsed
knowledge?
â schweik
8 hours ago
1)>
matches end-of-word, not the literal character>
. 2) Please use.
to escape the literal dot in the regular expression that is supposed to match>genome.1
. 3) You should also anchor it at the line start and end to avoid false matches:/^>genome.1$/
. 4) The-r
flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
â David Foerster
6 hours ago
add a comment |Â
Hi Goro, can I be yoursed
friend? Where can I find suchsed
knowledge?
â schweik
8 hours ago
1)>
matches end-of-word, not the literal character>
. 2) Please use.
to escape the literal dot in the regular expression that is supposed to match>genome.1
. 3) You should also anchor it at the line start and end to avoid false matches:/^>genome.1$/
. 4) The-r
flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
â David Foerster
6 hours ago
Hi Goro, can I be your
sed
friend? Where can I find such sed
knowledge?â schweik
8 hours ago
Hi Goro, can I be your
sed
friend? Where can I find such sed
knowledge?â schweik
8 hours ago
1)
>
matches end-of-word, not the literal character >
. 2) Please use .
to escape the literal dot in the regular expression that is supposed to match >genome.1
. 3) You should also anchor it at the line start and end to avoid false matches: /^>genome.1$/
. 4) The -r
flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.â David Foerster
6 hours ago
1)
>
matches end-of-word, not the literal character >
. 2) Please use .
to escape the literal dot in the regular expression that is supposed to match >genome.1
. 3) You should also anchor it at the line start and end to avoid false matches: /^>genome.1$/
. 4) The -r
flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.â David Foerster
6 hours ago
add a comment |Â
up vote
6
down vote
With perl
perl -00 -ne 'if (/^>genome.1n/) s/// if $. > 1; print' file
add a comment |Â
up vote
6
down vote
With perl
perl -00 -ne 'if (/^>genome.1n/) s/// if $. > 1; print' file
add a comment |Â
up vote
6
down vote
up vote
6
down vote
With perl
perl -00 -ne 'if (/^>genome.1n/) s/// if $. > 1; print' file
With perl
perl -00 -ne 'if (/^>genome.1n/) s/// if $. > 1; print' file
answered 9 hours ago
glenn jackman
48.6k366105
48.6k366105
add a comment |Â
add a comment |Â
up vote
0
down vote
With Awk:
if (/^>/)
in_section = 0;
if ($0 == ">genome.1")
in_section = 1;
if (!section_count++)
print;
else if (in_section)
print;
Usage:
awk ' if (/^>/) in_section = 0; if ($0 == ">genome.1") in_section = 1; if (!section_count++) print; else if (in_section) print; ' genome.txt
add a comment |Â
up vote
0
down vote
With Awk:
if (/^>/)
in_section = 0;
if ($0 == ">genome.1")
in_section = 1;
if (!section_count++)
print;
else if (in_section)
print;
Usage:
awk ' if (/^>/) in_section = 0; if ($0 == ">genome.1") in_section = 1; if (!section_count++) print; else if (in_section) print; ' genome.txt
add a comment |Â
up vote
0
down vote
up vote
0
down vote
With Awk:
if (/^>/)
in_section = 0;
if ($0 == ">genome.1")
in_section = 1;
if (!section_count++)
print;
else if (in_section)
print;
Usage:
awk ' if (/^>/) in_section = 0; if ($0 == ">genome.1") in_section = 1; if (!section_count++) print; else if (in_section) print; ' genome.txt
With Awk:
if (/^>/)
in_section = 0;
if ($0 == ">genome.1")
in_section = 1;
if (!section_count++)
print;
else if (in_section)
print;
Usage:
awk ' if (/^>/) in_section = 0; if ($0 == ">genome.1") in_section = 1; if (!section_count++) print; else if (in_section) print; ' genome.txt
answered 6 hours ago
data:image/s3,"s3://crabby-images/40f78/40f78256b800e11d4b7501852e949eefa25f2a9f" alt=""
data:image/s3,"s3://crabby-images/40f78/40f78256b800e11d4b7501852e949eefa25f2a9f" alt=""
David Foerster
948616
948616
add a comment |Â
add a comment |Â
paul is a new contributor. Be nice, and check out our Code of Conduct.
paul is a new contributor. Be nice, and check out our Code of Conduct.
paul is a new contributor. Be nice, and check out our Code of Conduct.
paul is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f474268%2fcollecting-specific-genome-data-from-a-file-and-collect-it-in-the-same-title%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
2
Hi @paul, what is your
sed
command that you used?â Goro
10 hours ago
I tried but it didn't work
â paul
9 hours ago
2
Show what you tried and we can help fix your errors.
â glenn jackman
9 hours ago
1
Re "...every time I do it using
sed
...": You ought to include thesed
line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).â Peter Mortensen
7 hours ago
The file format is FASTA.
â Peter Mortensen
7 hours ago