Collecting specific genome data from a file and collect it in the same title
Clash Royale CLAN TAG#URR8PPP
up vote
6
down vote
favorite
I have genomes data in a file, genomes-seq.txt
. The titles of the sequences begin with >
, and then the genome name:
>genome.1
atcg
atcg
atcggtc
>genome.2
atct
tgcgtgctt
attttt
>genome.
sdkf
sdf;ksdf
sdlfkjdslc
edsfsfv
>genome.3
as;ldkhaskjd
asdkljdsl
asdkljasdk;l
>genome.4
ekjfhdhsa
dsfkjskajd
asdknasd
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.234
efijhusidh
siduhygfhuji
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
I want to collect the similar data for genome.1 in one file so it looks like this:
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
But every time I do it using sed I get:
>genome.1
atcg
atcg
atcggtc
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
That is, multiple genome.1
s. How can I do it correctly so on large data set I don't need to remove all the repetitions?
bash
New contributor
add a comment |Â
up vote
6
down vote
favorite
I have genomes data in a file, genomes-seq.txt
. The titles of the sequences begin with >
, and then the genome name:
>genome.1
atcg
atcg
atcggtc
>genome.2
atct
tgcgtgctt
attttt
>genome.
sdkf
sdf;ksdf
sdlfkjdslc
edsfsfv
>genome.3
as;ldkhaskjd
asdkljdsl
asdkljasdk;l
>genome.4
ekjfhdhsa
dsfkjskajd
asdknasd
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.234
efijhusidh
siduhygfhuji
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
I want to collect the similar data for genome.1 in one file so it looks like this:
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
But every time I do it using sed I get:
>genome.1
atcg
atcg
atcggtc
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
That is, multiple genome.1
s. How can I do it correctly so on large data set I don't need to remove all the repetitions?
bash
New contributor
2
Hi @paul, what is yoursed
command that you used?
â Goro
10 hours ago
I tried but it didn't work
â paul
9 hours ago
2
Show what you tried and we can help fix your errors.
â glenn jackman
9 hours ago
1
Re "...every time I do it usingsed
...": You ought to include thesed
line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).
â Peter Mortensen
7 hours ago
The file format is FASTA.
â Peter Mortensen
7 hours ago
add a comment |Â
up vote
6
down vote
favorite
up vote
6
down vote
favorite
I have genomes data in a file, genomes-seq.txt
. The titles of the sequences begin with >
, and then the genome name:
>genome.1
atcg
atcg
atcggtc
>genome.2
atct
tgcgtgctt
attttt
>genome.
sdkf
sdf;ksdf
sdlfkjdslc
edsfsfv
>genome.3
as;ldkhaskjd
asdkljdsl
asdkljasdk;l
>genome.4
ekjfhdhsa
dsfkjskajd
asdknasd
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.234
efijhusidh
siduhygfhuji
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
I want to collect the similar data for genome.1 in one file so it looks like this:
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
But every time I do it using sed I get:
>genome.1
atcg
atcg
atcggtc
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
That is, multiple genome.1
s. How can I do it correctly so on large data set I don't need to remove all the repetitions?
bash
New contributor
I have genomes data in a file, genomes-seq.txt
. The titles of the sequences begin with >
, and then the genome name:
>genome.1
atcg
atcg
atcggtc
>genome.2
atct
tgcgtgctt
attttt
>genome.
sdkf
sdf;ksdf
sdlfkjdslc
edsfsfv
>genome.3
as;ldkhaskjd
asdkljdsl
asdkljasdk;l
>genome.4
ekjfhdhsa
dsfkjskajd
asdknasd
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.234
efijhusidh
siduhygfhuji
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
I want to collect the similar data for genome.1 in one file so it looks like this:
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
But every time I do it using sed I get:
>genome.1
atcg
atcg
atcggtc
>genome.1
iruuwi
sdkljbh
sdfljnsdl
>genome.1
ljhdcj
sdljhsdil
fweusfhygc
That is, multiple genome.1
s. How can I do it correctly so on large data set I don't need to remove all the repetitions?
bash
bash
New contributor
New contributor
edited 5 hours ago
Peter Mortensen
82758
82758
New contributor
asked 10 hours ago
paul
333
333
New contributor
New contributor
2
Hi @paul, what is yoursed
command that you used?
â Goro
10 hours ago
I tried but it didn't work
â paul
9 hours ago
2
Show what you tried and we can help fix your errors.
â glenn jackman
9 hours ago
1
Re "...every time I do it usingsed
...": You ought to include thesed
line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).
â Peter Mortensen
7 hours ago
The file format is FASTA.
â Peter Mortensen
7 hours ago
add a comment |Â
2
Hi @paul, what is yoursed
command that you used?
â Goro
10 hours ago
I tried but it didn't work
â paul
9 hours ago
2
Show what you tried and we can help fix your errors.
â glenn jackman
9 hours ago
1
Re "...every time I do it usingsed
...": You ought to include thesed
line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).
â Peter Mortensen
7 hours ago
The file format is FASTA.
â Peter Mortensen
7 hours ago
2
2
Hi @paul, what is your
sed
command that you used?â Goro
10 hours ago
Hi @paul, what is your
sed
command that you used?â Goro
10 hours ago
I tried but it didn't work
â paul
9 hours ago
I tried but it didn't work
â paul
9 hours ago
2
2
Show what you tried and we can help fix your errors.
â glenn jackman
9 hours ago
Show what you tried and we can help fix your errors.
â glenn jackman
9 hours ago
1
1
Re "...every time I do it using
sed
...": You ought to include the sed
line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).â Peter Mortensen
7 hours ago
Re "...every time I do it using
sed
...": You ought to include the sed
line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).â Peter Mortensen
7 hours ago
The file format is FASTA.
â Peter Mortensen
7 hours ago
The file format is FASTA.
â Peter Mortensen
7 hours ago
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
10
down vote
accepted
$sed -nr />genome.1/,/^$/p file | sed '2,$/^>genome.1$/d'
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
genome.1 is the key word, change depending on the list you would like to generate.
Hi Goro, can I be yoursed
friend? Where can I find suchsed
knowledge?
â schweik
8 hours ago
1)>
matches end-of-word, not the literal character>
. 2) Please use.
to escape the literal dot in the regular expression that is supposed to match>genome.1
. 3) You should also anchor it at the line start and end to avoid false matches:/^>genome.1$/
. 4) The-r
flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
â David Foerster
6 hours ago
add a comment |Â
up vote
6
down vote
With perl
perl -00 -ne 'if (/^>genome.1n/) s/// if $. > 1; print' file
add a comment |Â
up vote
0
down vote
With Awk:
if (/^>/)
in_section = 0;
if ($0 == ">genome.1")
in_section = 1;
if (!section_count++)
print;
else if (in_section)
print;
Usage:
awk ' if (/^>/) in_section = 0; if ($0 == ">genome.1") in_section = 1; if (!section_count++) print; else if (in_section) print; ' genome.txt
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
10
down vote
accepted
$sed -nr />genome.1/,/^$/p file | sed '2,$/^>genome.1$/d'
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
genome.1 is the key word, change depending on the list you would like to generate.
Hi Goro, can I be yoursed
friend? Where can I find suchsed
knowledge?
â schweik
8 hours ago
1)>
matches end-of-word, not the literal character>
. 2) Please use.
to escape the literal dot in the regular expression that is supposed to match>genome.1
. 3) You should also anchor it at the line start and end to avoid false matches:/^>genome.1$/
. 4) The-r
flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
â David Foerster
6 hours ago
add a comment |Â
up vote
10
down vote
accepted
$sed -nr />genome.1/,/^$/p file | sed '2,$/^>genome.1$/d'
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
genome.1 is the key word, change depending on the list you would like to generate.
Hi Goro, can I be yoursed
friend? Where can I find suchsed
knowledge?
â schweik
8 hours ago
1)>
matches end-of-word, not the literal character>
. 2) Please use.
to escape the literal dot in the regular expression that is supposed to match>genome.1
. 3) You should also anchor it at the line start and end to avoid false matches:/^>genome.1$/
. 4) The-r
flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
â David Foerster
6 hours ago
add a comment |Â
up vote
10
down vote
accepted
up vote
10
down vote
accepted
$sed -nr />genome.1/,/^$/p file | sed '2,$/^>genome.1$/d'
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
genome.1 is the key word, change depending on the list you would like to generate.
$sed -nr />genome.1/,/^$/p file | sed '2,$/^>genome.1$/d'
>genome.1
atcg
atcggtc
iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc
genome.1 is the key word, change depending on the list you would like to generate.
edited 9 hours ago
answered 9 hours ago
Goro
8,14153878
8,14153878
Hi Goro, can I be yoursed
friend? Where can I find suchsed
knowledge?
â schweik
8 hours ago
1)>
matches end-of-word, not the literal character>
. 2) Please use.
to escape the literal dot in the regular expression that is supposed to match>genome.1
. 3) You should also anchor it at the line start and end to avoid false matches:/^>genome.1$/
. 4) The-r
flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
â David Foerster
6 hours ago
add a comment |Â
Hi Goro, can I be yoursed
friend? Where can I find suchsed
knowledge?
â schweik
8 hours ago
1)>
matches end-of-word, not the literal character>
. 2) Please use.
to escape the literal dot in the regular expression that is supposed to match>genome.1
. 3) You should also anchor it at the line start and end to avoid false matches:/^>genome.1$/
. 4) The-r
flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
â David Foerster
6 hours ago
Hi Goro, can I be your
sed
friend? Where can I find such sed
knowledge?â schweik
8 hours ago
Hi Goro, can I be your
sed
friend? Where can I find such sed
knowledge?â schweik
8 hours ago
1)
>
matches end-of-word, not the literal character >
. 2) Please use .
to escape the literal dot in the regular expression that is supposed to match >genome.1
. 3) You should also anchor it at the line start and end to avoid false matches: /^>genome.1$/
. 4) The -r
flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.â David Foerster
6 hours ago
1)
>
matches end-of-word, not the literal character >
. 2) Please use .
to escape the literal dot in the regular expression that is supposed to match >genome.1
. 3) You should also anchor it at the line start and end to avoid false matches: /^>genome.1$/
. 4) The -r
flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.â David Foerster
6 hours ago
add a comment |Â
up vote
6
down vote
With perl
perl -00 -ne 'if (/^>genome.1n/) s/// if $. > 1; print' file
add a comment |Â
up vote
6
down vote
With perl
perl -00 -ne 'if (/^>genome.1n/) s/// if $. > 1; print' file
add a comment |Â
up vote
6
down vote
up vote
6
down vote
With perl
perl -00 -ne 'if (/^>genome.1n/) s/// if $. > 1; print' file
With perl
perl -00 -ne 'if (/^>genome.1n/) s/// if $. > 1; print' file
answered 9 hours ago
glenn jackman
48.6k366105
48.6k366105
add a comment |Â
add a comment |Â
up vote
0
down vote
With Awk:
if (/^>/)
in_section = 0;
if ($0 == ">genome.1")
in_section = 1;
if (!section_count++)
print;
else if (in_section)
print;
Usage:
awk ' if (/^>/) in_section = 0; if ($0 == ">genome.1") in_section = 1; if (!section_count++) print; else if (in_section) print; ' genome.txt
add a comment |Â
up vote
0
down vote
With Awk:
if (/^>/)
in_section = 0;
if ($0 == ">genome.1")
in_section = 1;
if (!section_count++)
print;
else if (in_section)
print;
Usage:
awk ' if (/^>/) in_section = 0; if ($0 == ">genome.1") in_section = 1; if (!section_count++) print; else if (in_section) print; ' genome.txt
add a comment |Â
up vote
0
down vote
up vote
0
down vote
With Awk:
if (/^>/)
in_section = 0;
if ($0 == ">genome.1")
in_section = 1;
if (!section_count++)
print;
else if (in_section)
print;
Usage:
awk ' if (/^>/) in_section = 0; if ($0 == ">genome.1") in_section = 1; if (!section_count++) print; else if (in_section) print; ' genome.txt
With Awk:
if (/^>/)
in_section = 0;
if ($0 == ">genome.1")
in_section = 1;
if (!section_count++)
print;
else if (in_section)
print;
Usage:
awk ' if (/^>/) in_section = 0; if ($0 == ">genome.1") in_section = 1; if (!section_count++) print; else if (in_section) print; ' genome.txt
answered 6 hours ago
David Foerster
948616
948616
add a comment |Â
add a comment |Â
paul is a new contributor. Be nice, and check out our Code of Conduct.
paul is a new contributor. Be nice, and check out our Code of Conduct.
paul is a new contributor. Be nice, and check out our Code of Conduct.
paul is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f474268%2fcollecting-specific-genome-data-from-a-file-and-collect-it-in-the-same-title%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
2
Hi @paul, what is your
sed
command that you used?â Goro
10 hours ago
I tried but it didn't work
â paul
9 hours ago
2
Show what you tried and we can help fix your errors.
â glenn jackman
9 hours ago
1
Re "...every time I do it using
sed
...": You ought to include thesed
line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).â Peter Mortensen
7 hours ago
The file format is FASTA.
â Peter Mortensen
7 hours ago