Collecting specific genome data from a file and collect it in the same title

up vote
6
down vote

favorite

I have genomes data in a file, genomes-seq.txt. The titles of the sequences begin with >, and then the genome name:

>genome.1
atcg
atcg
atcggtc

>genome.2
atct
tgcgtgctt
attttt

>genome.
sdkf
sdf;ksdf
sdlfkjdslc
edsfsfv

>genome.3
as;ldkhaskjd
asdkljdsl
asdkljasdk;l

>genome.4
ekjfhdhsa
dsfkjskajd
asdknasd


>genome.1
iruuwi
sdkljbh
sdfljnsdl

>genome.234
efijhusidh
siduhygfhuji

>genome.1
ljhdcj
sdljhsdil
fweusfhygc

I want to collect the similar data for genome.1 in one file so it looks like this:

>genome.1
atcg
atcggtc

iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc

But every time I do it using sed I get:

>genome.1
atcg
atcg
atcggtc

>genome.1
iruuwi
sdkljbh
sdfljnsdl

>genome.1
ljhdcj
sdljhsdil
fweusfhygc

That is, multiple genome.1s. How can I do it correctly so on large data set I don't need to remove all the repetitions?

edited 5 hours ago

Peter Mortensen

82758

asked 10 hours ago

paul

333

New contributor

2

Hi @paul, what is your sed command that you used?
â€“Â Goro
10 hours ago

I tried but it didn't work
â€“Â paul
9 hours ago

2

Show what you tried and we can help fix your errors.
â€“Â glenn jackman
9 hours ago

1

Re "...every time I do it using sed...": You ought to include the sed line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).
â€“Â Peter Mortensen
7 hours ago

The file format is FASTA.
â€“Â Peter Mortensen
7 hours ago

add a commentÂ |Â

up vote
6
down vote

favorite

I have genomes data in a file, genomes-seq.txt. The titles of the sequences begin with >, and then the genome name:

>genome.1
atcg
atcg
atcggtc

>genome.2
atct
tgcgtgctt
attttt

>genome.
sdkf
sdf;ksdf
sdlfkjdslc
edsfsfv

>genome.3
as;ldkhaskjd
asdkljdsl
asdkljasdk;l

>genome.4
ekjfhdhsa
dsfkjskajd
asdknasd


>genome.1
iruuwi
sdkljbh
sdfljnsdl

>genome.234
efijhusidh
siduhygfhuji

>genome.1
ljhdcj
sdljhsdil
fweusfhygc

I want to collect the similar data for genome.1 in one file so it looks like this:

>genome.1
atcg
atcggtc

iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc

But every time I do it using sed I get:

>genome.1
atcg
atcg
atcggtc

>genome.1
iruuwi
sdkljbh
sdfljnsdl

>genome.1
ljhdcj
sdljhsdil
fweusfhygc

That is, multiple genome.1s. How can I do it correctly so on large data set I don't need to remove all the repetitions?

edited 5 hours ago

Peter Mortensen

82758

asked 10 hours ago

paul

333

New contributor

2

Hi @paul, what is your sed command that you used?
â€“Â Goro
10 hours ago

I tried but it didn't work
â€“Â paul
9 hours ago

2

Show what you tried and we can help fix your errors.
â€“Â glenn jackman
9 hours ago

1

Re "...every time I do it using sed...": You ought to include the sed line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).
â€“Â Peter Mortensen
7 hours ago

The file format is FASTA.
â€“Â Peter Mortensen
7 hours ago

add a commentÂ |Â

up vote
6
down vote

favorite

I have genomes data in a file, genomes-seq.txt. The titles of the sequences begin with >, and then the genome name:

>genome.1
atcg
atcg
atcggtc

>genome.2
atct
tgcgtgctt
attttt

>genome.
sdkf
sdf;ksdf
sdlfkjdslc
edsfsfv

>genome.3
as;ldkhaskjd
asdkljdsl
asdkljasdk;l

>genome.4
ekjfhdhsa
dsfkjskajd
asdknasd


>genome.1
iruuwi
sdkljbh
sdfljnsdl

>genome.234
efijhusidh
siduhygfhuji

>genome.1
ljhdcj
sdljhsdil
fweusfhygc

I want to collect the similar data for genome.1 in one file so it looks like this:

>genome.1
atcg
atcggtc

iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc

But every time I do it using sed I get:

>genome.1
atcg
atcg
atcggtc

>genome.1
iruuwi
sdkljbh
sdfljnsdl

>genome.1
ljhdcj
sdljhsdil
fweusfhygc

That is, multiple genome.1s. How can I do it correctly so on large data set I don't need to remove all the repetitions?

edited 5 hours ago

Peter Mortensen

82758

asked 10 hours ago

paul

333

New contributor

I have genomes data in a file, genomes-seq.txt. The titles of the sequences begin with >, and then the genome name:

>genome.1
atcg
atcg
atcggtc

>genome.2
atct
tgcgtgctt
attttt

>genome.
sdkf
sdf;ksdf
sdlfkjdslc
edsfsfv

>genome.3
as;ldkhaskjd
asdkljdsl
asdkljasdk;l

>genome.4
ekjfhdhsa
dsfkjskajd
asdknasd


>genome.1
iruuwi
sdkljbh
sdfljnsdl

>genome.234
efijhusidh
siduhygfhuji

>genome.1
ljhdcj
sdljhsdil
fweusfhygc

I want to collect the similar data for genome.1 in one file so it looks like this:

>genome.1
atcg
atcggtc

iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc

But every time I do it using sed I get:

>genome.1
atcg
atcg
atcggtc

>genome.1
iruuwi
sdkljbh
sdfljnsdl

>genome.1
ljhdcj
sdljhsdil
fweusfhygc

That is, multiple genome.1s. How can I do it correctly so on large data set I don't need to remove all the repetitions?

bash

edited 5 hours ago

Peter Mortensen

82758

asked 10 hours ago

paul

333

New contributor

edited 5 hours ago

Peter Mortensen

82758

asked 10 hours ago

paul

333

New contributor

edited 5 hours ago

Peter Mortensen

82758

edited 5 hours ago

Peter Mortensen

82758

edited 5 hours ago

Peter Mortensen

82758

asked 10 hours ago

paul

333

New contributor

asked 10 hours ago

paul

333

asked 10 hours ago

paul

333

New contributor

paul is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

2

Hi @paul, what is your sed command that you used?
â€“Â Goro
10 hours ago

I tried but it didn't work
â€“Â paul
9 hours ago

2

Show what you tried and we can help fix your errors.
â€“Â glenn jackman
9 hours ago

1

Re "...every time I do it using sed...": You ought to include the sed line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).
â€“Â Peter Mortensen
7 hours ago

The file format is FASTA.
â€“Â Peter Mortensen
7 hours ago

add a commentÂ |Â

2

Hi @paul, what is your sed command that you used?
â€“Â Goro
10 hours ago

I tried but it didn't work
â€“Â paul
9 hours ago

2

Show what you tried and we can help fix your errors.
â€“Â glenn jackman
9 hours ago

1

Re "...every time I do it using sed...": You ought to include the sed line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).
â€“Â Peter Mortensen
7 hours ago

The file format is FASTA.
â€“Â Peter Mortensen
7 hours ago

Hi @paul, what is your sed command that you used?
â€“Â Goro
10 hours ago

I tried but it didn't work
â€“Â paul
9 hours ago

Show what you tried and we can help fix your errors.
â€“Â glenn jackman
9 hours ago

Re "...every time I do it using sed...": You ought to include the sed line in the question. Otherwise, this amounts to a work order (using this site as a script-writing service).
â€“Â Peter Mortensen
7 hours ago

The file format is FASTA.
â€“Â Peter Mortensen
7 hours ago

add a commentÂ |Â

3 Answers
3

active

oldest

votes

up vote
10
down vote

accepted

$sed -nr />genome.1/,/^$/p file | sed '2,$/^>genome.1$/d'

>genome.1
atcg
atcggtc

iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc

genome.1 is the key word, change depending on the list you would like to generate.

edited 9 hours ago

answered 9 hours ago

Goro

8,14153878

Hi Goro, can I be your sed friend? Where can I find such sed knowledge?
â€“Â schweik
8 hours ago

1) > matches end-of-word, not the literal character >. 2) Please use . to escape the literal dot in the regular expression that is supposed to match >genome.1. 3) You should also anchor it at the line start and end to avoid false matches: /^>genome.1$/. 4) The -r flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
â€“Â David Foerster
6 hours ago

add a commentÂ |Â

up vote
6
down vote

With perl

perl -00 -ne 'if (/^>genome.1n/) s/// if $. > 1; print' file

answered 9 hours ago

glenn jackman

48.6k366105

add a commentÂ |Â

up vote
0
down vote

With Awk:


 if (/^>/)
 in_section = 0;
 if ($0 == ">genome.1") 
 in_section = 1;
 if (!section_count++)
 print;
 else if (in_section)
 print;

Usage:

awk ' if (/^>/) in_section = 0; if ($0 == ">genome.1") in_section = 1; if (!section_count++) print; else if (in_section) print; ' genome.txt

answered 6 hours ago

David Foerster

948616

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

paul is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f474268%2fcollecting-specific-genome-data-from-a-file-and-collect-it-in-the-same-title%23new-answer', 'question_page');

);

Post as a guest

Name

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
10
down vote

accepted

$sed -nr />genome.1/,/^$/p file | sed '2,$/^>genome.1$/d'

>genome.1
atcg
atcggtc

iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc

genome.1 is the key word, change depending on the list you would like to generate.

edited 9 hours ago

answered 9 hours ago

Goro

8,14153878

Hi Goro, can I be your sed friend? Where can I find such sed knowledge?
â€“Â schweik
8 hours ago

1) > matches end-of-word, not the literal character >. 2) Please use . to escape the literal dot in the regular expression that is supposed to match >genome.1. 3) You should also anchor it at the line start and end to avoid false matches: /^>genome.1$/. 4) The -r flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
â€“Â David Foerster
6 hours ago

add a commentÂ |Â

up vote
10
down vote

accepted

$sed -nr />genome.1/,/^$/p file | sed '2,$/^>genome.1$/d'

>genome.1
atcg
atcggtc

iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc

genome.1 is the key word, change depending on the list you would like to generate.

edited 9 hours ago

answered 9 hours ago

Goro

8,14153878

Hi Goro, can I be your sed friend? Where can I find such sed knowledge?
â€“Â schweik
8 hours ago

1) > matches end-of-word, not the literal character >. 2) Please use . to escape the literal dot in the regular expression that is supposed to match >genome.1. 3) You should also anchor it at the line start and end to avoid false matches: /^>genome.1$/. 4) The -r flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
â€“Â David Foerster
6 hours ago

add a commentÂ |Â

up vote
10
down vote

accepted

$sed -nr />genome.1/,/^$/p file | sed '2,$/^>genome.1$/d'

>genome.1
atcg
atcggtc

iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc

genome.1 is the key word, change depending on the list you would like to generate.

edited 9 hours ago

answered 9 hours ago

Goro

8,14153878

$sed -nr />genome.1/,/^$/p file | sed '2,$/^>genome.1$/d'

>genome.1
atcg
atcggtc

iruuwi
sdkljbh
sdfljnsdl
ljhdcj
sdljhsdil
fweusfhygc

genome.1 is the key word, change depending on the list you would like to generate.

edited 9 hours ago

answered 9 hours ago

Goro

8,14153878

edited 9 hours ago

answered 9 hours ago

Goro

8,14153878

answered 9 hours ago

Goro

8,14153878

answered 9 hours ago

Goro

8,14153878

Hi Goro, can I be your sed friend? Where can I find such sed knowledge?
â€“Â schweik
8 hours ago

1) > matches end-of-word, not the literal character >. 2) Please use . to escape the literal dot in the regular expression that is supposed to match >genome.1. 3) You should also anchor it at the line start and end to avoid false matches: /^>genome.1$/. 4) The -r flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
â€“Â David Foerster
6 hours ago

add a commentÂ |Â

Hi Goro, can I be your sed friend? Where can I find such sed knowledge?
â€“Â schweik
8 hours ago

1) > matches end-of-word, not the literal character >. 2) Please use . to escape the literal dot in the regular expression that is supposed to match >genome.1. 3) You should also anchor it at the line start and end to avoid false matches: /^>genome.1$/. 4) The -r flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
â€“Â David Foerster
6 hours ago

Hi Goro, can I be your sed friend? Where can I find such sed knowledge?
â€“Â schweik
8 hours ago

1) > matches end-of-word, not the literal character >. 2) Please use . to escape the literal dot in the regular expression that is supposed to match >genome.1. 3) You should also anchor it at the line start and end to avoid false matches: /^>genome.1$/. 4) The -r flag on the first sed command is not required but harmless since you don't use any characters affected by it. 5) As a rule of thumb you should escape sed commands provided through a shell to avoid easily overlooked issues.
â€“Â David Foerster
6 hours ago

add a commentÂ |Â

up vote
6
down vote

With perl

perl -00 -ne 'if (/^>genome.1n/) s/// if $. > 1; print' file

answered 9 hours ago

glenn jackman

48.6k366105

add a commentÂ |Â

up vote
6
down vote

With perl

perl -00 -ne 'if (/^>genome.1n/) s/// if $. > 1; print' file

answered 9 hours ago

glenn jackman

48.6k366105

add a commentÂ |Â

up vote
6
down vote

With perl

perl -00 -ne 'if (/^>genome.1n/) s/// if $. > 1; print' file

answered 9 hours ago

glenn jackman

48.6k366105

With perl

perl -00 -ne 'if (/^>genome.1n/) s/// if $. > 1; print' file

answered 9 hours ago

glenn jackman

48.6k366105

answered 9 hours ago

glenn jackman

48.6k366105

answered 9 hours ago

glenn jackman

48.6k366105

answered 9 hours ago

glenn jackman

48.6k366105

add a commentÂ |Â

up vote
0
down vote

With Awk:


 if (/^>/)
 in_section = 0;
 if ($0 == ">genome.1") 
 in_section = 1;
 if (!section_count++)
 print;
 else if (in_section)
 print;

Usage:

awk ' if (/^>/) in_section = 0; if ($0 == ">genome.1") in_section = 1; if (!section_count++) print; else if (in_section) print; ' genome.txt

answered 6 hours ago

David Foerster

948616

add a commentÂ |Â

up vote
0
down vote

With Awk:


 if (/^>/)
 in_section = 0;
 if ($0 == ">genome.1") 
 in_section = 1;
 if (!section_count++)
 print;
 else if (in_section)
 print;

Usage:

awk ' if (/^>/) in_section = 0; if ($0 == ">genome.1") in_section = 1; if (!section_count++) print; else if (in_section) print; ' genome.txt

answered 6 hours ago

David Foerster

948616

add a commentÂ |Â

up vote
0
down vote

With Awk:


 if (/^>/)
 in_section = 0;
 if ($0 == ">genome.1") 
 in_section = 1;
 if (!section_count++)
 print;
 else if (in_section)
 print;

Usage:

awk ' if (/^>/) in_section = 0; if ($0 == ">genome.1") in_section = 1; if (!section_count++) print; else if (in_section) print; ' genome.txt

answered 6 hours ago

David Foerster

948616

With Awk:


 if (/^>/)
 in_section = 0;
 if ($0 == ">genome.1") 
 in_section = 1;
 if (!section_count++)
 print;
 else if (in_section)
 print;

Usage:

awk ' if (/^>/) in_section = 0; if ($0 == ">genome.1") in_section = 1; if (!section_count++) print; else if (in_section) print; ' genome.txt

answered 6 hours ago

David Foerster

948616

answered 6 hours ago

David Foerster

948616

answered 6 hours ago

David Foerster

948616

answered 6 hours ago

David Foerster

948616

add a commentÂ |Â

paul is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

paul is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu