Selectively retrieve portions of a large file if a condition is met

Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
I have a large file with many sections like this:
Bayes Empirical Bayes (BEB) analysis (Yang, Wong & Nielsen 2005. Mol.
Biol. Evol. 22:1107-1118)
Positively selected sites (*: P>95%; **: P>99%)
(amino acids refer to 1st sequence: 33134_Pseudomonas_10M)
Pr(w>1) post mean +- SE for w
271 A 0.911 1.524 +- 0.000
369 D 0.955* 1.467 +- 0.153
492 S 0.916 1.439 +- 0.203
The grid (...)
I need a command that says something like: if after "BEB" and before "The grid" there is a "*" or "**" right after a number, print that whole row and add what is after "(amino acids refer to 1st sequence:" and before ")" in a new column. For example:
369 D 0.955* 1.467 +- 0.153 33134_Pseudomonas_10M
note: if there were two rows with "*" and/or "**" on the same section, I only need the added text once, Example:
369 D 0.955* 1.467 +- 0.153 33134_Pseudomonas_10M
378 R 0.987* 2.323 +- 0.254
text-processing text-formatting bioinformatics
add a comment |
up vote
0
down vote
favorite
I have a large file with many sections like this:
Bayes Empirical Bayes (BEB) analysis (Yang, Wong & Nielsen 2005. Mol.
Biol. Evol. 22:1107-1118)
Positively selected sites (*: P>95%; **: P>99%)
(amino acids refer to 1st sequence: 33134_Pseudomonas_10M)
Pr(w>1) post mean +- SE for w
271 A 0.911 1.524 +- 0.000
369 D 0.955* 1.467 +- 0.153
492 S 0.916 1.439 +- 0.203
The grid (...)
I need a command that says something like: if after "BEB" and before "The grid" there is a "*" or "**" right after a number, print that whole row and add what is after "(amino acids refer to 1st sequence:" and before ")" in a new column. For example:
369 D 0.955* 1.467 +- 0.153 33134_Pseudomonas_10M
note: if there were two rows with "*" and/or "**" on the same section, I only need the added text once, Example:
369 D 0.955* 1.467 +- 0.153 33134_Pseudomonas_10M
378 R 0.987* 2.323 +- 0.254
text-processing text-formatting bioinformatics
1
this could do itawk '/BEB/,/The grid/)/,"")==2) seq=$0; else if($3~/^[0-9.]+*+$/) print $0, seq; seq = "" ' the_file. but it's hard to know from your snippet. You should probably make the regexps (/BEB/, etc) more narrow.
– mosvy
Nov 21 at 23:29
Worked perfectly. Thanks @mosvy!
– Manuel
Nov 21 at 23:36
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have a large file with many sections like this:
Bayes Empirical Bayes (BEB) analysis (Yang, Wong & Nielsen 2005. Mol.
Biol. Evol. 22:1107-1118)
Positively selected sites (*: P>95%; **: P>99%)
(amino acids refer to 1st sequence: 33134_Pseudomonas_10M)
Pr(w>1) post mean +- SE for w
271 A 0.911 1.524 +- 0.000
369 D 0.955* 1.467 +- 0.153
492 S 0.916 1.439 +- 0.203
The grid (...)
I need a command that says something like: if after "BEB" and before "The grid" there is a "*" or "**" right after a number, print that whole row and add what is after "(amino acids refer to 1st sequence:" and before ")" in a new column. For example:
369 D 0.955* 1.467 +- 0.153 33134_Pseudomonas_10M
note: if there were two rows with "*" and/or "**" on the same section, I only need the added text once, Example:
369 D 0.955* 1.467 +- 0.153 33134_Pseudomonas_10M
378 R 0.987* 2.323 +- 0.254
text-processing text-formatting bioinformatics
I have a large file with many sections like this:
Bayes Empirical Bayes (BEB) analysis (Yang, Wong & Nielsen 2005. Mol.
Biol. Evol. 22:1107-1118)
Positively selected sites (*: P>95%; **: P>99%)
(amino acids refer to 1st sequence: 33134_Pseudomonas_10M)
Pr(w>1) post mean +- SE for w
271 A 0.911 1.524 +- 0.000
369 D 0.955* 1.467 +- 0.153
492 S 0.916 1.439 +- 0.203
The grid (...)
I need a command that says something like: if after "BEB" and before "The grid" there is a "*" or "**" right after a number, print that whole row and add what is after "(amino acids refer to 1st sequence:" and before ")" in a new column. For example:
369 D 0.955* 1.467 +- 0.153 33134_Pseudomonas_10M
note: if there were two rows with "*" and/or "**" on the same section, I only need the added text once, Example:
369 D 0.955* 1.467 +- 0.153 33134_Pseudomonas_10M
378 R 0.987* 2.323 +- 0.254
text-processing text-formatting bioinformatics
text-processing text-formatting bioinformatics
edited Nov 21 at 23:13
asked Nov 21 at 23:07
Manuel
1399
1399
1
this could do itawk '/BEB/,/The grid/)/,"")==2) seq=$0; else if($3~/^[0-9.]+*+$/) print $0, seq; seq = "" ' the_file. but it's hard to know from your snippet. You should probably make the regexps (/BEB/, etc) more narrow.
– mosvy
Nov 21 at 23:29
Worked perfectly. Thanks @mosvy!
– Manuel
Nov 21 at 23:36
add a comment |
1
this could do itawk '/BEB/,/The grid/)/,"")==2) seq=$0; else if($3~/^[0-9.]+*+$/) print $0, seq; seq = "" ' the_file. but it's hard to know from your snippet. You should probably make the regexps (/BEB/, etc) more narrow.
– mosvy
Nov 21 at 23:29
Worked perfectly. Thanks @mosvy!
– Manuel
Nov 21 at 23:36
1
1
this could do it
awk '/BEB/,/The grid/)/,"")==2) seq=$0; else if($3~/^[0-9.]+*+$/) print $0, seq; seq = "" ' the_file. but it's hard to know from your snippet. You should probably make the regexps (/BEB/, etc) more narrow.– mosvy
Nov 21 at 23:29
this could do it
awk '/BEB/,/The grid/)/,"")==2) seq=$0; else if($3~/^[0-9.]+*+$/) print $0, seq; seq = "" ' the_file. but it's hard to know from your snippet. You should probably make the regexps (/BEB/, etc) more narrow.– mosvy
Nov 21 at 23:29
Worked perfectly. Thanks @mosvy!
– Manuel
Nov 21 at 23:36
Worked perfectly. Thanks @mosvy!
– Manuel
Nov 21 at 23:36
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f483322%2fselectively-retrieve-portions-of-a-large-file-if-a-condition-is-met%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
this could do it
awk '/BEB/,/The grid/)/,"")==2) seq=$0; else if($3~/^[0-9.]+*+$/) print $0, seq; seq = "" ' the_file. but it's hard to know from your snippet. You should probably make the regexps (/BEB/, etc) more narrow.– mosvy
Nov 21 at 23:29
Worked perfectly. Thanks @mosvy!
– Manuel
Nov 21 at 23:36