How to use sed and regular expressions to find pattern and remove last few characters?
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
I have a gff file in which I need to remove -R* from all lines that have the pattern Parent=gopAga1_........-R.;
The file structure is below shown for a single gene, but I need a global fix for all genes in the file.
>2446 17292 . + . ID=gopAga1_00004497-RA;Parent=gopAga1_00004497;Name=gopAga1_00004497-RA;Alias=augustus_masked-scaffold4362-processed-gene-0.0-mRNA-1;_AED=0.12;_QI=0|0.8|0.81|1|1|1|11|1368|404;_eAED=0.12;Note=Similar to PLAT: Tissue-type plasminogen activator (Pongo abelii);
>scaffold4362 maker exon 2446 2545 . + . ID=gopAga1_00004497-RA:exon:4045;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 6721 6834 . + . ID=gopAga1_00004497-RA:exon:4046;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 7241 7415 . + . ID=gopAga1_00004497-RA:exon:4047;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 10114 10205 . + . ID=gopAga1_00004497-RA:exon:4048;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 10478 10649 . + . ID=gopAga1_00004497-RA:exon:4049;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 11037 11122 . + . ID=gopAga1_00004497-RA:exon:4050;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 11518 11713 . + . ID=gopAga1_00004497-RA:exon:4051;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 12794 12930 . + . ID=gopAga1_00004497-RA:exon:4052;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 13006 13146 . + . ID=gopAga1_00004497-RA:exon:4053;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 14920 15047 . + . ID=gopAga1_00004497-RA:exon:4054;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 16051 17292 . + . ID=gopAga1_00004497-RA:exon:4055;Parent=gopAga1_00004497-RA;
I am using sed to find patterns but I am unsure how to use sed to remove everything between the last number of the line and semi-colon.
Will my current script work? The expected output is below.
sed 's/Parent=gopAga1_........-R.;$/Parent=gopAga1........;/ gop.gff
>2446 17292 . + . ID=gopAga1_00004497-RA;Parent=gopAga1_00004497;Name=gopAga1_00004497-RA;Alias=augustus_masked-scaffold4362-processed-gene-0.0-mRNA-1;_AED=0.12;_QI=0|0.8|0.81|1|1|1|11|1368|404;_eAED=0.12;Note=Similar to PLAT: Tissue-type plasminogen activator (Pongo abelii);
>scaffold4362 maker exon 2446 2545 . + . ID=gopAga1_00004497-RA:exon:4045;Parent=gopAga1_00004497;
>scaffold4362 maker exon 6721 6834 . + . ID=gopAga1_00004497-RA:exon:4046;Parent=gopAga1_00004497;
>scaffold4362 maker exon 7241 7415 . + . ID=gopAga1_00004497-RA:exon:4047;Parent=gopAga1_00004497;
>scaffold4362 maker exon 10114 10205 . + . ID=gopAga1_00004497-RA:exon:4048;Parent=gopAga1_00004497;
>scaffold4362 maker exon 10478 10649 . + . ID=gopAga1_00004497-RA:exon:4049;Parent=gopAga1_00004497;
>scaffold4362 maker exon 11037 11122 . + . ID=gopAga1_00004497-RA:exon:4050;Parent=gopAga1_00004497;
>scaffold4362 maker exon 11518 11713 . + . ID=gopAga1_00004497-RA:exon:4051;Parent=gopAga1_00004497;
>scaffold4362 maker exon 12794 12930 . + . ID=gopAga1_00004497-RA:exon:4052;Parent=gopAga1_00004497;
>scaffold4362 maker exon 13006 13146 . + . ID=gopAga1_00004497-RA:exon:4053;Parent=gopAga1_00004497;
>scaffold4362 maker exon 14920 15047 . + . ID=gopAga1_00004497-RA:exon:4054;Parent=gopAga1_00004497;
>scaffold4362 maker exon 16051 17292 . + . ID=gopAga1_00004497-RA:exon:4055;Parent=gopAga1_00004497;
sed regular-expression
add a comment |Â
up vote
1
down vote
favorite
I have a gff file in which I need to remove -R* from all lines that have the pattern Parent=gopAga1_........-R.;
The file structure is below shown for a single gene, but I need a global fix for all genes in the file.
>2446 17292 . + . ID=gopAga1_00004497-RA;Parent=gopAga1_00004497;Name=gopAga1_00004497-RA;Alias=augustus_masked-scaffold4362-processed-gene-0.0-mRNA-1;_AED=0.12;_QI=0|0.8|0.81|1|1|1|11|1368|404;_eAED=0.12;Note=Similar to PLAT: Tissue-type plasminogen activator (Pongo abelii);
>scaffold4362 maker exon 2446 2545 . + . ID=gopAga1_00004497-RA:exon:4045;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 6721 6834 . + . ID=gopAga1_00004497-RA:exon:4046;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 7241 7415 . + . ID=gopAga1_00004497-RA:exon:4047;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 10114 10205 . + . ID=gopAga1_00004497-RA:exon:4048;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 10478 10649 . + . ID=gopAga1_00004497-RA:exon:4049;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 11037 11122 . + . ID=gopAga1_00004497-RA:exon:4050;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 11518 11713 . + . ID=gopAga1_00004497-RA:exon:4051;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 12794 12930 . + . ID=gopAga1_00004497-RA:exon:4052;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 13006 13146 . + . ID=gopAga1_00004497-RA:exon:4053;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 14920 15047 . + . ID=gopAga1_00004497-RA:exon:4054;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 16051 17292 . + . ID=gopAga1_00004497-RA:exon:4055;Parent=gopAga1_00004497-RA;
I am using sed to find patterns but I am unsure how to use sed to remove everything between the last number of the line and semi-colon.
Will my current script work? The expected output is below.
sed 's/Parent=gopAga1_........-R.;$/Parent=gopAga1........;/ gop.gff
>2446 17292 . + . ID=gopAga1_00004497-RA;Parent=gopAga1_00004497;Name=gopAga1_00004497-RA;Alias=augustus_masked-scaffold4362-processed-gene-0.0-mRNA-1;_AED=0.12;_QI=0|0.8|0.81|1|1|1|11|1368|404;_eAED=0.12;Note=Similar to PLAT: Tissue-type plasminogen activator (Pongo abelii);
>scaffold4362 maker exon 2446 2545 . + . ID=gopAga1_00004497-RA:exon:4045;Parent=gopAga1_00004497;
>scaffold4362 maker exon 6721 6834 . + . ID=gopAga1_00004497-RA:exon:4046;Parent=gopAga1_00004497;
>scaffold4362 maker exon 7241 7415 . + . ID=gopAga1_00004497-RA:exon:4047;Parent=gopAga1_00004497;
>scaffold4362 maker exon 10114 10205 . + . ID=gopAga1_00004497-RA:exon:4048;Parent=gopAga1_00004497;
>scaffold4362 maker exon 10478 10649 . + . ID=gopAga1_00004497-RA:exon:4049;Parent=gopAga1_00004497;
>scaffold4362 maker exon 11037 11122 . + . ID=gopAga1_00004497-RA:exon:4050;Parent=gopAga1_00004497;
>scaffold4362 maker exon 11518 11713 . + . ID=gopAga1_00004497-RA:exon:4051;Parent=gopAga1_00004497;
>scaffold4362 maker exon 12794 12930 . + . ID=gopAga1_00004497-RA:exon:4052;Parent=gopAga1_00004497;
>scaffold4362 maker exon 13006 13146 . + . ID=gopAga1_00004497-RA:exon:4053;Parent=gopAga1_00004497;
>scaffold4362 maker exon 14920 15047 . + . ID=gopAga1_00004497-RA:exon:4054;Parent=gopAga1_00004497;
>scaffold4362 maker exon 16051 17292 . + . ID=gopAga1_00004497-RA:exon:4055;Parent=gopAga1_00004497;
sed regular-expression
You should also provide the expected output.
â Rakesh Sharma
Aug 23 at 17:31
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have a gff file in which I need to remove -R* from all lines that have the pattern Parent=gopAga1_........-R.;
The file structure is below shown for a single gene, but I need a global fix for all genes in the file.
>2446 17292 . + . ID=gopAga1_00004497-RA;Parent=gopAga1_00004497;Name=gopAga1_00004497-RA;Alias=augustus_masked-scaffold4362-processed-gene-0.0-mRNA-1;_AED=0.12;_QI=0|0.8|0.81|1|1|1|11|1368|404;_eAED=0.12;Note=Similar to PLAT: Tissue-type plasminogen activator (Pongo abelii);
>scaffold4362 maker exon 2446 2545 . + . ID=gopAga1_00004497-RA:exon:4045;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 6721 6834 . + . ID=gopAga1_00004497-RA:exon:4046;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 7241 7415 . + . ID=gopAga1_00004497-RA:exon:4047;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 10114 10205 . + . ID=gopAga1_00004497-RA:exon:4048;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 10478 10649 . + . ID=gopAga1_00004497-RA:exon:4049;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 11037 11122 . + . ID=gopAga1_00004497-RA:exon:4050;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 11518 11713 . + . ID=gopAga1_00004497-RA:exon:4051;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 12794 12930 . + . ID=gopAga1_00004497-RA:exon:4052;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 13006 13146 . + . ID=gopAga1_00004497-RA:exon:4053;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 14920 15047 . + . ID=gopAga1_00004497-RA:exon:4054;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 16051 17292 . + . ID=gopAga1_00004497-RA:exon:4055;Parent=gopAga1_00004497-RA;
I am using sed to find patterns but I am unsure how to use sed to remove everything between the last number of the line and semi-colon.
Will my current script work? The expected output is below.
sed 's/Parent=gopAga1_........-R.;$/Parent=gopAga1........;/ gop.gff
>2446 17292 . + . ID=gopAga1_00004497-RA;Parent=gopAga1_00004497;Name=gopAga1_00004497-RA;Alias=augustus_masked-scaffold4362-processed-gene-0.0-mRNA-1;_AED=0.12;_QI=0|0.8|0.81|1|1|1|11|1368|404;_eAED=0.12;Note=Similar to PLAT: Tissue-type plasminogen activator (Pongo abelii);
>scaffold4362 maker exon 2446 2545 . + . ID=gopAga1_00004497-RA:exon:4045;Parent=gopAga1_00004497;
>scaffold4362 maker exon 6721 6834 . + . ID=gopAga1_00004497-RA:exon:4046;Parent=gopAga1_00004497;
>scaffold4362 maker exon 7241 7415 . + . ID=gopAga1_00004497-RA:exon:4047;Parent=gopAga1_00004497;
>scaffold4362 maker exon 10114 10205 . + . ID=gopAga1_00004497-RA:exon:4048;Parent=gopAga1_00004497;
>scaffold4362 maker exon 10478 10649 . + . ID=gopAga1_00004497-RA:exon:4049;Parent=gopAga1_00004497;
>scaffold4362 maker exon 11037 11122 . + . ID=gopAga1_00004497-RA:exon:4050;Parent=gopAga1_00004497;
>scaffold4362 maker exon 11518 11713 . + . ID=gopAga1_00004497-RA:exon:4051;Parent=gopAga1_00004497;
>scaffold4362 maker exon 12794 12930 . + . ID=gopAga1_00004497-RA:exon:4052;Parent=gopAga1_00004497;
>scaffold4362 maker exon 13006 13146 . + . ID=gopAga1_00004497-RA:exon:4053;Parent=gopAga1_00004497;
>scaffold4362 maker exon 14920 15047 . + . ID=gopAga1_00004497-RA:exon:4054;Parent=gopAga1_00004497;
>scaffold4362 maker exon 16051 17292 . + . ID=gopAga1_00004497-RA:exon:4055;Parent=gopAga1_00004497;
sed regular-expression
I have a gff file in which I need to remove -R* from all lines that have the pattern Parent=gopAga1_........-R.;
The file structure is below shown for a single gene, but I need a global fix for all genes in the file.
>2446 17292 . + . ID=gopAga1_00004497-RA;Parent=gopAga1_00004497;Name=gopAga1_00004497-RA;Alias=augustus_masked-scaffold4362-processed-gene-0.0-mRNA-1;_AED=0.12;_QI=0|0.8|0.81|1|1|1|11|1368|404;_eAED=0.12;Note=Similar to PLAT: Tissue-type plasminogen activator (Pongo abelii);
>scaffold4362 maker exon 2446 2545 . + . ID=gopAga1_00004497-RA:exon:4045;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 6721 6834 . + . ID=gopAga1_00004497-RA:exon:4046;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 7241 7415 . + . ID=gopAga1_00004497-RA:exon:4047;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 10114 10205 . + . ID=gopAga1_00004497-RA:exon:4048;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 10478 10649 . + . ID=gopAga1_00004497-RA:exon:4049;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 11037 11122 . + . ID=gopAga1_00004497-RA:exon:4050;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 11518 11713 . + . ID=gopAga1_00004497-RA:exon:4051;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 12794 12930 . + . ID=gopAga1_00004497-RA:exon:4052;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 13006 13146 . + . ID=gopAga1_00004497-RA:exon:4053;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 14920 15047 . + . ID=gopAga1_00004497-RA:exon:4054;Parent=gopAga1_00004497-RA;
>scaffold4362 maker exon 16051 17292 . + . ID=gopAga1_00004497-RA:exon:4055;Parent=gopAga1_00004497-RA;
I am using sed to find patterns but I am unsure how to use sed to remove everything between the last number of the line and semi-colon.
Will my current script work? The expected output is below.
sed 's/Parent=gopAga1_........-R.;$/Parent=gopAga1........;/ gop.gff
>2446 17292 . + . ID=gopAga1_00004497-RA;Parent=gopAga1_00004497;Name=gopAga1_00004497-RA;Alias=augustus_masked-scaffold4362-processed-gene-0.0-mRNA-1;_AED=0.12;_QI=0|0.8|0.81|1|1|1|11|1368|404;_eAED=0.12;Note=Similar to PLAT: Tissue-type plasminogen activator (Pongo abelii);
>scaffold4362 maker exon 2446 2545 . + . ID=gopAga1_00004497-RA:exon:4045;Parent=gopAga1_00004497;
>scaffold4362 maker exon 6721 6834 . + . ID=gopAga1_00004497-RA:exon:4046;Parent=gopAga1_00004497;
>scaffold4362 maker exon 7241 7415 . + . ID=gopAga1_00004497-RA:exon:4047;Parent=gopAga1_00004497;
>scaffold4362 maker exon 10114 10205 . + . ID=gopAga1_00004497-RA:exon:4048;Parent=gopAga1_00004497;
>scaffold4362 maker exon 10478 10649 . + . ID=gopAga1_00004497-RA:exon:4049;Parent=gopAga1_00004497;
>scaffold4362 maker exon 11037 11122 . + . ID=gopAga1_00004497-RA:exon:4050;Parent=gopAga1_00004497;
>scaffold4362 maker exon 11518 11713 . + . ID=gopAga1_00004497-RA:exon:4051;Parent=gopAga1_00004497;
>scaffold4362 maker exon 12794 12930 . + . ID=gopAga1_00004497-RA:exon:4052;Parent=gopAga1_00004497;
>scaffold4362 maker exon 13006 13146 . + . ID=gopAga1_00004497-RA:exon:4053;Parent=gopAga1_00004497;
>scaffold4362 maker exon 14920 15047 . + . ID=gopAga1_00004497-RA:exon:4054;Parent=gopAga1_00004497;
>scaffold4362 maker exon 16051 17292 . + . ID=gopAga1_00004497-RA:exon:4055;Parent=gopAga1_00004497;
sed regular-expression
sed regular-expression
edited Aug 23 at 21:07
asked Aug 23 at 15:45
fargo
62
62
You should also provide the expected output.
â Rakesh Sharma
Aug 23 at 17:31
add a comment |Â
You should also provide the expected output.
â Rakesh Sharma
Aug 23 at 17:31
You should also provide the expected output.
â Rakesh Sharma
Aug 23 at 17:31
You should also provide the expected output.
â Rakesh Sharma
Aug 23 at 17:31
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
4
down vote
Maybe:
sed 's/-R[^;]*;$/;/'
Or
awk -F ';' -f OFS=';' 'sub(/-R.*/, "",$(NF-1)); 1'
add a comment |Â
up vote
1
down vote
$ sed '/^>.*Parent=samp1_/s/-R.;$/;/' <file.fa
>ID=samp1_00004497:4045;Parent=samp1_00004497;
>ID=samp1_00004497:4046;Parent=samp1_00004498;
>ID=samp1_00004497:4047;Parent=samp1_00004499;
>ID=samp1_00004497:4048;Parent=samp1_00004496;
The sed
command above will find all lines starting with >
and containing the string Parent=samp1_
, and for each such line replace the final -R.;
(where the .
matches a single character) with just ;
. Lines not ending with anything matching -R.;
would remain unaltered.
Change the dot in -R.;
to [^;]*
if you want to remove any number of non-;
character up to the ;
at the end.
For your updated question, use Parent=gopAga1_
in place of Parent=samp1_
.
The file is actually a gff3 file. I am updating the question to reflect the full format. This sed command works I specifically need it to also contain Parent=samp1_ as part of the pattern recognition.
â fargo
Aug 23 at 16:08
@fargo See updated answer.
â Kusalananda
Aug 23 at 16:13
This worked after I removed ^> from the script. I do have a second issue since some lines have multiple parent IDs. For example.ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500-RB,gopAga1_00004500-RA,gopAga1_00004500;
I would like to edit the file so that it only reports the first parent ID with everthing after the -RB removed, such asID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500
I tried usingsed '.*Parent=gopAga1_/s/-R.*;$/;/'
but was unsuccesful. Anything wrong with the code?
â fargo
Aug 23 at 21:19
I realized that my script is editing the very last instance of the pattern. How can I invoke the first instance?
â fargo
Aug 23 at 21:39
@fargo There is no reason having^>
in the expression would have stopped thesed
script from working, unless your lines does not actually start with>
as the first character. I don't quite understand what you mean by "report" in your comment. Do you mean you want to delete all but the first parent? If at all possible, your question should be self-contained, and without a string of follow-up questions. Edit the question and include all relevant information and exactly what you're looking to achieve.
â Kusalananda
Aug 24 at 5:55
add a comment |Â
up vote
1
down vote
The commandline
sed -re 's/([^-]+).+?;/1;/g'
will output everything up to -
for each line, not including -
and then append a semicolon to the end.
Update
sed -re 's/(_.8)-R./1/g'
will remove unwanted characters based on presence of _
then 8 chars then -R
.
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
Maybe:
sed 's/-R[^;]*;$/;/'
Or
awk -F ';' -f OFS=';' 'sub(/-R.*/, "",$(NF-1)); 1'
add a comment |Â
up vote
4
down vote
Maybe:
sed 's/-R[^;]*;$/;/'
Or
awk -F ';' -f OFS=';' 'sub(/-R.*/, "",$(NF-1)); 1'
add a comment |Â
up vote
4
down vote
up vote
4
down vote
Maybe:
sed 's/-R[^;]*;$/;/'
Or
awk -F ';' -f OFS=';' 'sub(/-R.*/, "",$(NF-1)); 1'
Maybe:
sed 's/-R[^;]*;$/;/'
Or
awk -F ';' -f OFS=';' 'sub(/-R.*/, "",$(NF-1)); 1'
answered Aug 23 at 15:51
Stéphane Chazelas
285k53525864
285k53525864
add a comment |Â
add a comment |Â
up vote
1
down vote
$ sed '/^>.*Parent=samp1_/s/-R.;$/;/' <file.fa
>ID=samp1_00004497:4045;Parent=samp1_00004497;
>ID=samp1_00004497:4046;Parent=samp1_00004498;
>ID=samp1_00004497:4047;Parent=samp1_00004499;
>ID=samp1_00004497:4048;Parent=samp1_00004496;
The sed
command above will find all lines starting with >
and containing the string Parent=samp1_
, and for each such line replace the final -R.;
(where the .
matches a single character) with just ;
. Lines not ending with anything matching -R.;
would remain unaltered.
Change the dot in -R.;
to [^;]*
if you want to remove any number of non-;
character up to the ;
at the end.
For your updated question, use Parent=gopAga1_
in place of Parent=samp1_
.
The file is actually a gff3 file. I am updating the question to reflect the full format. This sed command works I specifically need it to also contain Parent=samp1_ as part of the pattern recognition.
â fargo
Aug 23 at 16:08
@fargo See updated answer.
â Kusalananda
Aug 23 at 16:13
This worked after I removed ^> from the script. I do have a second issue since some lines have multiple parent IDs. For example.ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500-RB,gopAga1_00004500-RA,gopAga1_00004500;
I would like to edit the file so that it only reports the first parent ID with everthing after the -RB removed, such asID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500
I tried usingsed '.*Parent=gopAga1_/s/-R.*;$/;/'
but was unsuccesful. Anything wrong with the code?
â fargo
Aug 23 at 21:19
I realized that my script is editing the very last instance of the pattern. How can I invoke the first instance?
â fargo
Aug 23 at 21:39
@fargo There is no reason having^>
in the expression would have stopped thesed
script from working, unless your lines does not actually start with>
as the first character. I don't quite understand what you mean by "report" in your comment. Do you mean you want to delete all but the first parent? If at all possible, your question should be self-contained, and without a string of follow-up questions. Edit the question and include all relevant information and exactly what you're looking to achieve.
â Kusalananda
Aug 24 at 5:55
add a comment |Â
up vote
1
down vote
$ sed '/^>.*Parent=samp1_/s/-R.;$/;/' <file.fa
>ID=samp1_00004497:4045;Parent=samp1_00004497;
>ID=samp1_00004497:4046;Parent=samp1_00004498;
>ID=samp1_00004497:4047;Parent=samp1_00004499;
>ID=samp1_00004497:4048;Parent=samp1_00004496;
The sed
command above will find all lines starting with >
and containing the string Parent=samp1_
, and for each such line replace the final -R.;
(where the .
matches a single character) with just ;
. Lines not ending with anything matching -R.;
would remain unaltered.
Change the dot in -R.;
to [^;]*
if you want to remove any number of non-;
character up to the ;
at the end.
For your updated question, use Parent=gopAga1_
in place of Parent=samp1_
.
The file is actually a gff3 file. I am updating the question to reflect the full format. This sed command works I specifically need it to also contain Parent=samp1_ as part of the pattern recognition.
â fargo
Aug 23 at 16:08
@fargo See updated answer.
â Kusalananda
Aug 23 at 16:13
This worked after I removed ^> from the script. I do have a second issue since some lines have multiple parent IDs. For example.ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500-RB,gopAga1_00004500-RA,gopAga1_00004500;
I would like to edit the file so that it only reports the first parent ID with everthing after the -RB removed, such asID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500
I tried usingsed '.*Parent=gopAga1_/s/-R.*;$/;/'
but was unsuccesful. Anything wrong with the code?
â fargo
Aug 23 at 21:19
I realized that my script is editing the very last instance of the pattern. How can I invoke the first instance?
â fargo
Aug 23 at 21:39
@fargo There is no reason having^>
in the expression would have stopped thesed
script from working, unless your lines does not actually start with>
as the first character. I don't quite understand what you mean by "report" in your comment. Do you mean you want to delete all but the first parent? If at all possible, your question should be self-contained, and without a string of follow-up questions. Edit the question and include all relevant information and exactly what you're looking to achieve.
â Kusalananda
Aug 24 at 5:55
add a comment |Â
up vote
1
down vote
up vote
1
down vote
$ sed '/^>.*Parent=samp1_/s/-R.;$/;/' <file.fa
>ID=samp1_00004497:4045;Parent=samp1_00004497;
>ID=samp1_00004497:4046;Parent=samp1_00004498;
>ID=samp1_00004497:4047;Parent=samp1_00004499;
>ID=samp1_00004497:4048;Parent=samp1_00004496;
The sed
command above will find all lines starting with >
and containing the string Parent=samp1_
, and for each such line replace the final -R.;
(where the .
matches a single character) with just ;
. Lines not ending with anything matching -R.;
would remain unaltered.
Change the dot in -R.;
to [^;]*
if you want to remove any number of non-;
character up to the ;
at the end.
For your updated question, use Parent=gopAga1_
in place of Parent=samp1_
.
$ sed '/^>.*Parent=samp1_/s/-R.;$/;/' <file.fa
>ID=samp1_00004497:4045;Parent=samp1_00004497;
>ID=samp1_00004497:4046;Parent=samp1_00004498;
>ID=samp1_00004497:4047;Parent=samp1_00004499;
>ID=samp1_00004497:4048;Parent=samp1_00004496;
The sed
command above will find all lines starting with >
and containing the string Parent=samp1_
, and for each such line replace the final -R.;
(where the .
matches a single character) with just ;
. Lines not ending with anything matching -R.;
would remain unaltered.
Change the dot in -R.;
to [^;]*
if you want to remove any number of non-;
character up to the ;
at the end.
For your updated question, use Parent=gopAga1_
in place of Parent=samp1_
.
edited Aug 23 at 16:24
answered Aug 23 at 15:53
Kusalananda
107k14209329
107k14209329
The file is actually a gff3 file. I am updating the question to reflect the full format. This sed command works I specifically need it to also contain Parent=samp1_ as part of the pattern recognition.
â fargo
Aug 23 at 16:08
@fargo See updated answer.
â Kusalananda
Aug 23 at 16:13
This worked after I removed ^> from the script. I do have a second issue since some lines have multiple parent IDs. For example.ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500-RB,gopAga1_00004500-RA,gopAga1_00004500;
I would like to edit the file so that it only reports the first parent ID with everthing after the -RB removed, such asID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500
I tried usingsed '.*Parent=gopAga1_/s/-R.*;$/;/'
but was unsuccesful. Anything wrong with the code?
â fargo
Aug 23 at 21:19
I realized that my script is editing the very last instance of the pattern. How can I invoke the first instance?
â fargo
Aug 23 at 21:39
@fargo There is no reason having^>
in the expression would have stopped thesed
script from working, unless your lines does not actually start with>
as the first character. I don't quite understand what you mean by "report" in your comment. Do you mean you want to delete all but the first parent? If at all possible, your question should be self-contained, and without a string of follow-up questions. Edit the question and include all relevant information and exactly what you're looking to achieve.
â Kusalananda
Aug 24 at 5:55
add a comment |Â
The file is actually a gff3 file. I am updating the question to reflect the full format. This sed command works I specifically need it to also contain Parent=samp1_ as part of the pattern recognition.
â fargo
Aug 23 at 16:08
@fargo See updated answer.
â Kusalananda
Aug 23 at 16:13
This worked after I removed ^> from the script. I do have a second issue since some lines have multiple parent IDs. For example.ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500-RB,gopAga1_00004500-RA,gopAga1_00004500;
I would like to edit the file so that it only reports the first parent ID with everthing after the -RB removed, such asID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500
I tried usingsed '.*Parent=gopAga1_/s/-R.*;$/;/'
but was unsuccesful. Anything wrong with the code?
â fargo
Aug 23 at 21:19
I realized that my script is editing the very last instance of the pattern. How can I invoke the first instance?
â fargo
Aug 23 at 21:39
@fargo There is no reason having^>
in the expression would have stopped thesed
script from working, unless your lines does not actually start with>
as the first character. I don't quite understand what you mean by "report" in your comment. Do you mean you want to delete all but the first parent? If at all possible, your question should be self-contained, and without a string of follow-up questions. Edit the question and include all relevant information and exactly what you're looking to achieve.
â Kusalananda
Aug 24 at 5:55
The file is actually a gff3 file. I am updating the question to reflect the full format. This sed command works I specifically need it to also contain Parent=samp1_ as part of the pattern recognition.
â fargo
Aug 23 at 16:08
The file is actually a gff3 file. I am updating the question to reflect the full format. This sed command works I specifically need it to also contain Parent=samp1_ as part of the pattern recognition.
â fargo
Aug 23 at 16:08
@fargo See updated answer.
â Kusalananda
Aug 23 at 16:13
@fargo See updated answer.
â Kusalananda
Aug 23 at 16:13
This worked after I removed ^> from the script. I do have a second issue since some lines have multiple parent IDs. For example.
ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500-RB,gopAga1_00004500-RA,gopAga1_00004500;
I would like to edit the file so that it only reports the first parent ID with everthing after the -RB removed, such as ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500
I tried using sed '.*Parent=gopAga1_/s/-R.*;$/;/'
but was unsuccesful. Anything wrong with the code?â fargo
Aug 23 at 21:19
This worked after I removed ^> from the script. I do have a second issue since some lines have multiple parent IDs. For example.
ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500-RB,gopAga1_00004500-RA,gopAga1_00004500;
I would like to edit the file so that it only reports the first parent ID with everthing after the -RB removed, such as ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500
I tried using sed '.*Parent=gopAga1_/s/-R.*;$/;/'
but was unsuccesful. Anything wrong with the code?â fargo
Aug 23 at 21:19
I realized that my script is editing the very last instance of the pattern. How can I invoke the first instance?
â fargo
Aug 23 at 21:39
I realized that my script is editing the very last instance of the pattern. How can I invoke the first instance?
â fargo
Aug 23 at 21:39
@fargo There is no reason having
^>
in the expression would have stopped the sed
script from working, unless your lines does not actually start with >
as the first character. I don't quite understand what you mean by "report" in your comment. Do you mean you want to delete all but the first parent? If at all possible, your question should be self-contained, and without a string of follow-up questions. Edit the question and include all relevant information and exactly what you're looking to achieve.â Kusalananda
Aug 24 at 5:55
@fargo There is no reason having
^>
in the expression would have stopped the sed
script from working, unless your lines does not actually start with >
as the first character. I don't quite understand what you mean by "report" in your comment. Do you mean you want to delete all but the first parent? If at all possible, your question should be self-contained, and without a string of follow-up questions. Edit the question and include all relevant information and exactly what you're looking to achieve.â Kusalananda
Aug 24 at 5:55
add a comment |Â
up vote
1
down vote
The commandline
sed -re 's/([^-]+).+?;/1;/g'
will output everything up to -
for each line, not including -
and then append a semicolon to the end.
Update
sed -re 's/(_.8)-R./1/g'
will remove unwanted characters based on presence of _
then 8 chars then -R
.
add a comment |Â
up vote
1
down vote
The commandline
sed -re 's/([^-]+).+?;/1;/g'
will output everything up to -
for each line, not including -
and then append a semicolon to the end.
Update
sed -re 's/(_.8)-R./1/g'
will remove unwanted characters based on presence of _
then 8 chars then -R
.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
The commandline
sed -re 's/([^-]+).+?;/1;/g'
will output everything up to -
for each line, not including -
and then append a semicolon to the end.
Update
sed -re 's/(_.8)-R./1/g'
will remove unwanted characters based on presence of _
then 8 chars then -R
.
The commandline
sed -re 's/([^-]+).+?;/1;/g'
will output everything up to -
for each line, not including -
and then append a semicolon to the end.
Update
sed -re 's/(_.8)-R./1/g'
will remove unwanted characters based on presence of _
then 8 chars then -R
.
edited Aug 23 at 16:28
answered Aug 23 at 15:56
loa_in_
22717
22717
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f464437%2fhow-to-use-sed-and-regular-expressions-to-find-pattern-and-remove-last-few-chara%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
You should also provide the expected output.
â Rakesh Sharma
Aug 23 at 17:31