How to use sed and regular expressions to find pattern and remove last few characters?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












I have a gff file in which I need to remove -R* from all lines that have the pattern Parent=gopAga1_........-R.;



The file structure is below shown for a single gene, but I need a global fix for all genes in the file.



>2446 17292 . + . ID=gopAga1_00004497-RA;Parent=gopAga1_00004497;Name=gopAga1_00004497-RA;Alias=augustus_masked-scaffold4362-processed-gene-0.0-mRNA-1;_AED=0.12;_QI=0|0.8|0.81|1|1|1|11|1368|404;_eAED=0.12;Note=Similar to PLAT: Tissue-type plasminogen activator (Pongo abelii);

>scaffold4362 maker exon 2446 2545 . + . ID=gopAga1_00004497-RA:exon:4045;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 6721 6834 . + . ID=gopAga1_00004497-RA:exon:4046;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 7241 7415 . + . ID=gopAga1_00004497-RA:exon:4047;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 10114 10205 . + . ID=gopAga1_00004497-RA:exon:4048;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 10478 10649 . + . ID=gopAga1_00004497-RA:exon:4049;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 11037 11122 . + . ID=gopAga1_00004497-RA:exon:4050;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 11518 11713 . + . ID=gopAga1_00004497-RA:exon:4051;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 12794 12930 . + . ID=gopAga1_00004497-RA:exon:4052;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 13006 13146 . + . ID=gopAga1_00004497-RA:exon:4053;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 14920 15047 . + . ID=gopAga1_00004497-RA:exon:4054;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 16051 17292 . + . ID=gopAga1_00004497-RA:exon:4055;Parent=gopAga1_00004497-RA;


I am using sed to find patterns but I am unsure how to use sed to remove everything between the last number of the line and semi-colon.



Will my current script work? The expected output is below.



sed 's/Parent=gopAga1_........-R.;$/Parent=gopAga1........;/ gop.gff



>2446 17292 . + . ID=gopAga1_00004497-RA;Parent=gopAga1_00004497;Name=gopAga1_00004497-RA;Alias=augustus_masked-scaffold4362-processed-gene-0.0-mRNA-1;_AED=0.12;_QI=0|0.8|0.81|1|1|1|11|1368|404;_eAED=0.12;Note=Similar to PLAT: Tissue-type plasminogen activator (Pongo abelii);

>scaffold4362 maker exon 2446 2545 . + . ID=gopAga1_00004497-RA:exon:4045;Parent=gopAga1_00004497;

>scaffold4362 maker exon 6721 6834 . + . ID=gopAga1_00004497-RA:exon:4046;Parent=gopAga1_00004497;

>scaffold4362 maker exon 7241 7415 . + . ID=gopAga1_00004497-RA:exon:4047;Parent=gopAga1_00004497;

>scaffold4362 maker exon 10114 10205 . + . ID=gopAga1_00004497-RA:exon:4048;Parent=gopAga1_00004497;

>scaffold4362 maker exon 10478 10649 . + . ID=gopAga1_00004497-RA:exon:4049;Parent=gopAga1_00004497;

>scaffold4362 maker exon 11037 11122 . + . ID=gopAga1_00004497-RA:exon:4050;Parent=gopAga1_00004497;

>scaffold4362 maker exon 11518 11713 . + . ID=gopAga1_00004497-RA:exon:4051;Parent=gopAga1_00004497;

>scaffold4362 maker exon 12794 12930 . + . ID=gopAga1_00004497-RA:exon:4052;Parent=gopAga1_00004497;

>scaffold4362 maker exon 13006 13146 . + . ID=gopAga1_00004497-RA:exon:4053;Parent=gopAga1_00004497;

>scaffold4362 maker exon 14920 15047 . + . ID=gopAga1_00004497-RA:exon:4054;Parent=gopAga1_00004497;

>scaffold4362 maker exon 16051 17292 . + . ID=gopAga1_00004497-RA:exon:4055;Parent=gopAga1_00004497;











share|improve this question























  • You should also provide the expected output.
    – Rakesh Sharma
    Aug 23 at 17:31














up vote
1
down vote

favorite












I have a gff file in which I need to remove -R* from all lines that have the pattern Parent=gopAga1_........-R.;



The file structure is below shown for a single gene, but I need a global fix for all genes in the file.



>2446 17292 . + . ID=gopAga1_00004497-RA;Parent=gopAga1_00004497;Name=gopAga1_00004497-RA;Alias=augustus_masked-scaffold4362-processed-gene-0.0-mRNA-1;_AED=0.12;_QI=0|0.8|0.81|1|1|1|11|1368|404;_eAED=0.12;Note=Similar to PLAT: Tissue-type plasminogen activator (Pongo abelii);

>scaffold4362 maker exon 2446 2545 . + . ID=gopAga1_00004497-RA:exon:4045;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 6721 6834 . + . ID=gopAga1_00004497-RA:exon:4046;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 7241 7415 . + . ID=gopAga1_00004497-RA:exon:4047;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 10114 10205 . + . ID=gopAga1_00004497-RA:exon:4048;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 10478 10649 . + . ID=gopAga1_00004497-RA:exon:4049;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 11037 11122 . + . ID=gopAga1_00004497-RA:exon:4050;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 11518 11713 . + . ID=gopAga1_00004497-RA:exon:4051;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 12794 12930 . + . ID=gopAga1_00004497-RA:exon:4052;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 13006 13146 . + . ID=gopAga1_00004497-RA:exon:4053;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 14920 15047 . + . ID=gopAga1_00004497-RA:exon:4054;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 16051 17292 . + . ID=gopAga1_00004497-RA:exon:4055;Parent=gopAga1_00004497-RA;


I am using sed to find patterns but I am unsure how to use sed to remove everything between the last number of the line and semi-colon.



Will my current script work? The expected output is below.



sed 's/Parent=gopAga1_........-R.;$/Parent=gopAga1........;/ gop.gff



>2446 17292 . + . ID=gopAga1_00004497-RA;Parent=gopAga1_00004497;Name=gopAga1_00004497-RA;Alias=augustus_masked-scaffold4362-processed-gene-0.0-mRNA-1;_AED=0.12;_QI=0|0.8|0.81|1|1|1|11|1368|404;_eAED=0.12;Note=Similar to PLAT: Tissue-type plasminogen activator (Pongo abelii);

>scaffold4362 maker exon 2446 2545 . + . ID=gopAga1_00004497-RA:exon:4045;Parent=gopAga1_00004497;

>scaffold4362 maker exon 6721 6834 . + . ID=gopAga1_00004497-RA:exon:4046;Parent=gopAga1_00004497;

>scaffold4362 maker exon 7241 7415 . + . ID=gopAga1_00004497-RA:exon:4047;Parent=gopAga1_00004497;

>scaffold4362 maker exon 10114 10205 . + . ID=gopAga1_00004497-RA:exon:4048;Parent=gopAga1_00004497;

>scaffold4362 maker exon 10478 10649 . + . ID=gopAga1_00004497-RA:exon:4049;Parent=gopAga1_00004497;

>scaffold4362 maker exon 11037 11122 . + . ID=gopAga1_00004497-RA:exon:4050;Parent=gopAga1_00004497;

>scaffold4362 maker exon 11518 11713 . + . ID=gopAga1_00004497-RA:exon:4051;Parent=gopAga1_00004497;

>scaffold4362 maker exon 12794 12930 . + . ID=gopAga1_00004497-RA:exon:4052;Parent=gopAga1_00004497;

>scaffold4362 maker exon 13006 13146 . + . ID=gopAga1_00004497-RA:exon:4053;Parent=gopAga1_00004497;

>scaffold4362 maker exon 14920 15047 . + . ID=gopAga1_00004497-RA:exon:4054;Parent=gopAga1_00004497;

>scaffold4362 maker exon 16051 17292 . + . ID=gopAga1_00004497-RA:exon:4055;Parent=gopAga1_00004497;











share|improve this question























  • You should also provide the expected output.
    – Rakesh Sharma
    Aug 23 at 17:31












up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have a gff file in which I need to remove -R* from all lines that have the pattern Parent=gopAga1_........-R.;



The file structure is below shown for a single gene, but I need a global fix for all genes in the file.



>2446 17292 . + . ID=gopAga1_00004497-RA;Parent=gopAga1_00004497;Name=gopAga1_00004497-RA;Alias=augustus_masked-scaffold4362-processed-gene-0.0-mRNA-1;_AED=0.12;_QI=0|0.8|0.81|1|1|1|11|1368|404;_eAED=0.12;Note=Similar to PLAT: Tissue-type plasminogen activator (Pongo abelii);

>scaffold4362 maker exon 2446 2545 . + . ID=gopAga1_00004497-RA:exon:4045;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 6721 6834 . + . ID=gopAga1_00004497-RA:exon:4046;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 7241 7415 . + . ID=gopAga1_00004497-RA:exon:4047;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 10114 10205 . + . ID=gopAga1_00004497-RA:exon:4048;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 10478 10649 . + . ID=gopAga1_00004497-RA:exon:4049;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 11037 11122 . + . ID=gopAga1_00004497-RA:exon:4050;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 11518 11713 . + . ID=gopAga1_00004497-RA:exon:4051;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 12794 12930 . + . ID=gopAga1_00004497-RA:exon:4052;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 13006 13146 . + . ID=gopAga1_00004497-RA:exon:4053;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 14920 15047 . + . ID=gopAga1_00004497-RA:exon:4054;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 16051 17292 . + . ID=gopAga1_00004497-RA:exon:4055;Parent=gopAga1_00004497-RA;


I am using sed to find patterns but I am unsure how to use sed to remove everything between the last number of the line and semi-colon.



Will my current script work? The expected output is below.



sed 's/Parent=gopAga1_........-R.;$/Parent=gopAga1........;/ gop.gff



>2446 17292 . + . ID=gopAga1_00004497-RA;Parent=gopAga1_00004497;Name=gopAga1_00004497-RA;Alias=augustus_masked-scaffold4362-processed-gene-0.0-mRNA-1;_AED=0.12;_QI=0|0.8|0.81|1|1|1|11|1368|404;_eAED=0.12;Note=Similar to PLAT: Tissue-type plasminogen activator (Pongo abelii);

>scaffold4362 maker exon 2446 2545 . + . ID=gopAga1_00004497-RA:exon:4045;Parent=gopAga1_00004497;

>scaffold4362 maker exon 6721 6834 . + . ID=gopAga1_00004497-RA:exon:4046;Parent=gopAga1_00004497;

>scaffold4362 maker exon 7241 7415 . + . ID=gopAga1_00004497-RA:exon:4047;Parent=gopAga1_00004497;

>scaffold4362 maker exon 10114 10205 . + . ID=gopAga1_00004497-RA:exon:4048;Parent=gopAga1_00004497;

>scaffold4362 maker exon 10478 10649 . + . ID=gopAga1_00004497-RA:exon:4049;Parent=gopAga1_00004497;

>scaffold4362 maker exon 11037 11122 . + . ID=gopAga1_00004497-RA:exon:4050;Parent=gopAga1_00004497;

>scaffold4362 maker exon 11518 11713 . + . ID=gopAga1_00004497-RA:exon:4051;Parent=gopAga1_00004497;

>scaffold4362 maker exon 12794 12930 . + . ID=gopAga1_00004497-RA:exon:4052;Parent=gopAga1_00004497;

>scaffold4362 maker exon 13006 13146 . + . ID=gopAga1_00004497-RA:exon:4053;Parent=gopAga1_00004497;

>scaffold4362 maker exon 14920 15047 . + . ID=gopAga1_00004497-RA:exon:4054;Parent=gopAga1_00004497;

>scaffold4362 maker exon 16051 17292 . + . ID=gopAga1_00004497-RA:exon:4055;Parent=gopAga1_00004497;











share|improve this question















I have a gff file in which I need to remove -R* from all lines that have the pattern Parent=gopAga1_........-R.;



The file structure is below shown for a single gene, but I need a global fix for all genes in the file.



>2446 17292 . + . ID=gopAga1_00004497-RA;Parent=gopAga1_00004497;Name=gopAga1_00004497-RA;Alias=augustus_masked-scaffold4362-processed-gene-0.0-mRNA-1;_AED=0.12;_QI=0|0.8|0.81|1|1|1|11|1368|404;_eAED=0.12;Note=Similar to PLAT: Tissue-type plasminogen activator (Pongo abelii);

>scaffold4362 maker exon 2446 2545 . + . ID=gopAga1_00004497-RA:exon:4045;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 6721 6834 . + . ID=gopAga1_00004497-RA:exon:4046;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 7241 7415 . + . ID=gopAga1_00004497-RA:exon:4047;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 10114 10205 . + . ID=gopAga1_00004497-RA:exon:4048;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 10478 10649 . + . ID=gopAga1_00004497-RA:exon:4049;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 11037 11122 . + . ID=gopAga1_00004497-RA:exon:4050;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 11518 11713 . + . ID=gopAga1_00004497-RA:exon:4051;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 12794 12930 . + . ID=gopAga1_00004497-RA:exon:4052;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 13006 13146 . + . ID=gopAga1_00004497-RA:exon:4053;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 14920 15047 . + . ID=gopAga1_00004497-RA:exon:4054;Parent=gopAga1_00004497-RA;

>scaffold4362 maker exon 16051 17292 . + . ID=gopAga1_00004497-RA:exon:4055;Parent=gopAga1_00004497-RA;


I am using sed to find patterns but I am unsure how to use sed to remove everything between the last number of the line and semi-colon.



Will my current script work? The expected output is below.



sed 's/Parent=gopAga1_........-R.;$/Parent=gopAga1........;/ gop.gff



>2446 17292 . + . ID=gopAga1_00004497-RA;Parent=gopAga1_00004497;Name=gopAga1_00004497-RA;Alias=augustus_masked-scaffold4362-processed-gene-0.0-mRNA-1;_AED=0.12;_QI=0|0.8|0.81|1|1|1|11|1368|404;_eAED=0.12;Note=Similar to PLAT: Tissue-type plasminogen activator (Pongo abelii);

>scaffold4362 maker exon 2446 2545 . + . ID=gopAga1_00004497-RA:exon:4045;Parent=gopAga1_00004497;

>scaffold4362 maker exon 6721 6834 . + . ID=gopAga1_00004497-RA:exon:4046;Parent=gopAga1_00004497;

>scaffold4362 maker exon 7241 7415 . + . ID=gopAga1_00004497-RA:exon:4047;Parent=gopAga1_00004497;

>scaffold4362 maker exon 10114 10205 . + . ID=gopAga1_00004497-RA:exon:4048;Parent=gopAga1_00004497;

>scaffold4362 maker exon 10478 10649 . + . ID=gopAga1_00004497-RA:exon:4049;Parent=gopAga1_00004497;

>scaffold4362 maker exon 11037 11122 . + . ID=gopAga1_00004497-RA:exon:4050;Parent=gopAga1_00004497;

>scaffold4362 maker exon 11518 11713 . + . ID=gopAga1_00004497-RA:exon:4051;Parent=gopAga1_00004497;

>scaffold4362 maker exon 12794 12930 . + . ID=gopAga1_00004497-RA:exon:4052;Parent=gopAga1_00004497;

>scaffold4362 maker exon 13006 13146 . + . ID=gopAga1_00004497-RA:exon:4053;Parent=gopAga1_00004497;

>scaffold4362 maker exon 14920 15047 . + . ID=gopAga1_00004497-RA:exon:4054;Parent=gopAga1_00004497;

>scaffold4362 maker exon 16051 17292 . + . ID=gopAga1_00004497-RA:exon:4055;Parent=gopAga1_00004497;








sed regular-expression






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Aug 23 at 21:07

























asked Aug 23 at 15:45









fargo

62




62











  • You should also provide the expected output.
    – Rakesh Sharma
    Aug 23 at 17:31
















  • You should also provide the expected output.
    – Rakesh Sharma
    Aug 23 at 17:31















You should also provide the expected output.
– Rakesh Sharma
Aug 23 at 17:31




You should also provide the expected output.
– Rakesh Sharma
Aug 23 at 17:31










3 Answers
3






active

oldest

votes

















up vote
4
down vote













Maybe:



sed 's/-R[^;]*;$/;/'


Or



awk -F ';' -f OFS=';' 'sub(/-R.*/, "",$(NF-1)); 1'





share|improve this answer



























    up vote
    1
    down vote













    $ sed '/^>.*Parent=samp1_/s/-R.;$/;/' <file.fa
    >ID=samp1_00004497:4045;Parent=samp1_00004497;
    >ID=samp1_00004497:4046;Parent=samp1_00004498;
    >ID=samp1_00004497:4047;Parent=samp1_00004499;
    >ID=samp1_00004497:4048;Parent=samp1_00004496;


    The sed command above will find all lines starting with > and containing the string Parent=samp1_, and for each such line replace the final -R.; (where the . matches a single character) with just ;. Lines not ending with anything matching -R.; would remain unaltered.



    Change the dot in -R.; to [^;]* if you want to remove any number of non-; character up to the ; at the end.



    For your updated question, use Parent=gopAga1_ in place of Parent=samp1_.






    share|improve this answer






















    • The file is actually a gff3 file. I am updating the question to reflect the full format. This sed command works I specifically need it to also contain Parent=samp1_ as part of the pattern recognition.
      – fargo
      Aug 23 at 16:08











    • @fargo See updated answer.
      – Kusalananda
      Aug 23 at 16:13










    • This worked after I removed ^> from the script. I do have a second issue since some lines have multiple parent IDs. For example. ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500-RB,gopAga1_00004500-RA,gopAga1_00004500; I would like to edit the file so that it only reports the first parent ID with everthing after the -RB removed, such as ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500 I tried using sed '.*Parent=gopAga1_/s/-R.*;$/;/' but was unsuccesful. Anything wrong with the code?
      – fargo
      Aug 23 at 21:19










    • I realized that my script is editing the very last instance of the pattern. How can I invoke the first instance?
      – fargo
      Aug 23 at 21:39










    • @fargo There is no reason having ^> in the expression would have stopped the sed script from working, unless your lines does not actually start with > as the first character. I don't quite understand what you mean by "report" in your comment. Do you mean you want to delete all but the first parent? If at all possible, your question should be self-contained, and without a string of follow-up questions. Edit the question and include all relevant information and exactly what you're looking to achieve.
      – Kusalananda
      Aug 24 at 5:55

















    up vote
    1
    down vote













    The commandline



    sed -re 's/([^-]+).+?;/1;/g'


    will output everything up to - for each line, not including - and then append a semicolon to the end.



    Update



    sed -re 's/(_.8)-R./1/g'


    will remove unwanted characters based on presence of _ then 8 chars then -R.






    share|improve this answer






















      Your Answer







      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "106"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f464437%2fhow-to-use-sed-and-regular-expressions-to-find-pattern-and-remove-last-few-chara%23new-answer', 'question_page');

      );

      Post as a guest






























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      4
      down vote













      Maybe:



      sed 's/-R[^;]*;$/;/'


      Or



      awk -F ';' -f OFS=';' 'sub(/-R.*/, "",$(NF-1)); 1'





      share|improve this answer
























        up vote
        4
        down vote













        Maybe:



        sed 's/-R[^;]*;$/;/'


        Or



        awk -F ';' -f OFS=';' 'sub(/-R.*/, "",$(NF-1)); 1'





        share|improve this answer






















          up vote
          4
          down vote










          up vote
          4
          down vote









          Maybe:



          sed 's/-R[^;]*;$/;/'


          Or



          awk -F ';' -f OFS=';' 'sub(/-R.*/, "",$(NF-1)); 1'





          share|improve this answer












          Maybe:



          sed 's/-R[^;]*;$/;/'


          Or



          awk -F ';' -f OFS=';' 'sub(/-R.*/, "",$(NF-1)); 1'






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Aug 23 at 15:51









          Stéphane Chazelas

          285k53525864




          285k53525864






















              up vote
              1
              down vote













              $ sed '/^>.*Parent=samp1_/s/-R.;$/;/' <file.fa
              >ID=samp1_00004497:4045;Parent=samp1_00004497;
              >ID=samp1_00004497:4046;Parent=samp1_00004498;
              >ID=samp1_00004497:4047;Parent=samp1_00004499;
              >ID=samp1_00004497:4048;Parent=samp1_00004496;


              The sed command above will find all lines starting with > and containing the string Parent=samp1_, and for each such line replace the final -R.; (where the . matches a single character) with just ;. Lines not ending with anything matching -R.; would remain unaltered.



              Change the dot in -R.; to [^;]* if you want to remove any number of non-; character up to the ; at the end.



              For your updated question, use Parent=gopAga1_ in place of Parent=samp1_.






              share|improve this answer






















              • The file is actually a gff3 file. I am updating the question to reflect the full format. This sed command works I specifically need it to also contain Parent=samp1_ as part of the pattern recognition.
                – fargo
                Aug 23 at 16:08











              • @fargo See updated answer.
                – Kusalananda
                Aug 23 at 16:13










              • This worked after I removed ^> from the script. I do have a second issue since some lines have multiple parent IDs. For example. ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500-RB,gopAga1_00004500-RA,gopAga1_00004500; I would like to edit the file so that it only reports the first parent ID with everthing after the -RB removed, such as ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500 I tried using sed '.*Parent=gopAga1_/s/-R.*;$/;/' but was unsuccesful. Anything wrong with the code?
                – fargo
                Aug 23 at 21:19










              • I realized that my script is editing the very last instance of the pattern. How can I invoke the first instance?
                – fargo
                Aug 23 at 21:39










              • @fargo There is no reason having ^> in the expression would have stopped the sed script from working, unless your lines does not actually start with > as the first character. I don't quite understand what you mean by "report" in your comment. Do you mean you want to delete all but the first parent? If at all possible, your question should be self-contained, and without a string of follow-up questions. Edit the question and include all relevant information and exactly what you're looking to achieve.
                – Kusalananda
                Aug 24 at 5:55














              up vote
              1
              down vote













              $ sed '/^>.*Parent=samp1_/s/-R.;$/;/' <file.fa
              >ID=samp1_00004497:4045;Parent=samp1_00004497;
              >ID=samp1_00004497:4046;Parent=samp1_00004498;
              >ID=samp1_00004497:4047;Parent=samp1_00004499;
              >ID=samp1_00004497:4048;Parent=samp1_00004496;


              The sed command above will find all lines starting with > and containing the string Parent=samp1_, and for each such line replace the final -R.; (where the . matches a single character) with just ;. Lines not ending with anything matching -R.; would remain unaltered.



              Change the dot in -R.; to [^;]* if you want to remove any number of non-; character up to the ; at the end.



              For your updated question, use Parent=gopAga1_ in place of Parent=samp1_.






              share|improve this answer






















              • The file is actually a gff3 file. I am updating the question to reflect the full format. This sed command works I specifically need it to also contain Parent=samp1_ as part of the pattern recognition.
                – fargo
                Aug 23 at 16:08











              • @fargo See updated answer.
                – Kusalananda
                Aug 23 at 16:13










              • This worked after I removed ^> from the script. I do have a second issue since some lines have multiple parent IDs. For example. ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500-RB,gopAga1_00004500-RA,gopAga1_00004500; I would like to edit the file so that it only reports the first parent ID with everthing after the -RB removed, such as ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500 I tried using sed '.*Parent=gopAga1_/s/-R.*;$/;/' but was unsuccesful. Anything wrong with the code?
                – fargo
                Aug 23 at 21:19










              • I realized that my script is editing the very last instance of the pattern. How can I invoke the first instance?
                – fargo
                Aug 23 at 21:39










              • @fargo There is no reason having ^> in the expression would have stopped the sed script from working, unless your lines does not actually start with > as the first character. I don't quite understand what you mean by "report" in your comment. Do you mean you want to delete all but the first parent? If at all possible, your question should be self-contained, and without a string of follow-up questions. Edit the question and include all relevant information and exactly what you're looking to achieve.
                – Kusalananda
                Aug 24 at 5:55












              up vote
              1
              down vote










              up vote
              1
              down vote









              $ sed '/^>.*Parent=samp1_/s/-R.;$/;/' <file.fa
              >ID=samp1_00004497:4045;Parent=samp1_00004497;
              >ID=samp1_00004497:4046;Parent=samp1_00004498;
              >ID=samp1_00004497:4047;Parent=samp1_00004499;
              >ID=samp1_00004497:4048;Parent=samp1_00004496;


              The sed command above will find all lines starting with > and containing the string Parent=samp1_, and for each such line replace the final -R.; (where the . matches a single character) with just ;. Lines not ending with anything matching -R.; would remain unaltered.



              Change the dot in -R.; to [^;]* if you want to remove any number of non-; character up to the ; at the end.



              For your updated question, use Parent=gopAga1_ in place of Parent=samp1_.






              share|improve this answer














              $ sed '/^>.*Parent=samp1_/s/-R.;$/;/' <file.fa
              >ID=samp1_00004497:4045;Parent=samp1_00004497;
              >ID=samp1_00004497:4046;Parent=samp1_00004498;
              >ID=samp1_00004497:4047;Parent=samp1_00004499;
              >ID=samp1_00004497:4048;Parent=samp1_00004496;


              The sed command above will find all lines starting with > and containing the string Parent=samp1_, and for each such line replace the final -R.; (where the . matches a single character) with just ;. Lines not ending with anything matching -R.; would remain unaltered.



              Change the dot in -R.; to [^;]* if you want to remove any number of non-; character up to the ; at the end.



              For your updated question, use Parent=gopAga1_ in place of Parent=samp1_.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Aug 23 at 16:24

























              answered Aug 23 at 15:53









              Kusalananda

              107k14209329




              107k14209329











              • The file is actually a gff3 file. I am updating the question to reflect the full format. This sed command works I specifically need it to also contain Parent=samp1_ as part of the pattern recognition.
                – fargo
                Aug 23 at 16:08











              • @fargo See updated answer.
                – Kusalananda
                Aug 23 at 16:13










              • This worked after I removed ^> from the script. I do have a second issue since some lines have multiple parent IDs. For example. ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500-RB,gopAga1_00004500-RA,gopAga1_00004500; I would like to edit the file so that it only reports the first parent ID with everthing after the -RB removed, such as ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500 I tried using sed '.*Parent=gopAga1_/s/-R.*;$/;/' but was unsuccesful. Anything wrong with the code?
                – fargo
                Aug 23 at 21:19










              • I realized that my script is editing the very last instance of the pattern. How can I invoke the first instance?
                – fargo
                Aug 23 at 21:39










              • @fargo There is no reason having ^> in the expression would have stopped the sed script from working, unless your lines does not actually start with > as the first character. I don't quite understand what you mean by "report" in your comment. Do you mean you want to delete all but the first parent? If at all possible, your question should be self-contained, and without a string of follow-up questions. Edit the question and include all relevant information and exactly what you're looking to achieve.
                – Kusalananda
                Aug 24 at 5:55
















              • The file is actually a gff3 file. I am updating the question to reflect the full format. This sed command works I specifically need it to also contain Parent=samp1_ as part of the pattern recognition.
                – fargo
                Aug 23 at 16:08











              • @fargo See updated answer.
                – Kusalananda
                Aug 23 at 16:13










              • This worked after I removed ^> from the script. I do have a second issue since some lines have multiple parent IDs. For example. ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500-RB,gopAga1_00004500-RA,gopAga1_00004500; I would like to edit the file so that it only reports the first parent ID with everthing after the -RB removed, such as ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500 I tried using sed '.*Parent=gopAga1_/s/-R.*;$/;/' but was unsuccesful. Anything wrong with the code?
                – fargo
                Aug 23 at 21:19










              • I realized that my script is editing the very last instance of the pattern. How can I invoke the first instance?
                – fargo
                Aug 23 at 21:39










              • @fargo There is no reason having ^> in the expression would have stopped the sed script from working, unless your lines does not actually start with > as the first character. I don't quite understand what you mean by "report" in your comment. Do you mean you want to delete all but the first parent? If at all possible, your question should be self-contained, and without a string of follow-up questions. Edit the question and include all relevant information and exactly what you're looking to achieve.
                – Kusalananda
                Aug 24 at 5:55















              The file is actually a gff3 file. I am updating the question to reflect the full format. This sed command works I specifically need it to also contain Parent=samp1_ as part of the pattern recognition.
              – fargo
              Aug 23 at 16:08





              The file is actually a gff3 file. I am updating the question to reflect the full format. This sed command works I specifically need it to also contain Parent=samp1_ as part of the pattern recognition.
              – fargo
              Aug 23 at 16:08













              @fargo See updated answer.
              – Kusalananda
              Aug 23 at 16:13




              @fargo See updated answer.
              – Kusalananda
              Aug 23 at 16:13












              This worked after I removed ^> from the script. I do have a second issue since some lines have multiple parent IDs. For example. ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500-RB,gopAga1_00004500-RA,gopAga1_00004500; I would like to edit the file so that it only reports the first parent ID with everthing after the -RB removed, such as ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500 I tried using sed '.*Parent=gopAga1_/s/-R.*;$/;/' but was unsuccesful. Anything wrong with the code?
              – fargo
              Aug 23 at 21:19




              This worked after I removed ^> from the script. I do have a second issue since some lines have multiple parent IDs. For example. ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500-RB,gopAga1_00004500-RA,gopAga1_00004500; I would like to edit the file so that it only reports the first parent ID with everthing after the -RB removed, such as ID=gopAga1_00004500-RB:exon:4085;Parent=gopAga1_00004500 I tried using sed '.*Parent=gopAga1_/s/-R.*;$/;/' but was unsuccesful. Anything wrong with the code?
              – fargo
              Aug 23 at 21:19












              I realized that my script is editing the very last instance of the pattern. How can I invoke the first instance?
              – fargo
              Aug 23 at 21:39




              I realized that my script is editing the very last instance of the pattern. How can I invoke the first instance?
              – fargo
              Aug 23 at 21:39












              @fargo There is no reason having ^> in the expression would have stopped the sed script from working, unless your lines does not actually start with > as the first character. I don't quite understand what you mean by "report" in your comment. Do you mean you want to delete all but the first parent? If at all possible, your question should be self-contained, and without a string of follow-up questions. Edit the question and include all relevant information and exactly what you're looking to achieve.
              – Kusalananda
              Aug 24 at 5:55




              @fargo There is no reason having ^> in the expression would have stopped the sed script from working, unless your lines does not actually start with > as the first character. I don't quite understand what you mean by "report" in your comment. Do you mean you want to delete all but the first parent? If at all possible, your question should be self-contained, and without a string of follow-up questions. Edit the question and include all relevant information and exactly what you're looking to achieve.
              – Kusalananda
              Aug 24 at 5:55










              up vote
              1
              down vote













              The commandline



              sed -re 's/([^-]+).+?;/1;/g'


              will output everything up to - for each line, not including - and then append a semicolon to the end.



              Update



              sed -re 's/(_.8)-R./1/g'


              will remove unwanted characters based on presence of _ then 8 chars then -R.






              share|improve this answer


























                up vote
                1
                down vote













                The commandline



                sed -re 's/([^-]+).+?;/1;/g'


                will output everything up to - for each line, not including - and then append a semicolon to the end.



                Update



                sed -re 's/(_.8)-R./1/g'


                will remove unwanted characters based on presence of _ then 8 chars then -R.






                share|improve this answer
























                  up vote
                  1
                  down vote










                  up vote
                  1
                  down vote









                  The commandline



                  sed -re 's/([^-]+).+?;/1;/g'


                  will output everything up to - for each line, not including - and then append a semicolon to the end.



                  Update



                  sed -re 's/(_.8)-R./1/g'


                  will remove unwanted characters based on presence of _ then 8 chars then -R.






                  share|improve this answer














                  The commandline



                  sed -re 's/([^-]+).+?;/1;/g'


                  will output everything up to - for each line, not including - and then append a semicolon to the end.



                  Update



                  sed -re 's/(_.8)-R./1/g'


                  will remove unwanted characters based on presence of _ then 8 chars then -R.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Aug 23 at 16:28

























                  answered Aug 23 at 15:56









                  loa_in_

                  22717




                  22717



























                       

                      draft saved


                      draft discarded















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f464437%2fhow-to-use-sed-and-regular-expressions-to-find-pattern-and-remove-last-few-chara%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Popular posts from this blog

                      How to check contact read email or not when send email to Individual?

                      Displaying single band from multi-band raster using QGIS

                      How many registers does an x86_64 CPU actually have?