extracting information from a column [closed]

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












I have a file which looks like this:



chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";


I want to extract gene_id and gene_name values along with first 8 columns(file is tab separated). I have written a script in perl which can do this, but I am looking for a one liner in awk,sed etc which can do this.



PS. The file is tab separated and has 9 columns. The 9 th column has values which are separated by spaces.



My output should look like this:



chr1 HAVANA exon 12613 12721 . + . ENSG00000223972.5 DDX11L1
chr1 HAVANA exon 13221 14409 . + . ENSG00000223972.5 DDX11L1









share|improve this question















closed as unclear what you're asking by DarkHeart, RalfFriedl, Shadur, Kiwy, αғsнιη Sep 12 at 15:57


Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.














  • Why can't you just print columns 1-8, 10 and 16 with awk ?
    – don_crissti
    Sep 10 at 22:18










  • meaning using field separator as space and tab?
    – user3138373
    Sep 10 at 22:20










  • I updated the post. Sorry for the confusion
    – user3138373
    Sep 10 at 22:23










  • OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
    – don_crissti
    Sep 10 at 22:25















up vote
0
down vote

favorite












I have a file which looks like this:



chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";


I want to extract gene_id and gene_name values along with first 8 columns(file is tab separated). I have written a script in perl which can do this, but I am looking for a one liner in awk,sed etc which can do this.



PS. The file is tab separated and has 9 columns. The 9 th column has values which are separated by spaces.



My output should look like this:



chr1 HAVANA exon 12613 12721 . + . ENSG00000223972.5 DDX11L1
chr1 HAVANA exon 13221 14409 . + . ENSG00000223972.5 DDX11L1









share|improve this question















closed as unclear what you're asking by DarkHeart, RalfFriedl, Shadur, Kiwy, αғsнιη Sep 12 at 15:57


Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.














  • Why can't you just print columns 1-8, 10 and 16 with awk ?
    – don_crissti
    Sep 10 at 22:18










  • meaning using field separator as space and tab?
    – user3138373
    Sep 10 at 22:20










  • I updated the post. Sorry for the confusion
    – user3138373
    Sep 10 at 22:23










  • OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
    – don_crissti
    Sep 10 at 22:25













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have a file which looks like this:



chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";


I want to extract gene_id and gene_name values along with first 8 columns(file is tab separated). I have written a script in perl which can do this, but I am looking for a one liner in awk,sed etc which can do this.



PS. The file is tab separated and has 9 columns. The 9 th column has values which are separated by spaces.



My output should look like this:



chr1 HAVANA exon 12613 12721 . + . ENSG00000223972.5 DDX11L1
chr1 HAVANA exon 13221 14409 . + . ENSG00000223972.5 DDX11L1









share|improve this question















I have a file which looks like this:



chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";


I want to extract gene_id and gene_name values along with first 8 columns(file is tab separated). I have written a script in perl which can do this, but I am looking for a one liner in awk,sed etc which can do this.



PS. The file is tab separated and has 9 columns. The 9 th column has values which are separated by spaces.



My output should look like this:



chr1 HAVANA exon 12613 12721 . + . ENSG00000223972.5 DDX11L1
chr1 HAVANA exon 13221 14409 . + . ENSG00000223972.5 DDX11L1






text-processing awk sed bioinformatics






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 10 at 22:45









Jeff Schaller

33.1k849111




33.1k849111










asked Sep 10 at 22:11









user3138373

84541430




84541430




closed as unclear what you're asking by DarkHeart, RalfFriedl, Shadur, Kiwy, αғsнιη Sep 12 at 15:57


Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.






closed as unclear what you're asking by DarkHeart, RalfFriedl, Shadur, Kiwy, αғsнιη Sep 12 at 15:57


Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.













  • Why can't you just print columns 1-8, 10 and 16 with awk ?
    – don_crissti
    Sep 10 at 22:18










  • meaning using field separator as space and tab?
    – user3138373
    Sep 10 at 22:20










  • I updated the post. Sorry for the confusion
    – user3138373
    Sep 10 at 22:23










  • OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
    – don_crissti
    Sep 10 at 22:25

















  • Why can't you just print columns 1-8, 10 and 16 with awk ?
    – don_crissti
    Sep 10 at 22:18










  • meaning using field separator as space and tab?
    – user3138373
    Sep 10 at 22:20










  • I updated the post. Sorry for the confusion
    – user3138373
    Sep 10 at 22:23










  • OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
    – don_crissti
    Sep 10 at 22:25
















Why can't you just print columns 1-8, 10 and 16 with awk ?
– don_crissti
Sep 10 at 22:18




Why can't you just print columns 1-8, 10 and 16 with awk ?
– don_crissti
Sep 10 at 22:18












meaning using field separator as space and tab?
– user3138373
Sep 10 at 22:20




meaning using field separator as space and tab?
– user3138373
Sep 10 at 22:20












I updated the post. Sorry for the confusion
– user3138373
Sep 10 at 22:23




I updated the post. Sorry for the confusion
– user3138373
Sep 10 at 22:23












OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
– don_crissti
Sep 10 at 22:25





OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
– don_crissti
Sep 10 at 22:25











3 Answers
3






active

oldest

votes

















up vote
1
down vote



accepted










awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename > output



without quotes and semicolons :



awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename | sed -e 's/;//g; s/"//g;' > output



more accurate using only awk:



awk ' ORS=" "; print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t"; gsub(";", "", $10); gsub(""", "", $10); print $10 "t"; gsub(";", "", $16) ; gsub(""", "", $16); print $16 ; ORS="n" ; print " "; ' filename > output






share|improve this answer





























    up vote
    1
    down vote













    Perl one-liner. It can be golfed a bit shorter, but I think this is pretty clear.



    perl -F't' -lane '
    if (($id, $name) = / b gene_id s+ " ([^"]+) .+ b gene_name s+ " ([^"]+)/x)
    print join "t", @F[0..7], $id, $name;

    ' file


    A little more "clever":



    perl -F't' -E '$,="t"; say @F[0..7], $gid, $gname if %g = /bgene_(id|name)s+"([^"]+)/g' file





    share|improve this answer





























      up vote
      0
      down vote













      The following awk script assumes that the 9th column could have data in any order.



      The code will split the column on ; followed by an optional space. It will then iterate over the resulting elements and split these on spaces into a key value pair. If the key (the thing to the left of the space) is any of the two strings gene_id or gene_name, the value for this key is remembered. The parsing of the 9th column ends when we have found our two strings, after which the column is rewritten and the modified line is printed.



      The code also discards any input that does not contain both gene_id and gene_name.



      BEGIN 
      FS = OFS = "t"



      n = split($9, a, "; ?")

      found = 0;
      for (i = 1; i <= n; ++i)
      if (split(a[i], b, " ") == 2)
      if (b[1] == "gene_id")
      gene_id = b[2]
      ++found
      else if (b[1] == "gene_name")
      gene_name = b[2]
      ++found


      if (found == 2) break


      if (found == 2)
      $9 = gene_id " " gene_name
      print




      Testing on the data provided:



      $ awk -f script.awk <file
      chr1 HAVANA exon 12613 12721 . + . "ENSG00000223972.5" "DDX11L1"
      chr1 HAVANA exon 13221 14409 . + . "ENSG00000223972.5" "DDX11L1"


      To remove the double quotes from the values, change



      if (found == 2) 
      $9 = gene_id " " gene_name
      print



      into



      if (found == 2) 
      gsub(""", "", gene_id)
      gsub(""", "", gene_name)
      $9 = gene_id " " gene_name
      print



      which removes all double quotes from the gene name and ID, or,



      if (found == 2) 
      gene_id = substr(gene_id, 2, length(gene_id) - 2)
      gene_name = substr(gene_name, 2, length(gene_name) - 2)
      $9 = gene_id " " gene_name
      print



      which removes the first and last characters from the two values.






      share|improve this answer





























        3 Answers
        3






        active

        oldest

        votes








        3 Answers
        3






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes








        up vote
        1
        down vote



        accepted










        awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename > output



        without quotes and semicolons :



        awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename | sed -e 's/;//g; s/"//g;' > output



        more accurate using only awk:



        awk ' ORS=" "; print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t"; gsub(";", "", $10); gsub(""", "", $10); print $10 "t"; gsub(";", "", $16) ; gsub(""", "", $16); print $16 ; ORS="n" ; print " "; ' filename > output






        share|improve this answer


























          up vote
          1
          down vote



          accepted










          awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename > output



          without quotes and semicolons :



          awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename | sed -e 's/;//g; s/"//g;' > output



          more accurate using only awk:



          awk ' ORS=" "; print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t"; gsub(";", "", $10); gsub(""", "", $10); print $10 "t"; gsub(";", "", $16) ; gsub(""", "", $16); print $16 ; ORS="n" ; print " "; ' filename > output






          share|improve this answer
























            up vote
            1
            down vote



            accepted







            up vote
            1
            down vote



            accepted






            awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename > output



            without quotes and semicolons :



            awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename | sed -e 's/;//g; s/"//g;' > output



            more accurate using only awk:



            awk ' ORS=" "; print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t"; gsub(";", "", $10); gsub(""", "", $10); print $10 "t"; gsub(";", "", $16) ; gsub(""", "", $16); print $16 ; ORS="n" ; print " "; ' filename > output






            share|improve this answer














            awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename > output



            without quotes and semicolons :



            awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename | sed -e 's/;//g; s/"//g;' > output



            more accurate using only awk:



            awk ' ORS=" "; print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t"; gsub(";", "", $10); gsub(""", "", $10); print $10 "t"; gsub(";", "", $16) ; gsub(""", "", $16); print $16 ; ORS="n" ; print " "; ' filename > output







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Sep 11 at 0:52

























            answered Sep 11 at 0:43









            elig

            1468




            1468






















                up vote
                1
                down vote













                Perl one-liner. It can be golfed a bit shorter, but I think this is pretty clear.



                perl -F't' -lane '
                if (($id, $name) = / b gene_id s+ " ([^"]+) .+ b gene_name s+ " ([^"]+)/x)
                print join "t", @F[0..7], $id, $name;

                ' file


                A little more "clever":



                perl -F't' -E '$,="t"; say @F[0..7], $gid, $gname if %g = /bgene_(id|name)s+"([^"]+)/g' file





                share|improve this answer


























                  up vote
                  1
                  down vote













                  Perl one-liner. It can be golfed a bit shorter, but I think this is pretty clear.



                  perl -F't' -lane '
                  if (($id, $name) = / b gene_id s+ " ([^"]+) .+ b gene_name s+ " ([^"]+)/x)
                  print join "t", @F[0..7], $id, $name;

                  ' file


                  A little more "clever":



                  perl -F't' -E '$,="t"; say @F[0..7], $gid, $gname if %g = /bgene_(id|name)s+"([^"]+)/g' file





                  share|improve this answer
























                    up vote
                    1
                    down vote










                    up vote
                    1
                    down vote









                    Perl one-liner. It can be golfed a bit shorter, but I think this is pretty clear.



                    perl -F't' -lane '
                    if (($id, $name) = / b gene_id s+ " ([^"]+) .+ b gene_name s+ " ([^"]+)/x)
                    print join "t", @F[0..7], $id, $name;

                    ' file


                    A little more "clever":



                    perl -F't' -E '$,="t"; say @F[0..7], $gid, $gname if %g = /bgene_(id|name)s+"([^"]+)/g' file





                    share|improve this answer














                    Perl one-liner. It can be golfed a bit shorter, but I think this is pretty clear.



                    perl -F't' -lane '
                    if (($id, $name) = / b gene_id s+ " ([^"]+) .+ b gene_name s+ " ([^"]+)/x)
                    print join "t", @F[0..7], $id, $name;

                    ' file


                    A little more "clever":



                    perl -F't' -E '$,="t"; say @F[0..7], $gid, $gname if %g = /bgene_(id|name)s+"([^"]+)/g' file






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Sep 10 at 23:12

























                    answered Sep 10 at 22:56









                    glenn jackman

                    48.2k365105




                    48.2k365105




















                        up vote
                        0
                        down vote













                        The following awk script assumes that the 9th column could have data in any order.



                        The code will split the column on ; followed by an optional space. It will then iterate over the resulting elements and split these on spaces into a key value pair. If the key (the thing to the left of the space) is any of the two strings gene_id or gene_name, the value for this key is remembered. The parsing of the 9th column ends when we have found our two strings, after which the column is rewritten and the modified line is printed.



                        The code also discards any input that does not contain both gene_id and gene_name.



                        BEGIN 
                        FS = OFS = "t"



                        n = split($9, a, "; ?")

                        found = 0;
                        for (i = 1; i <= n; ++i)
                        if (split(a[i], b, " ") == 2)
                        if (b[1] == "gene_id")
                        gene_id = b[2]
                        ++found
                        else if (b[1] == "gene_name")
                        gene_name = b[2]
                        ++found


                        if (found == 2) break


                        if (found == 2)
                        $9 = gene_id " " gene_name
                        print




                        Testing on the data provided:



                        $ awk -f script.awk <file
                        chr1 HAVANA exon 12613 12721 . + . "ENSG00000223972.5" "DDX11L1"
                        chr1 HAVANA exon 13221 14409 . + . "ENSG00000223972.5" "DDX11L1"


                        To remove the double quotes from the values, change



                        if (found == 2) 
                        $9 = gene_id " " gene_name
                        print



                        into



                        if (found == 2) 
                        gsub(""", "", gene_id)
                        gsub(""", "", gene_name)
                        $9 = gene_id " " gene_name
                        print



                        which removes all double quotes from the gene name and ID, or,



                        if (found == 2) 
                        gene_id = substr(gene_id, 2, length(gene_id) - 2)
                        gene_name = substr(gene_name, 2, length(gene_name) - 2)
                        $9 = gene_id " " gene_name
                        print



                        which removes the first and last characters from the two values.






                        share|improve this answer


























                          up vote
                          0
                          down vote













                          The following awk script assumes that the 9th column could have data in any order.



                          The code will split the column on ; followed by an optional space. It will then iterate over the resulting elements and split these on spaces into a key value pair. If the key (the thing to the left of the space) is any of the two strings gene_id or gene_name, the value for this key is remembered. The parsing of the 9th column ends when we have found our two strings, after which the column is rewritten and the modified line is printed.



                          The code also discards any input that does not contain both gene_id and gene_name.



                          BEGIN 
                          FS = OFS = "t"



                          n = split($9, a, "; ?")

                          found = 0;
                          for (i = 1; i <= n; ++i)
                          if (split(a[i], b, " ") == 2)
                          if (b[1] == "gene_id")
                          gene_id = b[2]
                          ++found
                          else if (b[1] == "gene_name")
                          gene_name = b[2]
                          ++found


                          if (found == 2) break


                          if (found == 2)
                          $9 = gene_id " " gene_name
                          print




                          Testing on the data provided:



                          $ awk -f script.awk <file
                          chr1 HAVANA exon 12613 12721 . + . "ENSG00000223972.5" "DDX11L1"
                          chr1 HAVANA exon 13221 14409 . + . "ENSG00000223972.5" "DDX11L1"


                          To remove the double quotes from the values, change



                          if (found == 2) 
                          $9 = gene_id " " gene_name
                          print



                          into



                          if (found == 2) 
                          gsub(""", "", gene_id)
                          gsub(""", "", gene_name)
                          $9 = gene_id " " gene_name
                          print



                          which removes all double quotes from the gene name and ID, or,



                          if (found == 2) 
                          gene_id = substr(gene_id, 2, length(gene_id) - 2)
                          gene_name = substr(gene_name, 2, length(gene_name) - 2)
                          $9 = gene_id " " gene_name
                          print



                          which removes the first and last characters from the two values.






                          share|improve this answer
























                            up vote
                            0
                            down vote










                            up vote
                            0
                            down vote









                            The following awk script assumes that the 9th column could have data in any order.



                            The code will split the column on ; followed by an optional space. It will then iterate over the resulting elements and split these on spaces into a key value pair. If the key (the thing to the left of the space) is any of the two strings gene_id or gene_name, the value for this key is remembered. The parsing of the 9th column ends when we have found our two strings, after which the column is rewritten and the modified line is printed.



                            The code also discards any input that does not contain both gene_id and gene_name.



                            BEGIN 
                            FS = OFS = "t"



                            n = split($9, a, "; ?")

                            found = 0;
                            for (i = 1; i <= n; ++i)
                            if (split(a[i], b, " ") == 2)
                            if (b[1] == "gene_id")
                            gene_id = b[2]
                            ++found
                            else if (b[1] == "gene_name")
                            gene_name = b[2]
                            ++found


                            if (found == 2) break


                            if (found == 2)
                            $9 = gene_id " " gene_name
                            print




                            Testing on the data provided:



                            $ awk -f script.awk <file
                            chr1 HAVANA exon 12613 12721 . + . "ENSG00000223972.5" "DDX11L1"
                            chr1 HAVANA exon 13221 14409 . + . "ENSG00000223972.5" "DDX11L1"


                            To remove the double quotes from the values, change



                            if (found == 2) 
                            $9 = gene_id " " gene_name
                            print



                            into



                            if (found == 2) 
                            gsub(""", "", gene_id)
                            gsub(""", "", gene_name)
                            $9 = gene_id " " gene_name
                            print



                            which removes all double quotes from the gene name and ID, or,



                            if (found == 2) 
                            gene_id = substr(gene_id, 2, length(gene_id) - 2)
                            gene_name = substr(gene_name, 2, length(gene_name) - 2)
                            $9 = gene_id " " gene_name
                            print



                            which removes the first and last characters from the two values.






                            share|improve this answer














                            The following awk script assumes that the 9th column could have data in any order.



                            The code will split the column on ; followed by an optional space. It will then iterate over the resulting elements and split these on spaces into a key value pair. If the key (the thing to the left of the space) is any of the two strings gene_id or gene_name, the value for this key is remembered. The parsing of the 9th column ends when we have found our two strings, after which the column is rewritten and the modified line is printed.



                            The code also discards any input that does not contain both gene_id and gene_name.



                            BEGIN 
                            FS = OFS = "t"



                            n = split($9, a, "; ?")

                            found = 0;
                            for (i = 1; i <= n; ++i)
                            if (split(a[i], b, " ") == 2)
                            if (b[1] == "gene_id")
                            gene_id = b[2]
                            ++found
                            else if (b[1] == "gene_name")
                            gene_name = b[2]
                            ++found


                            if (found == 2) break


                            if (found == 2)
                            $9 = gene_id " " gene_name
                            print




                            Testing on the data provided:



                            $ awk -f script.awk <file
                            chr1 HAVANA exon 12613 12721 . + . "ENSG00000223972.5" "DDX11L1"
                            chr1 HAVANA exon 13221 14409 . + . "ENSG00000223972.5" "DDX11L1"


                            To remove the double quotes from the values, change



                            if (found == 2) 
                            $9 = gene_id " " gene_name
                            print



                            into



                            if (found == 2) 
                            gsub(""", "", gene_id)
                            gsub(""", "", gene_name)
                            $9 = gene_id " " gene_name
                            print



                            which removes all double quotes from the gene name and ID, or,



                            if (found == 2) 
                            gene_id = substr(gene_id, 2, length(gene_id) - 2)
                            gene_name = substr(gene_name, 2, length(gene_name) - 2)
                            $9 = gene_id " " gene_name
                            print



                            which removes the first and last characters from the two values.







                            share|improve this answer














                            share|improve this answer



                            share|improve this answer








                            edited Sep 12 at 15:34

























                            answered Sep 12 at 13:52









                            Kusalananda

                            107k14209331




                            107k14209331












                                Popular posts from this blog

                                How to check contact read email or not when send email to Individual?

                                Displaying single band from multi-band raster using QGIS

                                How many registers does an x86_64 CPU actually have?