extracting information from a column [closed]
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
I have a file which looks like this:
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
I want to extract gene_id and gene_name values along with first 8 columns(file is tab separated). I have written a script in perl which can do this, but I am looking for a one liner in awk,sed etc which can do this.
PS. The file is tab separated and has 9 columns. The 9 th column has values which are separated by spaces.
My output should look like this:
chr1 HAVANA exon 12613 12721 . + . ENSG00000223972.5 DDX11L1
chr1 HAVANA exon 13221 14409 . + . ENSG00000223972.5 DDX11L1
text-processing awk sed bioinformatics
closed as unclear what you're asking by DarkHeart, RalfFriedl, Shadur, Kiwy, ñÃÂsýù÷ Sep 12 at 15:57
Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, itâÂÂs hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |Â
up vote
0
down vote
favorite
I have a file which looks like this:
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
I want to extract gene_id and gene_name values along with first 8 columns(file is tab separated). I have written a script in perl which can do this, but I am looking for a one liner in awk,sed etc which can do this.
PS. The file is tab separated and has 9 columns. The 9 th column has values which are separated by spaces.
My output should look like this:
chr1 HAVANA exon 12613 12721 . + . ENSG00000223972.5 DDX11L1
chr1 HAVANA exon 13221 14409 . + . ENSG00000223972.5 DDX11L1
text-processing awk sed bioinformatics
closed as unclear what you're asking by DarkHeart, RalfFriedl, Shadur, Kiwy, ñÃÂsýù÷ Sep 12 at 15:57
Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, itâÂÂs hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
Why can't you just print columns 1-8, 10 and 16 withawk
?
â don_crissti
Sep 10 at 22:18
meaning using field separator as space and tab?
â user3138373
Sep 10 at 22:20
I updated the post. Sorry for the confusion
â user3138373
Sep 10 at 22:23
OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
â don_crissti
Sep 10 at 22:25
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have a file which looks like this:
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
I want to extract gene_id and gene_name values along with first 8 columns(file is tab separated). I have written a script in perl which can do this, but I am looking for a one liner in awk,sed etc which can do this.
PS. The file is tab separated and has 9 columns. The 9 th column has values which are separated by spaces.
My output should look like this:
chr1 HAVANA exon 12613 12721 . + . ENSG00000223972.5 DDX11L1
chr1 HAVANA exon 13221 14409 . + . ENSG00000223972.5 DDX11L1
text-processing awk sed bioinformatics
I have a file which looks like this:
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
I want to extract gene_id and gene_name values along with first 8 columns(file is tab separated). I have written a script in perl which can do this, but I am looking for a one liner in awk,sed etc which can do this.
PS. The file is tab separated and has 9 columns. The 9 th column has values which are separated by spaces.
My output should look like this:
chr1 HAVANA exon 12613 12721 . + . ENSG00000223972.5 DDX11L1
chr1 HAVANA exon 13221 14409 . + . ENSG00000223972.5 DDX11L1
text-processing awk sed bioinformatics
text-processing awk sed bioinformatics
edited Sep 10 at 22:45
Jeff Schaller
33.1k849111
33.1k849111
asked Sep 10 at 22:11
user3138373
84541430
84541430
closed as unclear what you're asking by DarkHeart, RalfFriedl, Shadur, Kiwy, ñÃÂsýù÷ Sep 12 at 15:57
Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, itâÂÂs hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
closed as unclear what you're asking by DarkHeart, RalfFriedl, Shadur, Kiwy, ñÃÂsýù÷ Sep 12 at 15:57
Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, itâÂÂs hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
Why can't you just print columns 1-8, 10 and 16 withawk
?
â don_crissti
Sep 10 at 22:18
meaning using field separator as space and tab?
â user3138373
Sep 10 at 22:20
I updated the post. Sorry for the confusion
â user3138373
Sep 10 at 22:23
OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
â don_crissti
Sep 10 at 22:25
add a comment |Â
Why can't you just print columns 1-8, 10 and 16 withawk
?
â don_crissti
Sep 10 at 22:18
meaning using field separator as space and tab?
â user3138373
Sep 10 at 22:20
I updated the post. Sorry for the confusion
â user3138373
Sep 10 at 22:23
OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
â don_crissti
Sep 10 at 22:25
Why can't you just print columns 1-8, 10 and 16 with
awk
?â don_crissti
Sep 10 at 22:18
Why can't you just print columns 1-8, 10 and 16 with
awk
?â don_crissti
Sep 10 at 22:18
meaning using field separator as space and tab?
â user3138373
Sep 10 at 22:20
meaning using field separator as space and tab?
â user3138373
Sep 10 at 22:20
I updated the post. Sorry for the confusion
â user3138373
Sep 10 at 22:23
I updated the post. Sorry for the confusion
â user3138373
Sep 10 at 22:23
OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
â don_crissti
Sep 10 at 22:25
OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
â don_crissti
Sep 10 at 22:25
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
1
down vote
accepted
awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename > output
without quotes and semicolons :
awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename | sed -e 's/;//g; s/"//g;' > output
more accurate using only awk:
awk ' ORS=" "; print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t"; gsub(";", "", $10); gsub(""", "", $10); print $10 "t"; gsub(";", "", $16) ; gsub(""", "", $16); print $16 ; ORS="n" ; print " "; ' filename > output
add a comment |Â
up vote
1
down vote
Perl one-liner. It can be golfed a bit shorter, but I think this is pretty clear.
perl -F't' -lane '
if (($id, $name) = / b gene_id s+ " ([^"]+) .+ b gene_name s+ " ([^"]+)/x)
print join "t", @F[0..7], $id, $name;
' file
A little more "clever":
perl -F't' -E '$,="t"; say @F[0..7], $gid, $gname if %g = /bgene_(id|name)s+"([^"]+)/g' file
add a comment |Â
up vote
0
down vote
The following awk
script assumes that the 9th column could have data in any order.
The code will split the column on ;
followed by an optional space. It will then iterate over the resulting elements and split these on spaces into a key value pair. If the key (the thing to the left of the space) is any of the two strings gene_id
or gene_name
, the value for this key is remembered. The parsing of the 9th column ends when we have found our two strings, after which the column is rewritten and the modified line is printed.
The code also discards any input that does not contain both gene_id
and gene_name
.
BEGIN
FS = OFS = "t"
n = split($9, a, "; ?")
found = 0;
for (i = 1; i <= n; ++i)
if (split(a[i], b, " ") == 2)
if (b[1] == "gene_id")
gene_id = b[2]
++found
else if (b[1] == "gene_name")
gene_name = b[2]
++found
if (found == 2) break
if (found == 2)
$9 = gene_id " " gene_name
print
Testing on the data provided:
$ awk -f script.awk <file
chr1 HAVANA exon 12613 12721 . + . "ENSG00000223972.5" "DDX11L1"
chr1 HAVANA exon 13221 14409 . + . "ENSG00000223972.5" "DDX11L1"
To remove the double quotes from the values, change
if (found == 2)
$9 = gene_id " " gene_name
print
into
if (found == 2)
gsub(""", "", gene_id)
gsub(""", "", gene_name)
$9 = gene_id " " gene_name
print
which removes all double quotes from the gene name and ID, or,
if (found == 2)
gene_id = substr(gene_id, 2, length(gene_id) - 2)
gene_name = substr(gene_name, 2, length(gene_name) - 2)
$9 = gene_id " " gene_name
print
which removes the first and last characters from the two values.
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename > output
without quotes and semicolons :
awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename | sed -e 's/;//g; s/"//g;' > output
more accurate using only awk:
awk ' ORS=" "; print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t"; gsub(";", "", $10); gsub(""", "", $10); print $10 "t"; gsub(";", "", $16) ; gsub(""", "", $16); print $16 ; ORS="n" ; print " "; ' filename > output
add a comment |Â
up vote
1
down vote
accepted
awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename > output
without quotes and semicolons :
awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename | sed -e 's/;//g; s/"//g;' > output
more accurate using only awk:
awk ' ORS=" "; print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t"; gsub(";", "", $10); gsub(""", "", $10); print $10 "t"; gsub(";", "", $16) ; gsub(""", "", $16); print $16 ; ORS="n" ; print " "; ' filename > output
add a comment |Â
up vote
1
down vote
accepted
up vote
1
down vote
accepted
awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename > output
without quotes and semicolons :
awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename | sed -e 's/;//g; s/"//g;' > output
more accurate using only awk:
awk ' ORS=" "; print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t"; gsub(";", "", $10); gsub(""", "", $10); print $10 "t"; gsub(";", "", $16) ; gsub(""", "", $16); print $16 ; ORS="n" ; print " "; ' filename > output
awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename > output
without quotes and semicolons :
awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename | sed -e 's/;//g; s/"//g;' > output
more accurate using only awk:
awk ' ORS=" "; print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t"; gsub(";", "", $10); gsub(""", "", $10); print $10 "t"; gsub(";", "", $16) ; gsub(""", "", $16); print $16 ; ORS="n" ; print " "; ' filename > output
edited Sep 11 at 0:52
answered Sep 11 at 0:43
elig
1468
1468
add a comment |Â
add a comment |Â
up vote
1
down vote
Perl one-liner. It can be golfed a bit shorter, but I think this is pretty clear.
perl -F't' -lane '
if (($id, $name) = / b gene_id s+ " ([^"]+) .+ b gene_name s+ " ([^"]+)/x)
print join "t", @F[0..7], $id, $name;
' file
A little more "clever":
perl -F't' -E '$,="t"; say @F[0..7], $gid, $gname if %g = /bgene_(id|name)s+"([^"]+)/g' file
add a comment |Â
up vote
1
down vote
Perl one-liner. It can be golfed a bit shorter, but I think this is pretty clear.
perl -F't' -lane '
if (($id, $name) = / b gene_id s+ " ([^"]+) .+ b gene_name s+ " ([^"]+)/x)
print join "t", @F[0..7], $id, $name;
' file
A little more "clever":
perl -F't' -E '$,="t"; say @F[0..7], $gid, $gname if %g = /bgene_(id|name)s+"([^"]+)/g' file
add a comment |Â
up vote
1
down vote
up vote
1
down vote
Perl one-liner. It can be golfed a bit shorter, but I think this is pretty clear.
perl -F't' -lane '
if (($id, $name) = / b gene_id s+ " ([^"]+) .+ b gene_name s+ " ([^"]+)/x)
print join "t", @F[0..7], $id, $name;
' file
A little more "clever":
perl -F't' -E '$,="t"; say @F[0..7], $gid, $gname if %g = /bgene_(id|name)s+"([^"]+)/g' file
Perl one-liner. It can be golfed a bit shorter, but I think this is pretty clear.
perl -F't' -lane '
if (($id, $name) = / b gene_id s+ " ([^"]+) .+ b gene_name s+ " ([^"]+)/x)
print join "t", @F[0..7], $id, $name;
' file
A little more "clever":
perl -F't' -E '$,="t"; say @F[0..7], $gid, $gname if %g = /bgene_(id|name)s+"([^"]+)/g' file
edited Sep 10 at 23:12
answered Sep 10 at 22:56
glenn jackman
48.2k365105
48.2k365105
add a comment |Â
add a comment |Â
up vote
0
down vote
The following awk
script assumes that the 9th column could have data in any order.
The code will split the column on ;
followed by an optional space. It will then iterate over the resulting elements and split these on spaces into a key value pair. If the key (the thing to the left of the space) is any of the two strings gene_id
or gene_name
, the value for this key is remembered. The parsing of the 9th column ends when we have found our two strings, after which the column is rewritten and the modified line is printed.
The code also discards any input that does not contain both gene_id
and gene_name
.
BEGIN
FS = OFS = "t"
n = split($9, a, "; ?")
found = 0;
for (i = 1; i <= n; ++i)
if (split(a[i], b, " ") == 2)
if (b[1] == "gene_id")
gene_id = b[2]
++found
else if (b[1] == "gene_name")
gene_name = b[2]
++found
if (found == 2) break
if (found == 2)
$9 = gene_id " " gene_name
print
Testing on the data provided:
$ awk -f script.awk <file
chr1 HAVANA exon 12613 12721 . + . "ENSG00000223972.5" "DDX11L1"
chr1 HAVANA exon 13221 14409 . + . "ENSG00000223972.5" "DDX11L1"
To remove the double quotes from the values, change
if (found == 2)
$9 = gene_id " " gene_name
print
into
if (found == 2)
gsub(""", "", gene_id)
gsub(""", "", gene_name)
$9 = gene_id " " gene_name
print
which removes all double quotes from the gene name and ID, or,
if (found == 2)
gene_id = substr(gene_id, 2, length(gene_id) - 2)
gene_name = substr(gene_name, 2, length(gene_name) - 2)
$9 = gene_id " " gene_name
print
which removes the first and last characters from the two values.
add a comment |Â
up vote
0
down vote
The following awk
script assumes that the 9th column could have data in any order.
The code will split the column on ;
followed by an optional space. It will then iterate over the resulting elements and split these on spaces into a key value pair. If the key (the thing to the left of the space) is any of the two strings gene_id
or gene_name
, the value for this key is remembered. The parsing of the 9th column ends when we have found our two strings, after which the column is rewritten and the modified line is printed.
The code also discards any input that does not contain both gene_id
and gene_name
.
BEGIN
FS = OFS = "t"
n = split($9, a, "; ?")
found = 0;
for (i = 1; i <= n; ++i)
if (split(a[i], b, " ") == 2)
if (b[1] == "gene_id")
gene_id = b[2]
++found
else if (b[1] == "gene_name")
gene_name = b[2]
++found
if (found == 2) break
if (found == 2)
$9 = gene_id " " gene_name
print
Testing on the data provided:
$ awk -f script.awk <file
chr1 HAVANA exon 12613 12721 . + . "ENSG00000223972.5" "DDX11L1"
chr1 HAVANA exon 13221 14409 . + . "ENSG00000223972.5" "DDX11L1"
To remove the double quotes from the values, change
if (found == 2)
$9 = gene_id " " gene_name
print
into
if (found == 2)
gsub(""", "", gene_id)
gsub(""", "", gene_name)
$9 = gene_id " " gene_name
print
which removes all double quotes from the gene name and ID, or,
if (found == 2)
gene_id = substr(gene_id, 2, length(gene_id) - 2)
gene_name = substr(gene_name, 2, length(gene_name) - 2)
$9 = gene_id " " gene_name
print
which removes the first and last characters from the two values.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
The following awk
script assumes that the 9th column could have data in any order.
The code will split the column on ;
followed by an optional space. It will then iterate over the resulting elements and split these on spaces into a key value pair. If the key (the thing to the left of the space) is any of the two strings gene_id
or gene_name
, the value for this key is remembered. The parsing of the 9th column ends when we have found our two strings, after which the column is rewritten and the modified line is printed.
The code also discards any input that does not contain both gene_id
and gene_name
.
BEGIN
FS = OFS = "t"
n = split($9, a, "; ?")
found = 0;
for (i = 1; i <= n; ++i)
if (split(a[i], b, " ") == 2)
if (b[1] == "gene_id")
gene_id = b[2]
++found
else if (b[1] == "gene_name")
gene_name = b[2]
++found
if (found == 2) break
if (found == 2)
$9 = gene_id " " gene_name
print
Testing on the data provided:
$ awk -f script.awk <file
chr1 HAVANA exon 12613 12721 . + . "ENSG00000223972.5" "DDX11L1"
chr1 HAVANA exon 13221 14409 . + . "ENSG00000223972.5" "DDX11L1"
To remove the double quotes from the values, change
if (found == 2)
$9 = gene_id " " gene_name
print
into
if (found == 2)
gsub(""", "", gene_id)
gsub(""", "", gene_name)
$9 = gene_id " " gene_name
print
which removes all double quotes from the gene name and ID, or,
if (found == 2)
gene_id = substr(gene_id, 2, length(gene_id) - 2)
gene_name = substr(gene_name, 2, length(gene_name) - 2)
$9 = gene_id " " gene_name
print
which removes the first and last characters from the two values.
The following awk
script assumes that the 9th column could have data in any order.
The code will split the column on ;
followed by an optional space. It will then iterate over the resulting elements and split these on spaces into a key value pair. If the key (the thing to the left of the space) is any of the two strings gene_id
or gene_name
, the value for this key is remembered. The parsing of the 9th column ends when we have found our two strings, after which the column is rewritten and the modified line is printed.
The code also discards any input that does not contain both gene_id
and gene_name
.
BEGIN
FS = OFS = "t"
n = split($9, a, "; ?")
found = 0;
for (i = 1; i <= n; ++i)
if (split(a[i], b, " ") == 2)
if (b[1] == "gene_id")
gene_id = b[2]
++found
else if (b[1] == "gene_name")
gene_name = b[2]
++found
if (found == 2) break
if (found == 2)
$9 = gene_id " " gene_name
print
Testing on the data provided:
$ awk -f script.awk <file
chr1 HAVANA exon 12613 12721 . + . "ENSG00000223972.5" "DDX11L1"
chr1 HAVANA exon 13221 14409 . + . "ENSG00000223972.5" "DDX11L1"
To remove the double quotes from the values, change
if (found == 2)
$9 = gene_id " " gene_name
print
into
if (found == 2)
gsub(""", "", gene_id)
gsub(""", "", gene_name)
$9 = gene_id " " gene_name
print
which removes all double quotes from the gene name and ID, or,
if (found == 2)
gene_id = substr(gene_id, 2, length(gene_id) - 2)
gene_name = substr(gene_name, 2, length(gene_name) - 2)
$9 = gene_id " " gene_name
print
which removes the first and last characters from the two values.
edited Sep 12 at 15:34
answered Sep 12 at 13:52
Kusalananda
107k14209331
107k14209331
add a comment |Â
add a comment |Â
Why can't you just print columns 1-8, 10 and 16 with
awk
?â don_crissti
Sep 10 at 22:18
meaning using field separator as space and tab?
â user3138373
Sep 10 at 22:20
I updated the post. Sorry for the confusion
â user3138373
Sep 10 at 22:23
OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
â don_crissti
Sep 10 at 22:25