extracting information from a column [closed]

up vote
0
down vote

favorite

I have a file which looks like this:

chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

I want to extract gene_id and gene_name values along with first 8 columns(file is tab separated). I have written a script in perl which can do this, but I am looking for a one liner in awk,sed etc which can do this.

PS. The file is tab separated and has 9 columns. The 9 th column has values which are separated by spaces.

My output should look like this:

chr1 HAVANA exon 12613 12721 . + . ENSG00000223972.5 DDX11L1
chr1 HAVANA exon 13221 14409 . + . ENSG00000223972.5 DDX11L1

edited Sep 10 at 22:45

Jeff Schaller

33.1k849111

asked Sep 10 at 22:11

user3138373

84541430

closed as unclear what you're asking by DarkHeart, RalfFriedl, Shadur, Kiwy, ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ· Sep 12 at 15:57

Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, itÃ¢Â€Â™s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

Why can't you just print columns 1-8, 10 and 16 with awk ?
â€“Â don_crissti
Sep 10 at 22:18

meaning using field separator as space and tab?
â€“Â user3138373
Sep 10 at 22:20

I updated the post. Sorry for the confusion
â€“Â user3138373
Sep 10 at 22:23

OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
â€“Â don_crissti
Sep 10 at 22:25

add a commentÂ |Â

up vote
0
down vote

favorite

I have a file which looks like this:

chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

PS. The file is tab separated and has 9 columns. The 9 th column has values which are separated by spaces.

My output should look like this:

chr1 HAVANA exon 12613 12721 . + . ENSG00000223972.5 DDX11L1
chr1 HAVANA exon 13221 14409 . + . ENSG00000223972.5 DDX11L1

edited Sep 10 at 22:45

Jeff Schaller

33.1k849111

asked Sep 10 at 22:11

user3138373

84541430

closed as unclear what you're asking by DarkHeart, RalfFriedl, Shadur, Kiwy, ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ· Sep 12 at 15:57

Why can't you just print columns 1-8, 10 and 16 with awk ?
â€“Â don_crissti
Sep 10 at 22:18

meaning using field separator as space and tab?
â€“Â user3138373
Sep 10 at 22:20

I updated the post. Sorry for the confusion
â€“Â user3138373
Sep 10 at 22:23

OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
â€“Â don_crissti
Sep 10 at 22:25

add a commentÂ |Â

up vote
0
down vote

favorite

I have a file which looks like this:

chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

PS. The file is tab separated and has 9 columns. The 9 th column has values which are separated by spaces.

My output should look like this:

chr1 HAVANA exon 12613 12721 . + . ENSG00000223972.5 DDX11L1
chr1 HAVANA exon 13221 14409 . + . ENSG00000223972.5 DDX11L1

edited Sep 10 at 22:45

Jeff Schaller

33.1k849111

asked Sep 10 at 22:11

user3138373

84541430

I have a file which looks like this:

chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

PS. The file is tab separated and has 9 columns. The 9 th column has values which are separated by spaces.

My output should look like this:

chr1 HAVANA exon 12613 12721 . + . ENSG00000223972.5 DDX11L1
chr1 HAVANA exon 13221 14409 . + . ENSG00000223972.5 DDX11L1

text-processing awk sed bioinformatics

edited Sep 10 at 22:45

Jeff Schaller

33.1k849111

asked Sep 10 at 22:11

user3138373

84541430

edited Sep 10 at 22:45

Jeff Schaller

33.1k849111

asked Sep 10 at 22:11

user3138373

84541430

edited Sep 10 at 22:45

Jeff Schaller

33.1k849111

edited Sep 10 at 22:45

Jeff Schaller

33.1k849111

edited Sep 10 at 22:45

Jeff Schaller

33.1k849111

asked Sep 10 at 22:11

user3138373

84541430

asked Sep 10 at 22:11

user3138373

84541430

asked Sep 10 at 22:11

user3138373

84541430

closed as unclear what you're asking by DarkHeart, RalfFriedl, Shadur, Kiwy, ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ· Sep 12 at 15:57

Why can't you just print columns 1-8, 10 and 16 with awk ?
â€“Â don_crissti
Sep 10 at 22:18

meaning using field separator as space and tab?
â€“Â user3138373
Sep 10 at 22:20

I updated the post. Sorry for the confusion
â€“Â user3138373
Sep 10 at 22:23

OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
â€“Â don_crissti
Sep 10 at 22:25

add a commentÂ |Â

Why can't you just print columns 1-8, 10 and 16 with awk ?
â€“Â don_crissti
Sep 10 at 22:18

meaning using field separator as space and tab?
â€“Â user3138373
Sep 10 at 22:20

I updated the post. Sorry for the confusion
â€“Â user3138373
Sep 10 at 22:23

OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
â€“Â don_crissti
Sep 10 at 22:25

Why can't you just print columns 1-8, 10 and 16 with awk ?
â€“Â don_crissti
Sep 10 at 22:18

meaning using field separator as space and tab?
â€“Â user3138373
Sep 10 at 22:20

I updated the post. Sorry for the confusion
â€“Â user3138373
Sep 10 at 22:23

OK, so if none of those keys and values in your file contains spaces then you can do as I said above, using the default field separator.
â€“Â don_crissti
Sep 10 at 22:25

add a commentÂ |Â

3 Answers
3

active

oldest

votes

up vote
1
down vote

accepted

awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename > output

without quotes and semicolons :

awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename | sed -e 's/;//g; s/"//g;' > output

more accurate using only awk:

awk ' ORS=" "; print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t"; gsub(";", "", $10); gsub(""", "", $10); print $10 "t"; gsub(";", "", $16) ; gsub(""", "", $16); print $16 ; ORS="n" ; print " "; ' filename > output

edited Sep 11 at 0:52

answered Sep 11 at 0:43

elig

1468

add a commentÂ |Â

up vote
1
down vote

Perl one-liner. It can be golfed a bit shorter, but I think this is pretty clear.

perl -F't' -lane '
 if (($id, $name) = / b gene_id s+ " ([^"]+) .+ b gene_name s+ " ([^"]+)/x) 
 print join "t", @F[0..7], $id, $name;
 
' file

A little more "clever":

perl -F't' -E '$,="t"; say @F[0..7], $gid, $gname if %g = /bgene_(id|name)s+"([^"]+)/g' file

edited Sep 10 at 23:12

answered Sep 10 at 22:56

glenn jackman

48.2k365105

add a commentÂ |Â

up vote
0
down vote

The following awk script assumes that the 9th column could have data in any order.

The code will split the column on ; followed by an optional space. It will then iterate over the resulting elements and split these on spaces into a key value pair. If the key (the thing to the left of the space) is any of the two strings gene_id or gene_name, the value for this key is remembered. The parsing of the 9th column ends when we have found our two strings, after which the column is rewritten and the modified line is printed.

The code also discards any input that does not contain both gene_id and gene_name.

BEGIN 
 FS = OFS = "t"



 n = split($9, a, "; ?")

 found = 0;
 for (i = 1; i <= n; ++i)
 if (split(a[i], b, " ") == 2) 
 if (b[1] == "gene_id") 
 gene_id = b[2]
 ++found
 else if (b[1] == "gene_name") 
 gene_name = b[2]
 ++found
 

 if (found == 2) break
 

 if (found == 2) 
 $9 = gene_id " " gene_name
 print

Testing on the data provided:

$ awk -f script.awk <file
chr1 HAVANA exon 12613 12721 . + . "ENSG00000223972.5" "DDX11L1"
chr1 HAVANA exon 13221 14409 . + . "ENSG00000223972.5" "DDX11L1"

To remove the double quotes from the values, change

if (found == 2) 
 $9 = gene_id " " gene_name
 print

into

if (found == 2) 
 gsub(""", "", gene_id)
 gsub(""", "", gene_name)
 $9 = gene_id " " gene_name
 print

which removes all double quotes from the gene name and ID, or,

if (found == 2) 
 gene_id = substr(gene_id, 2, length(gene_id) - 2)
 gene_name = substr(gene_name, 2, length(gene_name) - 2)
 $9 = gene_id " " gene_name
 print

which removes the first and last characters from the two values.

edited Sep 12 at 15:34

answered Sep 12 at 13:52

Kusalananda

107k14209331

add a commentÂ |Â

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
1
down vote

accepted

awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename > output

without quotes and semicolons :

awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename | sed -e 's/;//g; s/"//g;' > output

more accurate using only awk:

edited Sep 11 at 0:52

answered Sep 11 at 0:43

elig

1468

add a commentÂ |Â

up vote
1
down vote

accepted

awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename > output

without quotes and semicolons :

awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename | sed -e 's/;//g; s/"//g;' > output

more accurate using only awk:

edited Sep 11 at 0:52

answered Sep 11 at 0:43

elig

1468

add a commentÂ |Â

up vote
1
down vote

accepted

awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename > output

without quotes and semicolons :

awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename | sed -e 's/;//g; s/"//g;' > output

more accurate using only awk:

edited Sep 11 at 0:52

answered Sep 11 at 0:43

elig

1468

awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename > output

without quotes and semicolons :

awk ' print $1 "t" $2 "t" $3 "t" $4 "t" $5 "t" $6 "t" $7 "t" $8 "t" $10 "t" $16 ; ' filename | sed -e 's/;//g; s/"//g;' > output

more accurate using only awk:

edited Sep 11 at 0:52

answered Sep 11 at 0:43

elig

1468

edited Sep 11 at 0:52

answered Sep 11 at 0:43

elig

1468

answered Sep 11 at 0:43

elig

1468

answered Sep 11 at 0:43

elig

1468

add a commentÂ |Â

up vote
1
down vote

Perl one-liner. It can be golfed a bit shorter, but I think this is pretty clear.

perl -F't' -lane '
 if (($id, $name) = / b gene_id s+ " ([^"]+) .+ b gene_name s+ " ([^"]+)/x) 
 print join "t", @F[0..7], $id, $name;
 
' file

A little more "clever":

perl -F't' -E '$,="t"; say @F[0..7], $gid, $gname if %g = /bgene_(id|name)s+"([^"]+)/g' file

edited Sep 10 at 23:12

answered Sep 10 at 22:56

glenn jackman

48.2k365105

add a commentÂ |Â

up vote
1
down vote

Perl one-liner. It can be golfed a bit shorter, but I think this is pretty clear.

perl -F't' -lane '
 if (($id, $name) = / b gene_id s+ " ([^"]+) .+ b gene_name s+ " ([^"]+)/x) 
 print join "t", @F[0..7], $id, $name;
 
' file

A little more "clever":

perl -F't' -E '$,="t"; say @F[0..7], $gid, $gname if %g = /bgene_(id|name)s+"([^"]+)/g' file

edited Sep 10 at 23:12

answered Sep 10 at 22:56

glenn jackman

48.2k365105

add a commentÂ |Â

up vote
1
down vote

Perl one-liner. It can be golfed a bit shorter, but I think this is pretty clear.

perl -F't' -lane '
 if (($id, $name) = / b gene_id s+ " ([^"]+) .+ b gene_name s+ " ([^"]+)/x) 
 print join "t", @F[0..7], $id, $name;
 
' file

A little more "clever":

perl -F't' -E '$,="t"; say @F[0..7], $gid, $gname if %g = /bgene_(id|name)s+"([^"]+)/g' file

edited Sep 10 at 23:12

answered Sep 10 at 22:56

glenn jackman

48.2k365105

Perl one-liner. It can be golfed a bit shorter, but I think this is pretty clear.

perl -F't' -lane '
 if (($id, $name) = / b gene_id s+ " ([^"]+) .+ b gene_name s+ " ([^"]+)/x) 
 print join "t", @F[0..7], $id, $name;
 
' file

A little more "clever":

perl -F't' -E '$,="t"; say @F[0..7], $gid, $gname if %g = /bgene_(id|name)s+"([^"]+)/g' file

edited Sep 10 at 23:12

answered Sep 10 at 22:56

glenn jackman

48.2k365105

edited Sep 10 at 23:12

answered Sep 10 at 22:56

glenn jackman

48.2k365105

answered Sep 10 at 22:56

glenn jackman

48.2k365105

answered Sep 10 at 22:56

glenn jackman

48.2k365105

add a commentÂ |Â

up vote
0
down vote

The following awk script assumes that the 9th column could have data in any order.

The code also discards any input that does not contain both gene_id and gene_name.

BEGIN 
 FS = OFS = "t"



 n = split($9, a, "; ?")

 found = 0;
 for (i = 1; i <= n; ++i)
 if (split(a[i], b, " ") == 2) 
 if (b[1] == "gene_id") 
 gene_id = b[2]
 ++found
 else if (b[1] == "gene_name") 
 gene_name = b[2]
 ++found
 

 if (found == 2) break
 

 if (found == 2) 
 $9 = gene_id " " gene_name
 print

Testing on the data provided:

$ awk -f script.awk <file
chr1 HAVANA exon 12613 12721 . + . "ENSG00000223972.5" "DDX11L1"
chr1 HAVANA exon 13221 14409 . + . "ENSG00000223972.5" "DDX11L1"

To remove the double quotes from the values, change

if (found == 2) 
 $9 = gene_id " " gene_name
 print

into

if (found == 2) 
 gsub(""", "", gene_id)
 gsub(""", "", gene_name)
 $9 = gene_id " " gene_name
 print

which removes all double quotes from the gene name and ID, or,

if (found == 2) 
 gene_id = substr(gene_id, 2, length(gene_id) - 2)
 gene_name = substr(gene_name, 2, length(gene_name) - 2)
 $9 = gene_id " " gene_name
 print

which removes the first and last characters from the two values.

edited Sep 12 at 15:34

answered Sep 12 at 13:52

Kusalananda

107k14209331

add a commentÂ |Â

up vote
0
down vote

The following awk script assumes that the 9th column could have data in any order.

The code also discards any input that does not contain both gene_id and gene_name.

BEGIN 
 FS = OFS = "t"



 n = split($9, a, "; ?")

 found = 0;
 for (i = 1; i <= n; ++i)
 if (split(a[i], b, " ") == 2) 
 if (b[1] == "gene_id") 
 gene_id = b[2]
 ++found
 else if (b[1] == "gene_name") 
 gene_name = b[2]
 ++found
 

 if (found == 2) break
 

 if (found == 2) 
 $9 = gene_id " " gene_name
 print

Testing on the data provided:

$ awk -f script.awk <file
chr1 HAVANA exon 12613 12721 . + . "ENSG00000223972.5" "DDX11L1"
chr1 HAVANA exon 13221 14409 . + . "ENSG00000223972.5" "DDX11L1"

To remove the double quotes from the values, change

if (found == 2) 
 $9 = gene_id " " gene_name
 print

into

if (found == 2) 
 gsub(""", "", gene_id)
 gsub(""", "", gene_name)
 $9 = gene_id " " gene_name
 print

which removes all double quotes from the gene name and ID, or,

if (found == 2) 
 gene_id = substr(gene_id, 2, length(gene_id) - 2)
 gene_name = substr(gene_name, 2, length(gene_name) - 2)
 $9 = gene_id " " gene_name
 print

which removes the first and last characters from the two values.

edited Sep 12 at 15:34

answered Sep 12 at 13:52

Kusalananda

107k14209331

add a commentÂ |Â

up vote
0
down vote

The following awk script assumes that the 9th column could have data in any order.

The code also discards any input that does not contain both gene_id and gene_name.

BEGIN 
 FS = OFS = "t"



 n = split($9, a, "; ?")

 found = 0;
 for (i = 1; i <= n; ++i)
 if (split(a[i], b, " ") == 2) 
 if (b[1] == "gene_id") 
 gene_id = b[2]
 ++found
 else if (b[1] == "gene_name") 
 gene_name = b[2]
 ++found
 

 if (found == 2) break
 

 if (found == 2) 
 $9 = gene_id " " gene_name
 print

Testing on the data provided:

$ awk -f script.awk <file
chr1 HAVANA exon 12613 12721 . + . "ENSG00000223972.5" "DDX11L1"
chr1 HAVANA exon 13221 14409 . + . "ENSG00000223972.5" "DDX11L1"

To remove the double quotes from the values, change

if (found == 2) 
 $9 = gene_id " " gene_name
 print

into

if (found == 2) 
 gsub(""", "", gene_id)
 gsub(""", "", gene_name)
 $9 = gene_id " " gene_name
 print

which removes all double quotes from the gene name and ID, or,

if (found == 2) 
 gene_id = substr(gene_id, 2, length(gene_id) - 2)
 gene_name = substr(gene_name, 2, length(gene_name) - 2)
 $9 = gene_id " " gene_name
 print

which removes the first and last characters from the two values.

edited Sep 12 at 15:34

answered Sep 12 at 13:52

Kusalananda

107k14209331

The following awk script assumes that the 9th column could have data in any order.

The code also discards any input that does not contain both gene_id and gene_name.

BEGIN 
 FS = OFS = "t"



 n = split($9, a, "; ?")

 found = 0;
 for (i = 1; i <= n; ++i)
 if (split(a[i], b, " ") == 2) 
 if (b[1] == "gene_id") 
 gene_id = b[2]
 ++found
 else if (b[1] == "gene_name") 
 gene_name = b[2]
 ++found
 

 if (found == 2) break
 

 if (found == 2) 
 $9 = gene_id " " gene_name
 print

Testing on the data provided:

$ awk -f script.awk <file
chr1 HAVANA exon 12613 12721 . + . "ENSG00000223972.5" "DDX11L1"
chr1 HAVANA exon 13221 14409 . + . "ENSG00000223972.5" "DDX11L1"

To remove the double quotes from the values, change

if (found == 2) 
 $9 = gene_id " " gene_name
 print

into

if (found == 2) 
 gsub(""", "", gene_id)
 gsub(""", "", gene_name)
 $9 = gene_id " " gene_name
 print

which removes all double quotes from the gene name and ID, or,

if (found == 2) 
 gene_id = substr(gene_id, 2, length(gene_id) - 2)
 gene_name = substr(gene_name, 2, length(gene_name) - 2)
 $9 = gene_id " " gene_name
 print

which removes the first and last characters from the two values.

edited Sep 12 at 15:34

answered Sep 12 at 13:52

Kusalananda

107k14209331

edited Sep 12 at 15:34

answered Sep 12 at 13:52

Kusalananda

107k14209331

answered Sep 12 at 13:52

Kusalananda

107k14209331

answered Sep 12 at 13:52

Kusalananda

107k14209331

add a commentÂ |Â

搜尋此網誌

mjhjmtu