Extracting names from file_a using information from 2 columns in file

Extracting names from file_a using information from 2 columns in file_b

Building on: Extract names from File_B having overlapping intervals with File_A

I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b. The first columns must match, such that I only get one gene per row (file_b) but I could in theory have multiple adjacent rows (column_b) match the same gene (e.g. if the second row in file_b was MT 4065)

There are a few problems with the code I have now.

(1)the way it is set up below, the last row of file_b is lost from the output, although this changes if this row (groupVII 17978350) is brought up the list. I would like it to work the way it is set up.

(2)The names are truncated if they have special characters (eg colons and hyphens). I would like to have the entire name after the equal sign.

(3) I would like to match the entry/row of file_b to the gene hits in the output, such that the first two columns is the entry and the third column is the gene hit.

file_a.tsv

MT insdc gene 2851 3825 . + . ID=gene:ENSGACG00000020925 Name=mt-nd1 biotype=protein_coding description=NADH dehydrogenase 1%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-7] gene_id=ENSGACG00000020925 logic_name=mt_genbank_import version=1
MT insdc gene 4036 5082 . + . ID=gene:ENSGACG00000020929 Name=mt-nd2 biotype=protein_coding description=NADH dehydrogenase 2%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-8] gene_id=ENSGACG00000020929 logic_name=mt_genbank_import version=1
groupIII ensembl gene 7332324 7334769 . - . ID=gene:ENSGACG00000015265 Name=si:dkeyp-68b7.10 biotype=protein_coding description=si:dkeyp-68b7.10 [Source:ZFIN%3BAcc:ZDB-GENE-070912-667] gene_id=ENSGACG00000015265 logic_name=ensembl version=1
groupIV ensembl gene 1368026 1374881 . + . ID=gene:ENSGACG00000016447 Name=hnrnpa0b biotype=protein_coding description=heterogeneous nuclear ribonucleoprotein A0b [Source:ZFIN%3BAcc:ZDB-GENE-030131-6154] gene_id=ENSGACG00000016447 logic_name=ensembl version=1
groupIV ensembl gene 5347339 5349041 . - . ID=gene:ENSGACG00000017010 Name=zgc:153018 biotype=protein_coding description=zgc:153018 [Source:ZFIN%3BAcc:ZDB-GENE-060929-752] gene_id=ENSGACG00000017010 logic_name=ensembl version=1
groupV ensembl gene 120615 125489 . + . ID=gene:ENSGACG00000002103 Name=zdhhc6 biotype=protein_coding description=zinc finger%2C DHHC-type containing 6 [Source:ZFIN%3BAcc:ZDB-GENE-030131-3189] gene_id=ENSGACG00000002103 logic_name=ensembl version=1
groupVI ensembl gene 11230354 11232784 . + . ID=gene:ENSGACG00000009527 Name=bnip4 biotype=protein_coding description=BCL2 interacting protein 4 [Source:ZFIN%3BAcc:ZDB-GENE-051113-212] gene_id=ENSGACG00000009527 logic_name=ensembl version=1
groupVII ensembl gene 2271611 2277214 . + . ID=gene:ENSGACG00000019012 Name=sf3b2 biotype=protein_coding description=splicing factor 3b%2C subunit 2 [Source:ZFIN%3BAcc:ZDB-GENE-070928-1] gene_id=ENSGACG00000019012 logic_name=ensembl version=2
groupVII ensembl gene 15815857 15824549 . + . ID=gene:ENSGACG00000020296 Name=mpp1 biotype=protein_coding description=membrane protein%2C palmitoylated 1 [Source:ZFIN%3BAcc:ZDB-GENE-031113-4] gene_id=ENSGACG00000020296 logic_name=ensembl version=1
groupVII ensembl gene 17978322 17982388 . + . ID=gene:ENSGACG00000020399 Name=si:ch211-284e13.4 biotype=protein_coding description=si:ch211-284e13.4 [Source:ZFIN%3BAcc:ZDB-GENE-060526-161] gene_id=ENSGACG00000020399 logic_name=ensembl version=1

file_b.tsv

MT 4050
groupIII 7332350
groupIV 5347350
groupVI 11230375
groupVII 17978350

code:

while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1); ' <file_a.tsv; done < file_b.tsv > output.tsv

output.tsv

mt
si
zgc
bnip4

Desired_output

MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4

edited Jan 21 at 6:13

asked Jan 21 at 1:08

Age87

1607

You need to insert t between fields in your output.tsv file. Otherwise, it's not a tsv file.

– Niko Gambt
Jan 21 at 5:13

add a comment |

Building on: Extract names from File_B having overlapping intervals with File_A

There are a few problems with the code I have now.

(2)The names are truncated if they have special characters (eg colons and hyphens). I would like to have the entire name after the equal sign.

(3) I would like to match the entry/row of file_b to the gene hits in the output, such that the first two columns is the entry and the third column is the gene hit.

file_a.tsv

MT insdc gene 2851 3825 . + . ID=gene:ENSGACG00000020925 Name=mt-nd1 biotype=protein_coding description=NADH dehydrogenase 1%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-7] gene_id=ENSGACG00000020925 logic_name=mt_genbank_import version=1
MT insdc gene 4036 5082 . + . ID=gene:ENSGACG00000020929 Name=mt-nd2 biotype=protein_coding description=NADH dehydrogenase 2%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-8] gene_id=ENSGACG00000020929 logic_name=mt_genbank_import version=1
groupIII ensembl gene 7332324 7334769 . - . ID=gene:ENSGACG00000015265 Name=si:dkeyp-68b7.10 biotype=protein_coding description=si:dkeyp-68b7.10 [Source:ZFIN%3BAcc:ZDB-GENE-070912-667] gene_id=ENSGACG00000015265 logic_name=ensembl version=1
groupIV ensembl gene 1368026 1374881 . + . ID=gene:ENSGACG00000016447 Name=hnrnpa0b biotype=protein_coding description=heterogeneous nuclear ribonucleoprotein A0b [Source:ZFIN%3BAcc:ZDB-GENE-030131-6154] gene_id=ENSGACG00000016447 logic_name=ensembl version=1
groupIV ensembl gene 5347339 5349041 . - . ID=gene:ENSGACG00000017010 Name=zgc:153018 biotype=protein_coding description=zgc:153018 [Source:ZFIN%3BAcc:ZDB-GENE-060929-752] gene_id=ENSGACG00000017010 logic_name=ensembl version=1
groupV ensembl gene 120615 125489 . + . ID=gene:ENSGACG00000002103 Name=zdhhc6 biotype=protein_coding description=zinc finger%2C DHHC-type containing 6 [Source:ZFIN%3BAcc:ZDB-GENE-030131-3189] gene_id=ENSGACG00000002103 logic_name=ensembl version=1
groupVI ensembl gene 11230354 11232784 . + . ID=gene:ENSGACG00000009527 Name=bnip4 biotype=protein_coding description=BCL2 interacting protein 4 [Source:ZFIN%3BAcc:ZDB-GENE-051113-212] gene_id=ENSGACG00000009527 logic_name=ensembl version=1
groupVII ensembl gene 2271611 2277214 . + . ID=gene:ENSGACG00000019012 Name=sf3b2 biotype=protein_coding description=splicing factor 3b%2C subunit 2 [Source:ZFIN%3BAcc:ZDB-GENE-070928-1] gene_id=ENSGACG00000019012 logic_name=ensembl version=2
groupVII ensembl gene 15815857 15824549 . + . ID=gene:ENSGACG00000020296 Name=mpp1 biotype=protein_coding description=membrane protein%2C palmitoylated 1 [Source:ZFIN%3BAcc:ZDB-GENE-031113-4] gene_id=ENSGACG00000020296 logic_name=ensembl version=1
groupVII ensembl gene 17978322 17982388 . + . ID=gene:ENSGACG00000020399 Name=si:ch211-284e13.4 biotype=protein_coding description=si:ch211-284e13.4 [Source:ZFIN%3BAcc:ZDB-GENE-060526-161] gene_id=ENSGACG00000020399 logic_name=ensembl version=1

file_b.tsv

MT 4050
groupIII 7332350
groupIV 5347350
groupVI 11230375
groupVII 17978350

code:

while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1); ' <file_a.tsv; done < file_b.tsv > output.tsv

output.tsv

mt
si
zgc
bnip4

Desired_output

MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4

edited Jan 21 at 6:13

asked Jan 21 at 1:08

Age87

1607

You need to insert t between fields in your output.tsv file. Otherwise, it's not a tsv file.

– Niko Gambt
Jan 21 at 5:13

add a comment |

Building on: Extract names from File_B having overlapping intervals with File_A

There are a few problems with the code I have now.

(2)The names are truncated if they have special characters (eg colons and hyphens). I would like to have the entire name after the equal sign.

(3) I would like to match the entry/row of file_b to the gene hits in the output, such that the first two columns is the entry and the third column is the gene hit.

file_a.tsv

MT insdc gene 2851 3825 . + . ID=gene:ENSGACG00000020925 Name=mt-nd1 biotype=protein_coding description=NADH dehydrogenase 1%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-7] gene_id=ENSGACG00000020925 logic_name=mt_genbank_import version=1
MT insdc gene 4036 5082 . + . ID=gene:ENSGACG00000020929 Name=mt-nd2 biotype=protein_coding description=NADH dehydrogenase 2%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-8] gene_id=ENSGACG00000020929 logic_name=mt_genbank_import version=1
groupIII ensembl gene 7332324 7334769 . - . ID=gene:ENSGACG00000015265 Name=si:dkeyp-68b7.10 biotype=protein_coding description=si:dkeyp-68b7.10 [Source:ZFIN%3BAcc:ZDB-GENE-070912-667] gene_id=ENSGACG00000015265 logic_name=ensembl version=1
groupIV ensembl gene 1368026 1374881 . + . ID=gene:ENSGACG00000016447 Name=hnrnpa0b biotype=protein_coding description=heterogeneous nuclear ribonucleoprotein A0b [Source:ZFIN%3BAcc:ZDB-GENE-030131-6154] gene_id=ENSGACG00000016447 logic_name=ensembl version=1
groupIV ensembl gene 5347339 5349041 . - . ID=gene:ENSGACG00000017010 Name=zgc:153018 biotype=protein_coding description=zgc:153018 [Source:ZFIN%3BAcc:ZDB-GENE-060929-752] gene_id=ENSGACG00000017010 logic_name=ensembl version=1
groupV ensembl gene 120615 125489 . + . ID=gene:ENSGACG00000002103 Name=zdhhc6 biotype=protein_coding description=zinc finger%2C DHHC-type containing 6 [Source:ZFIN%3BAcc:ZDB-GENE-030131-3189] gene_id=ENSGACG00000002103 logic_name=ensembl version=1
groupVI ensembl gene 11230354 11232784 . + . ID=gene:ENSGACG00000009527 Name=bnip4 biotype=protein_coding description=BCL2 interacting protein 4 [Source:ZFIN%3BAcc:ZDB-GENE-051113-212] gene_id=ENSGACG00000009527 logic_name=ensembl version=1
groupVII ensembl gene 2271611 2277214 . + . ID=gene:ENSGACG00000019012 Name=sf3b2 biotype=protein_coding description=splicing factor 3b%2C subunit 2 [Source:ZFIN%3BAcc:ZDB-GENE-070928-1] gene_id=ENSGACG00000019012 logic_name=ensembl version=2
groupVII ensembl gene 15815857 15824549 . + . ID=gene:ENSGACG00000020296 Name=mpp1 biotype=protein_coding description=membrane protein%2C palmitoylated 1 [Source:ZFIN%3BAcc:ZDB-GENE-031113-4] gene_id=ENSGACG00000020296 logic_name=ensembl version=1
groupVII ensembl gene 17978322 17982388 . + . ID=gene:ENSGACG00000020399 Name=si:ch211-284e13.4 biotype=protein_coding description=si:ch211-284e13.4 [Source:ZFIN%3BAcc:ZDB-GENE-060526-161] gene_id=ENSGACG00000020399 logic_name=ensembl version=1

file_b.tsv

MT 4050
groupIII 7332350
groupIV 5347350
groupVI 11230375
groupVII 17978350

code:

while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1); ' <file_a.tsv; done < file_b.tsv > output.tsv

output.tsv

mt
si
zgc
bnip4

Desired_output

MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4

edited Jan 21 at 6:13

asked Jan 21 at 1:08

Age87

1607

Building on: Extract names from File_B having overlapping intervals with File_A

There are a few problems with the code I have now.

(2)The names are truncated if they have special characters (eg colons and hyphens). I would like to have the entire name after the equal sign.

(3) I would like to match the entry/row of file_b to the gene hits in the output, such that the first two columns is the entry and the third column is the gene hit.

file_a.tsv

MT insdc gene 2851 3825 . + . ID=gene:ENSGACG00000020925 Name=mt-nd1 biotype=protein_coding description=NADH dehydrogenase 1%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-7] gene_id=ENSGACG00000020925 logic_name=mt_genbank_import version=1
MT insdc gene 4036 5082 . + . ID=gene:ENSGACG00000020929 Name=mt-nd2 biotype=protein_coding description=NADH dehydrogenase 2%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-8] gene_id=ENSGACG00000020929 logic_name=mt_genbank_import version=1
groupIII ensembl gene 7332324 7334769 . - . ID=gene:ENSGACG00000015265 Name=si:dkeyp-68b7.10 biotype=protein_coding description=si:dkeyp-68b7.10 [Source:ZFIN%3BAcc:ZDB-GENE-070912-667] gene_id=ENSGACG00000015265 logic_name=ensembl version=1
groupIV ensembl gene 1368026 1374881 . + . ID=gene:ENSGACG00000016447 Name=hnrnpa0b biotype=protein_coding description=heterogeneous nuclear ribonucleoprotein A0b [Source:ZFIN%3BAcc:ZDB-GENE-030131-6154] gene_id=ENSGACG00000016447 logic_name=ensembl version=1
groupIV ensembl gene 5347339 5349041 . - . ID=gene:ENSGACG00000017010 Name=zgc:153018 biotype=protein_coding description=zgc:153018 [Source:ZFIN%3BAcc:ZDB-GENE-060929-752] gene_id=ENSGACG00000017010 logic_name=ensembl version=1
groupV ensembl gene 120615 125489 . + . ID=gene:ENSGACG00000002103 Name=zdhhc6 biotype=protein_coding description=zinc finger%2C DHHC-type containing 6 [Source:ZFIN%3BAcc:ZDB-GENE-030131-3189] gene_id=ENSGACG00000002103 logic_name=ensembl version=1
groupVI ensembl gene 11230354 11232784 . + . ID=gene:ENSGACG00000009527 Name=bnip4 biotype=protein_coding description=BCL2 interacting protein 4 [Source:ZFIN%3BAcc:ZDB-GENE-051113-212] gene_id=ENSGACG00000009527 logic_name=ensembl version=1
groupVII ensembl gene 2271611 2277214 . + . ID=gene:ENSGACG00000019012 Name=sf3b2 biotype=protein_coding description=splicing factor 3b%2C subunit 2 [Source:ZFIN%3BAcc:ZDB-GENE-070928-1] gene_id=ENSGACG00000019012 logic_name=ensembl version=2
groupVII ensembl gene 15815857 15824549 . + . ID=gene:ENSGACG00000020296 Name=mpp1 biotype=protein_coding description=membrane protein%2C palmitoylated 1 [Source:ZFIN%3BAcc:ZDB-GENE-031113-4] gene_id=ENSGACG00000020296 logic_name=ensembl version=1
groupVII ensembl gene 17978322 17982388 . + . ID=gene:ENSGACG00000020399 Name=si:ch211-284e13.4 biotype=protein_coding description=si:ch211-284e13.4 [Source:ZFIN%3BAcc:ZDB-GENE-060526-161] gene_id=ENSGACG00000020399 logic_name=ensembl version=1

file_b.tsv

MT 4050
groupIII 7332350
groupIV 5347350
groupVI 11230375
groupVII 17978350

code:

while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1); ' <file_a.tsv; done < file_b.tsv > output.tsv

output.tsv

mt
si
zgc
bnip4

Desired_output

MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4

bash shell-script awk sed

edited Jan 21 at 6:13

asked Jan 21 at 1:08

Age87

1607

edited Jan 21 at 6:13

asked Jan 21 at 1:08

Age87

1607

edited Jan 21 at 6:13

asked Jan 21 at 1:08

Age87

1607

asked Jan 21 at 1:08

Age87

1607

asked Jan 21 at 1:08

Age87

1607

You need to insert t between fields in your output.tsv file. Otherwise, it's not a tsv file.

– Niko Gambt
Jan 21 at 5:13

add a comment |

You need to insert t between fields in your output.tsv file. Otherwise, it's not a tsv file.

– Niko Gambt
Jan 21 at 5:13

You need to insert t between fields in your output.tsv file. Otherwise, it's not a tsv file.

– Niko Gambt
Jan 21 at 5:13

add a comment |

2 Answers
2

active

oldest

votes

# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity) 
 # you can specify the "proximity" as required
 return val > lower - proximity && val < upper + proximity


BEGIN 
 OFS="t"


$1 == id && within_range(pos, $4, $5, 100) 
 name = gensub(/.*Name=([^t]*).*/, "\1", 1)
 if (name ~ /[^[:space:]]+/)
 print id, pos, name

Then run

while read -r id pos
do
 awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv

Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:

MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4

For the ID MT, the gene hit should be mt-nd2 not mt-nd1.

I still recommend using Python for data processing.

edited Jan 21 at 7:42

answered Jan 21 at 3:21

Niko Gambt

1836

I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again

– Age87
Jan 21 at 5:50

I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The $1 ~ id check worked for me.

– Niko Gambt
Jan 21 at 7:19

Try changing $1 ~ id to $1 == id. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used $1 == id. I've edited my code.

– Niko Gambt
Jan 21 at 7:40

Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)

– Age87
Jan 21 at 20:40

add a comment |

Your expected shown output doesn't look to me consistent(2 lines-->1st and 3rd), if that is a typo then could you please try following.

awk 'FNR==NRa[$1]=$2;next ($1 in a) && (a[$1]>=$4 && a[$1]<=$5)sub("Name=","",$10);print $1,a[$1],$10' b.tsv a.tsv > output.tsv

edited Jan 21 at 6:16

Age87

1607

answered Jan 21 at 1:18

RavinderSingh13

1336

@Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.

– RavinderSingh13
Jan 21 at 4:35

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f495683%2fextracting-names-from-file-a-using-information-from-2-columns-in-file-b%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity) 
 # you can specify the "proximity" as required
 return val > lower - proximity && val < upper + proximity


BEGIN 
 OFS="t"


$1 == id && within_range(pos, $4, $5, 100) 
 name = gensub(/.*Name=([^t]*).*/, "\1", 1)
 if (name ~ /[^[:space:]]+/)
 print id, pos, name

Then run

while read -r id pos
do
 awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv

Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:

MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4

For the ID MT, the gene hit should be mt-nd2 not mt-nd1.

I still recommend using Python for data processing.

edited Jan 21 at 7:42

answered Jan 21 at 3:21

Niko Gambt

1836

I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again

– Age87
Jan 21 at 5:50

I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The $1 ~ id check worked for me.

– Niko Gambt
Jan 21 at 7:19

Try changing $1 ~ id to $1 == id. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used $1 == id. I've edited my code.

– Niko Gambt
Jan 21 at 7:40

Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)

– Age87
Jan 21 at 20:40

add a comment |

# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity) 
 # you can specify the "proximity" as required
 return val > lower - proximity && val < upper + proximity


BEGIN 
 OFS="t"


$1 == id && within_range(pos, $4, $5, 100) 
 name = gensub(/.*Name=([^t]*).*/, "\1", 1)
 if (name ~ /[^[:space:]]+/)
 print id, pos, name

Then run

while read -r id pos
do
 awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv

Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:

MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4

For the ID MT, the gene hit should be mt-nd2 not mt-nd1.

I still recommend using Python for data processing.

edited Jan 21 at 7:42

answered Jan 21 at 3:21

Niko Gambt

1836

I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again

– Age87
Jan 21 at 5:50

I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The $1 ~ id check worked for me.

– Niko Gambt
Jan 21 at 7:19

Try changing $1 ~ id to $1 == id. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used $1 == id. I've edited my code.

– Niko Gambt
Jan 21 at 7:40

Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)

– Age87
Jan 21 at 20:40

add a comment |

# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity) 
 # you can specify the "proximity" as required
 return val > lower - proximity && val < upper + proximity


BEGIN 
 OFS="t"


$1 == id && within_range(pos, $4, $5, 100) 
 name = gensub(/.*Name=([^t]*).*/, "\1", 1)
 if (name ~ /[^[:space:]]+/)
 print id, pos, name

Then run

while read -r id pos
do
 awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv

Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:

MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4

For the ID MT, the gene hit should be mt-nd2 not mt-nd1.

I still recommend using Python for data processing.

edited Jan 21 at 7:42

answered Jan 21 at 3:21

Niko Gambt

1836

# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity) 
 # you can specify the "proximity" as required
 return val > lower - proximity && val < upper + proximity


BEGIN 
 OFS="t"


$1 == id && within_range(pos, $4, $5, 100) 
 name = gensub(/.*Name=([^t]*).*/, "\1", 1)
 if (name ~ /[^[:space:]]+/)
 print id, pos, name

Then run

while read -r id pos
do
 awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv

Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:

MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4

For the ID MT, the gene hit should be mt-nd2 not mt-nd1.

I still recommend using Python for data processing.

edited Jan 21 at 7:42

answered Jan 21 at 3:21

Niko Gambt

1836

edited Jan 21 at 7:42

answered Jan 21 at 3:21

Niko Gambt

1836

answered Jan 21 at 3:21

Niko Gambt

1836

answered Jan 21 at 3:21

Niko Gambt

1836

I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again

– Age87
Jan 21 at 5:50

I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The $1 ~ id check worked for me.

– Niko Gambt
Jan 21 at 7:19

Try changing $1 ~ id to $1 == id. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used $1 == id. I've edited my code.

– Niko Gambt
Jan 21 at 7:40

Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)

– Age87
Jan 21 at 20:40

add a comment |

I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again

– Age87
Jan 21 at 5:50

I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The $1 ~ id check worked for me.

– Niko Gambt
Jan 21 at 7:19

Try changing $1 ~ id to $1 == id. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used $1 == id. I've edited my code.

– Niko Gambt
Jan 21 at 7:40

Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)

– Age87
Jan 21 at 20:40

I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again

– Age87
Jan 21 at 5:50

I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The $1 ~ id check worked for me.

– Niko Gambt
Jan 21 at 7:19

Try changing $1 ~ id to $1 == id. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used $1 == id. I've edited my code.

– Niko Gambt
Jan 21 at 7:40

Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)

– Age87
Jan 21 at 20:40

add a comment |

Your expected shown output doesn't look to me consistent(2 lines-->1st and 3rd), if that is a typo then could you please try following.

awk 'FNR==NRa[$1]=$2;next ($1 in a) && (a[$1]>=$4 && a[$1]<=$5)sub("Name=","",$10);print $1,a[$1],$10' b.tsv a.tsv > output.tsv

edited Jan 21 at 6:16

Age87

1607

answered Jan 21 at 1:18

RavinderSingh13

1336

@Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.

– RavinderSingh13
Jan 21 at 4:35

add a comment |

Your expected shown output doesn't look to me consistent(2 lines-->1st and 3rd), if that is a typo then could you please try following.

awk 'FNR==NRa[$1]=$2;next ($1 in a) && (a[$1]>=$4 && a[$1]<=$5)sub("Name=","",$10);print $1,a[$1],$10' b.tsv a.tsv > output.tsv

edited Jan 21 at 6:16

Age87

1607

answered Jan 21 at 1:18

RavinderSingh13

1336

@Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.

– RavinderSingh13
Jan 21 at 4:35

add a comment |

Your expected shown output doesn't look to me consistent(2 lines-->1st and 3rd), if that is a typo then could you please try following.

awk 'FNR==NRa[$1]=$2;next ($1 in a) && (a[$1]>=$4 && a[$1]<=$5)sub("Name=","",$10);print $1,a[$1],$10' b.tsv a.tsv > output.tsv

edited Jan 21 at 6:16

Age87

1607

answered Jan 21 at 1:18

RavinderSingh13

1336

Your expected shown output doesn't look to me consistent(2 lines-->1st and 3rd), if that is a typo then could you please try following.

awk 'FNR==NRa[$1]=$2;next ($1 in a) && (a[$1]>=$4 && a[$1]<=$5)sub("Name=","",$10);print $1,a[$1],$10' b.tsv a.tsv > output.tsv

edited Jan 21 at 6:16

Age87

1607

answered Jan 21 at 1:18

RavinderSingh13

1336

edited Jan 21 at 6:16

Age87

1607

edited Jan 21 at 6:16

Age87

1607

edited Jan 21 at 6:16

Age87

1607

answered Jan 21 at 1:18

RavinderSingh13

1336

answered Jan 21 at 1:18

RavinderSingh13

1336

answered Jan 21 at 1:18

RavinderSingh13

1336

@Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.

– RavinderSingh13
Jan 21 at 4:35

add a comment |

@Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.

– RavinderSingh13
Jan 21 at 4:35

@Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.

– RavinderSingh13
Jan 21 at 4:35

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

搜尋此網誌

mjhjmtu