Extracting names from file_a using information from 2 columns in file_b

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












1















Building on: Extract names from File_B having overlapping intervals with File_A



I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b. The first columns must match, such that I only get one gene per row (file_b) but I could in theory have multiple adjacent rows (column_b) match the same gene (e.g. if the second row in file_b was MT 4065)



There are a few problems with the code I have now.



(1)the way it is set up below, the last row of file_b is lost from the output, although this changes if this row (groupVII 17978350) is brought up the list. I would like it to work the way it is set up.



(2)The names are truncated if they have special characters (eg colons and hyphens). I would like to have the entire name after the equal sign.



(3) I would like to match the entry/row of file_b to the gene hits in the output, such that the first two columns is the entry and the third column is the gene hit.



file_a.tsv



MT insdc gene 2851 3825 . + . ID=gene:ENSGACG00000020925 Name=mt-nd1 biotype=protein_coding description=NADH dehydrogenase 1%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-7] gene_id=ENSGACG00000020925 logic_name=mt_genbank_import version=1
MT insdc gene 4036 5082 . + . ID=gene:ENSGACG00000020929 Name=mt-nd2 biotype=protein_coding description=NADH dehydrogenase 2%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-8] gene_id=ENSGACG00000020929 logic_name=mt_genbank_import version=1
groupIII ensembl gene 7332324 7334769 . - . ID=gene:ENSGACG00000015265 Name=si:dkeyp-68b7.10 biotype=protein_coding description=si:dkeyp-68b7.10 [Source:ZFIN%3BAcc:ZDB-GENE-070912-667] gene_id=ENSGACG00000015265 logic_name=ensembl version=1
groupIV ensembl gene 1368026 1374881 . + . ID=gene:ENSGACG00000016447 Name=hnrnpa0b biotype=protein_coding description=heterogeneous nuclear ribonucleoprotein A0b [Source:ZFIN%3BAcc:ZDB-GENE-030131-6154] gene_id=ENSGACG00000016447 logic_name=ensembl version=1
groupIV ensembl gene 5347339 5349041 . - . ID=gene:ENSGACG00000017010 Name=zgc:153018 biotype=protein_coding description=zgc:153018 [Source:ZFIN%3BAcc:ZDB-GENE-060929-752] gene_id=ENSGACG00000017010 logic_name=ensembl version=1
groupV ensembl gene 120615 125489 . + . ID=gene:ENSGACG00000002103 Name=zdhhc6 biotype=protein_coding description=zinc finger%2C DHHC-type containing 6 [Source:ZFIN%3BAcc:ZDB-GENE-030131-3189] gene_id=ENSGACG00000002103 logic_name=ensembl version=1
groupVI ensembl gene 11230354 11232784 . + . ID=gene:ENSGACG00000009527 Name=bnip4 biotype=protein_coding description=BCL2 interacting protein 4 [Source:ZFIN%3BAcc:ZDB-GENE-051113-212] gene_id=ENSGACG00000009527 logic_name=ensembl version=1
groupVII ensembl gene 2271611 2277214 . + . ID=gene:ENSGACG00000019012 Name=sf3b2 biotype=protein_coding description=splicing factor 3b%2C subunit 2 [Source:ZFIN%3BAcc:ZDB-GENE-070928-1] gene_id=ENSGACG00000019012 logic_name=ensembl version=2
groupVII ensembl gene 15815857 15824549 . + . ID=gene:ENSGACG00000020296 Name=mpp1 biotype=protein_coding description=membrane protein%2C palmitoylated 1 [Source:ZFIN%3BAcc:ZDB-GENE-031113-4] gene_id=ENSGACG00000020296 logic_name=ensembl version=1
groupVII ensembl gene 17978322 17982388 . + . ID=gene:ENSGACG00000020399 Name=si:ch211-284e13.4 biotype=protein_coding description=si:ch211-284e13.4 [Source:ZFIN%3BAcc:ZDB-GENE-060526-161] gene_id=ENSGACG00000020399 logic_name=ensembl version=1


file_b.tsv



MT 4050
groupIII 7332350
groupIV 5347350
groupVI 11230375
groupVII 17978350


code:



while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1); ' <file_a.tsv; done < file_b.tsv > output.tsv


output.tsv



mt
si
zgc
bnip4


Desired_output



MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4









share|improve this question
























  • You need to insert t between fields in your output.tsv file. Otherwise, it's not a tsv file.

    – Niko Gambt
    Jan 21 at 5:13















1















Building on: Extract names from File_B having overlapping intervals with File_A



I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b. The first columns must match, such that I only get one gene per row (file_b) but I could in theory have multiple adjacent rows (column_b) match the same gene (e.g. if the second row in file_b was MT 4065)



There are a few problems with the code I have now.



(1)the way it is set up below, the last row of file_b is lost from the output, although this changes if this row (groupVII 17978350) is brought up the list. I would like it to work the way it is set up.



(2)The names are truncated if they have special characters (eg colons and hyphens). I would like to have the entire name after the equal sign.



(3) I would like to match the entry/row of file_b to the gene hits in the output, such that the first two columns is the entry and the third column is the gene hit.



file_a.tsv



MT insdc gene 2851 3825 . + . ID=gene:ENSGACG00000020925 Name=mt-nd1 biotype=protein_coding description=NADH dehydrogenase 1%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-7] gene_id=ENSGACG00000020925 logic_name=mt_genbank_import version=1
MT insdc gene 4036 5082 . + . ID=gene:ENSGACG00000020929 Name=mt-nd2 biotype=protein_coding description=NADH dehydrogenase 2%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-8] gene_id=ENSGACG00000020929 logic_name=mt_genbank_import version=1
groupIII ensembl gene 7332324 7334769 . - . ID=gene:ENSGACG00000015265 Name=si:dkeyp-68b7.10 biotype=protein_coding description=si:dkeyp-68b7.10 [Source:ZFIN%3BAcc:ZDB-GENE-070912-667] gene_id=ENSGACG00000015265 logic_name=ensembl version=1
groupIV ensembl gene 1368026 1374881 . + . ID=gene:ENSGACG00000016447 Name=hnrnpa0b biotype=protein_coding description=heterogeneous nuclear ribonucleoprotein A0b [Source:ZFIN%3BAcc:ZDB-GENE-030131-6154] gene_id=ENSGACG00000016447 logic_name=ensembl version=1
groupIV ensembl gene 5347339 5349041 . - . ID=gene:ENSGACG00000017010 Name=zgc:153018 biotype=protein_coding description=zgc:153018 [Source:ZFIN%3BAcc:ZDB-GENE-060929-752] gene_id=ENSGACG00000017010 logic_name=ensembl version=1
groupV ensembl gene 120615 125489 . + . ID=gene:ENSGACG00000002103 Name=zdhhc6 biotype=protein_coding description=zinc finger%2C DHHC-type containing 6 [Source:ZFIN%3BAcc:ZDB-GENE-030131-3189] gene_id=ENSGACG00000002103 logic_name=ensembl version=1
groupVI ensembl gene 11230354 11232784 . + . ID=gene:ENSGACG00000009527 Name=bnip4 biotype=protein_coding description=BCL2 interacting protein 4 [Source:ZFIN%3BAcc:ZDB-GENE-051113-212] gene_id=ENSGACG00000009527 logic_name=ensembl version=1
groupVII ensembl gene 2271611 2277214 . + . ID=gene:ENSGACG00000019012 Name=sf3b2 biotype=protein_coding description=splicing factor 3b%2C subunit 2 [Source:ZFIN%3BAcc:ZDB-GENE-070928-1] gene_id=ENSGACG00000019012 logic_name=ensembl version=2
groupVII ensembl gene 15815857 15824549 . + . ID=gene:ENSGACG00000020296 Name=mpp1 biotype=protein_coding description=membrane protein%2C palmitoylated 1 [Source:ZFIN%3BAcc:ZDB-GENE-031113-4] gene_id=ENSGACG00000020296 logic_name=ensembl version=1
groupVII ensembl gene 17978322 17982388 . + . ID=gene:ENSGACG00000020399 Name=si:ch211-284e13.4 biotype=protein_coding description=si:ch211-284e13.4 [Source:ZFIN%3BAcc:ZDB-GENE-060526-161] gene_id=ENSGACG00000020399 logic_name=ensembl version=1


file_b.tsv



MT 4050
groupIII 7332350
groupIV 5347350
groupVI 11230375
groupVII 17978350


code:



while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1); ' <file_a.tsv; done < file_b.tsv > output.tsv


output.tsv



mt
si
zgc
bnip4


Desired_output



MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4









share|improve this question
























  • You need to insert t between fields in your output.tsv file. Otherwise, it's not a tsv file.

    – Niko Gambt
    Jan 21 at 5:13













1












1








1








Building on: Extract names from File_B having overlapping intervals with File_A



I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b. The first columns must match, such that I only get one gene per row (file_b) but I could in theory have multiple adjacent rows (column_b) match the same gene (e.g. if the second row in file_b was MT 4065)



There are a few problems with the code I have now.



(1)the way it is set up below, the last row of file_b is lost from the output, although this changes if this row (groupVII 17978350) is brought up the list. I would like it to work the way it is set up.



(2)The names are truncated if they have special characters (eg colons and hyphens). I would like to have the entire name after the equal sign.



(3) I would like to match the entry/row of file_b to the gene hits in the output, such that the first two columns is the entry and the third column is the gene hit.



file_a.tsv



MT insdc gene 2851 3825 . + . ID=gene:ENSGACG00000020925 Name=mt-nd1 biotype=protein_coding description=NADH dehydrogenase 1%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-7] gene_id=ENSGACG00000020925 logic_name=mt_genbank_import version=1
MT insdc gene 4036 5082 . + . ID=gene:ENSGACG00000020929 Name=mt-nd2 biotype=protein_coding description=NADH dehydrogenase 2%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-8] gene_id=ENSGACG00000020929 logic_name=mt_genbank_import version=1
groupIII ensembl gene 7332324 7334769 . - . ID=gene:ENSGACG00000015265 Name=si:dkeyp-68b7.10 biotype=protein_coding description=si:dkeyp-68b7.10 [Source:ZFIN%3BAcc:ZDB-GENE-070912-667] gene_id=ENSGACG00000015265 logic_name=ensembl version=1
groupIV ensembl gene 1368026 1374881 . + . ID=gene:ENSGACG00000016447 Name=hnrnpa0b biotype=protein_coding description=heterogeneous nuclear ribonucleoprotein A0b [Source:ZFIN%3BAcc:ZDB-GENE-030131-6154] gene_id=ENSGACG00000016447 logic_name=ensembl version=1
groupIV ensembl gene 5347339 5349041 . - . ID=gene:ENSGACG00000017010 Name=zgc:153018 biotype=protein_coding description=zgc:153018 [Source:ZFIN%3BAcc:ZDB-GENE-060929-752] gene_id=ENSGACG00000017010 logic_name=ensembl version=1
groupV ensembl gene 120615 125489 . + . ID=gene:ENSGACG00000002103 Name=zdhhc6 biotype=protein_coding description=zinc finger%2C DHHC-type containing 6 [Source:ZFIN%3BAcc:ZDB-GENE-030131-3189] gene_id=ENSGACG00000002103 logic_name=ensembl version=1
groupVI ensembl gene 11230354 11232784 . + . ID=gene:ENSGACG00000009527 Name=bnip4 biotype=protein_coding description=BCL2 interacting protein 4 [Source:ZFIN%3BAcc:ZDB-GENE-051113-212] gene_id=ENSGACG00000009527 logic_name=ensembl version=1
groupVII ensembl gene 2271611 2277214 . + . ID=gene:ENSGACG00000019012 Name=sf3b2 biotype=protein_coding description=splicing factor 3b%2C subunit 2 [Source:ZFIN%3BAcc:ZDB-GENE-070928-1] gene_id=ENSGACG00000019012 logic_name=ensembl version=2
groupVII ensembl gene 15815857 15824549 . + . ID=gene:ENSGACG00000020296 Name=mpp1 biotype=protein_coding description=membrane protein%2C palmitoylated 1 [Source:ZFIN%3BAcc:ZDB-GENE-031113-4] gene_id=ENSGACG00000020296 logic_name=ensembl version=1
groupVII ensembl gene 17978322 17982388 . + . ID=gene:ENSGACG00000020399 Name=si:ch211-284e13.4 biotype=protein_coding description=si:ch211-284e13.4 [Source:ZFIN%3BAcc:ZDB-GENE-060526-161] gene_id=ENSGACG00000020399 logic_name=ensembl version=1


file_b.tsv



MT 4050
groupIII 7332350
groupIV 5347350
groupVI 11230375
groupVII 17978350


code:



while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1); ' <file_a.tsv; done < file_b.tsv > output.tsv


output.tsv



mt
si
zgc
bnip4


Desired_output



MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4









share|improve this question
















Building on: Extract names from File_B having overlapping intervals with File_A



I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b. The first columns must match, such that I only get one gene per row (file_b) but I could in theory have multiple adjacent rows (column_b) match the same gene (e.g. if the second row in file_b was MT 4065)



There are a few problems with the code I have now.



(1)the way it is set up below, the last row of file_b is lost from the output, although this changes if this row (groupVII 17978350) is brought up the list. I would like it to work the way it is set up.



(2)The names are truncated if they have special characters (eg colons and hyphens). I would like to have the entire name after the equal sign.



(3) I would like to match the entry/row of file_b to the gene hits in the output, such that the first two columns is the entry and the third column is the gene hit.



file_a.tsv



MT insdc gene 2851 3825 . + . ID=gene:ENSGACG00000020925 Name=mt-nd1 biotype=protein_coding description=NADH dehydrogenase 1%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-7] gene_id=ENSGACG00000020925 logic_name=mt_genbank_import version=1
MT insdc gene 4036 5082 . + . ID=gene:ENSGACG00000020929 Name=mt-nd2 biotype=protein_coding description=NADH dehydrogenase 2%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-8] gene_id=ENSGACG00000020929 logic_name=mt_genbank_import version=1
groupIII ensembl gene 7332324 7334769 . - . ID=gene:ENSGACG00000015265 Name=si:dkeyp-68b7.10 biotype=protein_coding description=si:dkeyp-68b7.10 [Source:ZFIN%3BAcc:ZDB-GENE-070912-667] gene_id=ENSGACG00000015265 logic_name=ensembl version=1
groupIV ensembl gene 1368026 1374881 . + . ID=gene:ENSGACG00000016447 Name=hnrnpa0b biotype=protein_coding description=heterogeneous nuclear ribonucleoprotein A0b [Source:ZFIN%3BAcc:ZDB-GENE-030131-6154] gene_id=ENSGACG00000016447 logic_name=ensembl version=1
groupIV ensembl gene 5347339 5349041 . - . ID=gene:ENSGACG00000017010 Name=zgc:153018 biotype=protein_coding description=zgc:153018 [Source:ZFIN%3BAcc:ZDB-GENE-060929-752] gene_id=ENSGACG00000017010 logic_name=ensembl version=1
groupV ensembl gene 120615 125489 . + . ID=gene:ENSGACG00000002103 Name=zdhhc6 biotype=protein_coding description=zinc finger%2C DHHC-type containing 6 [Source:ZFIN%3BAcc:ZDB-GENE-030131-3189] gene_id=ENSGACG00000002103 logic_name=ensembl version=1
groupVI ensembl gene 11230354 11232784 . + . ID=gene:ENSGACG00000009527 Name=bnip4 biotype=protein_coding description=BCL2 interacting protein 4 [Source:ZFIN%3BAcc:ZDB-GENE-051113-212] gene_id=ENSGACG00000009527 logic_name=ensembl version=1
groupVII ensembl gene 2271611 2277214 . + . ID=gene:ENSGACG00000019012 Name=sf3b2 biotype=protein_coding description=splicing factor 3b%2C subunit 2 [Source:ZFIN%3BAcc:ZDB-GENE-070928-1] gene_id=ENSGACG00000019012 logic_name=ensembl version=2
groupVII ensembl gene 15815857 15824549 . + . ID=gene:ENSGACG00000020296 Name=mpp1 biotype=protein_coding description=membrane protein%2C palmitoylated 1 [Source:ZFIN%3BAcc:ZDB-GENE-031113-4] gene_id=ENSGACG00000020296 logic_name=ensembl version=1
groupVII ensembl gene 17978322 17982388 . + . ID=gene:ENSGACG00000020399 Name=si:ch211-284e13.4 biotype=protein_coding description=si:ch211-284e13.4 [Source:ZFIN%3BAcc:ZDB-GENE-060526-161] gene_id=ENSGACG00000020399 logic_name=ensembl version=1


file_b.tsv



MT 4050
groupIII 7332350
groupIV 5347350
groupVI 11230375
groupVII 17978350


code:



while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1); ' <file_a.tsv; done < file_b.tsv > output.tsv


output.tsv



mt
si
zgc
bnip4


Desired_output



MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4






bash shell-script awk sed






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 21 at 6:13







Age87

















asked Jan 21 at 1:08









Age87Age87

1607




1607












  • You need to insert t between fields in your output.tsv file. Otherwise, it's not a tsv file.

    – Niko Gambt
    Jan 21 at 5:13

















  • You need to insert t between fields in your output.tsv file. Otherwise, it's not a tsv file.

    – Niko Gambt
    Jan 21 at 5:13
















You need to insert t between fields in your output.tsv file. Otherwise, it's not a tsv file.

– Niko Gambt
Jan 21 at 5:13





You need to insert t between fields in your output.tsv file. Otherwise, it's not a tsv file.

– Niko Gambt
Jan 21 at 5:13










2 Answers
2






active

oldest

votes


















1














# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity)
# you can specify the "proximity" as required
return val > lower - proximity && val < upper + proximity


BEGIN
OFS="t"


$1 == id && within_range(pos, $4, $5, 100)
name = gensub(/.*Name=([^t]*).*/, "\1", 1)
if (name ~ /[^[:space:]]+/)
print id, pos, name



Then run



while read -r id pos
do
awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv


Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:



MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4


For the ID MT, the gene hit should be mt-nd2 not mt-nd1.



I still recommend using Python for data processing.






share|improve this answer

























  • I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again

    – Age87
    Jan 21 at 5:50












  • I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The $1 ~ id check worked for me.

    – Niko Gambt
    Jan 21 at 7:19












  • Try changing $1 ~ id to $1 == id. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used $1 == id. I've edited my code.

    – Niko Gambt
    Jan 21 at 7:40












  • Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)

    – Age87
    Jan 21 at 20:40


















1














Your expected shown output doesn't look to me consistent(2 lines-->1st and 3rd), if that is a typo then could you please try following.



awk 'FNR==NRa[$1]=$2;next ($1 in a) && (a[$1]>=$4 && a[$1]<=$5)sub("Name=","",$10);print $1,a[$1],$10' b.tsv a.tsv > output.tsv





share|improve this answer

























  • @Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.

    – RavinderSingh13
    Jan 21 at 4:35










Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f495683%2fextracting-names-from-file-a-using-information-from-2-columns-in-file-b%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity)
# you can specify the "proximity" as required
return val > lower - proximity && val < upper + proximity


BEGIN
OFS="t"


$1 == id && within_range(pos, $4, $5, 100)
name = gensub(/.*Name=([^t]*).*/, "\1", 1)
if (name ~ /[^[:space:]]+/)
print id, pos, name



Then run



while read -r id pos
do
awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv


Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:



MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4


For the ID MT, the gene hit should be mt-nd2 not mt-nd1.



I still recommend using Python for data processing.






share|improve this answer

























  • I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again

    – Age87
    Jan 21 at 5:50












  • I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The $1 ~ id check worked for me.

    – Niko Gambt
    Jan 21 at 7:19












  • Try changing $1 ~ id to $1 == id. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used $1 == id. I've edited my code.

    – Niko Gambt
    Jan 21 at 7:40












  • Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)

    – Age87
    Jan 21 at 20:40















1














# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity)
# you can specify the "proximity" as required
return val > lower - proximity && val < upper + proximity


BEGIN
OFS="t"


$1 == id && within_range(pos, $4, $5, 100)
name = gensub(/.*Name=([^t]*).*/, "\1", 1)
if (name ~ /[^[:space:]]+/)
print id, pos, name



Then run



while read -r id pos
do
awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv


Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:



MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4


For the ID MT, the gene hit should be mt-nd2 not mt-nd1.



I still recommend using Python for data processing.






share|improve this answer

























  • I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again

    – Age87
    Jan 21 at 5:50












  • I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The $1 ~ id check worked for me.

    – Niko Gambt
    Jan 21 at 7:19












  • Try changing $1 ~ id to $1 == id. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used $1 == id. I've edited my code.

    – Niko Gambt
    Jan 21 at 7:40












  • Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)

    – Age87
    Jan 21 at 20:40













1












1








1







# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity)
# you can specify the "proximity" as required
return val > lower - proximity && val < upper + proximity


BEGIN
OFS="t"


$1 == id && within_range(pos, $4, $5, 100)
name = gensub(/.*Name=([^t]*).*/, "\1", 1)
if (name ~ /[^[:space:]]+/)
print id, pos, name



Then run



while read -r id pos
do
awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv


Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:



MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4


For the ID MT, the gene hit should be mt-nd2 not mt-nd1.



I still recommend using Python for data processing.






share|improve this answer















# save this as script.awk or whatevernameyouwant.awk

function within_range(val, lower, upper, proximity)
# you can specify the "proximity" as required
return val > lower - proximity && val < upper + proximity


BEGIN
OFS="t"


$1 == id && within_range(pos, $4, $5, 100)
name = gensub(/.*Name=([^t]*).*/, "\1", 1)
if (name ~ /[^[:space:]]+/)
print id, pos, name



Then run



while read -r id pos
do
awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv


Please make sure that the fields in your .tsv files are separated by tabs before processing them. My output:



MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4


For the ID MT, the gene hit should be mt-nd2 not mt-nd1.



I still recommend using Python for data processing.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 21 at 7:42

























answered Jan 21 at 3:21









Niko GambtNiko Gambt

1836




1836












  • I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again

    – Age87
    Jan 21 at 5:50












  • I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The $1 ~ id check worked for me.

    – Niko Gambt
    Jan 21 at 7:19












  • Try changing $1 ~ id to $1 == id. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used $1 == id. I've edited my code.

    – Niko Gambt
    Jan 21 at 7:40












  • Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)

    – Age87
    Jan 21 at 20:40

















  • I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again

    – Age87
    Jan 21 at 5:50












  • I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The $1 ~ id check worked for me.

    – Niko Gambt
    Jan 21 at 7:19












  • Try changing $1 ~ id to $1 == id. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used $1 == id. I've edited my code.

    – Niko Gambt
    Jan 21 at 7:40












  • Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)

    – Age87
    Jan 21 at 20:40
















I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again

– Age87
Jan 21 at 5:50






I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again

– Age87
Jan 21 at 5:50














I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The $1 ~ id check worked for me.

– Niko Gambt
Jan 21 at 7:19






I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The $1 ~ id check worked for me.

– Niko Gambt
Jan 21 at 7:19














Try changing $1 ~ id to $1 == id. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used $1 == id. I've edited my code.

– Niko Gambt
Jan 21 at 7:40






Try changing $1 ~ id to $1 == id. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used $1 == id. I've edited my code.

– Niko Gambt
Jan 21 at 7:40














Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)

– Age87
Jan 21 at 20:40





Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)

– Age87
Jan 21 at 20:40













1














Your expected shown output doesn't look to me consistent(2 lines-->1st and 3rd), if that is a typo then could you please try following.



awk 'FNR==NRa[$1]=$2;next ($1 in a) && (a[$1]>=$4 && a[$1]<=$5)sub("Name=","",$10);print $1,a[$1],$10' b.tsv a.tsv > output.tsv





share|improve this answer

























  • @Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.

    – RavinderSingh13
    Jan 21 at 4:35















1














Your expected shown output doesn't look to me consistent(2 lines-->1st and 3rd), if that is a typo then could you please try following.



awk 'FNR==NRa[$1]=$2;next ($1 in a) && (a[$1]>=$4 && a[$1]<=$5)sub("Name=","",$10);print $1,a[$1],$10' b.tsv a.tsv > output.tsv





share|improve this answer

























  • @Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.

    – RavinderSingh13
    Jan 21 at 4:35













1












1








1







Your expected shown output doesn't look to me consistent(2 lines-->1st and 3rd), if that is a typo then could you please try following.



awk 'FNR==NRa[$1]=$2;next ($1 in a) && (a[$1]>=$4 && a[$1]<=$5)sub("Name=","",$10);print $1,a[$1],$10' b.tsv a.tsv > output.tsv





share|improve this answer















Your expected shown output doesn't look to me consistent(2 lines-->1st and 3rd), if that is a typo then could you please try following.



awk 'FNR==NRa[$1]=$2;next ($1 in a) && (a[$1]>=$4 && a[$1]<=$5)sub("Name=","",$10);print $1,a[$1],$10' b.tsv a.tsv > output.tsv






share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 21 at 6:16









Age87

1607




1607










answered Jan 21 at 1:18









RavinderSingh13RavinderSingh13

1336




1336












  • @Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.

    – RavinderSingh13
    Jan 21 at 4:35

















  • @Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.

    – RavinderSingh13
    Jan 21 at 4:35
















@Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.

– RavinderSingh13
Jan 21 at 4:35





@Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.

– RavinderSingh13
Jan 21 at 4:35

















draft saved

draft discarded
















































Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f495683%2fextracting-names-from-file-a-using-information-from-2-columns-in-file-b%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown






Popular posts from this blog

How to check contact read email or not when send email to Individual?

How many registers does an x86_64 CPU actually have?

Nur Jahan