Extracting names from file_a using information from 2 columns in file_b
Clash Royale CLAN TAG#URR8PPP
Building on: Extract names from File_B having overlapping intervals with File_A
I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b. The first columns must match, such that I only get one gene per row (file_b) but I could in theory have multiple adjacent rows (column_b) match the same gene (e.g. if the second row in file_b was MT 4065
)
There are a few problems with the code I have now.
(1)the way it is set up below, the last row of file_b is lost from the output, although this changes if this row (groupVII 17978350
) is brought up the list. I would like it to work the way it is set up.
(2)The names are truncated if they have special characters (eg colons and hyphens). I would like to have the entire name after the equal sign.
(3) I would like to match the entry/row of file_b to the gene hits in the output, such that the first two columns is the entry and the third column is the gene hit.
file_a.tsv
MT insdc gene 2851 3825 . + . ID=gene:ENSGACG00000020925 Name=mt-nd1 biotype=protein_coding description=NADH dehydrogenase 1%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-7] gene_id=ENSGACG00000020925 logic_name=mt_genbank_import version=1
MT insdc gene 4036 5082 . + . ID=gene:ENSGACG00000020929 Name=mt-nd2 biotype=protein_coding description=NADH dehydrogenase 2%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-8] gene_id=ENSGACG00000020929 logic_name=mt_genbank_import version=1
groupIII ensembl gene 7332324 7334769 . - . ID=gene:ENSGACG00000015265 Name=si:dkeyp-68b7.10 biotype=protein_coding description=si:dkeyp-68b7.10 [Source:ZFIN%3BAcc:ZDB-GENE-070912-667] gene_id=ENSGACG00000015265 logic_name=ensembl version=1
groupIV ensembl gene 1368026 1374881 . + . ID=gene:ENSGACG00000016447 Name=hnrnpa0b biotype=protein_coding description=heterogeneous nuclear ribonucleoprotein A0b [Source:ZFIN%3BAcc:ZDB-GENE-030131-6154] gene_id=ENSGACG00000016447 logic_name=ensembl version=1
groupIV ensembl gene 5347339 5349041 . - . ID=gene:ENSGACG00000017010 Name=zgc:153018 biotype=protein_coding description=zgc:153018 [Source:ZFIN%3BAcc:ZDB-GENE-060929-752] gene_id=ENSGACG00000017010 logic_name=ensembl version=1
groupV ensembl gene 120615 125489 . + . ID=gene:ENSGACG00000002103 Name=zdhhc6 biotype=protein_coding description=zinc finger%2C DHHC-type containing 6 [Source:ZFIN%3BAcc:ZDB-GENE-030131-3189] gene_id=ENSGACG00000002103 logic_name=ensembl version=1
groupVI ensembl gene 11230354 11232784 . + . ID=gene:ENSGACG00000009527 Name=bnip4 biotype=protein_coding description=BCL2 interacting protein 4 [Source:ZFIN%3BAcc:ZDB-GENE-051113-212] gene_id=ENSGACG00000009527 logic_name=ensembl version=1
groupVII ensembl gene 2271611 2277214 . + . ID=gene:ENSGACG00000019012 Name=sf3b2 biotype=protein_coding description=splicing factor 3b%2C subunit 2 [Source:ZFIN%3BAcc:ZDB-GENE-070928-1] gene_id=ENSGACG00000019012 logic_name=ensembl version=2
groupVII ensembl gene 15815857 15824549 . + . ID=gene:ENSGACG00000020296 Name=mpp1 biotype=protein_coding description=membrane protein%2C palmitoylated 1 [Source:ZFIN%3BAcc:ZDB-GENE-031113-4] gene_id=ENSGACG00000020296 logic_name=ensembl version=1
groupVII ensembl gene 17978322 17982388 . + . ID=gene:ENSGACG00000020399 Name=si:ch211-284e13.4 biotype=protein_coding description=si:ch211-284e13.4 [Source:ZFIN%3BAcc:ZDB-GENE-060526-161] gene_id=ENSGACG00000020399 logic_name=ensembl version=1
file_b.tsv
MT 4050
groupIII 7332350
groupIV 5347350
groupVI 11230375
groupVII 17978350
code:
while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1); ' <file_a.tsv; done < file_b.tsv > output.tsv
output.tsv
mt
si
zgc
bnip4
Desired_output
MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4
bash shell-script awk sed
add a comment |
Building on: Extract names from File_B having overlapping intervals with File_A
I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b. The first columns must match, such that I only get one gene per row (file_b) but I could in theory have multiple adjacent rows (column_b) match the same gene (e.g. if the second row in file_b was MT 4065
)
There are a few problems with the code I have now.
(1)the way it is set up below, the last row of file_b is lost from the output, although this changes if this row (groupVII 17978350
) is brought up the list. I would like it to work the way it is set up.
(2)The names are truncated if they have special characters (eg colons and hyphens). I would like to have the entire name after the equal sign.
(3) I would like to match the entry/row of file_b to the gene hits in the output, such that the first two columns is the entry and the third column is the gene hit.
file_a.tsv
MT insdc gene 2851 3825 . + . ID=gene:ENSGACG00000020925 Name=mt-nd1 biotype=protein_coding description=NADH dehydrogenase 1%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-7] gene_id=ENSGACG00000020925 logic_name=mt_genbank_import version=1
MT insdc gene 4036 5082 . + . ID=gene:ENSGACG00000020929 Name=mt-nd2 biotype=protein_coding description=NADH dehydrogenase 2%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-8] gene_id=ENSGACG00000020929 logic_name=mt_genbank_import version=1
groupIII ensembl gene 7332324 7334769 . - . ID=gene:ENSGACG00000015265 Name=si:dkeyp-68b7.10 biotype=protein_coding description=si:dkeyp-68b7.10 [Source:ZFIN%3BAcc:ZDB-GENE-070912-667] gene_id=ENSGACG00000015265 logic_name=ensembl version=1
groupIV ensembl gene 1368026 1374881 . + . ID=gene:ENSGACG00000016447 Name=hnrnpa0b biotype=protein_coding description=heterogeneous nuclear ribonucleoprotein A0b [Source:ZFIN%3BAcc:ZDB-GENE-030131-6154] gene_id=ENSGACG00000016447 logic_name=ensembl version=1
groupIV ensembl gene 5347339 5349041 . - . ID=gene:ENSGACG00000017010 Name=zgc:153018 biotype=protein_coding description=zgc:153018 [Source:ZFIN%3BAcc:ZDB-GENE-060929-752] gene_id=ENSGACG00000017010 logic_name=ensembl version=1
groupV ensembl gene 120615 125489 . + . ID=gene:ENSGACG00000002103 Name=zdhhc6 biotype=protein_coding description=zinc finger%2C DHHC-type containing 6 [Source:ZFIN%3BAcc:ZDB-GENE-030131-3189] gene_id=ENSGACG00000002103 logic_name=ensembl version=1
groupVI ensembl gene 11230354 11232784 . + . ID=gene:ENSGACG00000009527 Name=bnip4 biotype=protein_coding description=BCL2 interacting protein 4 [Source:ZFIN%3BAcc:ZDB-GENE-051113-212] gene_id=ENSGACG00000009527 logic_name=ensembl version=1
groupVII ensembl gene 2271611 2277214 . + . ID=gene:ENSGACG00000019012 Name=sf3b2 biotype=protein_coding description=splicing factor 3b%2C subunit 2 [Source:ZFIN%3BAcc:ZDB-GENE-070928-1] gene_id=ENSGACG00000019012 logic_name=ensembl version=2
groupVII ensembl gene 15815857 15824549 . + . ID=gene:ENSGACG00000020296 Name=mpp1 biotype=protein_coding description=membrane protein%2C palmitoylated 1 [Source:ZFIN%3BAcc:ZDB-GENE-031113-4] gene_id=ENSGACG00000020296 logic_name=ensembl version=1
groupVII ensembl gene 17978322 17982388 . + . ID=gene:ENSGACG00000020399 Name=si:ch211-284e13.4 biotype=protein_coding description=si:ch211-284e13.4 [Source:ZFIN%3BAcc:ZDB-GENE-060526-161] gene_id=ENSGACG00000020399 logic_name=ensembl version=1
file_b.tsv
MT 4050
groupIII 7332350
groupIV 5347350
groupVI 11230375
groupVII 17978350
code:
while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1); ' <file_a.tsv; done < file_b.tsv > output.tsv
output.tsv
mt
si
zgc
bnip4
Desired_output
MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4
bash shell-script awk sed
You need to insertt
between fields in youroutput.tsv
file. Otherwise, it's not atsv
file.
– Niko Gambt
Jan 21 at 5:13
add a comment |
Building on: Extract names from File_B having overlapping intervals with File_A
I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b. The first columns must match, such that I only get one gene per row (file_b) but I could in theory have multiple adjacent rows (column_b) match the same gene (e.g. if the second row in file_b was MT 4065
)
There are a few problems with the code I have now.
(1)the way it is set up below, the last row of file_b is lost from the output, although this changes if this row (groupVII 17978350
) is brought up the list. I would like it to work the way it is set up.
(2)The names are truncated if they have special characters (eg colons and hyphens). I would like to have the entire name after the equal sign.
(3) I would like to match the entry/row of file_b to the gene hits in the output, such that the first two columns is the entry and the third column is the gene hit.
file_a.tsv
MT insdc gene 2851 3825 . + . ID=gene:ENSGACG00000020925 Name=mt-nd1 biotype=protein_coding description=NADH dehydrogenase 1%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-7] gene_id=ENSGACG00000020925 logic_name=mt_genbank_import version=1
MT insdc gene 4036 5082 . + . ID=gene:ENSGACG00000020929 Name=mt-nd2 biotype=protein_coding description=NADH dehydrogenase 2%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-8] gene_id=ENSGACG00000020929 logic_name=mt_genbank_import version=1
groupIII ensembl gene 7332324 7334769 . - . ID=gene:ENSGACG00000015265 Name=si:dkeyp-68b7.10 biotype=protein_coding description=si:dkeyp-68b7.10 [Source:ZFIN%3BAcc:ZDB-GENE-070912-667] gene_id=ENSGACG00000015265 logic_name=ensembl version=1
groupIV ensembl gene 1368026 1374881 . + . ID=gene:ENSGACG00000016447 Name=hnrnpa0b biotype=protein_coding description=heterogeneous nuclear ribonucleoprotein A0b [Source:ZFIN%3BAcc:ZDB-GENE-030131-6154] gene_id=ENSGACG00000016447 logic_name=ensembl version=1
groupIV ensembl gene 5347339 5349041 . - . ID=gene:ENSGACG00000017010 Name=zgc:153018 biotype=protein_coding description=zgc:153018 [Source:ZFIN%3BAcc:ZDB-GENE-060929-752] gene_id=ENSGACG00000017010 logic_name=ensembl version=1
groupV ensembl gene 120615 125489 . + . ID=gene:ENSGACG00000002103 Name=zdhhc6 biotype=protein_coding description=zinc finger%2C DHHC-type containing 6 [Source:ZFIN%3BAcc:ZDB-GENE-030131-3189] gene_id=ENSGACG00000002103 logic_name=ensembl version=1
groupVI ensembl gene 11230354 11232784 . + . ID=gene:ENSGACG00000009527 Name=bnip4 biotype=protein_coding description=BCL2 interacting protein 4 [Source:ZFIN%3BAcc:ZDB-GENE-051113-212] gene_id=ENSGACG00000009527 logic_name=ensembl version=1
groupVII ensembl gene 2271611 2277214 . + . ID=gene:ENSGACG00000019012 Name=sf3b2 biotype=protein_coding description=splicing factor 3b%2C subunit 2 [Source:ZFIN%3BAcc:ZDB-GENE-070928-1] gene_id=ENSGACG00000019012 logic_name=ensembl version=2
groupVII ensembl gene 15815857 15824549 . + . ID=gene:ENSGACG00000020296 Name=mpp1 biotype=protein_coding description=membrane protein%2C palmitoylated 1 [Source:ZFIN%3BAcc:ZDB-GENE-031113-4] gene_id=ENSGACG00000020296 logic_name=ensembl version=1
groupVII ensembl gene 17978322 17982388 . + . ID=gene:ENSGACG00000020399 Name=si:ch211-284e13.4 biotype=protein_coding description=si:ch211-284e13.4 [Source:ZFIN%3BAcc:ZDB-GENE-060526-161] gene_id=ENSGACG00000020399 logic_name=ensembl version=1
file_b.tsv
MT 4050
groupIII 7332350
groupIV 5347350
groupVI 11230375
groupVII 17978350
code:
while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1); ' <file_a.tsv; done < file_b.tsv > output.tsv
output.tsv
mt
si
zgc
bnip4
Desired_output
MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4
bash shell-script awk sed
Building on: Extract names from File_B having overlapping intervals with File_A
I want to extract gene names (usually column 10, what's after "Name=") by matching the first column of file_b to file_a, and extracting the gene names if the second column of file_b lies within the gene interval, delineated by columns 4 and 5 of file_b. The first columns must match, such that I only get one gene per row (file_b) but I could in theory have multiple adjacent rows (column_b) match the same gene (e.g. if the second row in file_b was MT 4065
)
There are a few problems with the code I have now.
(1)the way it is set up below, the last row of file_b is lost from the output, although this changes if this row (groupVII 17978350
) is brought up the list. I would like it to work the way it is set up.
(2)The names are truncated if they have special characters (eg colons and hyphens). I would like to have the entire name after the equal sign.
(3) I would like to match the entry/row of file_b to the gene hits in the output, such that the first two columns is the entry and the third column is the gene hit.
file_a.tsv
MT insdc gene 2851 3825 . + . ID=gene:ENSGACG00000020925 Name=mt-nd1 biotype=protein_coding description=NADH dehydrogenase 1%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-7] gene_id=ENSGACG00000020925 logic_name=mt_genbank_import version=1
MT insdc gene 4036 5082 . + . ID=gene:ENSGACG00000020929 Name=mt-nd2 biotype=protein_coding description=NADH dehydrogenase 2%2C mitochondrial [Source:ZFIN%3BAcc:ZDB-GENE-011205-8] gene_id=ENSGACG00000020929 logic_name=mt_genbank_import version=1
groupIII ensembl gene 7332324 7334769 . - . ID=gene:ENSGACG00000015265 Name=si:dkeyp-68b7.10 biotype=protein_coding description=si:dkeyp-68b7.10 [Source:ZFIN%3BAcc:ZDB-GENE-070912-667] gene_id=ENSGACG00000015265 logic_name=ensembl version=1
groupIV ensembl gene 1368026 1374881 . + . ID=gene:ENSGACG00000016447 Name=hnrnpa0b biotype=protein_coding description=heterogeneous nuclear ribonucleoprotein A0b [Source:ZFIN%3BAcc:ZDB-GENE-030131-6154] gene_id=ENSGACG00000016447 logic_name=ensembl version=1
groupIV ensembl gene 5347339 5349041 . - . ID=gene:ENSGACG00000017010 Name=zgc:153018 biotype=protein_coding description=zgc:153018 [Source:ZFIN%3BAcc:ZDB-GENE-060929-752] gene_id=ENSGACG00000017010 logic_name=ensembl version=1
groupV ensembl gene 120615 125489 . + . ID=gene:ENSGACG00000002103 Name=zdhhc6 biotype=protein_coding description=zinc finger%2C DHHC-type containing 6 [Source:ZFIN%3BAcc:ZDB-GENE-030131-3189] gene_id=ENSGACG00000002103 logic_name=ensembl version=1
groupVI ensembl gene 11230354 11232784 . + . ID=gene:ENSGACG00000009527 Name=bnip4 biotype=protein_coding description=BCL2 interacting protein 4 [Source:ZFIN%3BAcc:ZDB-GENE-051113-212] gene_id=ENSGACG00000009527 logic_name=ensembl version=1
groupVII ensembl gene 2271611 2277214 . + . ID=gene:ENSGACG00000019012 Name=sf3b2 biotype=protein_coding description=splicing factor 3b%2C subunit 2 [Source:ZFIN%3BAcc:ZDB-GENE-070928-1] gene_id=ENSGACG00000019012 logic_name=ensembl version=2
groupVII ensembl gene 15815857 15824549 . + . ID=gene:ENSGACG00000020296 Name=mpp1 biotype=protein_coding description=membrane protein%2C palmitoylated 1 [Source:ZFIN%3BAcc:ZDB-GENE-031113-4] gene_id=ENSGACG00000020296 logic_name=ensembl version=1
groupVII ensembl gene 17978322 17982388 . + . ID=gene:ENSGACG00000020399 Name=si:ch211-284e13.4 biotype=protein_coding description=si:ch211-284e13.4 [Source:ZFIN%3BAcc:ZDB-GENE-060526-161] gene_id=ENSGACG00000020399 logic_name=ensembl version=1
file_b.tsv
MT 4050
groupIII 7332350
groupIV 5347350
groupVI 11230375
groupVII 17978350
code:
while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*Name=([A-Za-z0-9]*).*/, "\1", 1); ' <file_a.tsv; done < file_b.tsv > output.tsv
output.tsv
mt
si
zgc
bnip4
Desired_output
MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4
bash shell-script awk sed
bash shell-script awk sed
edited Jan 21 at 6:13
Age87
asked Jan 21 at 1:08
Age87Age87
1607
1607
You need to insertt
between fields in youroutput.tsv
file. Otherwise, it's not atsv
file.
– Niko Gambt
Jan 21 at 5:13
add a comment |
You need to insertt
between fields in youroutput.tsv
file. Otherwise, it's not atsv
file.
– Niko Gambt
Jan 21 at 5:13
You need to insert
t
between fields in your output.tsv
file. Otherwise, it's not a tsv
file.– Niko Gambt
Jan 21 at 5:13
You need to insert
t
between fields in your output.tsv
file. Otherwise, it's not a tsv
file.– Niko Gambt
Jan 21 at 5:13
add a comment |
2 Answers
2
active
oldest
votes
# save this as script.awk or whatevernameyouwant.awk
function within_range(val, lower, upper, proximity)
# you can specify the "proximity" as required
return val > lower - proximity && val < upper + proximity
BEGIN
OFS="t"
$1 == id && within_range(pos, $4, $5, 100)
name = gensub(/.*Name=([^t]*).*/, "\1", 1)
if (name ~ /[^[:space:]]+/)
print id, pos, name
Then run
while read -r id pos
do
awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv
Please make sure that the fields in your .tsv
files are separated by tabs before processing them. My output:
MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4
For the ID MT
, the gene hit should be mt-nd2
not mt-nd1
.
I still recommend using Python for data processing.
I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again
– Age87
Jan 21 at 5:50
I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The$1 ~ id
check worked for me.
– Niko Gambt
Jan 21 at 7:19
Try changing$1 ~ id
to$1 == id
. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used$1 == id
. I've edited my code.
– Niko Gambt
Jan 21 at 7:40
Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)
– Age87
Jan 21 at 20:40
add a comment |
Your expected shown output doesn't look to me consistent(2 lines-->1st and 3rd), if that is a typo then could you please try following.
awk 'FNR==NRa[$1]=$2;next ($1 in a) && (a[$1]>=$4 && a[$1]<=$5)sub("Name=","",$10);print $1,a[$1],$10' b.tsv a.tsv > output.tsv
@Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.
– RavinderSingh13
Jan 21 at 4:35
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f495683%2fextracting-names-from-file-a-using-information-from-2-columns-in-file-b%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
# save this as script.awk or whatevernameyouwant.awk
function within_range(val, lower, upper, proximity)
# you can specify the "proximity" as required
return val > lower - proximity && val < upper + proximity
BEGIN
OFS="t"
$1 == id && within_range(pos, $4, $5, 100)
name = gensub(/.*Name=([^t]*).*/, "\1", 1)
if (name ~ /[^[:space:]]+/)
print id, pos, name
Then run
while read -r id pos
do
awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv
Please make sure that the fields in your .tsv
files are separated by tabs before processing them. My output:
MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4
For the ID MT
, the gene hit should be mt-nd2
not mt-nd1
.
I still recommend using Python for data processing.
I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again
– Age87
Jan 21 at 5:50
I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The$1 ~ id
check worked for me.
– Niko Gambt
Jan 21 at 7:19
Try changing$1 ~ id
to$1 == id
. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used$1 == id
. I've edited my code.
– Niko Gambt
Jan 21 at 7:40
Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)
– Age87
Jan 21 at 20:40
add a comment |
# save this as script.awk or whatevernameyouwant.awk
function within_range(val, lower, upper, proximity)
# you can specify the "proximity" as required
return val > lower - proximity && val < upper + proximity
BEGIN
OFS="t"
$1 == id && within_range(pos, $4, $5, 100)
name = gensub(/.*Name=([^t]*).*/, "\1", 1)
if (name ~ /[^[:space:]]+/)
print id, pos, name
Then run
while read -r id pos
do
awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv
Please make sure that the fields in your .tsv
files are separated by tabs before processing them. My output:
MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4
For the ID MT
, the gene hit should be mt-nd2
not mt-nd1
.
I still recommend using Python for data processing.
I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again
– Age87
Jan 21 at 5:50
I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The$1 ~ id
check worked for me.
– Niko Gambt
Jan 21 at 7:19
Try changing$1 ~ id
to$1 == id
. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used$1 == id
. I've edited my code.
– Niko Gambt
Jan 21 at 7:40
Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)
– Age87
Jan 21 at 20:40
add a comment |
# save this as script.awk or whatevernameyouwant.awk
function within_range(val, lower, upper, proximity)
# you can specify the "proximity" as required
return val > lower - proximity && val < upper + proximity
BEGIN
OFS="t"
$1 == id && within_range(pos, $4, $5, 100)
name = gensub(/.*Name=([^t]*).*/, "\1", 1)
if (name ~ /[^[:space:]]+/)
print id, pos, name
Then run
while read -r id pos
do
awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv
Please make sure that the fields in your .tsv
files are separated by tabs before processing them. My output:
MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4
For the ID MT
, the gene hit should be mt-nd2
not mt-nd1
.
I still recommend using Python for data processing.
# save this as script.awk or whatevernameyouwant.awk
function within_range(val, lower, upper, proximity)
# you can specify the "proximity" as required
return val > lower - proximity && val < upper + proximity
BEGIN
OFS="t"
$1 == id && within_range(pos, $4, $5, 100)
name = gensub(/.*Name=([^t]*).*/, "\1", 1)
if (name ~ /[^[:space:]]+/)
print id, pos, name
Then run
while read -r id pos
do
awk -v id=$id -v pos=$pos -f script.awk file_a.tsv
done < file_b.tsv > output.tsv
Please make sure that the fields in your .tsv
files are separated by tabs before processing them. My output:
MT 4050 mt-nd2
groupIII 7332350 si:dkeyp-68b7.10
groupIV 5347350 zgc:153018
groupVI 11230375 bnip4
groupVII 17978350 si:ch211-284e13.4
For the ID MT
, the gene hit should be mt-nd2
not mt-nd1
.
I still recommend using Python for data processing.
edited Jan 21 at 7:42
answered Jan 21 at 3:21
Niko GambtNiko Gambt
1836
1836
I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again
– Age87
Jan 21 at 5:50
I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The$1 ~ id
check worked for me.
– Niko Gambt
Jan 21 at 7:19
Try changing$1 ~ id
to$1 == id
. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used$1 == id
. I've edited my code.
– Niko Gambt
Jan 21 at 7:40
Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)
– Age87
Jan 21 at 20:40
add a comment |
I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again
– Age87
Jan 21 at 5:50
I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The$1 ~ id
check worked for me.
– Niko Gambt
Jan 21 at 7:19
Try changing$1 ~ id
to$1 == id
. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used$1 == id
. I've edited my code.
– Niko Gambt
Jan 21 at 7:40
Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)
– Age87
Jan 21 at 20:40
I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again
– Age87
Jan 21 at 5:50
I noticed a bigger problem. The code is not requiring that the first column of file_b match the first column of file_a, so I am getting spurious hits when the second column of file_b falls within an interval, regardless if the first columns match. For example, I may get two gene names for MT 4050, if 4050 falls within a different interval, say of groupIII. Anyway around this? I uploaded the larger files here: dropbox.com/sh/eh7jnmdevi3kmeb/AABPfR3SaciRJapZhdiVXOtha?dl=0 thanks again
– Age87
Jan 21 at 5:50
I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The
$1 ~ id
check worked for me.– Niko Gambt
Jan 21 at 7:19
I've just tested it by adding a spurious copy of the same MT row in file_a.tsv with the same interval from 4036 to 5082, but with the id changed to something else (I changed it to FOO). My code didn't match that spurious row. The
$1 ~ id
check worked for me.– Niko Gambt
Jan 21 at 7:19
Try changing
$1 ~ id
to $1 == id
. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used $1 == id
. I've edited my code.– Niko Gambt
Jan 21 at 7:40
Try changing
$1 ~ id
to $1 == id
. See if that fixes your problem. Yeah, that could explain why you got false matches. I should have used $1 == id
. I've edited my code.– Niko Gambt
Jan 21 at 7:40
Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)
– Age87
Jan 21 at 20:40
Perfect! You were right,changing it to $1 == id worked. Thank you. And learning some Python is definitely on my list ;)
– Age87
Jan 21 at 20:40
add a comment |
Your expected shown output doesn't look to me consistent(2 lines-->1st and 3rd), if that is a typo then could you please try following.
awk 'FNR==NRa[$1]=$2;next ($1 in a) && (a[$1]>=$4 && a[$1]<=$5)sub("Name=","",$10);print $1,a[$1],$10' b.tsv a.tsv > output.tsv
@Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.
– RavinderSingh13
Jan 21 at 4:35
add a comment |
Your expected shown output doesn't look to me consistent(2 lines-->1st and 3rd), if that is a typo then could you please try following.
awk 'FNR==NRa[$1]=$2;next ($1 in a) && (a[$1]>=$4 && a[$1]<=$5)sub("Name=","",$10);print $1,a[$1],$10' b.tsv a.tsv > output.tsv
@Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.
– RavinderSingh13
Jan 21 at 4:35
add a comment |
Your expected shown output doesn't look to me consistent(2 lines-->1st and 3rd), if that is a typo then could you please try following.
awk 'FNR==NRa[$1]=$2;next ($1 in a) && (a[$1]>=$4 && a[$1]<=$5)sub("Name=","",$10);print $1,a[$1],$10' b.tsv a.tsv > output.tsv
Your expected shown output doesn't look to me consistent(2 lines-->1st and 3rd), if that is a typo then could you please try following.
awk 'FNR==NRa[$1]=$2;next ($1 in a) && (a[$1]>=$4 && a[$1]<=$5)sub("Name=","",$10);print $1,a[$1],$10' b.tsv a.tsv > output.tsv
edited Jan 21 at 6:16
Age87
1607
1607
answered Jan 21 at 1:18
RavinderSingh13RavinderSingh13
1336
1336
@Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.
– RavinderSingh13
Jan 21 at 4:35
add a comment |
@Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.
– RavinderSingh13
Jan 21 at 4:35
@Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.
– RavinderSingh13
Jan 21 at 4:35
@Age87, cool, you could EDIT my answer so that other people may learn from it and try to select any answer as correct one out of all.
– RavinderSingh13
Jan 21 at 4:35
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f495683%2fextracting-names-from-file-a-using-information-from-2-columns-in-file-b%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You need to insert
t
between fields in youroutput.tsv
file. Otherwise, it's not atsv
file.– Niko Gambt
Jan 21 at 5:13