Extract names from File_B having overlapping intervals with File_A
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
Two space-delimited files:
File_A
MT 50000
groupI 7850000
groupI 7950000
groupI 9050000
groupI 21750000
groupII 8750000
groupII 10550000
groupII 16150000
groupII 20850000
groupIII 14750000
groupIII 15250000
groupIII 15450000
groupIII 15550000
groupIII 15650000
groupIV 7850000
The first column is the group ID and the second column is the mid-point of an interval 100,000 units long within the group. For example the first row corresponds to the interval 1-100000 in group MT, the second row the interval 7800000-7900000, and so on.
File_B
MT 2851 3825 Name=mt-nd1
MT 4036 5082 Name=mt-nd2
MT 5465 7015 Name=mt-co1
MT 7173 7863 Name=mt-co2
MT 8097 8780 Name=mt-atp6
groupI 18791 22890 Name=FGF12
groupI 36880 38991 Name=MB21D2
groupI 65279 68049 Name=cldn15lb
groupI 77722 105198 Name=col4a4
groupI 117583 141390 Name=col4a3
groupI 150455 155401 Name=sst1.1
groupI 9050030 9058000 Name=bco2b
groupI 1076088 1085084 Name=SORL1
groupI 1175505 1181937 Name=abcg4b
groupI 1184288 1184688 Name=lyrm9
groupI 1185206 1186192 Name=ift20
Column 1 of File_B is the group/chromosome name where a gene is located, column 2 and 3 are the intervals of a gene, where column 2 is the start and column 3 is the end. Finally, column 4 is the gene name.
I want to extract the only gene names from the 4th column of File_B that whose interval fall within the 100,000 interval of File_A.
Output_file
mt-nd1
mt-nd2
mt-co1
mt-co2
mt-atp6
bco2b
I was using the following code for a different, although similar, procedure (File_B had more columns and the second column for File_A was a point not an interval).
while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*gene=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*gene=([A-Za-z0-9]*).*/, "\1", 1); ' <File_B.txt; done < File_A.txt > Output_file.txt
Any help from any wiz is appreciated!
text-processing awk
add a comment |Â
up vote
0
down vote
favorite
Two space-delimited files:
File_A
MT 50000
groupI 7850000
groupI 7950000
groupI 9050000
groupI 21750000
groupII 8750000
groupII 10550000
groupII 16150000
groupII 20850000
groupIII 14750000
groupIII 15250000
groupIII 15450000
groupIII 15550000
groupIII 15650000
groupIV 7850000
The first column is the group ID and the second column is the mid-point of an interval 100,000 units long within the group. For example the first row corresponds to the interval 1-100000 in group MT, the second row the interval 7800000-7900000, and so on.
File_B
MT 2851 3825 Name=mt-nd1
MT 4036 5082 Name=mt-nd2
MT 5465 7015 Name=mt-co1
MT 7173 7863 Name=mt-co2
MT 8097 8780 Name=mt-atp6
groupI 18791 22890 Name=FGF12
groupI 36880 38991 Name=MB21D2
groupI 65279 68049 Name=cldn15lb
groupI 77722 105198 Name=col4a4
groupI 117583 141390 Name=col4a3
groupI 150455 155401 Name=sst1.1
groupI 9050030 9058000 Name=bco2b
groupI 1076088 1085084 Name=SORL1
groupI 1175505 1181937 Name=abcg4b
groupI 1184288 1184688 Name=lyrm9
groupI 1185206 1186192 Name=ift20
Column 1 of File_B is the group/chromosome name where a gene is located, column 2 and 3 are the intervals of a gene, where column 2 is the start and column 3 is the end. Finally, column 4 is the gene name.
I want to extract the only gene names from the 4th column of File_B that whose interval fall within the 100,000 interval of File_A.
Output_file
mt-nd1
mt-nd2
mt-co1
mt-co2
mt-atp6
bco2b
I was using the following code for a different, although similar, procedure (File_B had more columns and the second column for File_A was a point not an interval).
while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*gene=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*gene=([A-Za-z0-9]*).*/, "\1", 1); ' <File_B.txt; done < File_A.txt > Output_file.txt
Any help from any wiz is appreciated!
text-processing awk
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
Two space-delimited files:
File_A
MT 50000
groupI 7850000
groupI 7950000
groupI 9050000
groupI 21750000
groupII 8750000
groupII 10550000
groupII 16150000
groupII 20850000
groupIII 14750000
groupIII 15250000
groupIII 15450000
groupIII 15550000
groupIII 15650000
groupIV 7850000
The first column is the group ID and the second column is the mid-point of an interval 100,000 units long within the group. For example the first row corresponds to the interval 1-100000 in group MT, the second row the interval 7800000-7900000, and so on.
File_B
MT 2851 3825 Name=mt-nd1
MT 4036 5082 Name=mt-nd2
MT 5465 7015 Name=mt-co1
MT 7173 7863 Name=mt-co2
MT 8097 8780 Name=mt-atp6
groupI 18791 22890 Name=FGF12
groupI 36880 38991 Name=MB21D2
groupI 65279 68049 Name=cldn15lb
groupI 77722 105198 Name=col4a4
groupI 117583 141390 Name=col4a3
groupI 150455 155401 Name=sst1.1
groupI 9050030 9058000 Name=bco2b
groupI 1076088 1085084 Name=SORL1
groupI 1175505 1181937 Name=abcg4b
groupI 1184288 1184688 Name=lyrm9
groupI 1185206 1186192 Name=ift20
Column 1 of File_B is the group/chromosome name where a gene is located, column 2 and 3 are the intervals of a gene, where column 2 is the start and column 3 is the end. Finally, column 4 is the gene name.
I want to extract the only gene names from the 4th column of File_B that whose interval fall within the 100,000 interval of File_A.
Output_file
mt-nd1
mt-nd2
mt-co1
mt-co2
mt-atp6
bco2b
I was using the following code for a different, although similar, procedure (File_B had more columns and the second column for File_A was a point not an interval).
while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*gene=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*gene=([A-Za-z0-9]*).*/, "\1", 1); ' <File_B.txt; done < File_A.txt > Output_file.txt
Any help from any wiz is appreciated!
text-processing awk
Two space-delimited files:
File_A
MT 50000
groupI 7850000
groupI 7950000
groupI 9050000
groupI 21750000
groupII 8750000
groupII 10550000
groupII 16150000
groupII 20850000
groupIII 14750000
groupIII 15250000
groupIII 15450000
groupIII 15550000
groupIII 15650000
groupIV 7850000
The first column is the group ID and the second column is the mid-point of an interval 100,000 units long within the group. For example the first row corresponds to the interval 1-100000 in group MT, the second row the interval 7800000-7900000, and so on.
File_B
MT 2851 3825 Name=mt-nd1
MT 4036 5082 Name=mt-nd2
MT 5465 7015 Name=mt-co1
MT 7173 7863 Name=mt-co2
MT 8097 8780 Name=mt-atp6
groupI 18791 22890 Name=FGF12
groupI 36880 38991 Name=MB21D2
groupI 65279 68049 Name=cldn15lb
groupI 77722 105198 Name=col4a4
groupI 117583 141390 Name=col4a3
groupI 150455 155401 Name=sst1.1
groupI 9050030 9058000 Name=bco2b
groupI 1076088 1085084 Name=SORL1
groupI 1175505 1181937 Name=abcg4b
groupI 1184288 1184688 Name=lyrm9
groupI 1185206 1186192 Name=ift20
Column 1 of File_B is the group/chromosome name where a gene is located, column 2 and 3 are the intervals of a gene, where column 2 is the start and column 3 is the end. Finally, column 4 is the gene name.
I want to extract the only gene names from the 4th column of File_B that whose interval fall within the 100,000 interval of File_A.
Output_file
mt-nd1
mt-nd2
mt-co1
mt-co2
mt-atp6
bco2b
I was using the following code for a different, although similar, procedure (File_B had more columns and the second column for File_A was a point not an interval).
while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*gene=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*gene=([A-Za-z0-9]*).*/, "\1", 1); ' <File_B.txt; done < File_A.txt > Output_file.txt
Any help from any wiz is appreciated!
text-processing awk
text-processing awk
asked 1 min ago
Age87
1336
1336
add a comment |Â
add a comment |Â
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f475491%2fextract-names-from-file-b-having-overlapping-intervals-with-file-a%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password