Extract names from File_B having overlapping intervals with File_A

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












Two space-delimited files:



File_A



MT 50000
groupI 7850000
groupI 7950000
groupI 9050000
groupI 21750000
groupII 8750000
groupII 10550000
groupII 16150000
groupII 20850000
groupIII 14750000
groupIII 15250000
groupIII 15450000
groupIII 15550000
groupIII 15650000
groupIV 7850000


The first column is the group ID and the second column is the mid-point of an interval 100,000 units long within the group. For example the first row corresponds to the interval 1-100000 in group MT, the second row the interval 7800000-7900000, and so on.



File_B



MT 2851 3825 Name=mt-nd1
MT 4036 5082 Name=mt-nd2
MT 5465 7015 Name=mt-co1
MT 7173 7863 Name=mt-co2
MT 8097 8780 Name=mt-atp6
groupI 18791 22890 Name=FGF12
groupI 36880 38991 Name=MB21D2
groupI 65279 68049 Name=cldn15lb
groupI 77722 105198 Name=col4a4
groupI 117583 141390 Name=col4a3
groupI 150455 155401 Name=sst1.1
groupI 9050030 9058000 Name=bco2b
groupI 1076088 1085084 Name=SORL1
groupI 1175505 1181937 Name=abcg4b
groupI 1184288 1184688 Name=lyrm9
groupI 1185206 1186192 Name=ift20


Column 1 of File_B is the group/chromosome name where a gene is located, column 2 and 3 are the intervals of a gene, where column 2 is the start and column 3 is the end. Finally, column 4 is the gene name.
I want to extract the only gene names from the 4th column of File_B that whose interval fall within the 100,000 interval of File_A.



Output_file



mt-nd1
mt-nd2
mt-co1
mt-co2
mt-atp6
bco2b


I was using the following code for a different, although similar, procedure (File_B had more columns and the second column for File_A was a point not an interval).



while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*gene=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*gene=([A-Za-z0-9]*).*/, "\1", 1); ' <File_B.txt; done < File_A.txt > Output_file.txt


Any help from any wiz is appreciated!









share

























    up vote
    0
    down vote

    favorite












    Two space-delimited files:



    File_A



    MT 50000
    groupI 7850000
    groupI 7950000
    groupI 9050000
    groupI 21750000
    groupII 8750000
    groupII 10550000
    groupII 16150000
    groupII 20850000
    groupIII 14750000
    groupIII 15250000
    groupIII 15450000
    groupIII 15550000
    groupIII 15650000
    groupIV 7850000


    The first column is the group ID and the second column is the mid-point of an interval 100,000 units long within the group. For example the first row corresponds to the interval 1-100000 in group MT, the second row the interval 7800000-7900000, and so on.



    File_B



    MT 2851 3825 Name=mt-nd1
    MT 4036 5082 Name=mt-nd2
    MT 5465 7015 Name=mt-co1
    MT 7173 7863 Name=mt-co2
    MT 8097 8780 Name=mt-atp6
    groupI 18791 22890 Name=FGF12
    groupI 36880 38991 Name=MB21D2
    groupI 65279 68049 Name=cldn15lb
    groupI 77722 105198 Name=col4a4
    groupI 117583 141390 Name=col4a3
    groupI 150455 155401 Name=sst1.1
    groupI 9050030 9058000 Name=bco2b
    groupI 1076088 1085084 Name=SORL1
    groupI 1175505 1181937 Name=abcg4b
    groupI 1184288 1184688 Name=lyrm9
    groupI 1185206 1186192 Name=ift20


    Column 1 of File_B is the group/chromosome name where a gene is located, column 2 and 3 are the intervals of a gene, where column 2 is the start and column 3 is the end. Finally, column 4 is the gene name.
    I want to extract the only gene names from the 4th column of File_B that whose interval fall within the 100,000 interval of File_A.



    Output_file



    mt-nd1
    mt-nd2
    mt-co1
    mt-co2
    mt-atp6
    bco2b


    I was using the following code for a different, although similar, procedure (File_B had more columns and the second column for File_A was a point not an interval).



    while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*gene=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*gene=([A-Za-z0-9]*).*/, "\1", 1); ' <File_B.txt; done < File_A.txt > Output_file.txt


    Any help from any wiz is appreciated!









    share























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      Two space-delimited files:



      File_A



      MT 50000
      groupI 7850000
      groupI 7950000
      groupI 9050000
      groupI 21750000
      groupII 8750000
      groupII 10550000
      groupII 16150000
      groupII 20850000
      groupIII 14750000
      groupIII 15250000
      groupIII 15450000
      groupIII 15550000
      groupIII 15650000
      groupIV 7850000


      The first column is the group ID and the second column is the mid-point of an interval 100,000 units long within the group. For example the first row corresponds to the interval 1-100000 in group MT, the second row the interval 7800000-7900000, and so on.



      File_B



      MT 2851 3825 Name=mt-nd1
      MT 4036 5082 Name=mt-nd2
      MT 5465 7015 Name=mt-co1
      MT 7173 7863 Name=mt-co2
      MT 8097 8780 Name=mt-atp6
      groupI 18791 22890 Name=FGF12
      groupI 36880 38991 Name=MB21D2
      groupI 65279 68049 Name=cldn15lb
      groupI 77722 105198 Name=col4a4
      groupI 117583 141390 Name=col4a3
      groupI 150455 155401 Name=sst1.1
      groupI 9050030 9058000 Name=bco2b
      groupI 1076088 1085084 Name=SORL1
      groupI 1175505 1181937 Name=abcg4b
      groupI 1184288 1184688 Name=lyrm9
      groupI 1185206 1186192 Name=ift20


      Column 1 of File_B is the group/chromosome name where a gene is located, column 2 and 3 are the intervals of a gene, where column 2 is the start and column 3 is the end. Finally, column 4 is the gene name.
      I want to extract the only gene names from the 4th column of File_B that whose interval fall within the 100,000 interval of File_A.



      Output_file



      mt-nd1
      mt-nd2
      mt-co1
      mt-co2
      mt-atp6
      bco2b


      I was using the following code for a different, although similar, procedure (File_B had more columns and the second column for File_A was a point not an interval).



      while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*gene=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*gene=([A-Za-z0-9]*).*/, "\1", 1); ' <File_B.txt; done < File_A.txt > Output_file.txt


      Any help from any wiz is appreciated!









      share













      Two space-delimited files:



      File_A



      MT 50000
      groupI 7850000
      groupI 7950000
      groupI 9050000
      groupI 21750000
      groupII 8750000
      groupII 10550000
      groupII 16150000
      groupII 20850000
      groupIII 14750000
      groupIII 15250000
      groupIII 15450000
      groupIII 15550000
      groupIII 15650000
      groupIV 7850000


      The first column is the group ID and the second column is the mid-point of an interval 100,000 units long within the group. For example the first row corresponds to the interval 1-100000 in group MT, the second row the interval 7800000-7900000, and so on.



      File_B



      MT 2851 3825 Name=mt-nd1
      MT 4036 5082 Name=mt-nd2
      MT 5465 7015 Name=mt-co1
      MT 7173 7863 Name=mt-co2
      MT 8097 8780 Name=mt-atp6
      groupI 18791 22890 Name=FGF12
      groupI 36880 38991 Name=MB21D2
      groupI 65279 68049 Name=cldn15lb
      groupI 77722 105198 Name=col4a4
      groupI 117583 141390 Name=col4a3
      groupI 150455 155401 Name=sst1.1
      groupI 9050030 9058000 Name=bco2b
      groupI 1076088 1085084 Name=SORL1
      groupI 1175505 1181937 Name=abcg4b
      groupI 1184288 1184688 Name=lyrm9
      groupI 1185206 1186192 Name=ift20


      Column 1 of File_B is the group/chromosome name where a gene is located, column 2 and 3 are the intervals of a gene, where column 2 is the start and column 3 is the end. Finally, column 4 is the gene name.
      I want to extract the only gene names from the 4th column of File_B that whose interval fall within the 100,000 interval of File_A.



      Output_file



      mt-nd1
      mt-nd2
      mt-co1
      mt-co2
      mt-atp6
      bco2b


      I was using the following code for a different, although similar, procedure (File_B had more columns and the second column for File_A was a point not an interval).



      while read -r id pos; do awk -v id="$id" -v pos="$pos" '$1 == id && pos > $4 && pos < $5 if (gensub(/.*gene=([A-Za-z0-9]*).*/, "\1", 1) !~ /s/) print gensub(/.*gene=([A-Za-z0-9]*).*/, "\1", 1); ' <File_B.txt; done < File_A.txt > Output_file.txt


      Any help from any wiz is appreciated!







      text-processing awk





      share












      share










      share



      share










      asked 1 min ago









      Age87

      1336




      1336

























          active

          oldest

          votes











          Your Answer







          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "106"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f475491%2fextract-names-from-file-b-having-overlapping-intervals-with-file-a%23new-answer', 'question_page');

          );

          Post as a guest



































          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f475491%2fextract-names-from-file-b-having-overlapping-intervals-with-file-a%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          How to check contact read email or not when send email to Individual?

          Bahrain

          Postfix configuration issue with fips on centos 7; mailgun relay