compare and print the values in two arrays using awk

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite












A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG


The awk script to split the values of column 4 and 5 and then test these against each other pairwise. When a value differs between the two arrays, the string from the first column is printed with an underscore and the appropriate value from column 2 or 3. If both nucleotides are different, two lines of output are to be produced.
Also, print the differed values in col 4 and 5 against each id.



awk ' split($4, a1, ""); split($5, a2, ""); for (i in a1) if (a1[i] != a2[i]) print $1 "_" $(i+1) ' input > out


does the first part.



output needed as:



A01_11814111 G A

A01_11485519 G T






share|improve this question






















  • Are you sure the expected output is what will produce for the given input ? So why A01_11667971 C A or some other pairs didn't come in output? those are different.
    – Î±Ò“sнιη
    Oct 30 '17 at 20:27







  • 1




    Looks like all you have to do is print $1 "_" $(i+1), a1[i], a2[i]
    – glenn jackman
    Oct 30 '17 at 20:33











  • and what if both nucleotides are equal ?
    – RomanPerekhrest
    Oct 30 '17 at 20:48










  • If both nucl are equal then should not be printed. should not be case where both nucl are same.
    – Gavin
    Oct 30 '17 at 20:58






  • 1




    @Gavin, elaborate how this condition two lines of output are to be produced should be outputed
    – RomanPerekhrest
    Oct 30 '17 at 21:06















up vote
2
down vote

favorite












A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG


The awk script to split the values of column 4 and 5 and then test these against each other pairwise. When a value differs between the two arrays, the string from the first column is printed with an underscore and the appropriate value from column 2 or 3. If both nucleotides are different, two lines of output are to be produced.
Also, print the differed values in col 4 and 5 against each id.



awk ' split($4, a1, ""); split($5, a2, ""); for (i in a1) if (a1[i] != a2[i]) print $1 "_" $(i+1) ' input > out


does the first part.



output needed as:



A01_11814111 G A

A01_11485519 G T






share|improve this question






















  • Are you sure the expected output is what will produce for the given input ? So why A01_11667971 C A or some other pairs didn't come in output? those are different.
    – Î±Ò“sнιη
    Oct 30 '17 at 20:27







  • 1




    Looks like all you have to do is print $1 "_" $(i+1), a1[i], a2[i]
    – glenn jackman
    Oct 30 '17 at 20:33











  • and what if both nucleotides are equal ?
    – RomanPerekhrest
    Oct 30 '17 at 20:48










  • If both nucl are equal then should not be printed. should not be case where both nucl are same.
    – Gavin
    Oct 30 '17 at 20:58






  • 1




    @Gavin, elaborate how this condition two lines of output are to be produced should be outputed
    – RomanPerekhrest
    Oct 30 '17 at 21:06













up vote
2
down vote

favorite









up vote
2
down vote

favorite











A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG


The awk script to split the values of column 4 and 5 and then test these against each other pairwise. When a value differs between the two arrays, the string from the first column is printed with an underscore and the appropriate value from column 2 or 3. If both nucleotides are different, two lines of output are to be produced.
Also, print the differed values in col 4 and 5 against each id.



awk ' split($4, a1, ""); split($5, a2, ""); for (i in a1) if (a1[i] != a2[i]) print $1 "_" $(i+1) ' input > out


does the first part.



output needed as:



A01_11814111 G A

A01_11485519 G T






share|improve this question














A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG


The awk script to split the values of column 4 and 5 and then test these against each other pairwise. When a value differs between the two arrays, the string from the first column is printed with an underscore and the appropriate value from column 2 or 3. If both nucleotides are different, two lines of output are to be produced.
Also, print the differed values in col 4 and 5 against each id.



awk ' split($4, a1, ""); split($5, a2, ""); for (i in a1) if (a1[i] != a2[i]) print $1 "_" $(i+1) ' input > out


does the first part.



output needed as:



A01_11814111 G A

A01_11485519 G T








share|improve this question













share|improve this question




share|improve this question








edited Oct 30 '17 at 20:20









Jeff Schaller

32.1k849109




32.1k849109










asked Oct 30 '17 at 20:10









Gavin

233




233











  • Are you sure the expected output is what will produce for the given input ? So why A01_11667971 C A or some other pairs didn't come in output? those are different.
    – Î±Ò“sнιη
    Oct 30 '17 at 20:27







  • 1




    Looks like all you have to do is print $1 "_" $(i+1), a1[i], a2[i]
    – glenn jackman
    Oct 30 '17 at 20:33











  • and what if both nucleotides are equal ?
    – RomanPerekhrest
    Oct 30 '17 at 20:48










  • If both nucl are equal then should not be printed. should not be case where both nucl are same.
    – Gavin
    Oct 30 '17 at 20:58






  • 1




    @Gavin, elaborate how this condition two lines of output are to be produced should be outputed
    – RomanPerekhrest
    Oct 30 '17 at 21:06

















  • Are you sure the expected output is what will produce for the given input ? So why A01_11667971 C A or some other pairs didn't come in output? those are different.
    – Î±Ò“sнιη
    Oct 30 '17 at 20:27







  • 1




    Looks like all you have to do is print $1 "_" $(i+1), a1[i], a2[i]
    – glenn jackman
    Oct 30 '17 at 20:33











  • and what if both nucleotides are equal ?
    – RomanPerekhrest
    Oct 30 '17 at 20:48










  • If both nucl are equal then should not be printed. should not be case where both nucl are same.
    – Gavin
    Oct 30 '17 at 20:58






  • 1




    @Gavin, elaborate how this condition two lines of output are to be produced should be outputed
    – RomanPerekhrest
    Oct 30 '17 at 21:06
















Are you sure the expected output is what will produce for the given input ? So why A01_11667971 C A or some other pairs didn't come in output? those are different.
– Î±Ò“sнιη
Oct 30 '17 at 20:27





Are you sure the expected output is what will produce for the given input ? So why A01_11667971 C A or some other pairs didn't come in output? those are different.
– Î±Ò“sнιη
Oct 30 '17 at 20:27





1




1




Looks like all you have to do is print $1 "_" $(i+1), a1[i], a2[i]
– glenn jackman
Oct 30 '17 at 20:33





Looks like all you have to do is print $1 "_" $(i+1), a1[i], a2[i]
– glenn jackman
Oct 30 '17 at 20:33













and what if both nucleotides are equal ?
– RomanPerekhrest
Oct 30 '17 at 20:48




and what if both nucleotides are equal ?
– RomanPerekhrest
Oct 30 '17 at 20:48












If both nucl are equal then should not be printed. should not be case where both nucl are same.
– Gavin
Oct 30 '17 at 20:58




If both nucl are equal then should not be printed. should not be case where both nucl are same.
– Gavin
Oct 30 '17 at 20:58




1




1




@Gavin, elaborate how this condition two lines of output are to be produced should be outputed
– RomanPerekhrest
Oct 30 '17 at 21:06





@Gavin, elaborate how this condition two lines of output are to be produced should be outputed
– RomanPerekhrest
Oct 30 '17 at 21:06











2 Answers
2






active

oldest

votes

















up vote
1
down vote



accepted










Contents of tmp.txt



A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG


Contents of tmp.awk




if (substr($4,1,1) != substr($5,1,1))
print $1 "_" $2 " " substr($4,1,1) " " substr($5,1,1);

if (substr($4,2,1) != substr($5,2,1))
print $1 "_" $3 " " substr($4,2,1) " " substr($5,2,1);




Sample output



[user@server ~]$ awk -f tmp.awk tmp.txt
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C


Bonus. In bash



#!/bin/bash
while read line
do
set $line
if [ $4:0:1 != $5:0:1 ]
then printf "$1_$2 $4:0:1 $5:0:1n"
fi
if [ $4:1:1 != $5:1:1 ]
then printf "$1_$3 $4:1:1 $5:1:1n"
fi
done < tmp.txt


Sample output



A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C





share|improve this answer



























    up vote
    0
    down vote













    awk solution:



    awk '
    split($4$5, arr, "");
    if(arr[1] == arr[3])
    print $1 "_" $3, arr[2], arr[4];
    else
    print $1 "_" $2, arr[1], arr[3];
    ' input.txt


    sed solution:



    sed -r ' 

    s@(w*) *(w*) *(w*) *(w)(w) *4(w)$@1_3 5 6@
    s@(w*) *(w*) *(w*) *(w)(w) *(w)5$@1_2 4 6@

    ' input.txt


    Output (both the same)



    A01_11814111 G A
    A01_11485519 G T
    A01_11667971 C A
    A01_11876079 C G
    A01_11613258 A G
    A01_11876079 C G
    A01_11616463 A G
    A01_11875368 G A
    A01_11667971 C A
    A01_11564411 A G
    A01_11477215 T C





    share|improve this answer




















      Your Answer







      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "106"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f401470%2fcompare-and-print-the-values-in-two-arrays-using-awk%23new-answer', 'question_page');

      );

      Post as a guest






























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      1
      down vote



      accepted










      Contents of tmp.txt



      A01 11814111 11814112 GA AA
      A01 11485477 11485519 AG AT
      A01 11667935 11667971 TC TA
      A01 11876070 11876079 TC TG
      A01 11613258 11613277 AC GC
      A01 11876079 11876107 CA GA
      A01 11616453 11616463 TA TG
      A01 11875367 11875368 GG GA
      A01 11667971 11667993 CA AA
      A01 11564406 11564411 TA TG
      A01 11477215 11477235 TG CG


      Contents of tmp.awk




      if (substr($4,1,1) != substr($5,1,1))
      print $1 "_" $2 " " substr($4,1,1) " " substr($5,1,1);

      if (substr($4,2,1) != substr($5,2,1))
      print $1 "_" $3 " " substr($4,2,1) " " substr($5,2,1);




      Sample output



      [user@server ~]$ awk -f tmp.awk tmp.txt
      A01_11814111 G A
      A01_11485519 G T
      A01_11667971 C A
      A01_11876079 C G
      A01_11613258 A G
      A01_11876079 C G
      A01_11616463 A G
      A01_11875368 G A
      A01_11667971 C A
      A01_11564411 A G
      A01_11477215 T C


      Bonus. In bash



      #!/bin/bash
      while read line
      do
      set $line
      if [ $4:0:1 != $5:0:1 ]
      then printf "$1_$2 $4:0:1 $5:0:1n"
      fi
      if [ $4:1:1 != $5:1:1 ]
      then printf "$1_$3 $4:1:1 $5:1:1n"
      fi
      done < tmp.txt


      Sample output



      A01_11814111 G A
      A01_11485519 G T
      A01_11667971 C A
      A01_11876079 C G
      A01_11613258 A G
      A01_11876079 C G
      A01_11616463 A G
      A01_11875368 G A
      A01_11667971 C A
      A01_11564411 A G
      A01_11477215 T C





      share|improve this answer
























        up vote
        1
        down vote



        accepted










        Contents of tmp.txt



        A01 11814111 11814112 GA AA
        A01 11485477 11485519 AG AT
        A01 11667935 11667971 TC TA
        A01 11876070 11876079 TC TG
        A01 11613258 11613277 AC GC
        A01 11876079 11876107 CA GA
        A01 11616453 11616463 TA TG
        A01 11875367 11875368 GG GA
        A01 11667971 11667993 CA AA
        A01 11564406 11564411 TA TG
        A01 11477215 11477235 TG CG


        Contents of tmp.awk




        if (substr($4,1,1) != substr($5,1,1))
        print $1 "_" $2 " " substr($4,1,1) " " substr($5,1,1);

        if (substr($4,2,1) != substr($5,2,1))
        print $1 "_" $3 " " substr($4,2,1) " " substr($5,2,1);




        Sample output



        [user@server ~]$ awk -f tmp.awk tmp.txt
        A01_11814111 G A
        A01_11485519 G T
        A01_11667971 C A
        A01_11876079 C G
        A01_11613258 A G
        A01_11876079 C G
        A01_11616463 A G
        A01_11875368 G A
        A01_11667971 C A
        A01_11564411 A G
        A01_11477215 T C


        Bonus. In bash



        #!/bin/bash
        while read line
        do
        set $line
        if [ $4:0:1 != $5:0:1 ]
        then printf "$1_$2 $4:0:1 $5:0:1n"
        fi
        if [ $4:1:1 != $5:1:1 ]
        then printf "$1_$3 $4:1:1 $5:1:1n"
        fi
        done < tmp.txt


        Sample output



        A01_11814111 G A
        A01_11485519 G T
        A01_11667971 C A
        A01_11876079 C G
        A01_11613258 A G
        A01_11876079 C G
        A01_11616463 A G
        A01_11875368 G A
        A01_11667971 C A
        A01_11564411 A G
        A01_11477215 T C





        share|improve this answer






















          up vote
          1
          down vote



          accepted







          up vote
          1
          down vote



          accepted






          Contents of tmp.txt



          A01 11814111 11814112 GA AA
          A01 11485477 11485519 AG AT
          A01 11667935 11667971 TC TA
          A01 11876070 11876079 TC TG
          A01 11613258 11613277 AC GC
          A01 11876079 11876107 CA GA
          A01 11616453 11616463 TA TG
          A01 11875367 11875368 GG GA
          A01 11667971 11667993 CA AA
          A01 11564406 11564411 TA TG
          A01 11477215 11477235 TG CG


          Contents of tmp.awk




          if (substr($4,1,1) != substr($5,1,1))
          print $1 "_" $2 " " substr($4,1,1) " " substr($5,1,1);

          if (substr($4,2,1) != substr($5,2,1))
          print $1 "_" $3 " " substr($4,2,1) " " substr($5,2,1);




          Sample output



          [user@server ~]$ awk -f tmp.awk tmp.txt
          A01_11814111 G A
          A01_11485519 G T
          A01_11667971 C A
          A01_11876079 C G
          A01_11613258 A G
          A01_11876079 C G
          A01_11616463 A G
          A01_11875368 G A
          A01_11667971 C A
          A01_11564411 A G
          A01_11477215 T C


          Bonus. In bash



          #!/bin/bash
          while read line
          do
          set $line
          if [ $4:0:1 != $5:0:1 ]
          then printf "$1_$2 $4:0:1 $5:0:1n"
          fi
          if [ $4:1:1 != $5:1:1 ]
          then printf "$1_$3 $4:1:1 $5:1:1n"
          fi
          done < tmp.txt


          Sample output



          A01_11814111 G A
          A01_11485519 G T
          A01_11667971 C A
          A01_11876079 C G
          A01_11613258 A G
          A01_11876079 C G
          A01_11616463 A G
          A01_11875368 G A
          A01_11667971 C A
          A01_11564411 A G
          A01_11477215 T C





          share|improve this answer












          Contents of tmp.txt



          A01 11814111 11814112 GA AA
          A01 11485477 11485519 AG AT
          A01 11667935 11667971 TC TA
          A01 11876070 11876079 TC TG
          A01 11613258 11613277 AC GC
          A01 11876079 11876107 CA GA
          A01 11616453 11616463 TA TG
          A01 11875367 11875368 GG GA
          A01 11667971 11667993 CA AA
          A01 11564406 11564411 TA TG
          A01 11477215 11477235 TG CG


          Contents of tmp.awk




          if (substr($4,1,1) != substr($5,1,1))
          print $1 "_" $2 " " substr($4,1,1) " " substr($5,1,1);

          if (substr($4,2,1) != substr($5,2,1))
          print $1 "_" $3 " " substr($4,2,1) " " substr($5,2,1);




          Sample output



          [user@server ~]$ awk -f tmp.awk tmp.txt
          A01_11814111 G A
          A01_11485519 G T
          A01_11667971 C A
          A01_11876079 C G
          A01_11613258 A G
          A01_11876079 C G
          A01_11616463 A G
          A01_11875368 G A
          A01_11667971 C A
          A01_11564411 A G
          A01_11477215 T C


          Bonus. In bash



          #!/bin/bash
          while read line
          do
          set $line
          if [ $4:0:1 != $5:0:1 ]
          then printf "$1_$2 $4:0:1 $5:0:1n"
          fi
          if [ $4:1:1 != $5:1:1 ]
          then printf "$1_$3 $4:1:1 $5:1:1n"
          fi
          done < tmp.txt


          Sample output



          A01_11814111 G A
          A01_11485519 G T
          A01_11667971 C A
          A01_11876079 C G
          A01_11613258 A G
          A01_11876079 C G
          A01_11616463 A G
          A01_11875368 G A
          A01_11667971 C A
          A01_11564411 A G
          A01_11477215 T C






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Oct 30 '17 at 21:03









          Zachary Brady

          3,386831




          3,386831






















              up vote
              0
              down vote













              awk solution:



              awk '
              split($4$5, arr, "");
              if(arr[1] == arr[3])
              print $1 "_" $3, arr[2], arr[4];
              else
              print $1 "_" $2, arr[1], arr[3];
              ' input.txt


              sed solution:



              sed -r ' 

              s@(w*) *(w*) *(w*) *(w)(w) *4(w)$@1_3 5 6@
              s@(w*) *(w*) *(w*) *(w)(w) *(w)5$@1_2 4 6@

              ' input.txt


              Output (both the same)



              A01_11814111 G A
              A01_11485519 G T
              A01_11667971 C A
              A01_11876079 C G
              A01_11613258 A G
              A01_11876079 C G
              A01_11616463 A G
              A01_11875368 G A
              A01_11667971 C A
              A01_11564411 A G
              A01_11477215 T C





              share|improve this answer
























                up vote
                0
                down vote













                awk solution:



                awk '
                split($4$5, arr, "");
                if(arr[1] == arr[3])
                print $1 "_" $3, arr[2], arr[4];
                else
                print $1 "_" $2, arr[1], arr[3];
                ' input.txt


                sed solution:



                sed -r ' 

                s@(w*) *(w*) *(w*) *(w)(w) *4(w)$@1_3 5 6@
                s@(w*) *(w*) *(w*) *(w)(w) *(w)5$@1_2 4 6@

                ' input.txt


                Output (both the same)



                A01_11814111 G A
                A01_11485519 G T
                A01_11667971 C A
                A01_11876079 C G
                A01_11613258 A G
                A01_11876079 C G
                A01_11616463 A G
                A01_11875368 G A
                A01_11667971 C A
                A01_11564411 A G
                A01_11477215 T C





                share|improve this answer






















                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  awk solution:



                  awk '
                  split($4$5, arr, "");
                  if(arr[1] == arr[3])
                  print $1 "_" $3, arr[2], arr[4];
                  else
                  print $1 "_" $2, arr[1], arr[3];
                  ' input.txt


                  sed solution:



                  sed -r ' 

                  s@(w*) *(w*) *(w*) *(w)(w) *4(w)$@1_3 5 6@
                  s@(w*) *(w*) *(w*) *(w)(w) *(w)5$@1_2 4 6@

                  ' input.txt


                  Output (both the same)



                  A01_11814111 G A
                  A01_11485519 G T
                  A01_11667971 C A
                  A01_11876079 C G
                  A01_11613258 A G
                  A01_11876079 C G
                  A01_11616463 A G
                  A01_11875368 G A
                  A01_11667971 C A
                  A01_11564411 A G
                  A01_11477215 T C





                  share|improve this answer












                  awk solution:



                  awk '
                  split($4$5, arr, "");
                  if(arr[1] == arr[3])
                  print $1 "_" $3, arr[2], arr[4];
                  else
                  print $1 "_" $2, arr[1], arr[3];
                  ' input.txt


                  sed solution:



                  sed -r ' 

                  s@(w*) *(w*) *(w*) *(w)(w) *4(w)$@1_3 5 6@
                  s@(w*) *(w*) *(w*) *(w)(w) *(w)5$@1_2 4 6@

                  ' input.txt


                  Output (both the same)



                  A01_11814111 G A
                  A01_11485519 G T
                  A01_11667971 C A
                  A01_11876079 C G
                  A01_11613258 A G
                  A01_11876079 C G
                  A01_11616463 A G
                  A01_11875368 G A
                  A01_11667971 C A
                  A01_11564411 A G
                  A01_11477215 T C






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Oct 31 '17 at 0:55









                  MiniMax

                  2,706719




                  2,706719



























                       

                      draft saved


                      draft discarded















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f401470%2fcompare-and-print-the-values-in-two-arrays-using-awk%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Popular posts from this blog

                      Peggy Mitchell

                      The Forum (Inglewood, California)

                      Palaiologos