compare and print the values in two arrays using awk

Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG
The awk script to split the values of column 4 and 5 and then test these against each other pairwise. When a value differs between the two arrays, the string from the first column is printed with an underscore and the appropriate value from column 2 or 3. If both nucleotides are different, two lines of output are to be produced.
Also, print the differed values in col 4 and 5 against each id.
awk ' split($4, a1, ""); split($5, a2, ""); for (i in a1) if (a1[i] != a2[i]) print $1 "_" $(i+1) ' input > out
does the first part.
output needed as:
A01_11814111 G A
A01_11485519 G T
awk bioinformatics
add a comment |Â
up vote
2
down vote
favorite
A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG
The awk script to split the values of column 4 and 5 and then test these against each other pairwise. When a value differs between the two arrays, the string from the first column is printed with an underscore and the appropriate value from column 2 or 3. If both nucleotides are different, two lines of output are to be produced.
Also, print the differed values in col 4 and 5 against each id.
awk ' split($4, a1, ""); split($5, a2, ""); for (i in a1) if (a1[i] != a2[i]) print $1 "_" $(i+1) ' input > out
does the first part.
output needed as:
A01_11814111 G A
A01_11485519 G T
awk bioinformatics
Are you sure the expected output is what will produce for the given input ? So whyA01_11667971 C Aor some other pairs didn't come in output? those are different.
â Ã±ÃÂsýù÷
Oct 30 '17 at 20:27
1
Looks like all you have to do isprint $1 "_" $(i+1), a1[i], a2[i]
â glenn jackman
Oct 30 '17 at 20:33
and what if both nucleotides are equal ?
â RomanPerekhrest
Oct 30 '17 at 20:48
If both nucl are equal then should not be printed. should not be case where both nucl are same.
â Gavin
Oct 30 '17 at 20:58
1
@Gavin, elaborate how this condition two lines of output are to be produced should be outputed
â RomanPerekhrest
Oct 30 '17 at 21:06
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG
The awk script to split the values of column 4 and 5 and then test these against each other pairwise. When a value differs between the two arrays, the string from the first column is printed with an underscore and the appropriate value from column 2 or 3. If both nucleotides are different, two lines of output are to be produced.
Also, print the differed values in col 4 and 5 against each id.
awk ' split($4, a1, ""); split($5, a2, ""); for (i in a1) if (a1[i] != a2[i]) print $1 "_" $(i+1) ' input > out
does the first part.
output needed as:
A01_11814111 G A
A01_11485519 G T
awk bioinformatics
A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG
The awk script to split the values of column 4 and 5 and then test these against each other pairwise. When a value differs between the two arrays, the string from the first column is printed with an underscore and the appropriate value from column 2 or 3. If both nucleotides are different, two lines of output are to be produced.
Also, print the differed values in col 4 and 5 against each id.
awk ' split($4, a1, ""); split($5, a2, ""); for (i in a1) if (a1[i] != a2[i]) print $1 "_" $(i+1) ' input > out
does the first part.
output needed as:
A01_11814111 G A
A01_11485519 G T
awk bioinformatics
edited Oct 30 '17 at 20:20
Jeff Schaller
32.1k849109
32.1k849109
asked Oct 30 '17 at 20:10
Gavin
233
233
Are you sure the expected output is what will produce for the given input ? So whyA01_11667971 C Aor some other pairs didn't come in output? those are different.
â Ã±ÃÂsýù÷
Oct 30 '17 at 20:27
1
Looks like all you have to do isprint $1 "_" $(i+1), a1[i], a2[i]
â glenn jackman
Oct 30 '17 at 20:33
and what if both nucleotides are equal ?
â RomanPerekhrest
Oct 30 '17 at 20:48
If both nucl are equal then should not be printed. should not be case where both nucl are same.
â Gavin
Oct 30 '17 at 20:58
1
@Gavin, elaborate how this condition two lines of output are to be produced should be outputed
â RomanPerekhrest
Oct 30 '17 at 21:06
add a comment |Â
Are you sure the expected output is what will produce for the given input ? So whyA01_11667971 C Aor some other pairs didn't come in output? those are different.
â Ã±ÃÂsýù÷
Oct 30 '17 at 20:27
1
Looks like all you have to do isprint $1 "_" $(i+1), a1[i], a2[i]
â glenn jackman
Oct 30 '17 at 20:33
and what if both nucleotides are equal ?
â RomanPerekhrest
Oct 30 '17 at 20:48
If both nucl are equal then should not be printed. should not be case where both nucl are same.
â Gavin
Oct 30 '17 at 20:58
1
@Gavin, elaborate how this condition two lines of output are to be produced should be outputed
â RomanPerekhrest
Oct 30 '17 at 21:06
Are you sure the expected output is what will produce for the given input ? So why
A01_11667971 C A or some other pairs didn't come in output? those are different.â Ã±ÃÂsýù÷
Oct 30 '17 at 20:27
Are you sure the expected output is what will produce for the given input ? So why
A01_11667971 C A or some other pairs didn't come in output? those are different.â Ã±ÃÂsýù÷
Oct 30 '17 at 20:27
1
1
Looks like all you have to do is
print $1 "_" $(i+1), a1[i], a2[i]â glenn jackman
Oct 30 '17 at 20:33
Looks like all you have to do is
print $1 "_" $(i+1), a1[i], a2[i]â glenn jackman
Oct 30 '17 at 20:33
and what if both nucleotides are equal ?
â RomanPerekhrest
Oct 30 '17 at 20:48
and what if both nucleotides are equal ?
â RomanPerekhrest
Oct 30 '17 at 20:48
If both nucl are equal then should not be printed. should not be case where both nucl are same.
â Gavin
Oct 30 '17 at 20:58
If both nucl are equal then should not be printed. should not be case where both nucl are same.
â Gavin
Oct 30 '17 at 20:58
1
1
@Gavin, elaborate how this condition two lines of output are to be produced should be outputed
â RomanPerekhrest
Oct 30 '17 at 21:06
@Gavin, elaborate how this condition two lines of output are to be produced should be outputed
â RomanPerekhrest
Oct 30 '17 at 21:06
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
1
down vote
accepted
Contents of tmp.txt
A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG
Contents of tmp.awk
if (substr($4,1,1) != substr($5,1,1))
print $1 "_" $2 " " substr($4,1,1) " " substr($5,1,1);
if (substr($4,2,1) != substr($5,2,1))
print $1 "_" $3 " " substr($4,2,1) " " substr($5,2,1);
Sample output
[user@server ~]$ awk -f tmp.awk tmp.txt
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
Bonus. In bash
#!/bin/bash
while read line
do
set $line
if [ $4:0:1 != $5:0:1 ]
then printf "$1_$2 $4:0:1 $5:0:1n"
fi
if [ $4:1:1 != $5:1:1 ]
then printf "$1_$3 $4:1:1 $5:1:1n"
fi
done < tmp.txt
Sample output
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
add a comment |Â
up vote
0
down vote
awk solution:
awk '
split($4$5, arr, "");
if(arr[1] == arr[3])
print $1 "_" $3, arr[2], arr[4];
else
print $1 "_" $2, arr[1], arr[3];
' input.txt
sed solution:
sed -r '
s@(w*) *(w*) *(w*) *(w)(w) *4(w)$@1_3 5 6@
s@(w*) *(w*) *(w*) *(w)(w) *(w)5$@1_2 4 6@
' input.txt
Output (both the same)
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
Contents of tmp.txt
A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG
Contents of tmp.awk
if (substr($4,1,1) != substr($5,1,1))
print $1 "_" $2 " " substr($4,1,1) " " substr($5,1,1);
if (substr($4,2,1) != substr($5,2,1))
print $1 "_" $3 " " substr($4,2,1) " " substr($5,2,1);
Sample output
[user@server ~]$ awk -f tmp.awk tmp.txt
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
Bonus. In bash
#!/bin/bash
while read line
do
set $line
if [ $4:0:1 != $5:0:1 ]
then printf "$1_$2 $4:0:1 $5:0:1n"
fi
if [ $4:1:1 != $5:1:1 ]
then printf "$1_$3 $4:1:1 $5:1:1n"
fi
done < tmp.txt
Sample output
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
add a comment |Â
up vote
1
down vote
accepted
Contents of tmp.txt
A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG
Contents of tmp.awk
if (substr($4,1,1) != substr($5,1,1))
print $1 "_" $2 " " substr($4,1,1) " " substr($5,1,1);
if (substr($4,2,1) != substr($5,2,1))
print $1 "_" $3 " " substr($4,2,1) " " substr($5,2,1);
Sample output
[user@server ~]$ awk -f tmp.awk tmp.txt
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
Bonus. In bash
#!/bin/bash
while read line
do
set $line
if [ $4:0:1 != $5:0:1 ]
then printf "$1_$2 $4:0:1 $5:0:1n"
fi
if [ $4:1:1 != $5:1:1 ]
then printf "$1_$3 $4:1:1 $5:1:1n"
fi
done < tmp.txt
Sample output
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
add a comment |Â
up vote
1
down vote
accepted
up vote
1
down vote
accepted
Contents of tmp.txt
A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG
Contents of tmp.awk
if (substr($4,1,1) != substr($5,1,1))
print $1 "_" $2 " " substr($4,1,1) " " substr($5,1,1);
if (substr($4,2,1) != substr($5,2,1))
print $1 "_" $3 " " substr($4,2,1) " " substr($5,2,1);
Sample output
[user@server ~]$ awk -f tmp.awk tmp.txt
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
Bonus. In bash
#!/bin/bash
while read line
do
set $line
if [ $4:0:1 != $5:0:1 ]
then printf "$1_$2 $4:0:1 $5:0:1n"
fi
if [ $4:1:1 != $5:1:1 ]
then printf "$1_$3 $4:1:1 $5:1:1n"
fi
done < tmp.txt
Sample output
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
Contents of tmp.txt
A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG
Contents of tmp.awk
if (substr($4,1,1) != substr($5,1,1))
print $1 "_" $2 " " substr($4,1,1) " " substr($5,1,1);
if (substr($4,2,1) != substr($5,2,1))
print $1 "_" $3 " " substr($4,2,1) " " substr($5,2,1);
Sample output
[user@server ~]$ awk -f tmp.awk tmp.txt
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
Bonus. In bash
#!/bin/bash
while read line
do
set $line
if [ $4:0:1 != $5:0:1 ]
then printf "$1_$2 $4:0:1 $5:0:1n"
fi
if [ $4:1:1 != $5:1:1 ]
then printf "$1_$3 $4:1:1 $5:1:1n"
fi
done < tmp.txt
Sample output
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
answered Oct 30 '17 at 21:03
Zachary Brady
3,386831
3,386831
add a comment |Â
add a comment |Â
up vote
0
down vote
awk solution:
awk '
split($4$5, arr, "");
if(arr[1] == arr[3])
print $1 "_" $3, arr[2], arr[4];
else
print $1 "_" $2, arr[1], arr[3];
' input.txt
sed solution:
sed -r '
s@(w*) *(w*) *(w*) *(w)(w) *4(w)$@1_3 5 6@
s@(w*) *(w*) *(w*) *(w)(w) *(w)5$@1_2 4 6@
' input.txt
Output (both the same)
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
add a comment |Â
up vote
0
down vote
awk solution:
awk '
split($4$5, arr, "");
if(arr[1] == arr[3])
print $1 "_" $3, arr[2], arr[4];
else
print $1 "_" $2, arr[1], arr[3];
' input.txt
sed solution:
sed -r '
s@(w*) *(w*) *(w*) *(w)(w) *4(w)$@1_3 5 6@
s@(w*) *(w*) *(w*) *(w)(w) *(w)5$@1_2 4 6@
' input.txt
Output (both the same)
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
add a comment |Â
up vote
0
down vote
up vote
0
down vote
awk solution:
awk '
split($4$5, arr, "");
if(arr[1] == arr[3])
print $1 "_" $3, arr[2], arr[4];
else
print $1 "_" $2, arr[1], arr[3];
' input.txt
sed solution:
sed -r '
s@(w*) *(w*) *(w*) *(w)(w) *4(w)$@1_3 5 6@
s@(w*) *(w*) *(w*) *(w)(w) *(w)5$@1_2 4 6@
' input.txt
Output (both the same)
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
awk solution:
awk '
split($4$5, arr, "");
if(arr[1] == arr[3])
print $1 "_" $3, arr[2], arr[4];
else
print $1 "_" $2, arr[1], arr[3];
' input.txt
sed solution:
sed -r '
s@(w*) *(w*) *(w*) *(w)(w) *4(w)$@1_3 5 6@
s@(w*) *(w*) *(w*) *(w)(w) *(w)5$@1_2 4 6@
' input.txt
Output (both the same)
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C
answered Oct 31 '17 at 0:55
MiniMax
2,706719
2,706719
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f401470%2fcompare-and-print-the-values-in-two-arrays-using-awk%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Are you sure the expected output is what will produce for the given input ? So why
A01_11667971 C Aor some other pairs didn't come in output? those are different.â Ã±ÃÂsýù÷
Oct 30 '17 at 20:27
1
Looks like all you have to do is
print $1 "_" $(i+1), a1[i], a2[i]â glenn jackman
Oct 30 '17 at 20:33
and what if both nucleotides are equal ?
â RomanPerekhrest
Oct 30 '17 at 20:48
If both nucl are equal then should not be printed. should not be case where both nucl are same.
â Gavin
Oct 30 '17 at 20:58
1
@Gavin, elaborate how this condition two lines of output are to be produced should be outputed
â RomanPerekhrest
Oct 30 '17 at 21:06