compare and print the values in two arrays using awk

up vote
2
down vote

favorite

A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG

The awk script to split the values of column 4 and 5 and then test these against each other pairwise. When a value differs between the two arrays, the string from the first column is printed with an underscore and the appropriate value from column 2 or 3. If both nucleotides are different, two lines of output are to be produced.
Also, print the differed values in col 4 and 5 against each id.

awk ' split($4, a1, ""); split($5, a2, ""); for (i in a1) if (a1[i] != a2[i]) print $1 "_" $(i+1) ' input > out

does the first part.

output needed as:

A01_11814111 G A

A01_11485519 G T

edited Oct 30 '17 at 20:20

Jeff Schaller

32.1k849109

asked Oct 30 '17 at 20:10

Gavin

233

Are you sure the expected output is what will produce for the given input ? So why A01_11667971 C A or some other pairs didn't come in output? those are different.
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 30 '17 at 20:27

1

Looks like all you have to do is print $1 "_" $(i+1), a1[i], a2[i]
â€“Â glenn jackman
Oct 30 '17 at 20:33

and what if both nucleotides are equal ?
â€“Â RomanPerekhrest
Oct 30 '17 at 20:48

If both nucl are equal then should not be printed. should not be case where both nucl are same.
â€“Â Gavin
Oct 30 '17 at 20:58

1

@Gavin, elaborate how this condition two lines of output are to be produced should be outputed
â€“Â RomanPerekhrest
Oct 30 '17 at 21:06

add a commentÂ |Â

up vote
2
down vote

favorite

A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG

awk ' split($4, a1, ""); split($5, a2, ""); for (i in a1) if (a1[i] != a2[i]) print $1 "_" $(i+1) ' input > out

does the first part.

output needed as:

A01_11814111 G A

A01_11485519 G T

edited Oct 30 '17 at 20:20

Jeff Schaller

32.1k849109

asked Oct 30 '17 at 20:10

Gavin

233

Are you sure the expected output is what will produce for the given input ? So why A01_11667971 C A or some other pairs didn't come in output? those are different.
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 30 '17 at 20:27

1

Looks like all you have to do is print $1 "_" $(i+1), a1[i], a2[i]
â€“Â glenn jackman
Oct 30 '17 at 20:33

and what if both nucleotides are equal ?
â€“Â RomanPerekhrest
Oct 30 '17 at 20:48

If both nucl are equal then should not be printed. should not be case where both nucl are same.
â€“Â Gavin
Oct 30 '17 at 20:58

1

@Gavin, elaborate how this condition two lines of output are to be produced should be outputed
â€“Â RomanPerekhrest
Oct 30 '17 at 21:06

add a commentÂ |Â

up vote
2
down vote

favorite

A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG

awk ' split($4, a1, ""); split($5, a2, ""); for (i in a1) if (a1[i] != a2[i]) print $1 "_" $(i+1) ' input > out

does the first part.

output needed as:

A01_11814111 G A

A01_11485519 G T

edited Oct 30 '17 at 20:20

Jeff Schaller

32.1k849109

asked Oct 30 '17 at 20:10

Gavin

233

A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG

awk ' split($4, a1, ""); split($5, a2, ""); for (i in a1) if (a1[i] != a2[i]) print $1 "_" $(i+1) ' input > out

does the first part.

output needed as:

A01_11814111 G A

A01_11485519 G T

edited Oct 30 '17 at 20:20

Jeff Schaller

32.1k849109

asked Oct 30 '17 at 20:10

Gavin

233

edited Oct 30 '17 at 20:20

Jeff Schaller

32.1k849109

edited Oct 30 '17 at 20:20

Jeff Schaller

32.1k849109

edited Oct 30 '17 at 20:20

Jeff Schaller

32.1k849109

asked Oct 30 '17 at 20:10

Gavin

233

asked Oct 30 '17 at 20:10

Gavin

233

asked Oct 30 '17 at 20:10

Gavin

233

Are you sure the expected output is what will produce for the given input ? So why A01_11667971 C A or some other pairs didn't come in output? those are different.
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 30 '17 at 20:27

1

Looks like all you have to do is print $1 "_" $(i+1), a1[i], a2[i]
â€“Â glenn jackman
Oct 30 '17 at 20:33

and what if both nucleotides are equal ?
â€“Â RomanPerekhrest
Oct 30 '17 at 20:48

If both nucl are equal then should not be printed. should not be case where both nucl are same.
â€“Â Gavin
Oct 30 '17 at 20:58

1

@Gavin, elaborate how this condition two lines of output are to be produced should be outputed
â€“Â RomanPerekhrest
Oct 30 '17 at 21:06

add a commentÂ |Â

Are you sure the expected output is what will produce for the given input ? So why A01_11667971 C A or some other pairs didn't come in output? those are different.
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 30 '17 at 20:27

1

Looks like all you have to do is print $1 "_" $(i+1), a1[i], a2[i]
â€“Â glenn jackman
Oct 30 '17 at 20:33

and what if both nucleotides are equal ?
â€“Â RomanPerekhrest
Oct 30 '17 at 20:48

If both nucl are equal then should not be printed. should not be case where both nucl are same.
â€“Â Gavin
Oct 30 '17 at 20:58

1

@Gavin, elaborate how this condition two lines of output are to be produced should be outputed
â€“Â RomanPerekhrest
Oct 30 '17 at 21:06

Are you sure the expected output is what will produce for the given input ? So why A01_11667971 C A or some other pairs didn't come in output? those are different.
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Oct 30 '17 at 20:27

Looks like all you have to do is print $1 "_" $(i+1), a1[i], a2[i]
â€“Â glenn jackman
Oct 30 '17 at 20:33

and what if both nucleotides are equal ?
â€“Â RomanPerekhrest
Oct 30 '17 at 20:48

If both nucl are equal then should not be printed. should not be case where both nucl are same.
â€“Â Gavin
Oct 30 '17 at 20:58

@Gavin, elaborate how this condition two lines of output are to be produced should be outputed
â€“Â RomanPerekhrest
Oct 30 '17 at 21:06

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

Contents of tmp.txt

A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG

Contents of tmp.awk


 if (substr($4,1,1) != substr($5,1,1)) 
 print $1 "_" $2 " " substr($4,1,1) " " substr($5,1,1);
 
 if (substr($4,2,1) != substr($5,2,1)) 
 print $1 "_" $3 " " substr($4,2,1) " " substr($5,2,1);

Sample output

[user@server ~]$ awk -f tmp.awk tmp.txt
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C

Bonus. In bash

#!/bin/bash
while read line
do
 set $line
 if [ $4:0:1 != $5:0:1 ]
 then printf "$1_$2 $4:0:1 $5:0:1n"
 fi
 if [ $4:1:1 != $5:1:1 ]
 then printf "$1_$3 $4:1:1 $5:1:1n"
 fi
done < tmp.txt

Sample output

A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C

answered Oct 30 '17 at 21:03

Zachary Brady

3,386831

add a commentÂ |Â

up vote
0
down vote

awk solution:

awk '
 split($4$5, arr, "");
 if(arr[1] == arr[3])
 print $1 "_" $3, arr[2], arr[4];
 else
 print $1 "_" $2, arr[1], arr[3];
' input.txt

sed solution:

sed -r ' 

 s@(w*) *(w*) *(w*) *(w)(w) *4(w)$@1_3 5 6@
 s@(w*) *(w*) *(w*) *(w)(w) *(w)5$@1_2 4 6@

' input.txt

Output (both the same)

A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C

answered Oct 31 '17 at 0:55

MiniMax

2,706719

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f401470%2fcompare-and-print-the-values-in-two-arrays-using-awk%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

Contents of tmp.txt

A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG

Contents of tmp.awk


 if (substr($4,1,1) != substr($5,1,1)) 
 print $1 "_" $2 " " substr($4,1,1) " " substr($5,1,1);
 
 if (substr($4,2,1) != substr($5,2,1)) 
 print $1 "_" $3 " " substr($4,2,1) " " substr($5,2,1);

Sample output

[user@server ~]$ awk -f tmp.awk tmp.txt
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C

Bonus. In bash

#!/bin/bash
while read line
do
 set $line
 if [ $4:0:1 != $5:0:1 ]
 then printf "$1_$2 $4:0:1 $5:0:1n"
 fi
 if [ $4:1:1 != $5:1:1 ]
 then printf "$1_$3 $4:1:1 $5:1:1n"
 fi
done < tmp.txt

Sample output

A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C

answered Oct 30 '17 at 21:03

Zachary Brady

3,386831

add a commentÂ |Â

up vote
1
down vote

accepted

Contents of tmp.txt

A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG

Contents of tmp.awk


 if (substr($4,1,1) != substr($5,1,1)) 
 print $1 "_" $2 " " substr($4,1,1) " " substr($5,1,1);
 
 if (substr($4,2,1) != substr($5,2,1)) 
 print $1 "_" $3 " " substr($4,2,1) " " substr($5,2,1);

Sample output

[user@server ~]$ awk -f tmp.awk tmp.txt
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C

Bonus. In bash

#!/bin/bash
while read line
do
 set $line
 if [ $4:0:1 != $5:0:1 ]
 then printf "$1_$2 $4:0:1 $5:0:1n"
 fi
 if [ $4:1:1 != $5:1:1 ]
 then printf "$1_$3 $4:1:1 $5:1:1n"
 fi
done < tmp.txt

Sample output

A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C

answered Oct 30 '17 at 21:03

Zachary Brady

3,386831

add a commentÂ |Â

up vote
1
down vote

accepted

Contents of tmp.txt

A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG

Contents of tmp.awk


 if (substr($4,1,1) != substr($5,1,1)) 
 print $1 "_" $2 " " substr($4,1,1) " " substr($5,1,1);
 
 if (substr($4,2,1) != substr($5,2,1)) 
 print $1 "_" $3 " " substr($4,2,1) " " substr($5,2,1);

Sample output

[user@server ~]$ awk -f tmp.awk tmp.txt
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C

Bonus. In bash

#!/bin/bash
while read line
do
 set $line
 if [ $4:0:1 != $5:0:1 ]
 then printf "$1_$2 $4:0:1 $5:0:1n"
 fi
 if [ $4:1:1 != $5:1:1 ]
 then printf "$1_$3 $4:1:1 $5:1:1n"
 fi
done < tmp.txt

Sample output

A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C

answered Oct 30 '17 at 21:03

Zachary Brady

3,386831

Contents of tmp.txt

A01 11814111 11814112 GA AA
A01 11485477 11485519 AG AT
A01 11667935 11667971 TC TA
A01 11876070 11876079 TC TG
A01 11613258 11613277 AC GC
A01 11876079 11876107 CA GA
A01 11616453 11616463 TA TG
A01 11875367 11875368 GG GA
A01 11667971 11667993 CA AA
A01 11564406 11564411 TA TG
A01 11477215 11477235 TG CG

Contents of tmp.awk


 if (substr($4,1,1) != substr($5,1,1)) 
 print $1 "_" $2 " " substr($4,1,1) " " substr($5,1,1);
 
 if (substr($4,2,1) != substr($5,2,1)) 
 print $1 "_" $3 " " substr($4,2,1) " " substr($5,2,1);

Sample output

[user@server ~]$ awk -f tmp.awk tmp.txt
A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C

Bonus. In bash

#!/bin/bash
while read line
do
 set $line
 if [ $4:0:1 != $5:0:1 ]
 then printf "$1_$2 $4:0:1 $5:0:1n"
 fi
 if [ $4:1:1 != $5:1:1 ]
 then printf "$1_$3 $4:1:1 $5:1:1n"
 fi
done < tmp.txt

Sample output

A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C

answered Oct 30 '17 at 21:03

Zachary Brady

3,386831

answered Oct 30 '17 at 21:03

Zachary Brady

3,386831

answered Oct 30 '17 at 21:03

Zachary Brady

3,386831

answered Oct 30 '17 at 21:03

Zachary Brady

3,386831

add a commentÂ |Â

up vote
0
down vote

awk solution:

awk '
 split($4$5, arr, "");
 if(arr[1] == arr[3])
 print $1 "_" $3, arr[2], arr[4];
 else
 print $1 "_" $2, arr[1], arr[3];
' input.txt

sed solution:

sed -r ' 

 s@(w*) *(w*) *(w*) *(w)(w) *4(w)$@1_3 5 6@
 s@(w*) *(w*) *(w*) *(w)(w) *(w)5$@1_2 4 6@

' input.txt

Output (both the same)

A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C

answered Oct 31 '17 at 0:55

MiniMax

2,706719

add a commentÂ |Â

up vote
0
down vote

awk solution:

awk '
 split($4$5, arr, "");
 if(arr[1] == arr[3])
 print $1 "_" $3, arr[2], arr[4];
 else
 print $1 "_" $2, arr[1], arr[3];
' input.txt

sed solution:

sed -r ' 

 s@(w*) *(w*) *(w*) *(w)(w) *4(w)$@1_3 5 6@
 s@(w*) *(w*) *(w*) *(w)(w) *(w)5$@1_2 4 6@

' input.txt

Output (both the same)

A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C

answered Oct 31 '17 at 0:55

MiniMax

2,706719

add a commentÂ |Â

up vote
0
down vote

awk solution:

awk '
 split($4$5, arr, "");
 if(arr[1] == arr[3])
 print $1 "_" $3, arr[2], arr[4];
 else
 print $1 "_" $2, arr[1], arr[3];
' input.txt

sed solution:

sed -r ' 

 s@(w*) *(w*) *(w*) *(w)(w) *4(w)$@1_3 5 6@
 s@(w*) *(w*) *(w*) *(w)(w) *(w)5$@1_2 4 6@

' input.txt

Output (both the same)

A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C

answered Oct 31 '17 at 0:55

MiniMax

2,706719

awk solution:

awk '
 split($4$5, arr, "");
 if(arr[1] == arr[3])
 print $1 "_" $3, arr[2], arr[4];
 else
 print $1 "_" $2, arr[1], arr[3];
' input.txt

sed solution:

sed -r ' 

 s@(w*) *(w*) *(w*) *(w)(w) *4(w)$@1_3 5 6@
 s@(w*) *(w*) *(w*) *(w)(w) *(w)5$@1_2 4 6@

' input.txt

Output (both the same)

A01_11814111 G A
A01_11485519 G T
A01_11667971 C A
A01_11876079 C G
A01_11613258 A G
A01_11876079 C G
A01_11616463 A G
A01_11875368 G A
A01_11667971 C A
A01_11564411 A G
A01_11477215 T C

answered Oct 31 '17 at 0:55

MiniMax

2,706719

answered Oct 31 '17 at 0:55

MiniMax

2,706719

answered Oct 31 '17 at 0:55

MiniMax

2,706719

answered Oct 31 '17 at 0:55

MiniMax

2,706719

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu