filter list of partial duplicates by condition(s)
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
I have a list of partial duplicate records. Each unique record is identified by its first 5 fields, however each record has more than one "feature" associated with it, defined by the contents of the subsequent 4 fields. There is an "identifier" in the first field of each record, but an identifier can have more than one record associated with it. Example as follows:
A 1 122114 A T ABCD c.123A>T 41 K/Y
A 1 122114 A T EFGH c.456-7890T>A . .
B 7 56715 G C IJKL c.321+9876C>A . .
B 7 56715 G C MNOP c.543G>C 181 Q/L
B 7 56715 G C PONM c.-7324G>C . .
C 12 9844 T C QRST c.8392-68723T>C . .
C 12 3338745 T C UVWX c.599A>G 200 P/*
C 21 71120 C G YZAB c.35C>G 12 D
C 21 71120 C G CDEF c.-2345G>C . .
D 1 122114 A T ABCD c.123A>T 41 K/Y
D 1 122114 A T EFGH c.456-7890T>A . .
E 8 5094 A AT GHIJ c.678_679insT 226-227 .
E 8 5094 A AT KLMN c.-2356_-2357insT . .
I wish to filter the file down to one line for each "record", using a hierarchy of conditions to filter the "features", for example:
- Field 9 contains "/", or else
- Field 9 contains [A-Z], or else
- Field 8 contains [digit], or else
- Field 7 contains "[range from -50 to +50][A,C,T or G]"
Once a "record" meets these conditions, I do not wish it further (to avoid getting more than one line per "record").
I've tried using awk to create an array using the first 5 fields and running a for loop but I'm making a bit of a hash of it (excuse the pun):
awk -F"t" 'a[$1$2$3$4$5]=$0;for (i in a) if ($9~"/") print a[i]; else if ($9~/[A-Z]/) print a[i]; else if ($8~/[0-9]/) print a[i]' file
This ends up printing duplicates multiple times. Is there a way to do this in awk?
text-processing awk
add a comment |Â
up vote
1
down vote
favorite
I have a list of partial duplicate records. Each unique record is identified by its first 5 fields, however each record has more than one "feature" associated with it, defined by the contents of the subsequent 4 fields. There is an "identifier" in the first field of each record, but an identifier can have more than one record associated with it. Example as follows:
A 1 122114 A T ABCD c.123A>T 41 K/Y
A 1 122114 A T EFGH c.456-7890T>A . .
B 7 56715 G C IJKL c.321+9876C>A . .
B 7 56715 G C MNOP c.543G>C 181 Q/L
B 7 56715 G C PONM c.-7324G>C . .
C 12 9844 T C QRST c.8392-68723T>C . .
C 12 3338745 T C UVWX c.599A>G 200 P/*
C 21 71120 C G YZAB c.35C>G 12 D
C 21 71120 C G CDEF c.-2345G>C . .
D 1 122114 A T ABCD c.123A>T 41 K/Y
D 1 122114 A T EFGH c.456-7890T>A . .
E 8 5094 A AT GHIJ c.678_679insT 226-227 .
E 8 5094 A AT KLMN c.-2356_-2357insT . .
I wish to filter the file down to one line for each "record", using a hierarchy of conditions to filter the "features", for example:
- Field 9 contains "/", or else
- Field 9 contains [A-Z], or else
- Field 8 contains [digit], or else
- Field 7 contains "[range from -50 to +50][A,C,T or G]"
Once a "record" meets these conditions, I do not wish it further (to avoid getting more than one line per "record").
I've tried using awk to create an array using the first 5 fields and running a for loop but I'm making a bit of a hash of it (excuse the pun):
awk -F"t" 'a[$1$2$3$4$5]=$0;for (i in a) if ($9~"/") print a[i]; else if ($9~/[A-Z]/) print a[i]; else if ($8~/[0-9]/) print a[i]' file
This ends up printing duplicates multiple times. Is there a way to do this in awk?
text-processing awk
1
could you explain more on[range from -50 to +50][A,C,T or G]
? some have only one value, some multiple.. andACTG
matched immediately after number or anywhere in the field?
â Sundeep
Mar 16 at 10:46
2
post the expected result
â RomanPerekhrest
Mar 16 at 11:03
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have a list of partial duplicate records. Each unique record is identified by its first 5 fields, however each record has more than one "feature" associated with it, defined by the contents of the subsequent 4 fields. There is an "identifier" in the first field of each record, but an identifier can have more than one record associated with it. Example as follows:
A 1 122114 A T ABCD c.123A>T 41 K/Y
A 1 122114 A T EFGH c.456-7890T>A . .
B 7 56715 G C IJKL c.321+9876C>A . .
B 7 56715 G C MNOP c.543G>C 181 Q/L
B 7 56715 G C PONM c.-7324G>C . .
C 12 9844 T C QRST c.8392-68723T>C . .
C 12 3338745 T C UVWX c.599A>G 200 P/*
C 21 71120 C G YZAB c.35C>G 12 D
C 21 71120 C G CDEF c.-2345G>C . .
D 1 122114 A T ABCD c.123A>T 41 K/Y
D 1 122114 A T EFGH c.456-7890T>A . .
E 8 5094 A AT GHIJ c.678_679insT 226-227 .
E 8 5094 A AT KLMN c.-2356_-2357insT . .
I wish to filter the file down to one line for each "record", using a hierarchy of conditions to filter the "features", for example:
- Field 9 contains "/", or else
- Field 9 contains [A-Z], or else
- Field 8 contains [digit], or else
- Field 7 contains "[range from -50 to +50][A,C,T or G]"
Once a "record" meets these conditions, I do not wish it further (to avoid getting more than one line per "record").
I've tried using awk to create an array using the first 5 fields and running a for loop but I'm making a bit of a hash of it (excuse the pun):
awk -F"t" 'a[$1$2$3$4$5]=$0;for (i in a) if ($9~"/") print a[i]; else if ($9~/[A-Z]/) print a[i]; else if ($8~/[0-9]/) print a[i]' file
This ends up printing duplicates multiple times. Is there a way to do this in awk?
text-processing awk
I have a list of partial duplicate records. Each unique record is identified by its first 5 fields, however each record has more than one "feature" associated with it, defined by the contents of the subsequent 4 fields. There is an "identifier" in the first field of each record, but an identifier can have more than one record associated with it. Example as follows:
A 1 122114 A T ABCD c.123A>T 41 K/Y
A 1 122114 A T EFGH c.456-7890T>A . .
B 7 56715 G C IJKL c.321+9876C>A . .
B 7 56715 G C MNOP c.543G>C 181 Q/L
B 7 56715 G C PONM c.-7324G>C . .
C 12 9844 T C QRST c.8392-68723T>C . .
C 12 3338745 T C UVWX c.599A>G 200 P/*
C 21 71120 C G YZAB c.35C>G 12 D
C 21 71120 C G CDEF c.-2345G>C . .
D 1 122114 A T ABCD c.123A>T 41 K/Y
D 1 122114 A T EFGH c.456-7890T>A . .
E 8 5094 A AT GHIJ c.678_679insT 226-227 .
E 8 5094 A AT KLMN c.-2356_-2357insT . .
I wish to filter the file down to one line for each "record", using a hierarchy of conditions to filter the "features", for example:
- Field 9 contains "/", or else
- Field 9 contains [A-Z], or else
- Field 8 contains [digit], or else
- Field 7 contains "[range from -50 to +50][A,C,T or G]"
Once a "record" meets these conditions, I do not wish it further (to avoid getting more than one line per "record").
I've tried using awk to create an array using the first 5 fields and running a for loop but I'm making a bit of a hash of it (excuse the pun):
awk -F"t" 'a[$1$2$3$4$5]=$0;for (i in a) if ($9~"/") print a[i]; else if ($9~/[A-Z]/) print a[i]; else if ($8~/[0-9]/) print a[i]' file
This ends up printing duplicates multiple times. Is there a way to do this in awk?
text-processing awk
edited Mar 16 at 10:29
Jeff Schaller
31.2k846105
31.2k846105
asked Mar 16 at 10:26
Pete C
61
61
1
could you explain more on[range from -50 to +50][A,C,T or G]
? some have only one value, some multiple.. andACTG
matched immediately after number or anywhere in the field?
â Sundeep
Mar 16 at 10:46
2
post the expected result
â RomanPerekhrest
Mar 16 at 11:03
add a comment |Â
1
could you explain more on[range from -50 to +50][A,C,T or G]
? some have only one value, some multiple.. andACTG
matched immediately after number or anywhere in the field?
â Sundeep
Mar 16 at 10:46
2
post the expected result
â RomanPerekhrest
Mar 16 at 11:03
1
1
could you explain more on
[range from -50 to +50][A,C,T or G]
? some have only one value, some multiple.. and ACTG
matched immediately after number or anywhere in the field?â Sundeep
Mar 16 at 10:46
could you explain more on
[range from -50 to +50][A,C,T or G]
? some have only one value, some multiple.. and ACTG
matched immediately after number or anywhere in the field?â Sundeep
Mar 16 at 10:46
2
2
post the expected result
â RomanPerekhrest
Mar 16 at 11:03
post the expected result
â RomanPerekhrest
Mar 16 at 11:03
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
0
down vote
Perl one-liner here:
perl -F't' -lane '$r$F[0].$F[1].$F[2].$F[3].$F[4]=$_ if $F[8]=~/// or $F[8]=~[A-Z] or $F[7]=~/d/ or $F[6]=~/b(dd)[ACTG]/ and $1<=50; ENDprint $r$_ for (keys %r)' file
Comments:
A perl
solution was offered assuming it is available on your system. If needed, it should be easy to rewrite in awk
, given that the logic and syntax are very similar.
Conditions are based on your specifications, and awk
snippet. As already pointed out in the comments, at least one of them does not seem adherent to your input file sample.
The last found record having the same key is printed.
The records are printed in random order.
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Perl one-liner here:
perl -F't' -lane '$r$F[0].$F[1].$F[2].$F[3].$F[4]=$_ if $F[8]=~/// or $F[8]=~[A-Z] or $F[7]=~/d/ or $F[6]=~/b(dd)[ACTG]/ and $1<=50; ENDprint $r$_ for (keys %r)' file
Comments:
A perl
solution was offered assuming it is available on your system. If needed, it should be easy to rewrite in awk
, given that the logic and syntax are very similar.
Conditions are based on your specifications, and awk
snippet. As already pointed out in the comments, at least one of them does not seem adherent to your input file sample.
The last found record having the same key is printed.
The records are printed in random order.
add a comment |Â
up vote
0
down vote
Perl one-liner here:
perl -F't' -lane '$r$F[0].$F[1].$F[2].$F[3].$F[4]=$_ if $F[8]=~/// or $F[8]=~[A-Z] or $F[7]=~/d/ or $F[6]=~/b(dd)[ACTG]/ and $1<=50; ENDprint $r$_ for (keys %r)' file
Comments:
A perl
solution was offered assuming it is available on your system. If needed, it should be easy to rewrite in awk
, given that the logic and syntax are very similar.
Conditions are based on your specifications, and awk
snippet. As already pointed out in the comments, at least one of them does not seem adherent to your input file sample.
The last found record having the same key is printed.
The records are printed in random order.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Perl one-liner here:
perl -F't' -lane '$r$F[0].$F[1].$F[2].$F[3].$F[4]=$_ if $F[8]=~/// or $F[8]=~[A-Z] or $F[7]=~/d/ or $F[6]=~/b(dd)[ACTG]/ and $1<=50; ENDprint $r$_ for (keys %r)' file
Comments:
A perl
solution was offered assuming it is available on your system. If needed, it should be easy to rewrite in awk
, given that the logic and syntax are very similar.
Conditions are based on your specifications, and awk
snippet. As already pointed out in the comments, at least one of them does not seem adherent to your input file sample.
The last found record having the same key is printed.
The records are printed in random order.
Perl one-liner here:
perl -F't' -lane '$r$F[0].$F[1].$F[2].$F[3].$F[4]=$_ if $F[8]=~/// or $F[8]=~[A-Z] or $F[7]=~/d/ or $F[6]=~/b(dd)[ACTG]/ and $1<=50; ENDprint $r$_ for (keys %r)' file
Comments:
A perl
solution was offered assuming it is available on your system. If needed, it should be easy to rewrite in awk
, given that the logic and syntax are very similar.
Conditions are based on your specifications, and awk
snippet. As already pointed out in the comments, at least one of them does not seem adherent to your input file sample.
The last found record having the same key is printed.
The records are printed in random order.
answered Apr 18 at 14:33
simlev
50019
50019
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f430595%2ffilter-list-of-partial-duplicates-by-conditions%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
could you explain more on
[range from -50 to +50][A,C,T or G]
? some have only one value, some multiple.. andACTG
matched immediately after number or anywhere in the field?â Sundeep
Mar 16 at 10:46
2
post the expected result
â RomanPerekhrest
Mar 16 at 11:03