filter list of partial duplicates by condition(s)

up vote
1
down vote

favorite

I have a list of partial duplicate records. Each unique record is identified by its first 5 fields, however each record has more than one "feature" associated with it, defined by the contents of the subsequent 4 fields. There is an "identifier" in the first field of each record, but an identifier can have more than one record associated with it. Example as follows:

A 1 122114 A T ABCD c.123A>T 41 K/Y
A 1 122114 A T EFGH c.456-7890T>A . .
B 7 56715 G C IJKL c.321+9876C>A . .
B 7 56715 G C MNOP c.543G>C 181 Q/L
B 7 56715 G C PONM c.-7324G>C . .
C 12 9844 T C QRST c.8392-68723T>C . .
C 12 3338745 T C UVWX c.599A>G 200 P/*
C 21 71120 C G YZAB c.35C>G 12 D
C 21 71120 C G CDEF c.-2345G>C . .
D 1 122114 A T ABCD c.123A>T 41 K/Y
D 1 122114 A T EFGH c.456-7890T>A . .
E 8 5094 A AT GHIJ c.678_679insT 226-227 .
E 8 5094 A AT KLMN c.-2356_-2357insT . .

I wish to filter the file down to one line for each "record", using a hierarchy of conditions to filter the "features", for example:

Field 9 contains "/", or else

Field 9 contains [A-Z], or else

Field 8 contains [digit], or else

Field 7 contains "[range from -50 to +50][A,C,T or G]"

Once a "record" meets these conditions, I do not wish it further (to avoid getting more than one line per "record").

I've tried using awk to create an array using the first 5 fields and running a for loop but I'm making a bit of a hash of it (excuse the pun):

awk -F"t" 'a[$1$2$3$4$5]=$0;for (i in a) if ($9~"/") print a[i]; else if ($9~/[A-Z]/) print a[i]; else if ($8~/[0-9]/) print a[i]' file

This ends up printing duplicates multiple times. Is there a way to do this in awk?

edited Mar 16 at 10:29

Jeff Schaller

31.2k846105

asked Mar 16 at 10:26

Pete C

1

could you explain more on [range from -50 to +50][A,C,T or G]? some have only one value, some multiple.. and ACTG matched immediately after number or anywhere in the field?
â€“Â Sundeep
Mar 16 at 10:46

2

post the expected result
â€“Â RomanPerekhrest
Mar 16 at 11:03

add a commentÂ |Â

up vote
1
down vote

favorite

A 1 122114 A T ABCD c.123A>T 41 K/Y
A 1 122114 A T EFGH c.456-7890T>A . .
B 7 56715 G C IJKL c.321+9876C>A . .
B 7 56715 G C MNOP c.543G>C 181 Q/L
B 7 56715 G C PONM c.-7324G>C . .
C 12 9844 T C QRST c.8392-68723T>C . .
C 12 3338745 T C UVWX c.599A>G 200 P/*
C 21 71120 C G YZAB c.35C>G 12 D
C 21 71120 C G CDEF c.-2345G>C . .
D 1 122114 A T ABCD c.123A>T 41 K/Y
D 1 122114 A T EFGH c.456-7890T>A . .
E 8 5094 A AT GHIJ c.678_679insT 226-227 .
E 8 5094 A AT KLMN c.-2356_-2357insT . .

I wish to filter the file down to one line for each "record", using a hierarchy of conditions to filter the "features", for example:

Field 9 contains "/", or else

Field 9 contains [A-Z], or else

Field 8 contains [digit], or else

Field 7 contains "[range from -50 to +50][A,C,T or G]"

Once a "record" meets these conditions, I do not wish it further (to avoid getting more than one line per "record").

I've tried using awk to create an array using the first 5 fields and running a for loop but I'm making a bit of a hash of it (excuse the pun):

awk -F"t" 'a[$1$2$3$4$5]=$0;for (i in a) if ($9~"/") print a[i]; else if ($9~/[A-Z]/) print a[i]; else if ($8~/[0-9]/) print a[i]' file

This ends up printing duplicates multiple times. Is there a way to do this in awk?

edited Mar 16 at 10:29

Jeff Schaller

31.2k846105

asked Mar 16 at 10:26

Pete C

1

could you explain more on [range from -50 to +50][A,C,T or G]? some have only one value, some multiple.. and ACTG matched immediately after number or anywhere in the field?
â€“Â Sundeep
Mar 16 at 10:46

2

post the expected result
â€“Â RomanPerekhrest
Mar 16 at 11:03

add a commentÂ |Â

up vote
1
down vote

favorite

A 1 122114 A T ABCD c.123A>T 41 K/Y
A 1 122114 A T EFGH c.456-7890T>A . .
B 7 56715 G C IJKL c.321+9876C>A . .
B 7 56715 G C MNOP c.543G>C 181 Q/L
B 7 56715 G C PONM c.-7324G>C . .
C 12 9844 T C QRST c.8392-68723T>C . .
C 12 3338745 T C UVWX c.599A>G 200 P/*
C 21 71120 C G YZAB c.35C>G 12 D
C 21 71120 C G CDEF c.-2345G>C . .
D 1 122114 A T ABCD c.123A>T 41 K/Y
D 1 122114 A T EFGH c.456-7890T>A . .
E 8 5094 A AT GHIJ c.678_679insT 226-227 .
E 8 5094 A AT KLMN c.-2356_-2357insT . .

I wish to filter the file down to one line for each "record", using a hierarchy of conditions to filter the "features", for example:

Field 9 contains "/", or else

Field 9 contains [A-Z], or else

Field 8 contains [digit], or else

Field 7 contains "[range from -50 to +50][A,C,T or G]"

Once a "record" meets these conditions, I do not wish it further (to avoid getting more than one line per "record").

I've tried using awk to create an array using the first 5 fields and running a for loop but I'm making a bit of a hash of it (excuse the pun):

awk -F"t" 'a[$1$2$3$4$5]=$0;for (i in a) if ($9~"/") print a[i]; else if ($9~/[A-Z]/) print a[i]; else if ($8~/[0-9]/) print a[i]' file

This ends up printing duplicates multiple times. Is there a way to do this in awk?

edited Mar 16 at 10:29

Jeff Schaller

31.2k846105

asked Mar 16 at 10:26

Pete C

A 1 122114 A T ABCD c.123A>T 41 K/Y
A 1 122114 A T EFGH c.456-7890T>A . .
B 7 56715 G C IJKL c.321+9876C>A . .
B 7 56715 G C MNOP c.543G>C 181 Q/L
B 7 56715 G C PONM c.-7324G>C . .
C 12 9844 T C QRST c.8392-68723T>C . .
C 12 3338745 T C UVWX c.599A>G 200 P/*
C 21 71120 C G YZAB c.35C>G 12 D
C 21 71120 C G CDEF c.-2345G>C . .
D 1 122114 A T ABCD c.123A>T 41 K/Y
D 1 122114 A T EFGH c.456-7890T>A . .
E 8 5094 A AT GHIJ c.678_679insT 226-227 .
E 8 5094 A AT KLMN c.-2356_-2357insT . .

I wish to filter the file down to one line for each "record", using a hierarchy of conditions to filter the "features", for example:

Field 9 contains "/", or else

Field 9 contains [A-Z], or else

Field 8 contains [digit], or else

Field 7 contains "[range from -50 to +50][A,C,T or G]"

Once a "record" meets these conditions, I do not wish it further (to avoid getting more than one line per "record").

I've tried using awk to create an array using the first 5 fields and running a for loop but I'm making a bit of a hash of it (excuse the pun):

awk -F"t" 'a[$1$2$3$4$5]=$0;for (i in a) if ($9~"/") print a[i]; else if ($9~/[A-Z]/) print a[i]; else if ($8~/[0-9]/) print a[i]' file

This ends up printing duplicates multiple times. Is there a way to do this in awk?

edited Mar 16 at 10:29

Jeff Schaller

31.2k846105

asked Mar 16 at 10:26

Pete C

edited Mar 16 at 10:29

Jeff Schaller

31.2k846105

edited Mar 16 at 10:29

Jeff Schaller

31.2k846105

edited Mar 16 at 10:29

Jeff Schaller

31.2k846105

asked Mar 16 at 10:26

Pete C

asked Mar 16 at 10:26

Pete C

asked Mar 16 at 10:26

Pete C

1

could you explain more on [range from -50 to +50][A,C,T or G]? some have only one value, some multiple.. and ACTG matched immediately after number or anywhere in the field?
â€“Â Sundeep
Mar 16 at 10:46

2

post the expected result
â€“Â RomanPerekhrest
Mar 16 at 11:03

add a commentÂ |Â

1

could you explain more on [range from -50 to +50][A,C,T or G]? some have only one value, some multiple.. and ACTG matched immediately after number or anywhere in the field?
â€“Â Sundeep
Mar 16 at 10:46

2

post the expected result
â€“Â RomanPerekhrest
Mar 16 at 11:03

could you explain more on [range from -50 to +50][A,C,T or G]? some have only one value, some multiple.. and ACTG matched immediately after number or anywhere in the field?
â€“Â Sundeep
Mar 16 at 10:46

post the expected result
â€“Â RomanPerekhrest
Mar 16 at 11:03

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
0
down vote

Perl one-liner here:

perl -F't' -lane '$r$F[0].$F[1].$F[2].$F[3].$F[4]=$_ if $F[8]=~/// or $F[8]=~[A-Z] or $F[7]=~/d/ or $F[6]=~/b(dd)[ACTG]/ and $1<=50; ENDprint $r$_ for (keys %r)' file

Comments:

A perl solution was offered assuming it is available on your system. If needed, it should be easy to rewrite in awk, given that the logic and syntax are very similar.

Conditions are based on your specifications, and awk snippet. As already pointed out in the comments, at least one of them does not seem adherent to your input file sample.

The last found record having the same key is printed.

The records are printed in random order.

answered Apr 18 at 14:33

simlev

50019

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f430595%2ffilter-list-of-partial-duplicates-by-conditions%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

Perl one-liner here:

perl -F't' -lane '$r$F[0].$F[1].$F[2].$F[3].$F[4]=$_ if $F[8]=~/// or $F[8]=~[A-Z] or $F[7]=~/d/ or $F[6]=~/b(dd)[ACTG]/ and $1<=50; ENDprint $r$_ for (keys %r)' file

Comments:

A perl solution was offered assuming it is available on your system. If needed, it should be easy to rewrite in awk, given that the logic and syntax are very similar.

Conditions are based on your specifications, and awk snippet. As already pointed out in the comments, at least one of them does not seem adherent to your input file sample.

The last found record having the same key is printed.

The records are printed in random order.

answered Apr 18 at 14:33

simlev

50019

add a commentÂ |Â

up vote
0
down vote

Perl one-liner here:

perl -F't' -lane '$r$F[0].$F[1].$F[2].$F[3].$F[4]=$_ if $F[8]=~/// or $F[8]=~[A-Z] or $F[7]=~/d/ or $F[6]=~/b(dd)[ACTG]/ and $1<=50; ENDprint $r$_ for (keys %r)' file

Comments:

A perl solution was offered assuming it is available on your system. If needed, it should be easy to rewrite in awk, given that the logic and syntax are very similar.

Conditions are based on your specifications, and awk snippet. As already pointed out in the comments, at least one of them does not seem adherent to your input file sample.

The last found record having the same key is printed.

The records are printed in random order.

answered Apr 18 at 14:33

simlev

50019

add a commentÂ |Â

up vote
0
down vote

Perl one-liner here:

perl -F't' -lane '$r$F[0].$F[1].$F[2].$F[3].$F[4]=$_ if $F[8]=~/// or $F[8]=~[A-Z] or $F[7]=~/d/ or $F[6]=~/b(dd)[ACTG]/ and $1<=50; ENDprint $r$_ for (keys %r)' file

Comments:

A perl solution was offered assuming it is available on your system. If needed, it should be easy to rewrite in awk, given that the logic and syntax are very similar.

Conditions are based on your specifications, and awk snippet. As already pointed out in the comments, at least one of them does not seem adherent to your input file sample.

The last found record having the same key is printed.

The records are printed in random order.

answered Apr 18 at 14:33

simlev

50019

Perl one-liner here:

perl -F't' -lane '$r$F[0].$F[1].$F[2].$F[3].$F[4]=$_ if $F[8]=~/// or $F[8]=~[A-Z] or $F[7]=~/d/ or $F[6]=~/b(dd)[ACTG]/ and $1<=50; ENDprint $r$_ for (keys %r)' file

Comments:

A perl solution was offered assuming it is available on your system. If needed, it should be easy to rewrite in awk, given that the logic and syntax are very similar.

Conditions are based on your specifications, and awk snippet. As already pointed out in the comments, at least one of them does not seem adherent to your input file sample.

The last found record having the same key is printed.

The records are printed in random order.

answered Apr 18 at 14:33

simlev

50019

answered Apr 18 at 14:33

simlev

50019

answered Apr 18 at 14:33

simlev

50019

answered Apr 18 at 14:33

simlev

50019

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu