filter list of partial duplicates by condition(s)

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












I have a list of partial duplicate records. Each unique record is identified by its first 5 fields, however each record has more than one "feature" associated with it, defined by the contents of the subsequent 4 fields. There is an "identifier" in the first field of each record, but an identifier can have more than one record associated with it. Example as follows:



A 1 122114 A T ABCD c.123A>T 41 K/Y
A 1 122114 A T EFGH c.456-7890T>A . .
B 7 56715 G C IJKL c.321+9876C>A . .
B 7 56715 G C MNOP c.543G>C 181 Q/L
B 7 56715 G C PONM c.-7324G>C . .
C 12 9844 T C QRST c.8392-68723T>C . .
C 12 3338745 T C UVWX c.599A>G 200 P/*
C 21 71120 C G YZAB c.35C>G 12 D
C 21 71120 C G CDEF c.-2345G>C . .
D 1 122114 A T ABCD c.123A>T 41 K/Y
D 1 122114 A T EFGH c.456-7890T>A . .
E 8 5094 A AT GHIJ c.678_679insT 226-227 .
E 8 5094 A AT KLMN c.-2356_-2357insT . .


I wish to filter the file down to one line for each "record", using a hierarchy of conditions to filter the "features", for example:



  • Field 9 contains "/", or else

  • Field 9 contains [A-Z], or else

  • Field 8 contains [digit], or else

  • Field 7 contains "[range from -50 to +50][A,C,T or G]"

Once a "record" meets these conditions, I do not wish it further (to avoid getting more than one line per "record").



I've tried using awk to create an array using the first 5 fields and running a for loop but I'm making a bit of a hash of it (excuse the pun):



awk -F"t" 'a[$1$2$3$4$5]=$0;for (i in a) if ($9~"/") print a[i]; else if ($9~/[A-Z]/) print a[i]; else if ($8~/[0-9]/) print a[i]' file


This ends up printing duplicates multiple times. Is there a way to do this in awk?







share|improve this question


















  • 1




    could you explain more on [range from -50 to +50][A,C,T or G]? some have only one value, some multiple.. and ACTG matched immediately after number or anywhere in the field?
    – Sundeep
    Mar 16 at 10:46






  • 2




    post the expected result
    – RomanPerekhrest
    Mar 16 at 11:03














up vote
1
down vote

favorite












I have a list of partial duplicate records. Each unique record is identified by its first 5 fields, however each record has more than one "feature" associated with it, defined by the contents of the subsequent 4 fields. There is an "identifier" in the first field of each record, but an identifier can have more than one record associated with it. Example as follows:



A 1 122114 A T ABCD c.123A>T 41 K/Y
A 1 122114 A T EFGH c.456-7890T>A . .
B 7 56715 G C IJKL c.321+9876C>A . .
B 7 56715 G C MNOP c.543G>C 181 Q/L
B 7 56715 G C PONM c.-7324G>C . .
C 12 9844 T C QRST c.8392-68723T>C . .
C 12 3338745 T C UVWX c.599A>G 200 P/*
C 21 71120 C G YZAB c.35C>G 12 D
C 21 71120 C G CDEF c.-2345G>C . .
D 1 122114 A T ABCD c.123A>T 41 K/Y
D 1 122114 A T EFGH c.456-7890T>A . .
E 8 5094 A AT GHIJ c.678_679insT 226-227 .
E 8 5094 A AT KLMN c.-2356_-2357insT . .


I wish to filter the file down to one line for each "record", using a hierarchy of conditions to filter the "features", for example:



  • Field 9 contains "/", or else

  • Field 9 contains [A-Z], or else

  • Field 8 contains [digit], or else

  • Field 7 contains "[range from -50 to +50][A,C,T or G]"

Once a "record" meets these conditions, I do not wish it further (to avoid getting more than one line per "record").



I've tried using awk to create an array using the first 5 fields and running a for loop but I'm making a bit of a hash of it (excuse the pun):



awk -F"t" 'a[$1$2$3$4$5]=$0;for (i in a) if ($9~"/") print a[i]; else if ($9~/[A-Z]/) print a[i]; else if ($8~/[0-9]/) print a[i]' file


This ends up printing duplicates multiple times. Is there a way to do this in awk?







share|improve this question


















  • 1




    could you explain more on [range from -50 to +50][A,C,T or G]? some have only one value, some multiple.. and ACTG matched immediately after number or anywhere in the field?
    – Sundeep
    Mar 16 at 10:46






  • 2




    post the expected result
    – RomanPerekhrest
    Mar 16 at 11:03












up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have a list of partial duplicate records. Each unique record is identified by its first 5 fields, however each record has more than one "feature" associated with it, defined by the contents of the subsequent 4 fields. There is an "identifier" in the first field of each record, but an identifier can have more than one record associated with it. Example as follows:



A 1 122114 A T ABCD c.123A>T 41 K/Y
A 1 122114 A T EFGH c.456-7890T>A . .
B 7 56715 G C IJKL c.321+9876C>A . .
B 7 56715 G C MNOP c.543G>C 181 Q/L
B 7 56715 G C PONM c.-7324G>C . .
C 12 9844 T C QRST c.8392-68723T>C . .
C 12 3338745 T C UVWX c.599A>G 200 P/*
C 21 71120 C G YZAB c.35C>G 12 D
C 21 71120 C G CDEF c.-2345G>C . .
D 1 122114 A T ABCD c.123A>T 41 K/Y
D 1 122114 A T EFGH c.456-7890T>A . .
E 8 5094 A AT GHIJ c.678_679insT 226-227 .
E 8 5094 A AT KLMN c.-2356_-2357insT . .


I wish to filter the file down to one line for each "record", using a hierarchy of conditions to filter the "features", for example:



  • Field 9 contains "/", or else

  • Field 9 contains [A-Z], or else

  • Field 8 contains [digit], or else

  • Field 7 contains "[range from -50 to +50][A,C,T or G]"

Once a "record" meets these conditions, I do not wish it further (to avoid getting more than one line per "record").



I've tried using awk to create an array using the first 5 fields and running a for loop but I'm making a bit of a hash of it (excuse the pun):



awk -F"t" 'a[$1$2$3$4$5]=$0;for (i in a) if ($9~"/") print a[i]; else if ($9~/[A-Z]/) print a[i]; else if ($8~/[0-9]/) print a[i]' file


This ends up printing duplicates multiple times. Is there a way to do this in awk?







share|improve this question














I have a list of partial duplicate records. Each unique record is identified by its first 5 fields, however each record has more than one "feature" associated with it, defined by the contents of the subsequent 4 fields. There is an "identifier" in the first field of each record, but an identifier can have more than one record associated with it. Example as follows:



A 1 122114 A T ABCD c.123A>T 41 K/Y
A 1 122114 A T EFGH c.456-7890T>A . .
B 7 56715 G C IJKL c.321+9876C>A . .
B 7 56715 G C MNOP c.543G>C 181 Q/L
B 7 56715 G C PONM c.-7324G>C . .
C 12 9844 T C QRST c.8392-68723T>C . .
C 12 3338745 T C UVWX c.599A>G 200 P/*
C 21 71120 C G YZAB c.35C>G 12 D
C 21 71120 C G CDEF c.-2345G>C . .
D 1 122114 A T ABCD c.123A>T 41 K/Y
D 1 122114 A T EFGH c.456-7890T>A . .
E 8 5094 A AT GHIJ c.678_679insT 226-227 .
E 8 5094 A AT KLMN c.-2356_-2357insT . .


I wish to filter the file down to one line for each "record", using a hierarchy of conditions to filter the "features", for example:



  • Field 9 contains "/", or else

  • Field 9 contains [A-Z], or else

  • Field 8 contains [digit], or else

  • Field 7 contains "[range from -50 to +50][A,C,T or G]"

Once a "record" meets these conditions, I do not wish it further (to avoid getting more than one line per "record").



I've tried using awk to create an array using the first 5 fields and running a for loop but I'm making a bit of a hash of it (excuse the pun):



awk -F"t" 'a[$1$2$3$4$5]=$0;for (i in a) if ($9~"/") print a[i]; else if ($9~/[A-Z]/) print a[i]; else if ($8~/[0-9]/) print a[i]' file


This ends up printing duplicates multiple times. Is there a way to do this in awk?









share|improve this question













share|improve this question




share|improve this question








edited Mar 16 at 10:29









Jeff Schaller

31.2k846105




31.2k846105










asked Mar 16 at 10:26









Pete C

61




61







  • 1




    could you explain more on [range from -50 to +50][A,C,T or G]? some have only one value, some multiple.. and ACTG matched immediately after number or anywhere in the field?
    – Sundeep
    Mar 16 at 10:46






  • 2




    post the expected result
    – RomanPerekhrest
    Mar 16 at 11:03












  • 1




    could you explain more on [range from -50 to +50][A,C,T or G]? some have only one value, some multiple.. and ACTG matched immediately after number or anywhere in the field?
    – Sundeep
    Mar 16 at 10:46






  • 2




    post the expected result
    – RomanPerekhrest
    Mar 16 at 11:03







1




1




could you explain more on [range from -50 to +50][A,C,T or G]? some have only one value, some multiple.. and ACTG matched immediately after number or anywhere in the field?
– Sundeep
Mar 16 at 10:46




could you explain more on [range from -50 to +50][A,C,T or G]? some have only one value, some multiple.. and ACTG matched immediately after number or anywhere in the field?
– Sundeep
Mar 16 at 10:46




2




2




post the expected result
– RomanPerekhrest
Mar 16 at 11:03




post the expected result
– RomanPerekhrest
Mar 16 at 11:03










1 Answer
1






active

oldest

votes

















up vote
0
down vote













Perl one-liner here:



perl -F't' -lane '$r$F[0].$F[1].$F[2].$F[3].$F[4]=$_ if $F[8]=~/// or $F[8]=~[A-Z] or $F[7]=~/d/ or $F[6]=~/b(dd)[ACTG]/ and $1<=50; ENDprint $r$_ for (keys %r)' file


Comments:



A perl solution was offered assuming it is available on your system. If needed, it should be easy to rewrite in awk, given that the logic and syntax are very similar.



Conditions are based on your specifications, and awk snippet. As already pointed out in the comments, at least one of them does not seem adherent to your input file sample.



The last found record having the same key is printed.



The records are printed in random order.






share|improve this answer




















    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f430595%2ffilter-list-of-partial-duplicates-by-conditions%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    Perl one-liner here:



    perl -F't' -lane '$r$F[0].$F[1].$F[2].$F[3].$F[4]=$_ if $F[8]=~/// or $F[8]=~[A-Z] or $F[7]=~/d/ or $F[6]=~/b(dd)[ACTG]/ and $1<=50; ENDprint $r$_ for (keys %r)' file


    Comments:



    A perl solution was offered assuming it is available on your system. If needed, it should be easy to rewrite in awk, given that the logic and syntax are very similar.



    Conditions are based on your specifications, and awk snippet. As already pointed out in the comments, at least one of them does not seem adherent to your input file sample.



    The last found record having the same key is printed.



    The records are printed in random order.






    share|improve this answer
























      up vote
      0
      down vote













      Perl one-liner here:



      perl -F't' -lane '$r$F[0].$F[1].$F[2].$F[3].$F[4]=$_ if $F[8]=~/// or $F[8]=~[A-Z] or $F[7]=~/d/ or $F[6]=~/b(dd)[ACTG]/ and $1<=50; ENDprint $r$_ for (keys %r)' file


      Comments:



      A perl solution was offered assuming it is available on your system. If needed, it should be easy to rewrite in awk, given that the logic and syntax are very similar.



      Conditions are based on your specifications, and awk snippet. As already pointed out in the comments, at least one of them does not seem adherent to your input file sample.



      The last found record having the same key is printed.



      The records are printed in random order.






      share|improve this answer






















        up vote
        0
        down vote










        up vote
        0
        down vote









        Perl one-liner here:



        perl -F't' -lane '$r$F[0].$F[1].$F[2].$F[3].$F[4]=$_ if $F[8]=~/// or $F[8]=~[A-Z] or $F[7]=~/d/ or $F[6]=~/b(dd)[ACTG]/ and $1<=50; ENDprint $r$_ for (keys %r)' file


        Comments:



        A perl solution was offered assuming it is available on your system. If needed, it should be easy to rewrite in awk, given that the logic and syntax are very similar.



        Conditions are based on your specifications, and awk snippet. As already pointed out in the comments, at least one of them does not seem adherent to your input file sample.



        The last found record having the same key is printed.



        The records are printed in random order.






        share|improve this answer












        Perl one-liner here:



        perl -F't' -lane '$r$F[0].$F[1].$F[2].$F[3].$F[4]=$_ if $F[8]=~/// or $F[8]=~[A-Z] or $F[7]=~/d/ or $F[6]=~/b(dd)[ACTG]/ and $1<=50; ENDprint $r$_ for (keys %r)' file


        Comments:



        A perl solution was offered assuming it is available on your system. If needed, it should be easy to rewrite in awk, given that the logic and syntax are very similar.



        Conditions are based on your specifications, and awk snippet. As already pointed out in the comments, at least one of them does not seem adherent to your input file sample.



        The last found record having the same key is printed.



        The records are printed in random order.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Apr 18 at 14:33









        simlev

        50019




        50019






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f430595%2ffilter-list-of-partial-duplicates-by-conditions%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Displaying single band from multi-band raster using QGIS

            How many registers does an x86_64 CPU actually have?