Extracting a single line containing the highest value in a given column from a text file from each consecutively numbered sub-group/family

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












1















Within my text file, I would like to take the line containing the highest value present in column 3, from each consecutively numbered family (i.e. family_1, family_2 etc.) from column 2 and input these data into a new text file.



Input data:



TTGSCA family_1 18.123083 681 36349 1
TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
CTYAAG family_2 16.95983 657 36170 1
.GCCAAR family_3 19.436863 698 35844 1
WGCCAA. family_3 19.99668 747 38506 1
.GCCAAS family_3 17.037859 599 31922 1
WGCCAA. family_3 19.99668 747 38506 1
CCACTK family_4 17.200712 776 44550 1
CCACTY family_4 18.86465 727 38616 1
MCACTT family_4 18.0871 737 40399 1
MCACTT family_4 18.0871 737 40399 1
YCACTT family_4 19.369513 804 43376 -1
CCAYTT family_4 16.193245 752 44296 1
CCAYTT family_4 16.193245 752 44296 1
SCACTT family_4 19.759317 687 34686 1


Output data:



TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1


I'm not sure whether to use grep or awk, and how to combine these into a single function.










share|improve this question



















  • 2





    Please don't post images of text, as they are hard to read, will not allow us to use the data in our tests, use more bandwidth, and are no good for screen readers. Please edit your question and replace it with raw text. Welcome to U/L!

    – Sparhawk
    Jan 4 at 4:10











  • Thanks! Apologies, have now replaced it!

    – Alex
    Jan 4 at 4:16











  • No worries, thank you for the fix. I'm fairly sure I know what you want, but could you also please provide your expected output for your sample. Thanks!

    – Sparhawk
    Jan 4 at 4:17











  • Done! Cheers for the help (clearly new to this!!)

    – Alex
    Jan 4 at 4:21











  • grep is no good for this; awk is probably the perfect tool.  If you search this site, you’ll find hundreds of questions very much like this.  We encourage you to do that; find a working solution and try to adapt it to your problem.  If you get stuck; edit your question to show what progress you made and what trouble you’re having. … … P.S. Since your data has “ties”, you should say how you want them broken.

    – G-Man
    Jan 4 at 4:25















1















Within my text file, I would like to take the line containing the highest value present in column 3, from each consecutively numbered family (i.e. family_1, family_2 etc.) from column 2 and input these data into a new text file.



Input data:



TTGSCA family_1 18.123083 681 36349 1
TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
CTYAAG family_2 16.95983 657 36170 1
.GCCAAR family_3 19.436863 698 35844 1
WGCCAA. family_3 19.99668 747 38506 1
.GCCAAS family_3 17.037859 599 31922 1
WGCCAA. family_3 19.99668 747 38506 1
CCACTK family_4 17.200712 776 44550 1
CCACTY family_4 18.86465 727 38616 1
MCACTT family_4 18.0871 737 40399 1
MCACTT family_4 18.0871 737 40399 1
YCACTT family_4 19.369513 804 43376 -1
CCAYTT family_4 16.193245 752 44296 1
CCAYTT family_4 16.193245 752 44296 1
SCACTT family_4 19.759317 687 34686 1


Output data:



TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1


I'm not sure whether to use grep or awk, and how to combine these into a single function.










share|improve this question



















  • 2





    Please don't post images of text, as they are hard to read, will not allow us to use the data in our tests, use more bandwidth, and are no good for screen readers. Please edit your question and replace it with raw text. Welcome to U/L!

    – Sparhawk
    Jan 4 at 4:10











  • Thanks! Apologies, have now replaced it!

    – Alex
    Jan 4 at 4:16











  • No worries, thank you for the fix. I'm fairly sure I know what you want, but could you also please provide your expected output for your sample. Thanks!

    – Sparhawk
    Jan 4 at 4:17











  • Done! Cheers for the help (clearly new to this!!)

    – Alex
    Jan 4 at 4:21











  • grep is no good for this; awk is probably the perfect tool.  If you search this site, you’ll find hundreds of questions very much like this.  We encourage you to do that; find a working solution and try to adapt it to your problem.  If you get stuck; edit your question to show what progress you made and what trouble you’re having. … … P.S. Since your data has “ties”, you should say how you want them broken.

    – G-Man
    Jan 4 at 4:25













1












1








1








Within my text file, I would like to take the line containing the highest value present in column 3, from each consecutively numbered family (i.e. family_1, family_2 etc.) from column 2 and input these data into a new text file.



Input data:



TTGSCA family_1 18.123083 681 36349 1
TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
CTYAAG family_2 16.95983 657 36170 1
.GCCAAR family_3 19.436863 698 35844 1
WGCCAA. family_3 19.99668 747 38506 1
.GCCAAS family_3 17.037859 599 31922 1
WGCCAA. family_3 19.99668 747 38506 1
CCACTK family_4 17.200712 776 44550 1
CCACTY family_4 18.86465 727 38616 1
MCACTT family_4 18.0871 737 40399 1
MCACTT family_4 18.0871 737 40399 1
YCACTT family_4 19.369513 804 43376 -1
CCAYTT family_4 16.193245 752 44296 1
CCAYTT family_4 16.193245 752 44296 1
SCACTT family_4 19.759317 687 34686 1


Output data:



TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1


I'm not sure whether to use grep or awk, and how to combine these into a single function.










share|improve this question
















Within my text file, I would like to take the line containing the highest value present in column 3, from each consecutively numbered family (i.e. family_1, family_2 etc.) from column 2 and input these data into a new text file.



Input data:



TTGSCA family_1 18.123083 681 36349 1
TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
CTYAAG family_2 16.95983 657 36170 1
.GCCAAR family_3 19.436863 698 35844 1
WGCCAA. family_3 19.99668 747 38506 1
.GCCAAS family_3 17.037859 599 31922 1
WGCCAA. family_3 19.99668 747 38506 1
CCACTK family_4 17.200712 776 44550 1
CCACTY family_4 18.86465 727 38616 1
MCACTT family_4 18.0871 737 40399 1
MCACTT family_4 18.0871 737 40399 1
YCACTT family_4 19.369513 804 43376 -1
CCAYTT family_4 16.193245 752 44296 1
CCAYTT family_4 16.193245 752 44296 1
SCACTT family_4 19.759317 687 34686 1


Output data:



TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1


I'm not sure whether to use grep or awk, and how to combine these into a single function.







shell-script shell awk grep






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 4 at 5:10









P_Yadav

1,6603924




1,6603924










asked Jan 4 at 4:06









AlexAlex

84




84







  • 2





    Please don't post images of text, as they are hard to read, will not allow us to use the data in our tests, use more bandwidth, and are no good for screen readers. Please edit your question and replace it with raw text. Welcome to U/L!

    – Sparhawk
    Jan 4 at 4:10











  • Thanks! Apologies, have now replaced it!

    – Alex
    Jan 4 at 4:16











  • No worries, thank you for the fix. I'm fairly sure I know what you want, but could you also please provide your expected output for your sample. Thanks!

    – Sparhawk
    Jan 4 at 4:17











  • Done! Cheers for the help (clearly new to this!!)

    – Alex
    Jan 4 at 4:21











  • grep is no good for this; awk is probably the perfect tool.  If you search this site, you’ll find hundreds of questions very much like this.  We encourage you to do that; find a working solution and try to adapt it to your problem.  If you get stuck; edit your question to show what progress you made and what trouble you’re having. … … P.S. Since your data has “ties”, you should say how you want them broken.

    – G-Man
    Jan 4 at 4:25












  • 2





    Please don't post images of text, as they are hard to read, will not allow us to use the data in our tests, use more bandwidth, and are no good for screen readers. Please edit your question and replace it with raw text. Welcome to U/L!

    – Sparhawk
    Jan 4 at 4:10











  • Thanks! Apologies, have now replaced it!

    – Alex
    Jan 4 at 4:16











  • No worries, thank you for the fix. I'm fairly sure I know what you want, but could you also please provide your expected output for your sample. Thanks!

    – Sparhawk
    Jan 4 at 4:17











  • Done! Cheers for the help (clearly new to this!!)

    – Alex
    Jan 4 at 4:21











  • grep is no good for this; awk is probably the perfect tool.  If you search this site, you’ll find hundreds of questions very much like this.  We encourage you to do that; find a working solution and try to adapt it to your problem.  If you get stuck; edit your question to show what progress you made and what trouble you’re having. … … P.S. Since your data has “ties”, you should say how you want them broken.

    – G-Man
    Jan 4 at 4:25







2




2





Please don't post images of text, as they are hard to read, will not allow us to use the data in our tests, use more bandwidth, and are no good for screen readers. Please edit your question and replace it with raw text. Welcome to U/L!

– Sparhawk
Jan 4 at 4:10





Please don't post images of text, as they are hard to read, will not allow us to use the data in our tests, use more bandwidth, and are no good for screen readers. Please edit your question and replace it with raw text. Welcome to U/L!

– Sparhawk
Jan 4 at 4:10













Thanks! Apologies, have now replaced it!

– Alex
Jan 4 at 4:16





Thanks! Apologies, have now replaced it!

– Alex
Jan 4 at 4:16













No worries, thank you for the fix. I'm fairly sure I know what you want, but could you also please provide your expected output for your sample. Thanks!

– Sparhawk
Jan 4 at 4:17





No worries, thank you for the fix. I'm fairly sure I know what you want, but could you also please provide your expected output for your sample. Thanks!

– Sparhawk
Jan 4 at 4:17













Done! Cheers for the help (clearly new to this!!)

– Alex
Jan 4 at 4:21





Done! Cheers for the help (clearly new to this!!)

– Alex
Jan 4 at 4:21













grep is no good for this; awk is probably the perfect tool.  If you search this site, you’ll find hundreds of questions very much like this.  We encourage you to do that; find a working solution and try to adapt it to your problem.  If you get stuck; edit your question to show what progress you made and what trouble you’re having. … … P.S. Since your data has “ties”, you should say how you want them broken.

– G-Man
Jan 4 at 4:25





grep is no good for this; awk is probably the perfect tool.  If you search this site, you’ll find hundreds of questions very much like this.  We encourage you to do that; find a working solution and try to adapt it to your problem.  If you get stuck; edit your question to show what progress you made and what trouble you’re having. … … P.S. Since your data has “ties”, you should say how you want them broken.

– G-Man
Jan 4 at 4:25










3 Answers
3






active

oldest

votes


















1














With GNU datamash (and a little help from cut):



$ datamash -Wf groupby 2 max 3 < file.txt | cut -f1-6
TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1





share|improve this answer























  • Thanks steeldriver! Datamash worked a treat.

    – Alex
    Jan 7 at 0:21


















0














I think datamash is probably the best tool, but here is a sort-unique alternative:



<infile sort -k2,2V -k3,3n | awk 'NR==1 || $2!=p; p=$2 '





share|improve this answer






























    0














    Below is a cleaner way of getting the desired output than my previous answer. It does require sort to be used twice but it's a lot better than having sort, grep, and tail being used four times.



    sort -k3r numbers | awk '!seen[$2]++' | sort -k2


    Output:



    TTGSCA family_1 18.123083 681 36349 1
    CTTRAG family_2 17.844843 685 37001 1
    WGCCAA. family_3 19.99668 747 38506 1
    SCACTT family_4 19.759317 687 34686 1





    share|improve this answer






















      Your Answer








      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "106"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f492379%2fextracting-a-single-line-containing-the-highest-value-in-a-given-column-from-a-t%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      1














      With GNU datamash (and a little help from cut):



      $ datamash -Wf groupby 2 max 3 < file.txt | cut -f1-6
      TTGSCA family_1 18.123083 681 36349 1
      CTTRAG family_2 17.844843 685 37001 1
      WGCCAA. family_3 19.99668 747 38506 1
      SCACTT family_4 19.759317 687 34686 1





      share|improve this answer























      • Thanks steeldriver! Datamash worked a treat.

        – Alex
        Jan 7 at 0:21















      1














      With GNU datamash (and a little help from cut):



      $ datamash -Wf groupby 2 max 3 < file.txt | cut -f1-6
      TTGSCA family_1 18.123083 681 36349 1
      CTTRAG family_2 17.844843 685 37001 1
      WGCCAA. family_3 19.99668 747 38506 1
      SCACTT family_4 19.759317 687 34686 1





      share|improve this answer























      • Thanks steeldriver! Datamash worked a treat.

        – Alex
        Jan 7 at 0:21













      1












      1








      1







      With GNU datamash (and a little help from cut):



      $ datamash -Wf groupby 2 max 3 < file.txt | cut -f1-6
      TTGSCA family_1 18.123083 681 36349 1
      CTTRAG family_2 17.844843 685 37001 1
      WGCCAA. family_3 19.99668 747 38506 1
      SCACTT family_4 19.759317 687 34686 1





      share|improve this answer













      With GNU datamash (and a little help from cut):



      $ datamash -Wf groupby 2 max 3 < file.txt | cut -f1-6
      TTGSCA family_1 18.123083 681 36349 1
      CTTRAG family_2 17.844843 685 37001 1
      WGCCAA. family_3 19.99668 747 38506 1
      SCACTT family_4 19.759317 687 34686 1






      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Jan 4 at 4:38









      steeldriversteeldriver

      35.5k35286




      35.5k35286












      • Thanks steeldriver! Datamash worked a treat.

        – Alex
        Jan 7 at 0:21

















      • Thanks steeldriver! Datamash worked a treat.

        – Alex
        Jan 7 at 0:21
















      Thanks steeldriver! Datamash worked a treat.

      – Alex
      Jan 7 at 0:21





      Thanks steeldriver! Datamash worked a treat.

      – Alex
      Jan 7 at 0:21













      0














      I think datamash is probably the best tool, but here is a sort-unique alternative:



      <infile sort -k2,2V -k3,3n | awk 'NR==1 || $2!=p; p=$2 '





      share|improve this answer



























        0














        I think datamash is probably the best tool, but here is a sort-unique alternative:



        <infile sort -k2,2V -k3,3n | awk 'NR==1 || $2!=p; p=$2 '





        share|improve this answer

























          0












          0








          0







          I think datamash is probably the best tool, but here is a sort-unique alternative:



          <infile sort -k2,2V -k3,3n | awk 'NR==1 || $2!=p; p=$2 '





          share|improve this answer













          I think datamash is probably the best tool, but here is a sort-unique alternative:



          <infile sort -k2,2V -k3,3n | awk 'NR==1 || $2!=p; p=$2 '






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jan 4 at 9:37









          ThorThor

          11.7k13359




          11.7k13359





















              0














              Below is a cleaner way of getting the desired output than my previous answer. It does require sort to be used twice but it's a lot better than having sort, grep, and tail being used four times.



              sort -k3r numbers | awk '!seen[$2]++' | sort -k2


              Output:



              TTGSCA family_1 18.123083 681 36349 1
              CTTRAG family_2 17.844843 685 37001 1
              WGCCAA. family_3 19.99668 747 38506 1
              SCACTT family_4 19.759317 687 34686 1





              share|improve this answer



























                0














                Below is a cleaner way of getting the desired output than my previous answer. It does require sort to be used twice but it's a lot better than having sort, grep, and tail being used four times.



                sort -k3r numbers | awk '!seen[$2]++' | sort -k2


                Output:



                TTGSCA family_1 18.123083 681 36349 1
                CTTRAG family_2 17.844843 685 37001 1
                WGCCAA. family_3 19.99668 747 38506 1
                SCACTT family_4 19.759317 687 34686 1





                share|improve this answer

























                  0












                  0








                  0







                  Below is a cleaner way of getting the desired output than my previous answer. It does require sort to be used twice but it's a lot better than having sort, grep, and tail being used four times.



                  sort -k3r numbers | awk '!seen[$2]++' | sort -k2


                  Output:



                  TTGSCA family_1 18.123083 681 36349 1
                  CTTRAG family_2 17.844843 685 37001 1
                  WGCCAA. family_3 19.99668 747 38506 1
                  SCACTT family_4 19.759317 687 34686 1





                  share|improve this answer













                  Below is a cleaner way of getting the desired output than my previous answer. It does require sort to be used twice but it's a lot better than having sort, grep, and tail being used four times.



                  sort -k3r numbers | awk '!seen[$2]++' | sort -k2


                  Output:



                  TTGSCA family_1 18.123083 681 36349 1
                  CTTRAG family_2 17.844843 685 37001 1
                  WGCCAA. family_3 19.99668 747 38506 1
                  SCACTT family_4 19.759317 687 34686 1






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jan 4 at 19:41









                  Nasir RileyNasir Riley

                  2,441249




                  2,441249



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Unix & Linux Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f492379%2fextracting-a-single-line-containing-the-highest-value-in-a-given-column-from-a-t%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown






                      Popular posts from this blog

                      How to check contact read email or not when send email to Individual?

                      How many registers does an x86_64 CPU actually have?

                      Nur Jahan