Extracting a single line containing the highest value in a given column from a text file from each consecutively numbered sub-group/family

Within my text file, I would like to take the line containing the highest value present in column 3, from each consecutively numbered family (i.e. family_1, family_2 etc.) from column 2 and input these data into a new text file.

Input data:

TTGSCA family_1 18.123083 681 36349 1
TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
CTYAAG family_2 16.95983 657 36170 1
.GCCAAR family_3 19.436863 698 35844 1
WGCCAA. family_3 19.99668 747 38506 1
.GCCAAS family_3 17.037859 599 31922 1
WGCCAA. family_3 19.99668 747 38506 1
CCACTK family_4 17.200712 776 44550 1
CCACTY family_4 18.86465 727 38616 1
MCACTT family_4 18.0871 737 40399 1
MCACTT family_4 18.0871 737 40399 1
YCACTT family_4 19.369513 804 43376 -1
CCAYTT family_4 16.193245 752 44296 1
CCAYTT family_4 16.193245 752 44296 1
SCACTT family_4 19.759317 687 34686 1

Output data:

TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1

I'm not sure whether to use grep or awk, and how to combine these into a single function.

edited Jan 4 at 5:10

P_Yadav

1,6603924

asked Jan 4 at 4:06

Alex

2

Please don't post images of text, as they are hard to read, will not allow us to use the data in our tests, use more bandwidth, and are no good for screen readers. Please edit your question and replace it with raw text. Welcome to U/L!

– Sparhawk
Jan 4 at 4:10

Thanks! Apologies, have now replaced it!

– Alex
Jan 4 at 4:16

No worries, thank you for the fix. I'm fairly sure I know what you want, but could you also please provide your expected output for your sample. Thanks!

– Sparhawk
Jan 4 at 4:17

Done! Cheers for the help (clearly new to this!!)

– Alex
Jan 4 at 4:21

grep is no good for this; awk is probably the perfect tool. If you search this site, you’ll find hundreds of questions very much like this. We encourage you to do that; find a working solution and try to adapt it to your problem. If you get stuck; edit your question to show what progress you made and what trouble you’re having. … … P.S. Since your data has “ties”, you should say how you want them broken.

– G-Man
Jan 4 at 4:25

add a comment |

Input data:

TTGSCA family_1 18.123083 681 36349 1
TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
CTYAAG family_2 16.95983 657 36170 1
.GCCAAR family_3 19.436863 698 35844 1
WGCCAA. family_3 19.99668 747 38506 1
.GCCAAS family_3 17.037859 599 31922 1
WGCCAA. family_3 19.99668 747 38506 1
CCACTK family_4 17.200712 776 44550 1
CCACTY family_4 18.86465 727 38616 1
MCACTT family_4 18.0871 737 40399 1
MCACTT family_4 18.0871 737 40399 1
YCACTT family_4 19.369513 804 43376 -1
CCAYTT family_4 16.193245 752 44296 1
CCAYTT family_4 16.193245 752 44296 1
SCACTT family_4 19.759317 687 34686 1

Output data:

TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1

I'm not sure whether to use grep or awk, and how to combine these into a single function.

edited Jan 4 at 5:10

P_Yadav

1,6603924

asked Jan 4 at 4:06

Alex

2

Please don't post images of text, as they are hard to read, will not allow us to use the data in our tests, use more bandwidth, and are no good for screen readers. Please edit your question and replace it with raw text. Welcome to U/L!

– Sparhawk
Jan 4 at 4:10

Thanks! Apologies, have now replaced it!

– Alex
Jan 4 at 4:16

No worries, thank you for the fix. I'm fairly sure I know what you want, but could you also please provide your expected output for your sample. Thanks!

– Sparhawk
Jan 4 at 4:17

Done! Cheers for the help (clearly new to this!!)

– Alex
Jan 4 at 4:21

grep is no good for this; awk is probably the perfect tool. If you search this site, you’ll find hundreds of questions very much like this. We encourage you to do that; find a working solution and try to adapt it to your problem. If you get stuck; edit your question to show what progress you made and what trouble you’re having. … … P.S. Since your data has “ties”, you should say how you want them broken.

– G-Man
Jan 4 at 4:25

add a comment |

Input data:

TTGSCA family_1 18.123083 681 36349 1
TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
CTYAAG family_2 16.95983 657 36170 1
.GCCAAR family_3 19.436863 698 35844 1
WGCCAA. family_3 19.99668 747 38506 1
.GCCAAS family_3 17.037859 599 31922 1
WGCCAA. family_3 19.99668 747 38506 1
CCACTK family_4 17.200712 776 44550 1
CCACTY family_4 18.86465 727 38616 1
MCACTT family_4 18.0871 737 40399 1
MCACTT family_4 18.0871 737 40399 1
YCACTT family_4 19.369513 804 43376 -1
CCAYTT family_4 16.193245 752 44296 1
CCAYTT family_4 16.193245 752 44296 1
SCACTT family_4 19.759317 687 34686 1

Output data:

TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1

I'm not sure whether to use grep or awk, and how to combine these into a single function.

edited Jan 4 at 5:10

P_Yadav

1,6603924

asked Jan 4 at 4:06

Alex

Input data:

TTGSCA family_1 18.123083 681 36349 1
TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
CTYAAG family_2 16.95983 657 36170 1
.GCCAAR family_3 19.436863 698 35844 1
WGCCAA. family_3 19.99668 747 38506 1
.GCCAAS family_3 17.037859 599 31922 1
WGCCAA. family_3 19.99668 747 38506 1
CCACTK family_4 17.200712 776 44550 1
CCACTY family_4 18.86465 727 38616 1
MCACTT family_4 18.0871 737 40399 1
MCACTT family_4 18.0871 737 40399 1
YCACTT family_4 19.369513 804 43376 -1
CCAYTT family_4 16.193245 752 44296 1
CCAYTT family_4 16.193245 752 44296 1
SCACTT family_4 19.759317 687 34686 1

Output data:

TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1

I'm not sure whether to use grep or awk, and how to combine these into a single function.

shell-script shell awk grep

edited Jan 4 at 5:10

P_Yadav

1,6603924

asked Jan 4 at 4:06

Alex

edited Jan 4 at 5:10

P_Yadav

1,6603924

asked Jan 4 at 4:06

Alex

edited Jan 4 at 5:10

P_Yadav

1,6603924

edited Jan 4 at 5:10

P_Yadav

1,6603924

edited Jan 4 at 5:10

P_Yadav

1,6603924

asked Jan 4 at 4:06

Alex

asked Jan 4 at 4:06

Alex

asked Jan 4 at 4:06

Alex

2

Please don't post images of text, as they are hard to read, will not allow us to use the data in our tests, use more bandwidth, and are no good for screen readers. Please edit your question and replace it with raw text. Welcome to U/L!

– Sparhawk
Jan 4 at 4:10

Thanks! Apologies, have now replaced it!

– Alex
Jan 4 at 4:16

No worries, thank you for the fix. I'm fairly sure I know what you want, but could you also please provide your expected output for your sample. Thanks!

– Sparhawk
Jan 4 at 4:17

Done! Cheers for the help (clearly new to this!!)

– Alex
Jan 4 at 4:21

grep is no good for this; awk is probably the perfect tool. If you search this site, you’ll find hundreds of questions very much like this. We encourage you to do that; find a working solution and try to adapt it to your problem. If you get stuck; edit your question to show what progress you made and what trouble you’re having. … … P.S. Since your data has “ties”, you should say how you want them broken.

– G-Man
Jan 4 at 4:25

add a comment |

2

Please don't post images of text, as they are hard to read, will not allow us to use the data in our tests, use more bandwidth, and are no good for screen readers. Please edit your question and replace it with raw text. Welcome to U/L!

– Sparhawk
Jan 4 at 4:10

Thanks! Apologies, have now replaced it!

– Alex
Jan 4 at 4:16

No worries, thank you for the fix. I'm fairly sure I know what you want, but could you also please provide your expected output for your sample. Thanks!

– Sparhawk
Jan 4 at 4:17

Done! Cheers for the help (clearly new to this!!)

– Alex
Jan 4 at 4:21

grep is no good for this; awk is probably the perfect tool. If you search this site, you’ll find hundreds of questions very much like this. We encourage you to do that; find a working solution and try to adapt it to your problem. If you get stuck; edit your question to show what progress you made and what trouble you’re having. … … P.S. Since your data has “ties”, you should say how you want them broken.

– G-Man
Jan 4 at 4:25

Please don't post images of text, as they are hard to read, will not allow us to use the data in our tests, use more bandwidth, and are no good for screen readers. Please edit your question and replace it with raw text. Welcome to U/L!

– Sparhawk
Jan 4 at 4:10

Thanks! Apologies, have now replaced it!

– Alex
Jan 4 at 4:16

No worries, thank you for the fix. I'm fairly sure I know what you want, but could you also please provide your expected output for your sample. Thanks!

– Sparhawk
Jan 4 at 4:17

Done! Cheers for the help (clearly new to this!!)

– Alex
Jan 4 at 4:21

grep is no good for this; awk is probably the perfect tool. If you search this site, you’ll find hundreds of questions very much like this. We encourage you to do that; find a working solution and try to adapt it to your problem. If you get stuck; edit your question to show what progress you made and what trouble you’re having. … … P.S. Since your data has “ties”, you should say how you want them broken.

– G-Man
Jan 4 at 4:25

add a comment |

3 Answers
3

active

oldest

votes

With GNU datamash (and a little help from cut):

$ datamash -Wf groupby 2 max 3 < file.txt | cut -f1-6
TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1

answered Jan 4 at 4:38

steeldriver

35.5k35286

Thanks steeldriver! Datamash worked a treat.

– Alex
Jan 7 at 0:21

add a comment |

I think datamash is probably the best tool, but here is a sort-unique alternative:

<infile sort -k2,2V -k3,3n | awk 'NR==1 || $2!=p; p=$2 '

answered Jan 4 at 9:37

Thor

11.7k13359

add a comment |

Below is a cleaner way of getting the desired output than my previous answer. It does require sort to be used twice but it's a lot better than having sort, grep, and tail being used four times.

sort -k3r numbers | awk '!seen[$2]++' | sort -k2

Output:

TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1

answered Jan 4 at 19:41

Nasir Riley

2,441249

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f492379%2fextracting-a-single-line-containing-the-highest-value-in-a-given-column-from-a-t%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

With GNU datamash (and a little help from cut):

$ datamash -Wf groupby 2 max 3 < file.txt | cut -f1-6
TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1

answered Jan 4 at 4:38

steeldriver

35.5k35286

Thanks steeldriver! Datamash worked a treat.

– Alex
Jan 7 at 0:21

add a comment |

With GNU datamash (and a little help from cut):

$ datamash -Wf groupby 2 max 3 < file.txt | cut -f1-6
TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1

answered Jan 4 at 4:38

steeldriver

35.5k35286

Thanks steeldriver! Datamash worked a treat.

– Alex
Jan 7 at 0:21

add a comment |

With GNU datamash (and a little help from cut):

$ datamash -Wf groupby 2 max 3 < file.txt | cut -f1-6
TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1

answered Jan 4 at 4:38

steeldriver

35.5k35286

With GNU datamash (and a little help from cut):

$ datamash -Wf groupby 2 max 3 < file.txt | cut -f1-6
TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1

answered Jan 4 at 4:38

steeldriver

35.5k35286

answered Jan 4 at 4:38

steeldriver

35.5k35286

answered Jan 4 at 4:38

steeldriver

35.5k35286

answered Jan 4 at 4:38

steeldriver

35.5k35286

Thanks steeldriver! Datamash worked a treat.

– Alex
Jan 7 at 0:21

add a comment |

Thanks steeldriver! Datamash worked a treat.

– Alex
Jan 7 at 0:21

Thanks steeldriver! Datamash worked a treat.

– Alex
Jan 7 at 0:21

add a comment |

I think datamash is probably the best tool, but here is a sort-unique alternative:

<infile sort -k2,2V -k3,3n | awk 'NR==1 || $2!=p; p=$2 '

answered Jan 4 at 9:37

Thor

11.7k13359

add a comment |

I think datamash is probably the best tool, but here is a sort-unique alternative:

<infile sort -k2,2V -k3,3n | awk 'NR==1 || $2!=p; p=$2 '

answered Jan 4 at 9:37

Thor

11.7k13359

add a comment |

I think datamash is probably the best tool, but here is a sort-unique alternative:

<infile sort -k2,2V -k3,3n | awk 'NR==1 || $2!=p; p=$2 '

answered Jan 4 at 9:37

Thor

11.7k13359

I think datamash is probably the best tool, but here is a sort-unique alternative:

<infile sort -k2,2V -k3,3n | awk 'NR==1 || $2!=p; p=$2 '

answered Jan 4 at 9:37

Thor

11.7k13359

answered Jan 4 at 9:37

Thor

11.7k13359

answered Jan 4 at 9:37

Thor

11.7k13359

answered Jan 4 at 9:37

Thor

11.7k13359

add a comment |

sort -k3r numbers | awk '!seen[$2]++' | sort -k2

Output:

TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1

answered Jan 4 at 19:41

Nasir Riley

2,441249

add a comment |

sort -k3r numbers | awk '!seen[$2]++' | sort -k2

Output:

TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1

answered Jan 4 at 19:41

Nasir Riley

2,441249

add a comment |

sort -k3r numbers | awk '!seen[$2]++' | sort -k2

Output:

TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1

answered Jan 4 at 19:41

Nasir Riley

2,441249

sort -k3r numbers | awk '!seen[$2]++' | sort -k2

Output:

TTGSCA family_1 18.123083 681 36349 1
CTTRAG family_2 17.844843 685 37001 1
WGCCAA. family_3 19.99668 747 38506 1
SCACTT family_4 19.759317 687 34686 1

answered Jan 4 at 19:41

Nasir Riley

2,441249

answered Jan 4 at 19:41

Nasir Riley

2,441249

answered Jan 4 at 19:41

Nasir Riley

2,441249

answered Jan 4 at 19:41

Nasir Riley

2,441249

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

搜尋此網誌

mjhjmtu