To get average of columns for all the entries whose first column entries are duplicate

I have a file which looks like this:

30901 foo1 bar1 89
30902 foo2 bar2 51
30903 foo3 bar3 101
30903 foox bary 12
30903 fooz bara 23
30903 foob barc 62
30904 fooe barf 59
...
...

I want my output to have output like this:

49.5 being the average of fourth column, for all four duplicate entries for id: 30903 in column 1.

All I could do was to get number of unique entries using
cat <logfile> | awk 'print $1' | uniq -c

to get the count of duplicate entries based on first column (id)

edited Jan 30 at 15:41

Archemar

20.2k93772

asked Jan 30 at 15:17

Mr_S

1

Closely related: Awk to average column based on key in 2nd column

– steeldriver
Jan 30 at 15:25

add a comment |

I have a file which looks like this:

30901 foo1 bar1 89
30902 foo2 bar2 51
30903 foo3 bar3 101
30903 foox bary 12
30903 fooz bara 23
30903 foob barc 62
30904 fooe barf 59
...
...

I want my output to have output like this:

49.5 being the average of fourth column, for all four duplicate entries for id: 30903 in column 1.

All I could do was to get number of unique entries using
cat <logfile> | awk 'print $1' | uniq -c

to get the count of duplicate entries based on first column (id)

edited Jan 30 at 15:41

Archemar

20.2k93772

asked Jan 30 at 15:17

Mr_S

1

Closely related: Awk to average column based on key in 2nd column

– steeldriver
Jan 30 at 15:25

add a comment |

I have a file which looks like this:

30901 foo1 bar1 89
30902 foo2 bar2 51
30903 foo3 bar3 101
30903 foox bary 12
30903 fooz bara 23
30903 foob barc 62
30904 fooe barf 59
...
...

I want my output to have output like this:

49.5 being the average of fourth column, for all four duplicate entries for id: 30903 in column 1.

All I could do was to get number of unique entries using
cat <logfile> | awk 'print $1' | uniq -c

to get the count of duplicate entries based on first column (id)

edited Jan 30 at 15:41

Archemar

20.2k93772

asked Jan 30 at 15:17

Mr_S

I have a file which looks like this:

30901 foo1 bar1 89
30902 foo2 bar2 51
30903 foo3 bar3 101
30903 foox bary 12
30903 fooz bara 23
30903 foob barc 62
30904 fooe barf 59
...
...

I want my output to have output like this:

49.5 being the average of fourth column, for all four duplicate entries for id: 30903 in column 1.

All I could do was to get number of unique entries using
cat <logfile> | awk 'print $1' | uniq -c

to get the count of duplicate entries based on first column (id)

text-processing awk

edited Jan 30 at 15:41

Archemar

20.2k93772

asked Jan 30 at 15:17

Mr_S

edited Jan 30 at 15:41

Archemar

20.2k93772

asked Jan 30 at 15:17

Mr_S

edited Jan 30 at 15:41

Archemar

20.2k93772

edited Jan 30 at 15:41

Archemar

20.2k93772

edited Jan 30 at 15:41

Archemar

20.2k93772

asked Jan 30 at 15:17

Mr_S

asked Jan 30 at 15:17

Mr_S

asked Jan 30 at 15:17

Mr_S

1

Closely related: Awk to average column based on key in 2nd column

– steeldriver
Jan 30 at 15:25

add a comment |

1

Closely related: Awk to average column based on key in 2nd column

– steeldriver
Jan 30 at 15:25

Closely related: Awk to average column based on key in 2nd column

– steeldriver
Jan 30 at 15:25

add a comment |

2 Answers
2

active

oldest

votes

try

 awk 'c[$1]++ ; t[$1]+=$4 
 END for (cc in c ) if ( c[cc]>1 ) printf "%s : %.1fn",cc,t[cc]/c[cc];' file.log

which give

30903 : 49.5

to get your output, remove if (c[cc]>1) part.

the script basically count c[$1]++ and add t[$1]+=$4 value.

c[x] is for count(x), that is $1 : first field

t[x] is for total(x), that is sum of total value for 4'th field

at the end, we loop over all values, computing and printing average.

edited Jan 30 at 20:18

answered Jan 30 at 15:40

Archemar

20.2k93772

More descriptive variable names would be helpful.

– glenn jackman
Jan 30 at 18:31

Thank you it works fine. This is exactly what I wanted.

– Mr_S
Feb 1 at 7:06

@glennjackman Descriptive names in a throwaway twp-liner in Awk? What for. It's a bit ironic that the loop dummy cc has a longer name than the c and t globals, though.

– Kaz
Feb 2 at 19:56

add a comment |

With Miller (http://johnkerl.org/miller/doc) you can run

mlr --nidx stats1 -a mean -f 4 -g 1 inputFile

and have

30901 89.000000
30902 51.000000
30903 49.500000
30904 59.000000

-f 4 to set the Value-field names on which to compute statistics (the fourth field);

-g 1 to set group-by-field names (the first field).

answered Jan 31 at 9:07

aborruso

22829

Thanks for the alternate solution. I cannot confirm it because I am using Fedora OS and I couldn't find Miller for fedora.

– Mr_S
Feb 1 at 7:41

@Mr_S here you have the generic linux binary (mlr.linux.x86_64 ) github.com/johnkerl/miller/releases/tag/5.4.0

– aborruso
Feb 1 at 8:21

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f497716%2fto-get-average-of-columns-for-all-the-entries-whose-first-column-entries-are-dup%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

try

 awk 'c[$1]++ ; t[$1]+=$4 
 END for (cc in c ) if ( c[cc]>1 ) printf "%s : %.1fn",cc,t[cc]/c[cc];' file.log

which give

30903 : 49.5

to get your output, remove if (c[cc]>1) part.

the script basically count c[$1]++ and add t[$1]+=$4 value.

c[x] is for count(x), that is $1 : first field

t[x] is for total(x), that is sum of total value for 4'th field

at the end, we loop over all values, computing and printing average.

edited Jan 30 at 20:18

answered Jan 30 at 15:40

Archemar

20.2k93772

More descriptive variable names would be helpful.

– glenn jackman
Jan 30 at 18:31

Thank you it works fine. This is exactly what I wanted.

– Mr_S
Feb 1 at 7:06

@glennjackman Descriptive names in a throwaway twp-liner in Awk? What for. It's a bit ironic that the loop dummy cc has a longer name than the c and t globals, though.

– Kaz
Feb 2 at 19:56

add a comment |

try

 awk 'c[$1]++ ; t[$1]+=$4 
 END for (cc in c ) if ( c[cc]>1 ) printf "%s : %.1fn",cc,t[cc]/c[cc];' file.log

which give

30903 : 49.5

to get your output, remove if (c[cc]>1) part.

the script basically count c[$1]++ and add t[$1]+=$4 value.

c[x] is for count(x), that is $1 : first field

t[x] is for total(x), that is sum of total value for 4'th field

at the end, we loop over all values, computing and printing average.

edited Jan 30 at 20:18

answered Jan 30 at 15:40

Archemar

20.2k93772

More descriptive variable names would be helpful.

– glenn jackman
Jan 30 at 18:31

Thank you it works fine. This is exactly what I wanted.

– Mr_S
Feb 1 at 7:06

@glennjackman Descriptive names in a throwaway twp-liner in Awk? What for. It's a bit ironic that the loop dummy cc has a longer name than the c and t globals, though.

– Kaz
Feb 2 at 19:56

add a comment |

try

 awk 'c[$1]++ ; t[$1]+=$4 
 END for (cc in c ) if ( c[cc]>1 ) printf "%s : %.1fn",cc,t[cc]/c[cc];' file.log

which give

30903 : 49.5

to get your output, remove if (c[cc]>1) part.

the script basically count c[$1]++ and add t[$1]+=$4 value.

c[x] is for count(x), that is $1 : first field

t[x] is for total(x), that is sum of total value for 4'th field

at the end, we loop over all values, computing and printing average.

edited Jan 30 at 20:18

answered Jan 30 at 15:40

Archemar

20.2k93772

try

 awk 'c[$1]++ ; t[$1]+=$4 
 END for (cc in c ) if ( c[cc]>1 ) printf "%s : %.1fn",cc,t[cc]/c[cc];' file.log

which give

30903 : 49.5

to get your output, remove if (c[cc]>1) part.

the script basically count c[$1]++ and add t[$1]+=$4 value.

c[x] is for count(x), that is $1 : first field

t[x] is for total(x), that is sum of total value for 4'th field

at the end, we loop over all values, computing and printing average.

edited Jan 30 at 20:18

answered Jan 30 at 15:40

Archemar

20.2k93772

edited Jan 30 at 20:18

answered Jan 30 at 15:40

Archemar

20.2k93772

answered Jan 30 at 15:40

Archemar

20.2k93772

answered Jan 30 at 15:40

Archemar

20.2k93772

More descriptive variable names would be helpful.

– glenn jackman
Jan 30 at 18:31

Thank you it works fine. This is exactly what I wanted.

– Mr_S
Feb 1 at 7:06

@glennjackman Descriptive names in a throwaway twp-liner in Awk? What for. It's a bit ironic that the loop dummy cc has a longer name than the c and t globals, though.

– Kaz
Feb 2 at 19:56

add a comment |

More descriptive variable names would be helpful.

– glenn jackman
Jan 30 at 18:31

Thank you it works fine. This is exactly what I wanted.

– Mr_S
Feb 1 at 7:06

@glennjackman Descriptive names in a throwaway twp-liner in Awk? What for. It's a bit ironic that the loop dummy cc has a longer name than the c and t globals, though.

– Kaz
Feb 2 at 19:56

More descriptive variable names would be helpful.

– glenn jackman
Jan 30 at 18:31

Thank you it works fine. This is exactly what I wanted.

– Mr_S
Feb 1 at 7:06

@glennjackman Descriptive names in a throwaway twp-liner in Awk? What for. It's a bit ironic that the loop dummy cc has a longer name than the c and t globals, though.

– Kaz
Feb 2 at 19:56

add a comment |

With Miller (http://johnkerl.org/miller/doc) you can run

mlr --nidx stats1 -a mean -f 4 -g 1 inputFile

and have

30901 89.000000
30902 51.000000
30903 49.500000
30904 59.000000

-f 4 to set the Value-field names on which to compute statistics (the fourth field);

-g 1 to set group-by-field names (the first field).

answered Jan 31 at 9:07

aborruso

22829

Thanks for the alternate solution. I cannot confirm it because I am using Fedora OS and I couldn't find Miller for fedora.

– Mr_S
Feb 1 at 7:41

@Mr_S here you have the generic linux binary (mlr.linux.x86_64 ) github.com/johnkerl/miller/releases/tag/5.4.0

– aborruso
Feb 1 at 8:21

add a comment |

With Miller (http://johnkerl.org/miller/doc) you can run

mlr --nidx stats1 -a mean -f 4 -g 1 inputFile

and have

30901 89.000000
30902 51.000000
30903 49.500000
30904 59.000000

-f 4 to set the Value-field names on which to compute statistics (the fourth field);

-g 1 to set group-by-field names (the first field).

answered Jan 31 at 9:07

aborruso

22829

Thanks for the alternate solution. I cannot confirm it because I am using Fedora OS and I couldn't find Miller for fedora.

– Mr_S
Feb 1 at 7:41

@Mr_S here you have the generic linux binary (mlr.linux.x86_64 ) github.com/johnkerl/miller/releases/tag/5.4.0

– aborruso
Feb 1 at 8:21

add a comment |

With Miller (http://johnkerl.org/miller/doc) you can run

mlr --nidx stats1 -a mean -f 4 -g 1 inputFile

and have

30901 89.000000
30902 51.000000
30903 49.500000
30904 59.000000

-f 4 to set the Value-field names on which to compute statistics (the fourth field);

-g 1 to set group-by-field names (the first field).

answered Jan 31 at 9:07

aborruso

22829

With Miller (http://johnkerl.org/miller/doc) you can run

mlr --nidx stats1 -a mean -f 4 -g 1 inputFile

and have

30901 89.000000
30902 51.000000
30903 49.500000
30904 59.000000

-f 4 to set the Value-field names on which to compute statistics (the fourth field);

-g 1 to set group-by-field names (the first field).

answered Jan 31 at 9:07

aborruso

22829

answered Jan 31 at 9:07

aborruso

22829

answered Jan 31 at 9:07

aborruso

22829

answered Jan 31 at 9:07

aborruso

22829

Thanks for the alternate solution. I cannot confirm it because I am using Fedora OS and I couldn't find Miller for fedora.

– Mr_S
Feb 1 at 7:41

@Mr_S here you have the generic linux binary (mlr.linux.x86_64 ) github.com/johnkerl/miller/releases/tag/5.4.0

– aborruso
Feb 1 at 8:21

add a comment |

Thanks for the alternate solution. I cannot confirm it because I am using Fedora OS and I couldn't find Miller for fedora.

– Mr_S
Feb 1 at 7:41

@Mr_S here you have the generic linux binary (mlr.linux.x86_64 ) github.com/johnkerl/miller/releases/tag/5.4.0

– aborruso
Feb 1 at 8:21

Thanks for the alternate solution. I cannot confirm it because I am using Fedora OS and I couldn't find Miller for fedora.

– Mr_S
Feb 1 at 7:41

@Mr_S here you have the generic linux binary (mlr.linux.x86_64 ) github.com/johnkerl/miller/releases/tag/5.4.0

– aborruso
Feb 1 at 8:21

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

搜尋此網誌

mjhjmtu