awk script to rearrange similar rows

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I want rearrange about 5 million rows (with 300 columns) into groups.

Data looks like the following: where there were various experiments (column 2) conducted at different locations (column headers in top row column 4 onwards) in different years (column 1) using instruments (column 3). The numbers in the matrix ( row 2 onwards, column 4 onwards) indicate how many instances of experiments were successful.

What I want is to rearrange the rows

Input

 345 346 347 348 349 350 351 352
2014 Exp1 IBM 24 45 22 
2014 Exp2 LEN 23 32 34 
2014 Exp3 LEN 2 34 34 
2014 Exp4 IBM 34 44 43 
2014 Exp5 IBM 2 45 51 45 
2014 Exp6 IBM 34 23 54 
2014 Exp7 IBM 23 23 24 
2014 Exp8 IBM 34 45 56 
2014 Exp9 LEN 24 45 45 
2014 Exp10 LEN 43 45 32 
2015 Exp11 IBM 34 55 33 34
2015 Exp12 IBM 1 33 4 5
2015 Exp13 IBM 43 55 34 43
2015 Exp14 IBM 45 32 43 4
2015 Exp15 IBM 23 4 5
2015 Exp16 IBM 32 34 43 
2015 Exp17 IBM 32 34 46 
2015 Exp18 LEN 32 54 67 
2015 Exp19 SCL 56 6 4 45 56
2015 Exp20 LEN 67 56 76 
2015 Exp21 LEN 45 56 65 
2015 Exp22 SCL 45 55 54
2015 Exp23 SCL 4 55 45

What I would like to have are the rows rearranged into groups such that

1) Within the same year
2) Using the same instrument

create groups such that,

each group has at least 3 locations in common, each of which has at least 20 successful experiments.

Requested Output

 345 346 347 348 349 350 351 352
1 2014 Exp1 IBM 24 45 22 
1 2014 Exp4 IBM 34 44 43 
1 2014 Exp7 IBM 23 23 24 
2 2014 Exp2 LEN 23 32 34 
2 2014 Exp9 LEN 24 45 45 
2 2014 Exp10 LEN 43 45 32 
3 2014 Exp5 IBM 2 45 51 45 
3 2014 Exp6 IBM 34 23 54 
3 2014 Exp8 IBM 34 45 56 
4 2015 Exp11 IBM 34 55 33 34
4 2015 Exp13 IBM 43 55 34 43
4 2015 Exp14 IBM 45 32 43 4
5 2015 Exp16 IBM 32 34 43 
5 2015 Exp17 IBM 32 34 46 
6 2015 Exp18 LEN 32 54 67 
6 2015 Exp20 LEN 67 56 76 
6 2015 Exp21 LEN 45 56 65 
7 2015 Exp19 SCL 56 6 4 45 56
7 2015 Exp22 SCL 45 55 54
 2014 Exp3 LEN 2 34 34 
 2015 Exp12 IBM 1 33 4 5
 2015 Exp15 IBM 23 4 5
 2015 Exp23 SCL 4 55 45

Here is what I tried.

awk ' NR>1 for (i=4;i<=NF;i++) if ($i!="") arr1[$1,$2,$3]=$i ; next 
 $1,$2,$3 in arr1 { 
 for (j=1;j<length(arr1);j++)) 
 {if (arr1[j] > 20)
 group++;
 END 
 for (j in n) 
 print group, arr1[j]
 
' input input

edited Mar 9 at 12:24

Rui F Ribeiro

41.9k1483142

asked Mar 30 '15 at 0:45

Sheetal Kaul

As far as I can see, to group the output correctly would require some form of clustering. This is non-trivial, and there would be several "correct" solutions in terms of how the rows were ordered.

– Kusalananda♦
Mar 9 at 18:16

add a comment |

I want rearrange about 5 million rows (with 300 columns) into groups.

What I want is to rearrange the rows

Input

 345 346 347 348 349 350 351 352
2014 Exp1 IBM 24 45 22 
2014 Exp2 LEN 23 32 34 
2014 Exp3 LEN 2 34 34 
2014 Exp4 IBM 34 44 43 
2014 Exp5 IBM 2 45 51 45 
2014 Exp6 IBM 34 23 54 
2014 Exp7 IBM 23 23 24 
2014 Exp8 IBM 34 45 56 
2014 Exp9 LEN 24 45 45 
2014 Exp10 LEN 43 45 32 
2015 Exp11 IBM 34 55 33 34
2015 Exp12 IBM 1 33 4 5
2015 Exp13 IBM 43 55 34 43
2015 Exp14 IBM 45 32 43 4
2015 Exp15 IBM 23 4 5
2015 Exp16 IBM 32 34 43 
2015 Exp17 IBM 32 34 46 
2015 Exp18 LEN 32 54 67 
2015 Exp19 SCL 56 6 4 45 56
2015 Exp20 LEN 67 56 76 
2015 Exp21 LEN 45 56 65 
2015 Exp22 SCL 45 55 54
2015 Exp23 SCL 4 55 45

What I would like to have are the rows rearranged into groups such that

1) Within the same year
2) Using the same instrument

create groups such that,

each group has at least 3 locations in common, each of which has at least 20 successful experiments.

Requested Output

 345 346 347 348 349 350 351 352
1 2014 Exp1 IBM 24 45 22 
1 2014 Exp4 IBM 34 44 43 
1 2014 Exp7 IBM 23 23 24 
2 2014 Exp2 LEN 23 32 34 
2 2014 Exp9 LEN 24 45 45 
2 2014 Exp10 LEN 43 45 32 
3 2014 Exp5 IBM 2 45 51 45 
3 2014 Exp6 IBM 34 23 54 
3 2014 Exp8 IBM 34 45 56 
4 2015 Exp11 IBM 34 55 33 34
4 2015 Exp13 IBM 43 55 34 43
4 2015 Exp14 IBM 45 32 43 4
5 2015 Exp16 IBM 32 34 43 
5 2015 Exp17 IBM 32 34 46 
6 2015 Exp18 LEN 32 54 67 
6 2015 Exp20 LEN 67 56 76 
6 2015 Exp21 LEN 45 56 65 
7 2015 Exp19 SCL 56 6 4 45 56
7 2015 Exp22 SCL 45 55 54
 2014 Exp3 LEN 2 34 34 
 2015 Exp12 IBM 1 33 4 5
 2015 Exp15 IBM 23 4 5
 2015 Exp23 SCL 4 55 45

Here is what I tried.

awk ' NR>1 for (i=4;i<=NF;i++) if ($i!="") arr1[$1,$2,$3]=$i ; next 
 $1,$2,$3 in arr1 { 
 for (j=1;j<length(arr1);j++)) 
 {if (arr1[j] > 20)
 group++;
 END 
 for (j in n) 
 print group, arr1[j]
 
' input input

edited Mar 9 at 12:24

Rui F Ribeiro

41.9k1483142

asked Mar 30 '15 at 0:45

Sheetal Kaul

As far as I can see, to group the output correctly would require some form of clustering. This is non-trivial, and there would be several "correct" solutions in terms of how the rows were ordered.

– Kusalananda♦
Mar 9 at 18:16

add a comment |

I want rearrange about 5 million rows (with 300 columns) into groups.

What I want is to rearrange the rows

Input

 345 346 347 348 349 350 351 352
2014 Exp1 IBM 24 45 22 
2014 Exp2 LEN 23 32 34 
2014 Exp3 LEN 2 34 34 
2014 Exp4 IBM 34 44 43 
2014 Exp5 IBM 2 45 51 45 
2014 Exp6 IBM 34 23 54 
2014 Exp7 IBM 23 23 24 
2014 Exp8 IBM 34 45 56 
2014 Exp9 LEN 24 45 45 
2014 Exp10 LEN 43 45 32 
2015 Exp11 IBM 34 55 33 34
2015 Exp12 IBM 1 33 4 5
2015 Exp13 IBM 43 55 34 43
2015 Exp14 IBM 45 32 43 4
2015 Exp15 IBM 23 4 5
2015 Exp16 IBM 32 34 43 
2015 Exp17 IBM 32 34 46 
2015 Exp18 LEN 32 54 67 
2015 Exp19 SCL 56 6 4 45 56
2015 Exp20 LEN 67 56 76 
2015 Exp21 LEN 45 56 65 
2015 Exp22 SCL 45 55 54
2015 Exp23 SCL 4 55 45

What I would like to have are the rows rearranged into groups such that

1) Within the same year
2) Using the same instrument

create groups such that,

each group has at least 3 locations in common, each of which has at least 20 successful experiments.

Requested Output

 345 346 347 348 349 350 351 352
1 2014 Exp1 IBM 24 45 22 
1 2014 Exp4 IBM 34 44 43 
1 2014 Exp7 IBM 23 23 24 
2 2014 Exp2 LEN 23 32 34 
2 2014 Exp9 LEN 24 45 45 
2 2014 Exp10 LEN 43 45 32 
3 2014 Exp5 IBM 2 45 51 45 
3 2014 Exp6 IBM 34 23 54 
3 2014 Exp8 IBM 34 45 56 
4 2015 Exp11 IBM 34 55 33 34
4 2015 Exp13 IBM 43 55 34 43
4 2015 Exp14 IBM 45 32 43 4
5 2015 Exp16 IBM 32 34 43 
5 2015 Exp17 IBM 32 34 46 
6 2015 Exp18 LEN 32 54 67 
6 2015 Exp20 LEN 67 56 76 
6 2015 Exp21 LEN 45 56 65 
7 2015 Exp19 SCL 56 6 4 45 56
7 2015 Exp22 SCL 45 55 54
 2014 Exp3 LEN 2 34 34 
 2015 Exp12 IBM 1 33 4 5
 2015 Exp15 IBM 23 4 5
 2015 Exp23 SCL 4 55 45

Here is what I tried.

awk ' NR>1 for (i=4;i<=NF;i++) if ($i!="") arr1[$1,$2,$3]=$i ; next 
 $1,$2,$3 in arr1 { 
 for (j=1;j<length(arr1);j++)) 
 {if (arr1[j] > 20)
 group++;
 END 
 for (j in n) 
 print group, arr1[j]
 
' input input

edited Mar 9 at 12:24

Rui F Ribeiro

41.9k1483142

asked Mar 30 '15 at 0:45

Sheetal Kaul

I want rearrange about 5 million rows (with 300 columns) into groups.

What I want is to rearrange the rows

Input

 345 346 347 348 349 350 351 352
2014 Exp1 IBM 24 45 22 
2014 Exp2 LEN 23 32 34 
2014 Exp3 LEN 2 34 34 
2014 Exp4 IBM 34 44 43 
2014 Exp5 IBM 2 45 51 45 
2014 Exp6 IBM 34 23 54 
2014 Exp7 IBM 23 23 24 
2014 Exp8 IBM 34 45 56 
2014 Exp9 LEN 24 45 45 
2014 Exp10 LEN 43 45 32 
2015 Exp11 IBM 34 55 33 34
2015 Exp12 IBM 1 33 4 5
2015 Exp13 IBM 43 55 34 43
2015 Exp14 IBM 45 32 43 4
2015 Exp15 IBM 23 4 5
2015 Exp16 IBM 32 34 43 
2015 Exp17 IBM 32 34 46 
2015 Exp18 LEN 32 54 67 
2015 Exp19 SCL 56 6 4 45 56
2015 Exp20 LEN 67 56 76 
2015 Exp21 LEN 45 56 65 
2015 Exp22 SCL 45 55 54
2015 Exp23 SCL 4 55 45

What I would like to have are the rows rearranged into groups such that

1) Within the same year
2) Using the same instrument

create groups such that,

each group has at least 3 locations in common, each of which has at least 20 successful experiments.

Requested Output

 345 346 347 348 349 350 351 352
1 2014 Exp1 IBM 24 45 22 
1 2014 Exp4 IBM 34 44 43 
1 2014 Exp7 IBM 23 23 24 
2 2014 Exp2 LEN 23 32 34 
2 2014 Exp9 LEN 24 45 45 
2 2014 Exp10 LEN 43 45 32 
3 2014 Exp5 IBM 2 45 51 45 
3 2014 Exp6 IBM 34 23 54 
3 2014 Exp8 IBM 34 45 56 
4 2015 Exp11 IBM 34 55 33 34
4 2015 Exp13 IBM 43 55 34 43
4 2015 Exp14 IBM 45 32 43 4
5 2015 Exp16 IBM 32 34 43 
5 2015 Exp17 IBM 32 34 46 
6 2015 Exp18 LEN 32 54 67 
6 2015 Exp20 LEN 67 56 76 
6 2015 Exp21 LEN 45 56 65 
7 2015 Exp19 SCL 56 6 4 45 56
7 2015 Exp22 SCL 45 55 54
 2014 Exp3 LEN 2 34 34 
 2015 Exp12 IBM 1 33 4 5
 2015 Exp15 IBM 23 4 5
 2015 Exp23 SCL 4 55 45

Here is what I tried.

awk ' NR>1 for (i=4;i<=NF;i++) if ($i!="") arr1[$1,$2,$3]=$i ; next 
 $1,$2,$3 in arr1 { 
 for (j=1;j<length(arr1);j++)) 
 {if (arr1[j] > 20)
 group++;
 END 
 for (j in n) 
 print group, arr1[j]
 
' input input

bash awk perl

edited Mar 9 at 12:24

Rui F Ribeiro

41.9k1483142

asked Mar 30 '15 at 0:45

Sheetal Kaul

edited Mar 9 at 12:24

Rui F Ribeiro

41.9k1483142

asked Mar 30 '15 at 0:45

Sheetal Kaul

edited Mar 9 at 12:24

Rui F Ribeiro

41.9k1483142

edited Mar 9 at 12:24

Rui F Ribeiro

41.9k1483142

edited Mar 9 at 12:24

Rui F Ribeiro

41.9k1483142

asked Mar 30 '15 at 0:45

Sheetal Kaul

asked Mar 30 '15 at 0:45

Sheetal Kaul

asked Mar 30 '15 at 0:45

Sheetal Kaul

As far as I can see, to group the output correctly would require some form of clustering. This is non-trivial, and there would be several "correct" solutions in terms of how the rows were ordered.

– Kusalananda♦
Mar 9 at 18:16

add a comment |

As far as I can see, to group the output correctly would require some form of clustering. This is non-trivial, and there would be several "correct" solutions in terms of how the rows were ordered.

– Kusalananda♦
Mar 9 at 18:16

As far as I can see, to group the output correctly would require some form of clustering. This is non-trivial, and there would be several "correct" solutions in terms of how the rows were ordered.

– Kusalananda♦
Mar 9 at 18:16

add a comment |

1 Answer
1

active

oldest

votes

A few random hints depending on the actual data formatting and other issues...

How are the data fields separated? (The first three spacings give the impression that there's a TAB character in between, while the last columns seem space separated.) You should be aware that the column information is lost for the culumns 4-N if your field separator is defined as per default. So the logic of your code is seriously flawed.

If you have no TAB separators but all blanks you can use GNU awk's FIELDWIDTHS feature to access the data (including the missing "blank" data, as you seem to be trying to achieve).

If you have TABs for the first three separators and blanks for the rest you should explocitly define FS="t", so that you can directly work on fields 1-3 and have the spacing intact in the final data (that you can address as a whole as field 4), which will make it easy to find "blank data".

It may further make processing easier if you create subsets of your data on the fly, operate on those, and concatenate the individual subsets afterwards. To separate the data in files depening on, say, year and instrument you can write:

awk ' print > "set_" $1 "_" $3" ' input

and it will create files named, e.g., set_2015_LEN or set_2014_IBM containing the respective entries.

The final task to identify "matching numeric column sets" depends on the previously mentioned topics; if, for example, the final eigth data columns can be addressed as one fixed length entity it might suffice to use the sort utility with an appropriately defined key-specification (see sort's option -k).

(BTW: for a compound index test instead of $1,$2,$3 in arr1 you have to write ($1,$2,$3) in arr1.)

answered Mar 30 '15 at 1:24

Janis

10.4k21638

Thank you for your suggestions, the data is tab delimited, I couldnt seem to line it up in this post, so introduced some manual spaces...the actual data has 300 columns, what I am thinking is replacing the >20 values by 1 and the others by blank, so that I can treat it as a fixed length entity. I checked out the sort -k option, but I couldn't follow how that would apply here specially for so many columns, would you give me a small example? thanks again for your suggestion on splitting the data-set.

– Sheetal Kaul
Mar 30 '15 at 2:34

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f193284%2fawk-script-to-rearrange-similar-rows%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

A few random hints depending on the actual data formatting and other issues...

If you have no TAB separators but all blanks you can use GNU awk's FIELDWIDTHS feature to access the data (including the missing "blank" data, as you seem to be trying to achieve).

awk ' print > "set_" $1 "_" $3" ' input

and it will create files named, e.g., set_2015_LEN or set_2014_IBM containing the respective entries.

(BTW: for a compound index test instead of $1,$2,$3 in arr1 you have to write ($1,$2,$3) in arr1.)

answered Mar 30 '15 at 1:24

Janis

10.4k21638

Thank you for your suggestions, the data is tab delimited, I couldnt seem to line it up in this post, so introduced some manual spaces...the actual data has 300 columns, what I am thinking is replacing the >20 values by 1 and the others by blank, so that I can treat it as a fixed length entity. I checked out the sort -k option, but I couldn't follow how that would apply here specially for so many columns, would you give me a small example? thanks again for your suggestion on splitting the data-set.

– Sheetal Kaul
Mar 30 '15 at 2:34

add a comment |

A few random hints depending on the actual data formatting and other issues...

If you have no TAB separators but all blanks you can use GNU awk's FIELDWIDTHS feature to access the data (including the missing "blank" data, as you seem to be trying to achieve).

awk ' print > "set_" $1 "_" $3" ' input

and it will create files named, e.g., set_2015_LEN or set_2014_IBM containing the respective entries.

(BTW: for a compound index test instead of $1,$2,$3 in arr1 you have to write ($1,$2,$3) in arr1.)

answered Mar 30 '15 at 1:24

Janis

10.4k21638

Thank you for your suggestions, the data is tab delimited, I couldnt seem to line it up in this post, so introduced some manual spaces...the actual data has 300 columns, what I am thinking is replacing the >20 values by 1 and the others by blank, so that I can treat it as a fixed length entity. I checked out the sort -k option, but I couldn't follow how that would apply here specially for so many columns, would you give me a small example? thanks again for your suggestion on splitting the data-set.

– Sheetal Kaul
Mar 30 '15 at 2:34

add a comment |

A few random hints depending on the actual data formatting and other issues...

If you have no TAB separators but all blanks you can use GNU awk's FIELDWIDTHS feature to access the data (including the missing "blank" data, as you seem to be trying to achieve).

awk ' print > "set_" $1 "_" $3" ' input

and it will create files named, e.g., set_2015_LEN or set_2014_IBM containing the respective entries.

(BTW: for a compound index test instead of $1,$2,$3 in arr1 you have to write ($1,$2,$3) in arr1.)

answered Mar 30 '15 at 1:24

Janis

10.4k21638

A few random hints depending on the actual data formatting and other issues...

If you have no TAB separators but all blanks you can use GNU awk's FIELDWIDTHS feature to access the data (including the missing "blank" data, as you seem to be trying to achieve).

awk ' print > "set_" $1 "_" $3" ' input

and it will create files named, e.g., set_2015_LEN or set_2014_IBM containing the respective entries.

(BTW: for a compound index test instead of $1,$2,$3 in arr1 you have to write ($1,$2,$3) in arr1.)

answered Mar 30 '15 at 1:24

Janis

10.4k21638

answered Mar 30 '15 at 1:24

Janis

10.4k21638

answered Mar 30 '15 at 1:24

Janis

10.4k21638

answered Mar 30 '15 at 1:24

Janis

10.4k21638

Thank you for your suggestions, the data is tab delimited, I couldnt seem to line it up in this post, so introduced some manual spaces...the actual data has 300 columns, what I am thinking is replacing the >20 values by 1 and the others by blank, so that I can treat it as a fixed length entity. I checked out the sort -k option, but I couldn't follow how that would apply here specially for so many columns, would you give me a small example? thanks again for your suggestion on splitting the data-set.

– Sheetal Kaul
Mar 30 '15 at 2:34

add a comment |

Thank you for your suggestions, the data is tab delimited, I couldnt seem to line it up in this post, so introduced some manual spaces...the actual data has 300 columns, what I am thinking is replacing the >20 values by 1 and the others by blank, so that I can treat it as a fixed length entity. I checked out the sort -k option, but I couldn't follow how that would apply here specially for so many columns, would you give me a small example? thanks again for your suggestion on splitting the data-set.

– Sheetal Kaul
Mar 30 '15 at 2:34

Thank you for your suggestions, the data is tab delimited, I couldnt seem to line it up in this post, so introduced some manual spaces...the actual data has 300 columns, what I am thinking is replacing the >20 values by 1 and the others by blank, so that I can treat it as a fixed length entity. I checked out the sort -k option, but I couldn't follow how that would apply here specially for so many columns, would you give me a small example? thanks again for your suggestion on splitting the data-set.

– Sheetal Kaul
Mar 30 '15 at 2:34

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

搜尋此網誌

mjhjmtu