awk script to rearrange similar rows
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I want rearrange about 5 million rows (with 300 columns) into groups.
Data looks like the following: where there were various experiments (column 2) conducted at different locations (column headers in top row column 4 onwards) in different years (column 1) using instruments (column 3). The numbers in the matrix ( row 2 onwards, column 4 onwards) indicate how many instances of experiments were successful.
What I want is to rearrange the rows
Input
345 346 347 348 349 350 351 352
2014 Exp1 IBM 24 45 22
2014 Exp2 LEN 23 32 34
2014 Exp3 LEN 2 34 34
2014 Exp4 IBM 34 44 43
2014 Exp5 IBM 2 45 51 45
2014 Exp6 IBM 34 23 54
2014 Exp7 IBM 23 23 24
2014 Exp8 IBM 34 45 56
2014 Exp9 LEN 24 45 45
2014 Exp10 LEN 43 45 32
2015 Exp11 IBM 34 55 33 34
2015 Exp12 IBM 1 33 4 5
2015 Exp13 IBM 43 55 34 43
2015 Exp14 IBM 45 32 43 4
2015 Exp15 IBM 23 4 5
2015 Exp16 IBM 32 34 43
2015 Exp17 IBM 32 34 46
2015 Exp18 LEN 32 54 67
2015 Exp19 SCL 56 6 4 45 56
2015 Exp20 LEN 67 56 76
2015 Exp21 LEN 45 56 65
2015 Exp22 SCL 45 55 54
2015 Exp23 SCL 4 55 45
What I would like to have are the rows rearranged into groups such that
1) Within the same year
2) Using the same instrument
create groups such that,
each group has at least 3 locations in common, each of which has at least 20 successful experiments.
Requested Output
345 346 347 348 349 350 351 352
1 2014 Exp1 IBM 24 45 22
1 2014 Exp4 IBM 34 44 43
1 2014 Exp7 IBM 23 23 24
2 2014 Exp2 LEN 23 32 34
2 2014 Exp9 LEN 24 45 45
2 2014 Exp10 LEN 43 45 32
3 2014 Exp5 IBM 2 45 51 45
3 2014 Exp6 IBM 34 23 54
3 2014 Exp8 IBM 34 45 56
4 2015 Exp11 IBM 34 55 33 34
4 2015 Exp13 IBM 43 55 34 43
4 2015 Exp14 IBM 45 32 43 4
5 2015 Exp16 IBM 32 34 43
5 2015 Exp17 IBM 32 34 46
6 2015 Exp18 LEN 32 54 67
6 2015 Exp20 LEN 67 56 76
6 2015 Exp21 LEN 45 56 65
7 2015 Exp19 SCL 56 6 4 45 56
7 2015 Exp22 SCL 45 55 54
2014 Exp3 LEN 2 34 34
2015 Exp12 IBM 1 33 4 5
2015 Exp15 IBM 23 4 5
2015 Exp23 SCL 4 55 45
Here is what I tried.
awk ' NR>1 for (i=4;i<=NF;i++) if ($i!="") arr1[$1,$2,$3]=$i ; next
$1,$2,$3 in arr1 {
for (j=1;j<length(arr1);j++))
{if (arr1[j] > 20)
group++;
END
for (j in n)
print group, arr1[j]
' input input
bash awk perl
add a comment |
I want rearrange about 5 million rows (with 300 columns) into groups.
Data looks like the following: where there were various experiments (column 2) conducted at different locations (column headers in top row column 4 onwards) in different years (column 1) using instruments (column 3). The numbers in the matrix ( row 2 onwards, column 4 onwards) indicate how many instances of experiments were successful.
What I want is to rearrange the rows
Input
345 346 347 348 349 350 351 352
2014 Exp1 IBM 24 45 22
2014 Exp2 LEN 23 32 34
2014 Exp3 LEN 2 34 34
2014 Exp4 IBM 34 44 43
2014 Exp5 IBM 2 45 51 45
2014 Exp6 IBM 34 23 54
2014 Exp7 IBM 23 23 24
2014 Exp8 IBM 34 45 56
2014 Exp9 LEN 24 45 45
2014 Exp10 LEN 43 45 32
2015 Exp11 IBM 34 55 33 34
2015 Exp12 IBM 1 33 4 5
2015 Exp13 IBM 43 55 34 43
2015 Exp14 IBM 45 32 43 4
2015 Exp15 IBM 23 4 5
2015 Exp16 IBM 32 34 43
2015 Exp17 IBM 32 34 46
2015 Exp18 LEN 32 54 67
2015 Exp19 SCL 56 6 4 45 56
2015 Exp20 LEN 67 56 76
2015 Exp21 LEN 45 56 65
2015 Exp22 SCL 45 55 54
2015 Exp23 SCL 4 55 45
What I would like to have are the rows rearranged into groups such that
1) Within the same year
2) Using the same instrument
create groups such that,
each group has at least 3 locations in common, each of which has at least 20 successful experiments.
Requested Output
345 346 347 348 349 350 351 352
1 2014 Exp1 IBM 24 45 22
1 2014 Exp4 IBM 34 44 43
1 2014 Exp7 IBM 23 23 24
2 2014 Exp2 LEN 23 32 34
2 2014 Exp9 LEN 24 45 45
2 2014 Exp10 LEN 43 45 32
3 2014 Exp5 IBM 2 45 51 45
3 2014 Exp6 IBM 34 23 54
3 2014 Exp8 IBM 34 45 56
4 2015 Exp11 IBM 34 55 33 34
4 2015 Exp13 IBM 43 55 34 43
4 2015 Exp14 IBM 45 32 43 4
5 2015 Exp16 IBM 32 34 43
5 2015 Exp17 IBM 32 34 46
6 2015 Exp18 LEN 32 54 67
6 2015 Exp20 LEN 67 56 76
6 2015 Exp21 LEN 45 56 65
7 2015 Exp19 SCL 56 6 4 45 56
7 2015 Exp22 SCL 45 55 54
2014 Exp3 LEN 2 34 34
2015 Exp12 IBM 1 33 4 5
2015 Exp15 IBM 23 4 5
2015 Exp23 SCL 4 55 45
Here is what I tried.
awk ' NR>1 for (i=4;i<=NF;i++) if ($i!="") arr1[$1,$2,$3]=$i ; next
$1,$2,$3 in arr1 {
for (j=1;j<length(arr1);j++))
{if (arr1[j] > 20)
group++;
END
for (j in n)
print group, arr1[j]
' input input
bash awk perl
As far as I can see, to group the output correctly would require some form of clustering. This is non-trivial, and there would be several "correct" solutions in terms of how the rows were ordered.
– Kusalananda♦
Mar 9 at 18:16
add a comment |
I want rearrange about 5 million rows (with 300 columns) into groups.
Data looks like the following: where there were various experiments (column 2) conducted at different locations (column headers in top row column 4 onwards) in different years (column 1) using instruments (column 3). The numbers in the matrix ( row 2 onwards, column 4 onwards) indicate how many instances of experiments were successful.
What I want is to rearrange the rows
Input
345 346 347 348 349 350 351 352
2014 Exp1 IBM 24 45 22
2014 Exp2 LEN 23 32 34
2014 Exp3 LEN 2 34 34
2014 Exp4 IBM 34 44 43
2014 Exp5 IBM 2 45 51 45
2014 Exp6 IBM 34 23 54
2014 Exp7 IBM 23 23 24
2014 Exp8 IBM 34 45 56
2014 Exp9 LEN 24 45 45
2014 Exp10 LEN 43 45 32
2015 Exp11 IBM 34 55 33 34
2015 Exp12 IBM 1 33 4 5
2015 Exp13 IBM 43 55 34 43
2015 Exp14 IBM 45 32 43 4
2015 Exp15 IBM 23 4 5
2015 Exp16 IBM 32 34 43
2015 Exp17 IBM 32 34 46
2015 Exp18 LEN 32 54 67
2015 Exp19 SCL 56 6 4 45 56
2015 Exp20 LEN 67 56 76
2015 Exp21 LEN 45 56 65
2015 Exp22 SCL 45 55 54
2015 Exp23 SCL 4 55 45
What I would like to have are the rows rearranged into groups such that
1) Within the same year
2) Using the same instrument
create groups such that,
each group has at least 3 locations in common, each of which has at least 20 successful experiments.
Requested Output
345 346 347 348 349 350 351 352
1 2014 Exp1 IBM 24 45 22
1 2014 Exp4 IBM 34 44 43
1 2014 Exp7 IBM 23 23 24
2 2014 Exp2 LEN 23 32 34
2 2014 Exp9 LEN 24 45 45
2 2014 Exp10 LEN 43 45 32
3 2014 Exp5 IBM 2 45 51 45
3 2014 Exp6 IBM 34 23 54
3 2014 Exp8 IBM 34 45 56
4 2015 Exp11 IBM 34 55 33 34
4 2015 Exp13 IBM 43 55 34 43
4 2015 Exp14 IBM 45 32 43 4
5 2015 Exp16 IBM 32 34 43
5 2015 Exp17 IBM 32 34 46
6 2015 Exp18 LEN 32 54 67
6 2015 Exp20 LEN 67 56 76
6 2015 Exp21 LEN 45 56 65
7 2015 Exp19 SCL 56 6 4 45 56
7 2015 Exp22 SCL 45 55 54
2014 Exp3 LEN 2 34 34
2015 Exp12 IBM 1 33 4 5
2015 Exp15 IBM 23 4 5
2015 Exp23 SCL 4 55 45
Here is what I tried.
awk ' NR>1 for (i=4;i<=NF;i++) if ($i!="") arr1[$1,$2,$3]=$i ; next
$1,$2,$3 in arr1 {
for (j=1;j<length(arr1);j++))
{if (arr1[j] > 20)
group++;
END
for (j in n)
print group, arr1[j]
' input input
bash awk perl
I want rearrange about 5 million rows (with 300 columns) into groups.
Data looks like the following: where there were various experiments (column 2) conducted at different locations (column headers in top row column 4 onwards) in different years (column 1) using instruments (column 3). The numbers in the matrix ( row 2 onwards, column 4 onwards) indicate how many instances of experiments were successful.
What I want is to rearrange the rows
Input
345 346 347 348 349 350 351 352
2014 Exp1 IBM 24 45 22
2014 Exp2 LEN 23 32 34
2014 Exp3 LEN 2 34 34
2014 Exp4 IBM 34 44 43
2014 Exp5 IBM 2 45 51 45
2014 Exp6 IBM 34 23 54
2014 Exp7 IBM 23 23 24
2014 Exp8 IBM 34 45 56
2014 Exp9 LEN 24 45 45
2014 Exp10 LEN 43 45 32
2015 Exp11 IBM 34 55 33 34
2015 Exp12 IBM 1 33 4 5
2015 Exp13 IBM 43 55 34 43
2015 Exp14 IBM 45 32 43 4
2015 Exp15 IBM 23 4 5
2015 Exp16 IBM 32 34 43
2015 Exp17 IBM 32 34 46
2015 Exp18 LEN 32 54 67
2015 Exp19 SCL 56 6 4 45 56
2015 Exp20 LEN 67 56 76
2015 Exp21 LEN 45 56 65
2015 Exp22 SCL 45 55 54
2015 Exp23 SCL 4 55 45
What I would like to have are the rows rearranged into groups such that
1) Within the same year
2) Using the same instrument
create groups such that,
each group has at least 3 locations in common, each of which has at least 20 successful experiments.
Requested Output
345 346 347 348 349 350 351 352
1 2014 Exp1 IBM 24 45 22
1 2014 Exp4 IBM 34 44 43
1 2014 Exp7 IBM 23 23 24
2 2014 Exp2 LEN 23 32 34
2 2014 Exp9 LEN 24 45 45
2 2014 Exp10 LEN 43 45 32
3 2014 Exp5 IBM 2 45 51 45
3 2014 Exp6 IBM 34 23 54
3 2014 Exp8 IBM 34 45 56
4 2015 Exp11 IBM 34 55 33 34
4 2015 Exp13 IBM 43 55 34 43
4 2015 Exp14 IBM 45 32 43 4
5 2015 Exp16 IBM 32 34 43
5 2015 Exp17 IBM 32 34 46
6 2015 Exp18 LEN 32 54 67
6 2015 Exp20 LEN 67 56 76
6 2015 Exp21 LEN 45 56 65
7 2015 Exp19 SCL 56 6 4 45 56
7 2015 Exp22 SCL 45 55 54
2014 Exp3 LEN 2 34 34
2015 Exp12 IBM 1 33 4 5
2015 Exp15 IBM 23 4 5
2015 Exp23 SCL 4 55 45
Here is what I tried.
awk ' NR>1 for (i=4;i<=NF;i++) if ($i!="") arr1[$1,$2,$3]=$i ; next
$1,$2,$3 in arr1 {
for (j=1;j<length(arr1);j++))
{if (arr1[j] > 20)
group++;
END
for (j in n)
print group, arr1[j]
' input input
bash awk perl
bash awk perl
edited Mar 9 at 12:24
Rui F Ribeiro
41.9k1483142
41.9k1483142
asked Mar 30 '15 at 0:45
Sheetal KaulSheetal Kaul
62
62
As far as I can see, to group the output correctly would require some form of clustering. This is non-trivial, and there would be several "correct" solutions in terms of how the rows were ordered.
– Kusalananda♦
Mar 9 at 18:16
add a comment |
As far as I can see, to group the output correctly would require some form of clustering. This is non-trivial, and there would be several "correct" solutions in terms of how the rows were ordered.
– Kusalananda♦
Mar 9 at 18:16
As far as I can see, to group the output correctly would require some form of clustering. This is non-trivial, and there would be several "correct" solutions in terms of how the rows were ordered.
– Kusalananda♦
Mar 9 at 18:16
As far as I can see, to group the output correctly would require some form of clustering. This is non-trivial, and there would be several "correct" solutions in terms of how the rows were ordered.
– Kusalananda♦
Mar 9 at 18:16
add a comment |
1 Answer
1
active
oldest
votes
A few random hints depending on the actual data formatting and other issues...
How are the data fields separated? (The first three spacings give the impression that there's a TAB character in between, while the last columns seem space separated.) You should be aware that the column information is lost for the culumns 4-N if your field separator is defined as per default. So the logic of your code is seriously flawed.
If you have no TAB separators but all blanks you can use GNU awk
's FIELDWIDTHS
feature to access the data (including the missing "blank" data, as you seem to be trying to achieve).
If you have TABs for the first three separators and blanks for the rest you should explocitly define FS="t"
, so that you can directly work on fields 1-3 and have the spacing intact in the final data (that you can address as a whole as field 4), which will make it easy to find "blank data".
It may further make processing easier if you create subsets of your data on the fly, operate on those, and concatenate the individual subsets afterwards. To separate the data in files depening on, say, year and instrument you can write:
awk ' print > "set_" $1 "_" $3" ' input
and it will create files named, e.g., set_2015_LEN
or set_2014_IBM
containing the respective entries.
The final task to identify "matching numeric column sets" depends on the previously mentioned topics; if, for example, the final eigth data columns can be addressed as one fixed length entity it might suffice to use the sort
utility with an appropriately defined key-specification (see sort
's option -k
).
(BTW: for a compound index test instead of $1,$2,$3 in arr1
you have to write ($1,$2,$3) in arr1
.)
Thank you for your suggestions, the data is tab delimited, I couldnt seem to line it up in this post, so introduced some manual spaces...the actual data has 300 columns, what I am thinking is replacing the >20 values by 1 and the others by blank, so that I can treat it as a fixed length entity. I checked out the sort -k option, but I couldn't follow how that would apply here specially for so many columns, would you give me a small example? thanks again for your suggestion on splitting the data-set.
– Sheetal Kaul
Mar 30 '15 at 2:34
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f193284%2fawk-script-to-rearrange-similar-rows%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
A few random hints depending on the actual data formatting and other issues...
How are the data fields separated? (The first three spacings give the impression that there's a TAB character in between, while the last columns seem space separated.) You should be aware that the column information is lost for the culumns 4-N if your field separator is defined as per default. So the logic of your code is seriously flawed.
If you have no TAB separators but all blanks you can use GNU awk
's FIELDWIDTHS
feature to access the data (including the missing "blank" data, as you seem to be trying to achieve).
If you have TABs for the first three separators and blanks for the rest you should explocitly define FS="t"
, so that you can directly work on fields 1-3 and have the spacing intact in the final data (that you can address as a whole as field 4), which will make it easy to find "blank data".
It may further make processing easier if you create subsets of your data on the fly, operate on those, and concatenate the individual subsets afterwards. To separate the data in files depening on, say, year and instrument you can write:
awk ' print > "set_" $1 "_" $3" ' input
and it will create files named, e.g., set_2015_LEN
or set_2014_IBM
containing the respective entries.
The final task to identify "matching numeric column sets" depends on the previously mentioned topics; if, for example, the final eigth data columns can be addressed as one fixed length entity it might suffice to use the sort
utility with an appropriately defined key-specification (see sort
's option -k
).
(BTW: for a compound index test instead of $1,$2,$3 in arr1
you have to write ($1,$2,$3) in arr1
.)
Thank you for your suggestions, the data is tab delimited, I couldnt seem to line it up in this post, so introduced some manual spaces...the actual data has 300 columns, what I am thinking is replacing the >20 values by 1 and the others by blank, so that I can treat it as a fixed length entity. I checked out the sort -k option, but I couldn't follow how that would apply here specially for so many columns, would you give me a small example? thanks again for your suggestion on splitting the data-set.
– Sheetal Kaul
Mar 30 '15 at 2:34
add a comment |
A few random hints depending on the actual data formatting and other issues...
How are the data fields separated? (The first three spacings give the impression that there's a TAB character in between, while the last columns seem space separated.) You should be aware that the column information is lost for the culumns 4-N if your field separator is defined as per default. So the logic of your code is seriously flawed.
If you have no TAB separators but all blanks you can use GNU awk
's FIELDWIDTHS
feature to access the data (including the missing "blank" data, as you seem to be trying to achieve).
If you have TABs for the first three separators and blanks for the rest you should explocitly define FS="t"
, so that you can directly work on fields 1-3 and have the spacing intact in the final data (that you can address as a whole as field 4), which will make it easy to find "blank data".
It may further make processing easier if you create subsets of your data on the fly, operate on those, and concatenate the individual subsets afterwards. To separate the data in files depening on, say, year and instrument you can write:
awk ' print > "set_" $1 "_" $3" ' input
and it will create files named, e.g., set_2015_LEN
or set_2014_IBM
containing the respective entries.
The final task to identify "matching numeric column sets" depends on the previously mentioned topics; if, for example, the final eigth data columns can be addressed as one fixed length entity it might suffice to use the sort
utility with an appropriately defined key-specification (see sort
's option -k
).
(BTW: for a compound index test instead of $1,$2,$3 in arr1
you have to write ($1,$2,$3) in arr1
.)
Thank you for your suggestions, the data is tab delimited, I couldnt seem to line it up in this post, so introduced some manual spaces...the actual data has 300 columns, what I am thinking is replacing the >20 values by 1 and the others by blank, so that I can treat it as a fixed length entity. I checked out the sort -k option, but I couldn't follow how that would apply here specially for so many columns, would you give me a small example? thanks again for your suggestion on splitting the data-set.
– Sheetal Kaul
Mar 30 '15 at 2:34
add a comment |
A few random hints depending on the actual data formatting and other issues...
How are the data fields separated? (The first three spacings give the impression that there's a TAB character in between, while the last columns seem space separated.) You should be aware that the column information is lost for the culumns 4-N if your field separator is defined as per default. So the logic of your code is seriously flawed.
If you have no TAB separators but all blanks you can use GNU awk
's FIELDWIDTHS
feature to access the data (including the missing "blank" data, as you seem to be trying to achieve).
If you have TABs for the first three separators and blanks for the rest you should explocitly define FS="t"
, so that you can directly work on fields 1-3 and have the spacing intact in the final data (that you can address as a whole as field 4), which will make it easy to find "blank data".
It may further make processing easier if you create subsets of your data on the fly, operate on those, and concatenate the individual subsets afterwards. To separate the data in files depening on, say, year and instrument you can write:
awk ' print > "set_" $1 "_" $3" ' input
and it will create files named, e.g., set_2015_LEN
or set_2014_IBM
containing the respective entries.
The final task to identify "matching numeric column sets" depends on the previously mentioned topics; if, for example, the final eigth data columns can be addressed as one fixed length entity it might suffice to use the sort
utility with an appropriately defined key-specification (see sort
's option -k
).
(BTW: for a compound index test instead of $1,$2,$3 in arr1
you have to write ($1,$2,$3) in arr1
.)
A few random hints depending on the actual data formatting and other issues...
How are the data fields separated? (The first three spacings give the impression that there's a TAB character in between, while the last columns seem space separated.) You should be aware that the column information is lost for the culumns 4-N if your field separator is defined as per default. So the logic of your code is seriously flawed.
If you have no TAB separators but all blanks you can use GNU awk
's FIELDWIDTHS
feature to access the data (including the missing "blank" data, as you seem to be trying to achieve).
If you have TABs for the first three separators and blanks for the rest you should explocitly define FS="t"
, so that you can directly work on fields 1-3 and have the spacing intact in the final data (that you can address as a whole as field 4), which will make it easy to find "blank data".
It may further make processing easier if you create subsets of your data on the fly, operate on those, and concatenate the individual subsets afterwards. To separate the data in files depening on, say, year and instrument you can write:
awk ' print > "set_" $1 "_" $3" ' input
and it will create files named, e.g., set_2015_LEN
or set_2014_IBM
containing the respective entries.
The final task to identify "matching numeric column sets" depends on the previously mentioned topics; if, for example, the final eigth data columns can be addressed as one fixed length entity it might suffice to use the sort
utility with an appropriately defined key-specification (see sort
's option -k
).
(BTW: for a compound index test instead of $1,$2,$3 in arr1
you have to write ($1,$2,$3) in arr1
.)
answered Mar 30 '15 at 1:24
JanisJanis
10.4k21638
10.4k21638
Thank you for your suggestions, the data is tab delimited, I couldnt seem to line it up in this post, so introduced some manual spaces...the actual data has 300 columns, what I am thinking is replacing the >20 values by 1 and the others by blank, so that I can treat it as a fixed length entity. I checked out the sort -k option, but I couldn't follow how that would apply here specially for so many columns, would you give me a small example? thanks again for your suggestion on splitting the data-set.
– Sheetal Kaul
Mar 30 '15 at 2:34
add a comment |
Thank you for your suggestions, the data is tab delimited, I couldnt seem to line it up in this post, so introduced some manual spaces...the actual data has 300 columns, what I am thinking is replacing the >20 values by 1 and the others by blank, so that I can treat it as a fixed length entity. I checked out the sort -k option, but I couldn't follow how that would apply here specially for so many columns, would you give me a small example? thanks again for your suggestion on splitting the data-set.
– Sheetal Kaul
Mar 30 '15 at 2:34
Thank you for your suggestions, the data is tab delimited, I couldnt seem to line it up in this post, so introduced some manual spaces...the actual data has 300 columns, what I am thinking is replacing the >20 values by 1 and the others by blank, so that I can treat it as a fixed length entity. I checked out the sort -k option, but I couldn't follow how that would apply here specially for so many columns, would you give me a small example? thanks again for your suggestion on splitting the data-set.
– Sheetal Kaul
Mar 30 '15 at 2:34
Thank you for your suggestions, the data is tab delimited, I couldnt seem to line it up in this post, so introduced some manual spaces...the actual data has 300 columns, what I am thinking is replacing the >20 values by 1 and the others by blank, so that I can treat it as a fixed length entity. I checked out the sort -k option, but I couldn't follow how that would apply here specially for so many columns, would you give me a small example? thanks again for your suggestion on splitting the data-set.
– Sheetal Kaul
Mar 30 '15 at 2:34
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f193284%2fawk-script-to-rearrange-similar-rows%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
As far as I can see, to group the output correctly would require some form of clustering. This is non-trivial, and there would be several "correct" solutions in terms of how the rows were ordered.
– Kusalananda♦
Mar 9 at 18:16