Is sort -k1,2 equivalent to sort -k1,1 -k2,2?
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
1
down vote
favorite
I'm experimenting with GNU sort and LC_COLLATE="en_US.UTF-8". I have a file called 'test':
1,0 1
10 2
1,0 3
10 4
With sort -k1,2
as well as with simple sort test
the order doesn't change:
$ sort -k1,2 test
1,0 1
10 2
1,0 3
10 4
So, sort thinks that '1,0' is equal to '10' probably due to some quirks of LC_COLLATE (skipping punctuation?)
Now, when I use sort -k1,1 -k2,2
, it gives me a different order:
$ sort -k1,1 -k2,2 test
10 2
10 4
1,0 1
1,0 3
and suddenly sort doesn't think that '10' is the same as '1,0' anymore.
What happened? Why isn't sort -k1,1 -k2,2
equivalent to sort -k1,2
in this case? Should it really be equivalent? Or have I misinterpreted the man page? (I tried versions 8.22 and 8.29 of coreutils, both have this behavior)
sort locale gnu
add a comment |Â
up vote
1
down vote
favorite
I'm experimenting with GNU sort and LC_COLLATE="en_US.UTF-8". I have a file called 'test':
1,0 1
10 2
1,0 3
10 4
With sort -k1,2
as well as with simple sort test
the order doesn't change:
$ sort -k1,2 test
1,0 1
10 2
1,0 3
10 4
So, sort thinks that '1,0' is equal to '10' probably due to some quirks of LC_COLLATE (skipping punctuation?)
Now, when I use sort -k1,1 -k2,2
, it gives me a different order:
$ sort -k1,1 -k2,2 test
10 2
10 4
1,0 1
1,0 3
and suddenly sort doesn't think that '10' is the same as '1,0' anymore.
What happened? Why isn't sort -k1,1 -k2,2
equivalent to sort -k1,2
in this case? Should it really be equivalent? Or have I misinterpreted the man page? (I tried versions 8.22 and 8.29 of coreutils, both have this behavior)
sort locale gnu
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I'm experimenting with GNU sort and LC_COLLATE="en_US.UTF-8". I have a file called 'test':
1,0 1
10 2
1,0 3
10 4
With sort -k1,2
as well as with simple sort test
the order doesn't change:
$ sort -k1,2 test
1,0 1
10 2
1,0 3
10 4
So, sort thinks that '1,0' is equal to '10' probably due to some quirks of LC_COLLATE (skipping punctuation?)
Now, when I use sort -k1,1 -k2,2
, it gives me a different order:
$ sort -k1,1 -k2,2 test
10 2
10 4
1,0 1
1,0 3
and suddenly sort doesn't think that '10' is the same as '1,0' anymore.
What happened? Why isn't sort -k1,1 -k2,2
equivalent to sort -k1,2
in this case? Should it really be equivalent? Or have I misinterpreted the man page? (I tried versions 8.22 and 8.29 of coreutils, both have this behavior)
sort locale gnu
I'm experimenting with GNU sort and LC_COLLATE="en_US.UTF-8". I have a file called 'test':
1,0 1
10 2
1,0 3
10 4
With sort -k1,2
as well as with simple sort test
the order doesn't change:
$ sort -k1,2 test
1,0 1
10 2
1,0 3
10 4
So, sort thinks that '1,0' is equal to '10' probably due to some quirks of LC_COLLATE (skipping punctuation?)
Now, when I use sort -k1,1 -k2,2
, it gives me a different order:
$ sort -k1,1 -k2,2 test
10 2
10 4
1,0 1
1,0 3
and suddenly sort doesn't think that '10' is the same as '1,0' anymore.
What happened? Why isn't sort -k1,1 -k2,2
equivalent to sort -k1,2
in this case? Should it really be equivalent? Or have I misinterpreted the man page? (I tried versions 8.22 and 8.29 of coreutils, both have this behavior)
sort locale gnu
edited Jul 19 at 16:23
asked Jul 19 at 16:11
lutyj
83
83
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
2
down vote
accepted
-k1,2
means âÂÂsort all lines, comparing the contents of all fields from 1 to 2 simultaneouslyâÂÂ; so âÂÂ1,0 1â is compared with âÂÂ10 2â etc.
-k1,1 -k2,2
means âÂÂsort all lines, comparing the contents of field 1, and when two lines have the same content in field 1, comparing the contents of field 2âÂÂ; so âÂÂ1,0â is compared with âÂÂ10âÂÂ, then âÂÂ2â with âÂÂ4â etc.
What happens then, in both cases, boils down to collation, in particular weighting. Digits typically have a higher weight than punctuation and spacing. When comparing âÂÂ1,0 1â and âÂÂ10 2âÂÂ, the difference due to the comma is ignored because the digits are different. When comparing âÂÂ1,0â and âÂÂ10âÂÂ, the only difference is the comma, so itâÂÂs no longer ignored. See ISO 14651 for details.
You can set LC_COLLATE=C
to get collation based only on character values, with no weights. Your examples both result in
1,0 1
1,0 3
10 2
10 4
when the âÂÂCâ locale is used.
Thank you for the explanation. That's some elaborate collation rules! Good to know. I actually use LC_COLLATE=C locally, but when running jobs on a hadoop cluster, I don't have the same control over remote environments. But that's for a different question :)
â lutyj
Jul 19 at 16:52
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
-k1,2
means âÂÂsort all lines, comparing the contents of all fields from 1 to 2 simultaneouslyâÂÂ; so âÂÂ1,0 1â is compared with âÂÂ10 2â etc.
-k1,1 -k2,2
means âÂÂsort all lines, comparing the contents of field 1, and when two lines have the same content in field 1, comparing the contents of field 2âÂÂ; so âÂÂ1,0â is compared with âÂÂ10âÂÂ, then âÂÂ2â with âÂÂ4â etc.
What happens then, in both cases, boils down to collation, in particular weighting. Digits typically have a higher weight than punctuation and spacing. When comparing âÂÂ1,0 1â and âÂÂ10 2âÂÂ, the difference due to the comma is ignored because the digits are different. When comparing âÂÂ1,0â and âÂÂ10âÂÂ, the only difference is the comma, so itâÂÂs no longer ignored. See ISO 14651 for details.
You can set LC_COLLATE=C
to get collation based only on character values, with no weights. Your examples both result in
1,0 1
1,0 3
10 2
10 4
when the âÂÂCâ locale is used.
Thank you for the explanation. That's some elaborate collation rules! Good to know. I actually use LC_COLLATE=C locally, but when running jobs on a hadoop cluster, I don't have the same control over remote environments. But that's for a different question :)
â lutyj
Jul 19 at 16:52
add a comment |Â
up vote
2
down vote
accepted
-k1,2
means âÂÂsort all lines, comparing the contents of all fields from 1 to 2 simultaneouslyâÂÂ; so âÂÂ1,0 1â is compared with âÂÂ10 2â etc.
-k1,1 -k2,2
means âÂÂsort all lines, comparing the contents of field 1, and when two lines have the same content in field 1, comparing the contents of field 2âÂÂ; so âÂÂ1,0â is compared with âÂÂ10âÂÂ, then âÂÂ2â with âÂÂ4â etc.
What happens then, in both cases, boils down to collation, in particular weighting. Digits typically have a higher weight than punctuation and spacing. When comparing âÂÂ1,0 1â and âÂÂ10 2âÂÂ, the difference due to the comma is ignored because the digits are different. When comparing âÂÂ1,0â and âÂÂ10âÂÂ, the only difference is the comma, so itâÂÂs no longer ignored. See ISO 14651 for details.
You can set LC_COLLATE=C
to get collation based only on character values, with no weights. Your examples both result in
1,0 1
1,0 3
10 2
10 4
when the âÂÂCâ locale is used.
Thank you for the explanation. That's some elaborate collation rules! Good to know. I actually use LC_COLLATE=C locally, but when running jobs on a hadoop cluster, I don't have the same control over remote environments. But that's for a different question :)
â lutyj
Jul 19 at 16:52
add a comment |Â
up vote
2
down vote
accepted
up vote
2
down vote
accepted
-k1,2
means âÂÂsort all lines, comparing the contents of all fields from 1 to 2 simultaneouslyâÂÂ; so âÂÂ1,0 1â is compared with âÂÂ10 2â etc.
-k1,1 -k2,2
means âÂÂsort all lines, comparing the contents of field 1, and when two lines have the same content in field 1, comparing the contents of field 2âÂÂ; so âÂÂ1,0â is compared with âÂÂ10âÂÂ, then âÂÂ2â with âÂÂ4â etc.
What happens then, in both cases, boils down to collation, in particular weighting. Digits typically have a higher weight than punctuation and spacing. When comparing âÂÂ1,0 1â and âÂÂ10 2âÂÂ, the difference due to the comma is ignored because the digits are different. When comparing âÂÂ1,0â and âÂÂ10âÂÂ, the only difference is the comma, so itâÂÂs no longer ignored. See ISO 14651 for details.
You can set LC_COLLATE=C
to get collation based only on character values, with no weights. Your examples both result in
1,0 1
1,0 3
10 2
10 4
when the âÂÂCâ locale is used.
-k1,2
means âÂÂsort all lines, comparing the contents of all fields from 1 to 2 simultaneouslyâÂÂ; so âÂÂ1,0 1â is compared with âÂÂ10 2â etc.
-k1,1 -k2,2
means âÂÂsort all lines, comparing the contents of field 1, and when two lines have the same content in field 1, comparing the contents of field 2âÂÂ; so âÂÂ1,0â is compared with âÂÂ10âÂÂ, then âÂÂ2â with âÂÂ4â etc.
What happens then, in both cases, boils down to collation, in particular weighting. Digits typically have a higher weight than punctuation and spacing. When comparing âÂÂ1,0 1â and âÂÂ10 2âÂÂ, the difference due to the comma is ignored because the digits are different. When comparing âÂÂ1,0â and âÂÂ10âÂÂ, the only difference is the comma, so itâÂÂs no longer ignored. See ISO 14651 for details.
You can set LC_COLLATE=C
to get collation based only on character values, with no weights. Your examples both result in
1,0 1
1,0 3
10 2
10 4
when the âÂÂCâ locale is used.
edited Jul 19 at 17:00
answered Jul 19 at 16:27
Stephen Kitt
139k22296359
139k22296359
Thank you for the explanation. That's some elaborate collation rules! Good to know. I actually use LC_COLLATE=C locally, but when running jobs on a hadoop cluster, I don't have the same control over remote environments. But that's for a different question :)
â lutyj
Jul 19 at 16:52
add a comment |Â
Thank you for the explanation. That's some elaborate collation rules! Good to know. I actually use LC_COLLATE=C locally, but when running jobs on a hadoop cluster, I don't have the same control over remote environments. But that's for a different question :)
â lutyj
Jul 19 at 16:52
Thank you for the explanation. That's some elaborate collation rules! Good to know. I actually use LC_COLLATE=C locally, but when running jobs on a hadoop cluster, I don't have the same control over remote environments. But that's for a different question :)
â lutyj
Jul 19 at 16:52
Thank you for the explanation. That's some elaborate collation rules! Good to know. I actually use LC_COLLATE=C locally, but when running jobs on a hadoop cluster, I don't have the same control over remote environments. But that's for a different question :)
â lutyj
Jul 19 at 16:52
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f457259%2fis-sort-k1-2-equivalent-to-sort-k1-1-k2-2%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password