Is sort -k1,2 equivalent to sort -k1,1 -k2,2?

Multi tool use
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
1
down vote
favorite
I'm experimenting with GNU sort and LC_COLLATE="en_US.UTF-8". I have a file called 'test':
1,0 1
10 2
1,0 3
10 4
With sort -k1,2
as well as with simple sort test
the order doesn't change:
$ sort -k1,2 test
1,0 1
10 2
1,0 3
10 4
So, sort thinks that '1,0' is equal to '10' probably due to some quirks of LC_COLLATE (skipping punctuation?)
Now, when I use sort -k1,1 -k2,2
, it gives me a different order:
$ sort -k1,1 -k2,2 test
10 2
10 4
1,0 1
1,0 3
and suddenly sort doesn't think that '10' is the same as '1,0' anymore.
What happened? Why isn't sort -k1,1 -k2,2
equivalent to sort -k1,2
in this case? Should it really be equivalent? Or have I misinterpreted the man page? (I tried versions 8.22 and 8.29 of coreutils, both have this behavior)
sort locale gnu
add a comment |Â
up vote
1
down vote
favorite
I'm experimenting with GNU sort and LC_COLLATE="en_US.UTF-8". I have a file called 'test':
1,0 1
10 2
1,0 3
10 4
With sort -k1,2
as well as with simple sort test
the order doesn't change:
$ sort -k1,2 test
1,0 1
10 2
1,0 3
10 4
So, sort thinks that '1,0' is equal to '10' probably due to some quirks of LC_COLLATE (skipping punctuation?)
Now, when I use sort -k1,1 -k2,2
, it gives me a different order:
$ sort -k1,1 -k2,2 test
10 2
10 4
1,0 1
1,0 3
and suddenly sort doesn't think that '10' is the same as '1,0' anymore.
What happened? Why isn't sort -k1,1 -k2,2
equivalent to sort -k1,2
in this case? Should it really be equivalent? Or have I misinterpreted the man page? (I tried versions 8.22 and 8.29 of coreutils, both have this behavior)
sort locale gnu
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I'm experimenting with GNU sort and LC_COLLATE="en_US.UTF-8". I have a file called 'test':
1,0 1
10 2
1,0 3
10 4
With sort -k1,2
as well as with simple sort test
the order doesn't change:
$ sort -k1,2 test
1,0 1
10 2
1,0 3
10 4
So, sort thinks that '1,0' is equal to '10' probably due to some quirks of LC_COLLATE (skipping punctuation?)
Now, when I use sort -k1,1 -k2,2
, it gives me a different order:
$ sort -k1,1 -k2,2 test
10 2
10 4
1,0 1
1,0 3
and suddenly sort doesn't think that '10' is the same as '1,0' anymore.
What happened? Why isn't sort -k1,1 -k2,2
equivalent to sort -k1,2
in this case? Should it really be equivalent? Or have I misinterpreted the man page? (I tried versions 8.22 and 8.29 of coreutils, both have this behavior)
sort locale gnu
I'm experimenting with GNU sort and LC_COLLATE="en_US.UTF-8". I have a file called 'test':
1,0 1
10 2
1,0 3
10 4
With sort -k1,2
as well as with simple sort test
the order doesn't change:
$ sort -k1,2 test
1,0 1
10 2
1,0 3
10 4
So, sort thinks that '1,0' is equal to '10' probably due to some quirks of LC_COLLATE (skipping punctuation?)
Now, when I use sort -k1,1 -k2,2
, it gives me a different order:
$ sort -k1,1 -k2,2 test
10 2
10 4
1,0 1
1,0 3
and suddenly sort doesn't think that '10' is the same as '1,0' anymore.
What happened? Why isn't sort -k1,1 -k2,2
equivalent to sort -k1,2
in this case? Should it really be equivalent? Or have I misinterpreted the man page? (I tried versions 8.22 and 8.29 of coreutils, both have this behavior)
sort locale gnu
edited Jul 19 at 16:23
asked Jul 19 at 16:11
lutyj
83
83
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
2
down vote
accepted
-k1,2
means “sort all lines, comparing the contents of all fields from 1 to 2 simultaneouslyâ€Â; so “1,0 1†is compared with “10 2†etc.
-k1,1 -k2,2
means “sort all lines, comparing the contents of field 1, and when two lines have the same content in field 1, comparing the contents of field 2â€Â; so “1,0†is compared with “10â€Â, then “2†with “4†etc.
What happens then, in both cases, boils down to collation, in particular weighting. Digits typically have a higher weight than punctuation and spacing. When comparing “1,0 1†and “10 2â€Â, the difference due to the comma is ignored because the digits are different. When comparing “1,0†and “10â€Â, the only difference is the comma, so it’s no longer ignored. See ISO 14651 for details.
You can set LC_COLLATE=C
to get collation based only on character values, with no weights. Your examples both result in
1,0 1
1,0 3
10 2
10 4
when the “C†locale is used.
Thank you for the explanation. That's some elaborate collation rules! Good to know. I actually use LC_COLLATE=C locally, but when running jobs on a hadoop cluster, I don't have the same control over remote environments. But that's for a different question :)
– lutyj
Jul 19 at 16:52
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
-k1,2
means “sort all lines, comparing the contents of all fields from 1 to 2 simultaneouslyâ€Â; so “1,0 1†is compared with “10 2†etc.
-k1,1 -k2,2
means “sort all lines, comparing the contents of field 1, and when two lines have the same content in field 1, comparing the contents of field 2â€Â; so “1,0†is compared with “10â€Â, then “2†with “4†etc.
What happens then, in both cases, boils down to collation, in particular weighting. Digits typically have a higher weight than punctuation and spacing. When comparing “1,0 1†and “10 2â€Â, the difference due to the comma is ignored because the digits are different. When comparing “1,0†and “10â€Â, the only difference is the comma, so it’s no longer ignored. See ISO 14651 for details.
You can set LC_COLLATE=C
to get collation based only on character values, with no weights. Your examples both result in
1,0 1
1,0 3
10 2
10 4
when the “C†locale is used.
Thank you for the explanation. That's some elaborate collation rules! Good to know. I actually use LC_COLLATE=C locally, but when running jobs on a hadoop cluster, I don't have the same control over remote environments. But that's for a different question :)
– lutyj
Jul 19 at 16:52
add a comment |Â
up vote
2
down vote
accepted
-k1,2
means “sort all lines, comparing the contents of all fields from 1 to 2 simultaneouslyâ€Â; so “1,0 1†is compared with “10 2†etc.
-k1,1 -k2,2
means “sort all lines, comparing the contents of field 1, and when two lines have the same content in field 1, comparing the contents of field 2â€Â; so “1,0†is compared with “10â€Â, then “2†with “4†etc.
What happens then, in both cases, boils down to collation, in particular weighting. Digits typically have a higher weight than punctuation and spacing. When comparing “1,0 1†and “10 2â€Â, the difference due to the comma is ignored because the digits are different. When comparing “1,0†and “10â€Â, the only difference is the comma, so it’s no longer ignored. See ISO 14651 for details.
You can set LC_COLLATE=C
to get collation based only on character values, with no weights. Your examples both result in
1,0 1
1,0 3
10 2
10 4
when the “C†locale is used.
Thank you for the explanation. That's some elaborate collation rules! Good to know. I actually use LC_COLLATE=C locally, but when running jobs on a hadoop cluster, I don't have the same control over remote environments. But that's for a different question :)
– lutyj
Jul 19 at 16:52
add a comment |Â
up vote
2
down vote
accepted
up vote
2
down vote
accepted
-k1,2
means “sort all lines, comparing the contents of all fields from 1 to 2 simultaneouslyâ€Â; so “1,0 1†is compared with “10 2†etc.
-k1,1 -k2,2
means “sort all lines, comparing the contents of field 1, and when two lines have the same content in field 1, comparing the contents of field 2â€Â; so “1,0†is compared with “10â€Â, then “2†with “4†etc.
What happens then, in both cases, boils down to collation, in particular weighting. Digits typically have a higher weight than punctuation and spacing. When comparing “1,0 1†and “10 2â€Â, the difference due to the comma is ignored because the digits are different. When comparing “1,0†and “10â€Â, the only difference is the comma, so it’s no longer ignored. See ISO 14651 for details.
You can set LC_COLLATE=C
to get collation based only on character values, with no weights. Your examples both result in
1,0 1
1,0 3
10 2
10 4
when the “C†locale is used.
-k1,2
means “sort all lines, comparing the contents of all fields from 1 to 2 simultaneouslyâ€Â; so “1,0 1†is compared with “10 2†etc.
-k1,1 -k2,2
means “sort all lines, comparing the contents of field 1, and when two lines have the same content in field 1, comparing the contents of field 2â€Â; so “1,0†is compared with “10â€Â, then “2†with “4†etc.
What happens then, in both cases, boils down to collation, in particular weighting. Digits typically have a higher weight than punctuation and spacing. When comparing “1,0 1†and “10 2â€Â, the difference due to the comma is ignored because the digits are different. When comparing “1,0†and “10â€Â, the only difference is the comma, so it’s no longer ignored. See ISO 14651 for details.
You can set LC_COLLATE=C
to get collation based only on character values, with no weights. Your examples both result in
1,0 1
1,0 3
10 2
10 4
when the “C†locale is used.
edited Jul 19 at 17:00
answered Jul 19 at 16:27
Stephen Kitt
139k22296359
139k22296359
Thank you for the explanation. That's some elaborate collation rules! Good to know. I actually use LC_COLLATE=C locally, but when running jobs on a hadoop cluster, I don't have the same control over remote environments. But that's for a different question :)
– lutyj
Jul 19 at 16:52
add a comment |Â
Thank you for the explanation. That's some elaborate collation rules! Good to know. I actually use LC_COLLATE=C locally, but when running jobs on a hadoop cluster, I don't have the same control over remote environments. But that's for a different question :)
– lutyj
Jul 19 at 16:52
Thank you for the explanation. That's some elaborate collation rules! Good to know. I actually use LC_COLLATE=C locally, but when running jobs on a hadoop cluster, I don't have the same control over remote environments. But that's for a different question :)
– lutyj
Jul 19 at 16:52
Thank you for the explanation. That's some elaborate collation rules! Good to know. I actually use LC_COLLATE=C locally, but when running jobs on a hadoop cluster, I don't have the same control over remote environments. But that's for a different question :)
– lutyj
Jul 19 at 16:52
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f457259%2fis-sort-k1-2-equivalent-to-sort-k1-1-k2-2%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password