What would break if the C locale was UTF-8 instead of ASCII?

up vote
6
down vote

favorite

The C locale is defined to use the ASCII charset and POSIX does not provide a way to use a charset without changing the locale as well.

What would happen if the encoding of C were switched to UTF-8 instead?

The positive side would be that UTF-8 would become the default charset for any process, even system daemons. Obviously there would be applications that would break because they assume that C uses 7-bit ASCII. But do these applications really exist? Right now a lot of written code is locale- and charset-aware to a certain extent, I would be surprised to see code that can only deal with 7-bit clean input and cannot be easily adapted to accept a UTF-8-enabled C.

asked Mar 12 '13 at 16:31

gioele

81011120

1

This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
â€“Â gioele
Mar 12 '13 at 16:34

FWIW, OpenBSD has a C.UTF-8 locale, as well as POSIX.UTF-8.
â€“Â Kusalananda
Aug 13 at 11:31

add a commentÂ |Â

up vote
6
down vote

favorite

The C locale is defined to use the ASCII charset and POSIX does not provide a way to use a charset without changing the locale as well.

What would happen if the encoding of C were switched to UTF-8 instead?

asked Mar 12 '13 at 16:31

gioele

81011120

1

This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
â€“Â gioele
Mar 12 '13 at 16:34

FWIW, OpenBSD has a C.UTF-8 locale, as well as POSIX.UTF-8.
â€“Â Kusalananda
Aug 13 at 11:31

add a commentÂ |Â

up vote
6
down vote

favorite

The C locale is defined to use the ASCII charset and POSIX does not provide a way to use a charset without changing the locale as well.

What would happen if the encoding of C were switched to UTF-8 instead?

asked Mar 12 '13 at 16:31

gioele

81011120

The C locale is defined to use the ASCII charset and POSIX does not provide a way to use a charset without changing the locale as well.

What would happen if the encoding of C were switched to UTF-8 instead?

character-encoding locale posix unicode compatibility

asked Mar 12 '13 at 16:31

gioele

81011120

asked Mar 12 '13 at 16:31

gioele

81011120

asked Mar 12 '13 at 16:31

gioele

81011120

asked Mar 12 '13 at 16:31

gioele

81011120

asked Mar 12 '13 at 16:31

gioele

81011120

1

This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
â€“Â gioele
Mar 12 '13 at 16:34

FWIW, OpenBSD has a C.UTF-8 locale, as well as POSIX.UTF-8.
â€“Â Kusalananda
Aug 13 at 11:31

add a commentÂ |Â

1

This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
â€“Â gioele
Mar 12 '13 at 16:34

FWIW, OpenBSD has a C.UTF-8 locale, as well as POSIX.UTF-8.
â€“Â Kusalananda
Aug 13 at 11:31

This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
â€“Â gioele
Mar 12 '13 at 16:34

FWIW, OpenBSD has a C.UTF-8 locale, as well as POSIX.UTF-8.
â€“Â Kusalananda
Aug 13 at 11:31

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
8
down vote

accepted

The C locale is not the default locale. It is a locale that is guaranteed not to cause any Ã¢Â€ÂœsurprisingÃ¢Â€Â behavior. A number of commands have output of a guaranteed form (e.g. ps or df headers, date format) in the C or POSIX locale. For encodings (LC_CTYPE), it is guaranteed that [:alpha:] only contains the ASCII letters, and so on. If the C locale was modified, this would call many applications to misbehave. For example, they might reject input that is invalid UTF-8 instead of treating it as binary data.

If you want all programs on your system to use UTF-8, set the default locale to UTF-8. All programs that manipulate a single encoding, that is. Some programs only manipulate byte streams and don't care about encodings. Some programs manipulate multiple encodings and don't care about the locale (for example, a web server or web client sets or reads the encoding for each connection in a header).

edited Aug 13 at 11:26

Isaac

7,1141834

answered Mar 12 '13 at 22:14

Gilles

509k12010061536

add a commentÂ |Â

up vote
5
down vote

You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.

It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.

However, this does not have anything to do with the how programs built from C code deal with input. The locale is used to translate input that is handed to an executable, which if the system locale is UTF-8, UTF-8 is what the program gets regardless of whether its source was written in C or something else. So:

I would be surprised to see code that can only deal with 7-bit clean
input and cannot be easily adapted to accept a UTF-8-enabled C

Does not really make sense. A minimal piece of standard C source that reads from standard input receives a stream of bytes from the system. If the system uses UTF-8 and it produced the stream from some HID hardware, then that stream may contain UTF-8 encoded characters. If it came from somewhere else, (eg, a network, a file) it might contain anything, which is what makes the assumption of a UTF-8 standard useful.

The fact that the C locale is a much more restricted char set than the UTF-8 locale is unrelated. It's just called "the C locale", but in fact it has no more or less to do with composing C code than any other.

You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.

The "Roger Leigh" link you posted in a comment I believe refers to using an expanded set (UTF-8) as the C locale in a C library destined for an embedded environment, so that no other locale has to be loaded for the system to deal with UTF-8.

So the answer to the question, "What would break if the C locale was UTF-8 instead of ASCII?" is, I would guess, nothing, but outside of an embedded environment, etc. there is not much of a need to do this. But very likely it will become the norm at some point for libraries such as GNU C (it might as well be, I think).

edited Mar 12 '13 at 16:54

answered Mar 12 '13 at 16:48

goldilocks

59.9k13138194

The behavior of various syscalls is influenced by the charset of the locale, for example Â«isupper() will not recognize an A-umlaut (Ã„) as an uppercase letter in the default C locale.Â» (from man7.org/linux/man-pages/man3/isprint.3.html). isprint() is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
â€“Â gioele
Mar 12 '13 at 19:15

Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
â€“Â goldilocks
Mar 12 '13 at 19:45

Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
â€“Â goldilocks
Mar 12 '13 at 19:54

2

@gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
â€“Â Gilles
Mar 12 '13 at 21:38

@goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
â€“Â Michael KjÃ¶rling
Mar 13 '13 at 10:21

Â |Â
show 1 more comment

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f67680%2fwhat-would-break-if-the-c-locale-was-utf-8-instead-of-ascii%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
8
down vote

accepted

edited Aug 13 at 11:26

Isaac

7,1141834

answered Mar 12 '13 at 22:14

Gilles

509k12010061536

add a commentÂ |Â

up vote
8
down vote

accepted

edited Aug 13 at 11:26

Isaac

7,1141834

answered Mar 12 '13 at 22:14

Gilles

509k12010061536

add a commentÂ |Â

up vote
8
down vote

accepted

edited Aug 13 at 11:26

Isaac

7,1141834

answered Mar 12 '13 at 22:14

Gilles

509k12010061536

edited Aug 13 at 11:26

Isaac

7,1141834

answered Mar 12 '13 at 22:14

Gilles

509k12010061536

edited Aug 13 at 11:26

Isaac

7,1141834

edited Aug 13 at 11:26

Isaac

7,1141834

edited Aug 13 at 11:26

Isaac

7,1141834

answered Mar 12 '13 at 22:14

Gilles

509k12010061536

answered Mar 12 '13 at 22:14

Gilles

509k12010061536

answered Mar 12 '13 at 22:14

Gilles

509k12010061536

add a commentÂ |Â

up vote
5
down vote

You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.

It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.

I would be surprised to see code that can only deal with 7-bit clean
input and cannot be easily adapted to accept a UTF-8-enabled C

You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.

edited Mar 12 '13 at 16:54

answered Mar 12 '13 at 16:48

goldilocks

59.9k13138194

The behavior of various syscalls is influenced by the charset of the locale, for example Â«isupper() will not recognize an A-umlaut (Ã„) as an uppercase letter in the default C locale.Â» (from man7.org/linux/man-pages/man3/isprint.3.html). isprint() is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
â€“Â gioele
Mar 12 '13 at 19:15

Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
â€“Â goldilocks
Mar 12 '13 at 19:45

Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
â€“Â goldilocks
Mar 12 '13 at 19:54

2

@gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
â€“Â Gilles
Mar 12 '13 at 21:38

@goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
â€“Â Michael KjÃ¶rling
Mar 13 '13 at 10:21

Â |Â
show 1 more comment

up vote
5
down vote

You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.

It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.

I would be surprised to see code that can only deal with 7-bit clean
input and cannot be easily adapted to accept a UTF-8-enabled C

You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.

edited Mar 12 '13 at 16:54

answered Mar 12 '13 at 16:48

goldilocks

59.9k13138194

The behavior of various syscalls is influenced by the charset of the locale, for example Â«isupper() will not recognize an A-umlaut (Ã„) as an uppercase letter in the default C locale.Â» (from man7.org/linux/man-pages/man3/isprint.3.html). isprint() is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
â€“Â gioele
Mar 12 '13 at 19:15

Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
â€“Â goldilocks
Mar 12 '13 at 19:45

Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
â€“Â goldilocks
Mar 12 '13 at 19:54

2

@gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
â€“Â Gilles
Mar 12 '13 at 21:38

@goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
â€“Â Michael KjÃ¶rling
Mar 13 '13 at 10:21

Â |Â
show 1 more comment

up vote
5
down vote

You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.

It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.

I would be surprised to see code that can only deal with 7-bit clean
input and cannot be easily adapted to accept a UTF-8-enabled C

You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.

edited Mar 12 '13 at 16:54

answered Mar 12 '13 at 16:48

goldilocks

59.9k13138194

You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.

It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.

I would be surprised to see code that can only deal with 7-bit clean
input and cannot be easily adapted to accept a UTF-8-enabled C

You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.

edited Mar 12 '13 at 16:54

answered Mar 12 '13 at 16:48

goldilocks

59.9k13138194

edited Mar 12 '13 at 16:54

answered Mar 12 '13 at 16:48

goldilocks

59.9k13138194

answered Mar 12 '13 at 16:48

goldilocks

59.9k13138194

answered Mar 12 '13 at 16:48

goldilocks

59.9k13138194

The behavior of various syscalls is influenced by the charset of the locale, for example Â«isupper() will not recognize an A-umlaut (Ã„) as an uppercase letter in the default C locale.Â» (from man7.org/linux/man-pages/man3/isprint.3.html). isprint() is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
â€“Â gioele
Mar 12 '13 at 19:15

Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
â€“Â goldilocks
Mar 12 '13 at 19:45

Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
â€“Â goldilocks
Mar 12 '13 at 19:54

2

@gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
â€“Â Gilles
Mar 12 '13 at 21:38

@goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
â€“Â Michael KjÃ¶rling
Mar 13 '13 at 10:21

Â |Â
show 1 more comment

The behavior of various syscalls is influenced by the charset of the locale, for example Â«isupper() will not recognize an A-umlaut (Ã„) as an uppercase letter in the default C locale.Â» (from man7.org/linux/man-pages/man3/isprint.3.html). isprint() is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
â€“Â gioele
Mar 12 '13 at 19:15

Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
â€“Â goldilocks
Mar 12 '13 at 19:45

Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
â€“Â goldilocks
Mar 12 '13 at 19:54

2

@gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
â€“Â Gilles
Mar 12 '13 at 21:38

@goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
â€“Â Michael KjÃ¶rling
Mar 13 '13 at 10:21

The behavior of various syscalls is influenced by the charset of the locale, for example Â«isupper() will not recognize an A-umlaut (Ã„) as an uppercase letter in the default C locale.Â» (from man7.org/linux/man-pages/man3/isprint.3.html). isprint() is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
â€“Â gioele
Mar 12 '13 at 19:15

Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
â€“Â goldilocks
Mar 12 '13 at 19:45

Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
â€“Â goldilocks
Mar 12 '13 at 19:54

@gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
â€“Â Gilles
Mar 12 '13 at 21:38

@goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
â€“Â Michael KjÃ¶rling
Mar 13 '13 at 10:21

Â |Â
show 1 more comment

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu