What would break if the C locale was UTF-8 instead of ASCII?
Clash Royale CLAN TAG#URR8PPP
up vote
6
down vote
favorite
The C locale is defined to use the ASCII charset and POSIX does not provide a way to use a charset without changing the locale as well.
What would happen if the encoding of C were switched to UTF-8 instead?
The positive side would be that UTF-8 would become the default charset for any process, even system daemons. Obviously there would be applications that would break because they assume that C uses 7-bit ASCII. But do these applications really exist? Right now a lot of written code is locale- and charset-aware to a certain extent, I would be surprised to see code that can only deal with 7-bit clean input and cannot be easily adapted to accept a UTF-8-enabled C.
character-encoding locale posix unicode compatibility
add a comment |Â
up vote
6
down vote
favorite
The C locale is defined to use the ASCII charset and POSIX does not provide a way to use a charset without changing the locale as well.
What would happen if the encoding of C were switched to UTF-8 instead?
The positive side would be that UTF-8 would become the default charset for any process, even system daemons. Obviously there would be applications that would break because they assume that C uses 7-bit ASCII. But do these applications really exist? Right now a lot of written code is locale- and charset-aware to a certain extent, I would be surprised to see code that can only deal with 7-bit clean input and cannot be easily adapted to accept a UTF-8-enabled C.
character-encoding locale posix unicode compatibility
1
This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
â gioele
Mar 12 '13 at 16:34
FWIW, OpenBSD has aC.UTF-8
locale, as well asPOSIX.UTF-8
.
â Kusalananda
Aug 13 at 11:31
add a comment |Â
up vote
6
down vote
favorite
up vote
6
down vote
favorite
The C locale is defined to use the ASCII charset and POSIX does not provide a way to use a charset without changing the locale as well.
What would happen if the encoding of C were switched to UTF-8 instead?
The positive side would be that UTF-8 would become the default charset for any process, even system daemons. Obviously there would be applications that would break because they assume that C uses 7-bit ASCII. But do these applications really exist? Right now a lot of written code is locale- and charset-aware to a certain extent, I would be surprised to see code that can only deal with 7-bit clean input and cannot be easily adapted to accept a UTF-8-enabled C.
character-encoding locale posix unicode compatibility
The C locale is defined to use the ASCII charset and POSIX does not provide a way to use a charset without changing the locale as well.
What would happen if the encoding of C were switched to UTF-8 instead?
The positive side would be that UTF-8 would become the default charset for any process, even system daemons. Obviously there would be applications that would break because they assume that C uses 7-bit ASCII. But do these applications really exist? Right now a lot of written code is locale- and charset-aware to a certain extent, I would be surprised to see code that can only deal with 7-bit clean input and cannot be easily adapted to accept a UTF-8-enabled C.
character-encoding locale posix unicode compatibility
character-encoding locale posix unicode compatibility
asked Mar 12 '13 at 16:31
gioele
81011120
81011120
1
This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
â gioele
Mar 12 '13 at 16:34
FWIW, OpenBSD has aC.UTF-8
locale, as well asPOSIX.UTF-8
.
â Kusalananda
Aug 13 at 11:31
add a comment |Â
1
This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
â gioele
Mar 12 '13 at 16:34
FWIW, OpenBSD has aC.UTF-8
locale, as well asPOSIX.UTF-8
.
â Kusalananda
Aug 13 at 11:31
1
1
This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
â gioele
Mar 12 '13 at 16:34
This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
â gioele
Mar 12 '13 at 16:34
FWIW, OpenBSD has a
C.UTF-8
locale, as well as POSIX.UTF-8
.â Kusalananda
Aug 13 at 11:31
FWIW, OpenBSD has a
C.UTF-8
locale, as well as POSIX.UTF-8
.â Kusalananda
Aug 13 at 11:31
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
8
down vote
accepted
The C locale is not the default locale. It is a locale that is guaranteed not to cause any âÂÂsurprisingâ behavior. A number of commands have output of a guaranteed form (e.g. ps
or df
headers, date
format) in the C
or POSIX
locale. For encodings (LC_CTYPE
), it is guaranteed that [:alpha:]
only contains the ASCII letters, and so on. If the C
locale was modified, this would call many applications to misbehave. For example, they might reject input that is invalid UTF-8 instead of treating it as binary data.
If you want all programs on your system to use UTF-8, set the default locale to UTF-8. All programs that manipulate a single encoding, that is. Some programs only manipulate byte streams and don't care about encodings. Some programs manipulate multiple encodings and don't care about the locale (for example, a web server or web client sets or reads the encoding for each connection in a header).
add a comment |Â
up vote
5
down vote
You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.
It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.
However, this does not have anything to do with the how programs built from C code deal with input. The locale is used to translate input that is handed to an executable, which if the system locale is UTF-8, UTF-8 is what the program gets regardless of whether its source was written in C or something else. So:
I would be surprised to see code that can only deal with 7-bit clean
input and cannot be easily adapted to accept a UTF-8-enabled C
Does not really make sense. A minimal piece of standard C source that reads from standard input receives a stream of bytes from the system. If the system uses UTF-8 and it produced the stream from some HID hardware, then that stream may contain UTF-8 encoded characters. If it came from somewhere else, (eg, a network, a file) it might contain anything, which is what makes the assumption of a UTF-8 standard useful.
The fact that the C locale is a much more restricted char set than the UTF-8 locale is unrelated. It's just called "the C locale", but in fact it has no more or less to do with composing C code than any other.
You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.
The "Roger Leigh" link you posted in a comment I believe refers to using an expanded set (UTF-8) as the C locale in a C library destined for an embedded environment, so that no other locale has to be loaded for the system to deal with UTF-8.
So the answer to the question, "What would break if the C locale was UTF-8 instead of ASCII?" is, I would guess, nothing, but outside of an embedded environment, etc. there is not much of a need to do this. But very likely it will become the norm at some point for libraries such as GNU C (it might as well be, I think).
The behavior of various syscalls is influenced by the charset of the locale, for example «isupper()
will not recognize an A-umlaut (Ã) as an uppercase letter in the default C locale.» (from man7.org/linux/man-pages/man3/isprint.3.html).isprint()
is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
â gioele
Mar 12 '13 at 19:15
Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
â goldilocks
Mar 12 '13 at 19:45
Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
â goldilocks
Mar 12 '13 at 19:54
2
@gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
â Gilles
Mar 12 '13 at 21:38
@goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
â Michael Kjörling
Mar 13 '13 at 10:21
 |Â
show 1 more comment
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
8
down vote
accepted
The C locale is not the default locale. It is a locale that is guaranteed not to cause any âÂÂsurprisingâ behavior. A number of commands have output of a guaranteed form (e.g. ps
or df
headers, date
format) in the C
or POSIX
locale. For encodings (LC_CTYPE
), it is guaranteed that [:alpha:]
only contains the ASCII letters, and so on. If the C
locale was modified, this would call many applications to misbehave. For example, they might reject input that is invalid UTF-8 instead of treating it as binary data.
If you want all programs on your system to use UTF-8, set the default locale to UTF-8. All programs that manipulate a single encoding, that is. Some programs only manipulate byte streams and don't care about encodings. Some programs manipulate multiple encodings and don't care about the locale (for example, a web server or web client sets or reads the encoding for each connection in a header).
add a comment |Â
up vote
8
down vote
accepted
The C locale is not the default locale. It is a locale that is guaranteed not to cause any âÂÂsurprisingâ behavior. A number of commands have output of a guaranteed form (e.g. ps
or df
headers, date
format) in the C
or POSIX
locale. For encodings (LC_CTYPE
), it is guaranteed that [:alpha:]
only contains the ASCII letters, and so on. If the C
locale was modified, this would call many applications to misbehave. For example, they might reject input that is invalid UTF-8 instead of treating it as binary data.
If you want all programs on your system to use UTF-8, set the default locale to UTF-8. All programs that manipulate a single encoding, that is. Some programs only manipulate byte streams and don't care about encodings. Some programs manipulate multiple encodings and don't care about the locale (for example, a web server or web client sets or reads the encoding for each connection in a header).
add a comment |Â
up vote
8
down vote
accepted
up vote
8
down vote
accepted
The C locale is not the default locale. It is a locale that is guaranteed not to cause any âÂÂsurprisingâ behavior. A number of commands have output of a guaranteed form (e.g. ps
or df
headers, date
format) in the C
or POSIX
locale. For encodings (LC_CTYPE
), it is guaranteed that [:alpha:]
only contains the ASCII letters, and so on. If the C
locale was modified, this would call many applications to misbehave. For example, they might reject input that is invalid UTF-8 instead of treating it as binary data.
If you want all programs on your system to use UTF-8, set the default locale to UTF-8. All programs that manipulate a single encoding, that is. Some programs only manipulate byte streams and don't care about encodings. Some programs manipulate multiple encodings and don't care about the locale (for example, a web server or web client sets or reads the encoding for each connection in a header).
The C locale is not the default locale. It is a locale that is guaranteed not to cause any âÂÂsurprisingâ behavior. A number of commands have output of a guaranteed form (e.g. ps
or df
headers, date
format) in the C
or POSIX
locale. For encodings (LC_CTYPE
), it is guaranteed that [:alpha:]
only contains the ASCII letters, and so on. If the C
locale was modified, this would call many applications to misbehave. For example, they might reject input that is invalid UTF-8 instead of treating it as binary data.
If you want all programs on your system to use UTF-8, set the default locale to UTF-8. All programs that manipulate a single encoding, that is. Some programs only manipulate byte streams and don't care about encodings. Some programs manipulate multiple encodings and don't care about the locale (for example, a web server or web client sets or reads the encoding for each connection in a header).
edited Aug 13 at 11:26
Isaac
7,1141834
7,1141834
answered Mar 12 '13 at 22:14
Gilles
509k12010061536
509k12010061536
add a comment |Â
add a comment |Â
up vote
5
down vote
You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.
It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.
However, this does not have anything to do with the how programs built from C code deal with input. The locale is used to translate input that is handed to an executable, which if the system locale is UTF-8, UTF-8 is what the program gets regardless of whether its source was written in C or something else. So:
I would be surprised to see code that can only deal with 7-bit clean
input and cannot be easily adapted to accept a UTF-8-enabled C
Does not really make sense. A minimal piece of standard C source that reads from standard input receives a stream of bytes from the system. If the system uses UTF-8 and it produced the stream from some HID hardware, then that stream may contain UTF-8 encoded characters. If it came from somewhere else, (eg, a network, a file) it might contain anything, which is what makes the assumption of a UTF-8 standard useful.
The fact that the C locale is a much more restricted char set than the UTF-8 locale is unrelated. It's just called "the C locale", but in fact it has no more or less to do with composing C code than any other.
You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.
The "Roger Leigh" link you posted in a comment I believe refers to using an expanded set (UTF-8) as the C locale in a C library destined for an embedded environment, so that no other locale has to be loaded for the system to deal with UTF-8.
So the answer to the question, "What would break if the C locale was UTF-8 instead of ASCII?" is, I would guess, nothing, but outside of an embedded environment, etc. there is not much of a need to do this. But very likely it will become the norm at some point for libraries such as GNU C (it might as well be, I think).
The behavior of various syscalls is influenced by the charset of the locale, for example «isupper()
will not recognize an A-umlaut (Ã) as an uppercase letter in the default C locale.» (from man7.org/linux/man-pages/man3/isprint.3.html).isprint()
is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
â gioele
Mar 12 '13 at 19:15
Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
â goldilocks
Mar 12 '13 at 19:45
Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
â goldilocks
Mar 12 '13 at 19:54
2
@gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
â Gilles
Mar 12 '13 at 21:38
@goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
â Michael Kjörling
Mar 13 '13 at 10:21
 |Â
show 1 more comment
up vote
5
down vote
You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.
It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.
However, this does not have anything to do with the how programs built from C code deal with input. The locale is used to translate input that is handed to an executable, which if the system locale is UTF-8, UTF-8 is what the program gets regardless of whether its source was written in C or something else. So:
I would be surprised to see code that can only deal with 7-bit clean
input and cannot be easily adapted to accept a UTF-8-enabled C
Does not really make sense. A minimal piece of standard C source that reads from standard input receives a stream of bytes from the system. If the system uses UTF-8 and it produced the stream from some HID hardware, then that stream may contain UTF-8 encoded characters. If it came from somewhere else, (eg, a network, a file) it might contain anything, which is what makes the assumption of a UTF-8 standard useful.
The fact that the C locale is a much more restricted char set than the UTF-8 locale is unrelated. It's just called "the C locale", but in fact it has no more or less to do with composing C code than any other.
You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.
The "Roger Leigh" link you posted in a comment I believe refers to using an expanded set (UTF-8) as the C locale in a C library destined for an embedded environment, so that no other locale has to be loaded for the system to deal with UTF-8.
So the answer to the question, "What would break if the C locale was UTF-8 instead of ASCII?" is, I would guess, nothing, but outside of an embedded environment, etc. there is not much of a need to do this. But very likely it will become the norm at some point for libraries such as GNU C (it might as well be, I think).
The behavior of various syscalls is influenced by the charset of the locale, for example «isupper()
will not recognize an A-umlaut (Ã) as an uppercase letter in the default C locale.» (from man7.org/linux/man-pages/man3/isprint.3.html).isprint()
is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
â gioele
Mar 12 '13 at 19:15
Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
â goldilocks
Mar 12 '13 at 19:45
Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
â goldilocks
Mar 12 '13 at 19:54
2
@gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
â Gilles
Mar 12 '13 at 21:38
@goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
â Michael Kjörling
Mar 13 '13 at 10:21
 |Â
show 1 more comment
up vote
5
down vote
up vote
5
down vote
You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.
It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.
However, this does not have anything to do with the how programs built from C code deal with input. The locale is used to translate input that is handed to an executable, which if the system locale is UTF-8, UTF-8 is what the program gets regardless of whether its source was written in C or something else. So:
I would be surprised to see code that can only deal with 7-bit clean
input and cannot be easily adapted to accept a UTF-8-enabled C
Does not really make sense. A minimal piece of standard C source that reads from standard input receives a stream of bytes from the system. If the system uses UTF-8 and it produced the stream from some HID hardware, then that stream may contain UTF-8 encoded characters. If it came from somewhere else, (eg, a network, a file) it might contain anything, which is what makes the assumption of a UTF-8 standard useful.
The fact that the C locale is a much more restricted char set than the UTF-8 locale is unrelated. It's just called "the C locale", but in fact it has no more or less to do with composing C code than any other.
You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.
The "Roger Leigh" link you posted in a comment I believe refers to using an expanded set (UTF-8) as the C locale in a C library destined for an embedded environment, so that no other locale has to be loaded for the system to deal with UTF-8.
So the answer to the question, "What would break if the C locale was UTF-8 instead of ASCII?" is, I would guess, nothing, but outside of an embedded environment, etc. there is not much of a need to do this. But very likely it will become the norm at some point for libraries such as GNU C (it might as well be, I think).
You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.
It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.
However, this does not have anything to do with the how programs built from C code deal with input. The locale is used to translate input that is handed to an executable, which if the system locale is UTF-8, UTF-8 is what the program gets regardless of whether its source was written in C or something else. So:
I would be surprised to see code that can only deal with 7-bit clean
input and cannot be easily adapted to accept a UTF-8-enabled C
Does not really make sense. A minimal piece of standard C source that reads from standard input receives a stream of bytes from the system. If the system uses UTF-8 and it produced the stream from some HID hardware, then that stream may contain UTF-8 encoded characters. If it came from somewhere else, (eg, a network, a file) it might contain anything, which is what makes the assumption of a UTF-8 standard useful.
The fact that the C locale is a much more restricted char set than the UTF-8 locale is unrelated. It's just called "the C locale", but in fact it has no more or less to do with composing C code than any other.
You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.
The "Roger Leigh" link you posted in a comment I believe refers to using an expanded set (UTF-8) as the C locale in a C library destined for an embedded environment, so that no other locale has to be loaded for the system to deal with UTF-8.
So the answer to the question, "What would break if the C locale was UTF-8 instead of ASCII?" is, I would guess, nothing, but outside of an embedded environment, etc. there is not much of a need to do this. But very likely it will become the norm at some point for libraries such as GNU C (it might as well be, I think).
edited Mar 12 '13 at 16:54
answered Mar 12 '13 at 16:48
goldilocks
59.9k13138194
59.9k13138194
The behavior of various syscalls is influenced by the charset of the locale, for example «isupper()
will not recognize an A-umlaut (Ã) as an uppercase letter in the default C locale.» (from man7.org/linux/man-pages/man3/isprint.3.html).isprint()
is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
â gioele
Mar 12 '13 at 19:15
Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
â goldilocks
Mar 12 '13 at 19:45
Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
â goldilocks
Mar 12 '13 at 19:54
2
@gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
â Gilles
Mar 12 '13 at 21:38
@goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
â Michael Kjörling
Mar 13 '13 at 10:21
 |Â
show 1 more comment
The behavior of various syscalls is influenced by the charset of the locale, for example «isupper()
will not recognize an A-umlaut (Ã) as an uppercase letter in the default C locale.» (from man7.org/linux/man-pages/man3/isprint.3.html).isprint()
is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
â gioele
Mar 12 '13 at 19:15
Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
â goldilocks
Mar 12 '13 at 19:45
Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
â goldilocks
Mar 12 '13 at 19:54
2
@gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
â Gilles
Mar 12 '13 at 21:38
@goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
â Michael Kjörling
Mar 13 '13 at 10:21
The behavior of various syscalls is influenced by the charset of the locale, for example «
isupper()
will not recognize an A-umlaut (Ã) as an uppercase letter in the default C locale.» (from man7.org/linux/man-pages/man3/isprint.3.html). isprint()
is another syscall that is influenced as well by the fact that C is defined as ASCII-only.â gioele
Mar 12 '13 at 19:15
The behavior of various syscalls is influenced by the charset of the locale, for example «
isupper()
will not recognize an A-umlaut (Ã) as an uppercase letter in the default C locale.» (from man7.org/linux/man-pages/man3/isprint.3.html). isprint()
is another syscall that is influenced as well by the fact that C is defined as ASCII-only.â gioele
Mar 12 '13 at 19:15
Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
â goldilocks
Mar 12 '13 at 19:45
Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
â goldilocks
Mar 12 '13 at 19:45
Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
â goldilocks
Mar 12 '13 at 19:54
Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
â goldilocks
Mar 12 '13 at 19:54
2
2
@gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
â Gilles
Mar 12 '13 at 21:38
@gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
â Gilles
Mar 12 '13 at 21:38
@goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
â Michael Kjörling
Mar 13 '13 at 10:21
@goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
â Michael Kjörling
Mar 13 '13 at 10:21
 |Â
show 1 more comment
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f67680%2fwhat-would-break-if-the-c-locale-was-utf-8-instead-of-ascii%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
â gioele
Mar 12 '13 at 16:34
FWIW, OpenBSD has a
C.UTF-8
locale, as well asPOSIX.UTF-8
.â Kusalananda
Aug 13 at 11:31