What would break if the C locale was UTF-8 instead of ASCII?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
6
down vote

favorite
1












The C locale is defined to use the ASCII charset and POSIX does not provide a way to use a charset without changing the locale as well.



What would happen if the encoding of C were switched to UTF-8 instead?



The positive side would be that UTF-8 would become the default charset for any process, even system daemons. Obviously there would be applications that would break because they assume that C uses 7-bit ASCII. But do these applications really exist? Right now a lot of written code is locale- and charset-aware to a certain extent, I would be surprised to see code that can only deal with 7-bit clean input and cannot be easily adapted to accept a UTF-8-enabled C.










share|improve this question

















  • 1




    This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
    – gioele
    Mar 12 '13 at 16:34










  • FWIW, OpenBSD has a C.UTF-8 locale, as well as POSIX.UTF-8.
    – Kusalananda
    Aug 13 at 11:31















up vote
6
down vote

favorite
1












The C locale is defined to use the ASCII charset and POSIX does not provide a way to use a charset without changing the locale as well.



What would happen if the encoding of C were switched to UTF-8 instead?



The positive side would be that UTF-8 would become the default charset for any process, even system daemons. Obviously there would be applications that would break because they assume that C uses 7-bit ASCII. But do these applications really exist? Right now a lot of written code is locale- and charset-aware to a certain extent, I would be surprised to see code that can only deal with 7-bit clean input and cannot be easily adapted to accept a UTF-8-enabled C.










share|improve this question

















  • 1




    This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
    – gioele
    Mar 12 '13 at 16:34










  • FWIW, OpenBSD has a C.UTF-8 locale, as well as POSIX.UTF-8.
    – Kusalananda
    Aug 13 at 11:31













up vote
6
down vote

favorite
1









up vote
6
down vote

favorite
1






1





The C locale is defined to use the ASCII charset and POSIX does not provide a way to use a charset without changing the locale as well.



What would happen if the encoding of C were switched to UTF-8 instead?



The positive side would be that UTF-8 would become the default charset for any process, even system daemons. Obviously there would be applications that would break because they assume that C uses 7-bit ASCII. But do these applications really exist? Right now a lot of written code is locale- and charset-aware to a certain extent, I would be surprised to see code that can only deal with 7-bit clean input and cannot be easily adapted to accept a UTF-8-enabled C.










share|improve this question













The C locale is defined to use the ASCII charset and POSIX does not provide a way to use a charset without changing the locale as well.



What would happen if the encoding of C were switched to UTF-8 instead?



The positive side would be that UTF-8 would become the default charset for any process, even system daemons. Obviously there would be applications that would break because they assume that C uses 7-bit ASCII. But do these applications really exist? Right now a lot of written code is locale- and charset-aware to a certain extent, I would be surprised to see code that can only deal with 7-bit clean input and cannot be easily adapted to accept a UTF-8-enabled C.







character-encoding locale posix unicode compatibility






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 12 '13 at 16:31









gioele

81011120




81011120







  • 1




    This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
    – gioele
    Mar 12 '13 at 16:34










  • FWIW, OpenBSD has a C.UTF-8 locale, as well as POSIX.UTF-8.
    – Kusalananda
    Aug 13 at 11:31













  • 1




    This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
    – gioele
    Mar 12 '13 at 16:34










  • FWIW, OpenBSD has a C.UTF-8 locale, as well as POSIX.UTF-8.
    – Kusalananda
    Aug 13 at 11:31








1




1




This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
– gioele
Mar 12 '13 at 16:34




This thread from 2009 discusses the need for an UTF-8-based C locale, but does not address the problem of breaking POSIX.
– gioele
Mar 12 '13 at 16:34












FWIW, OpenBSD has a C.UTF-8 locale, as well as POSIX.UTF-8.
– Kusalananda
Aug 13 at 11:31





FWIW, OpenBSD has a C.UTF-8 locale, as well as POSIX.UTF-8.
– Kusalananda
Aug 13 at 11:31











2 Answers
2






active

oldest

votes

















up vote
8
down vote



accepted










The C locale is not the default locale. It is a locale that is guaranteed not to cause any “surprising” behavior. A number of commands have output of a guaranteed form (e.g. ps or df headers, date format) in the C or POSIX locale. For encodings (LC_CTYPE), it is guaranteed that [:alpha:] only contains the ASCII letters, and so on. If the C locale was modified, this would call many applications to misbehave. For example, they might reject input that is invalid UTF-8 instead of treating it as binary data.



If you want all programs on your system to use UTF-8, set the default locale to UTF-8. All programs that manipulate a single encoding, that is. Some programs only manipulate byte streams and don't care about encodings. Some programs manipulate multiple encodings and don't care about the locale (for example, a web server or web client sets or reads the encoding for each connection in a header).






share|improve this answer





























    up vote
    5
    down vote













    You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.



    It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.



    However, this does not have anything to do with the how programs built from C code deal with input. The locale is used to translate input that is handed to an executable, which if the system locale is UTF-8, UTF-8 is what the program gets regardless of whether its source was written in C or something else. So:




    I would be surprised to see code that can only deal with 7-bit clean
    input and cannot be easily adapted to accept a UTF-8-enabled C




    Does not really make sense. A minimal piece of standard C source that reads from standard input receives a stream of bytes from the system. If the system uses UTF-8 and it produced the stream from some HID hardware, then that stream may contain UTF-8 encoded characters. If it came from somewhere else, (eg, a network, a file) it might contain anything, which is what makes the assumption of a UTF-8 standard useful.



    The fact that the C locale is a much more restricted char set than the UTF-8 locale is unrelated. It's just called "the C locale", but in fact it has no more or less to do with composing C code than any other.



    You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.



    The "Roger Leigh" link you posted in a comment I believe refers to using an expanded set (UTF-8) as the C locale in a C library destined for an embedded environment, so that no other locale has to be loaded for the system to deal with UTF-8.



    So the answer to the question, "What would break if the C locale was UTF-8 instead of ASCII?" is, I would guess, nothing, but outside of an embedded environment, etc. there is not much of a need to do this. But very likely it will become the norm at some point for libraries such as GNU C (it might as well be, I think).






    share|improve this answer






















    • The behavior of various syscalls is influenced by the charset of the locale, for example «isupper() will not recognize an A-umlaut (Ä) as an uppercase letter in the default C locale.» (from man7.org/linux/man-pages/man3/isprint.3.html). isprint() is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
      – gioele
      Mar 12 '13 at 19:15











    • Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
      – goldilocks
      Mar 12 '13 at 19:45











    • Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
      – goldilocks
      Mar 12 '13 at 19:54







    • 2




      @gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
      – Gilles
      Mar 12 '13 at 21:38










    • @goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
      – Michael Kjörling
      Mar 13 '13 at 10:21










    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f67680%2fwhat-would-break-if-the-c-locale-was-utf-8-instead-of-ascii%23new-answer', 'question_page');

    );

    Post as a guest






























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    8
    down vote



    accepted










    The C locale is not the default locale. It is a locale that is guaranteed not to cause any “surprising” behavior. A number of commands have output of a guaranteed form (e.g. ps or df headers, date format) in the C or POSIX locale. For encodings (LC_CTYPE), it is guaranteed that [:alpha:] only contains the ASCII letters, and so on. If the C locale was modified, this would call many applications to misbehave. For example, they might reject input that is invalid UTF-8 instead of treating it as binary data.



    If you want all programs on your system to use UTF-8, set the default locale to UTF-8. All programs that manipulate a single encoding, that is. Some programs only manipulate byte streams and don't care about encodings. Some programs manipulate multiple encodings and don't care about the locale (for example, a web server or web client sets or reads the encoding for each connection in a header).






    share|improve this answer


























      up vote
      8
      down vote



      accepted










      The C locale is not the default locale. It is a locale that is guaranteed not to cause any “surprising” behavior. A number of commands have output of a guaranteed form (e.g. ps or df headers, date format) in the C or POSIX locale. For encodings (LC_CTYPE), it is guaranteed that [:alpha:] only contains the ASCII letters, and so on. If the C locale was modified, this would call many applications to misbehave. For example, they might reject input that is invalid UTF-8 instead of treating it as binary data.



      If you want all programs on your system to use UTF-8, set the default locale to UTF-8. All programs that manipulate a single encoding, that is. Some programs only manipulate byte streams and don't care about encodings. Some programs manipulate multiple encodings and don't care about the locale (for example, a web server or web client sets or reads the encoding for each connection in a header).






      share|improve this answer
























        up vote
        8
        down vote



        accepted







        up vote
        8
        down vote



        accepted






        The C locale is not the default locale. It is a locale that is guaranteed not to cause any “surprising” behavior. A number of commands have output of a guaranteed form (e.g. ps or df headers, date format) in the C or POSIX locale. For encodings (LC_CTYPE), it is guaranteed that [:alpha:] only contains the ASCII letters, and so on. If the C locale was modified, this would call many applications to misbehave. For example, they might reject input that is invalid UTF-8 instead of treating it as binary data.



        If you want all programs on your system to use UTF-8, set the default locale to UTF-8. All programs that manipulate a single encoding, that is. Some programs only manipulate byte streams and don't care about encodings. Some programs manipulate multiple encodings and don't care about the locale (for example, a web server or web client sets or reads the encoding for each connection in a header).






        share|improve this answer














        The C locale is not the default locale. It is a locale that is guaranteed not to cause any “surprising” behavior. A number of commands have output of a guaranteed form (e.g. ps or df headers, date format) in the C or POSIX locale. For encodings (LC_CTYPE), it is guaranteed that [:alpha:] only contains the ASCII letters, and so on. If the C locale was modified, this would call many applications to misbehave. For example, they might reject input that is invalid UTF-8 instead of treating it as binary data.



        If you want all programs on your system to use UTF-8, set the default locale to UTF-8. All programs that manipulate a single encoding, that is. Some programs only manipulate byte streams and don't care about encodings. Some programs manipulate multiple encodings and don't care about the locale (for example, a web server or web client sets or reads the encoding for each connection in a header).







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Aug 13 at 11:26









        Isaac

        7,1141834




        7,1141834










        answered Mar 12 '13 at 22:14









        Gilles

        509k12010061536




        509k12010061536






















            up vote
            5
            down vote













            You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.



            It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.



            However, this does not have anything to do with the how programs built from C code deal with input. The locale is used to translate input that is handed to an executable, which if the system locale is UTF-8, UTF-8 is what the program gets regardless of whether its source was written in C or something else. So:




            I would be surprised to see code that can only deal with 7-bit clean
            input and cannot be easily adapted to accept a UTF-8-enabled C




            Does not really make sense. A minimal piece of standard C source that reads from standard input receives a stream of bytes from the system. If the system uses UTF-8 and it produced the stream from some HID hardware, then that stream may contain UTF-8 encoded characters. If it came from somewhere else, (eg, a network, a file) it might contain anything, which is what makes the assumption of a UTF-8 standard useful.



            The fact that the C locale is a much more restricted char set than the UTF-8 locale is unrelated. It's just called "the C locale", but in fact it has no more or less to do with composing C code than any other.



            You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.



            The "Roger Leigh" link you posted in a comment I believe refers to using an expanded set (UTF-8) as the C locale in a C library destined for an embedded environment, so that no other locale has to be loaded for the system to deal with UTF-8.



            So the answer to the question, "What would break if the C locale was UTF-8 instead of ASCII?" is, I would guess, nothing, but outside of an embedded environment, etc. there is not much of a need to do this. But very likely it will become the norm at some point for libraries such as GNU C (it might as well be, I think).






            share|improve this answer






















            • The behavior of various syscalls is influenced by the charset of the locale, for example «isupper() will not recognize an A-umlaut (Ä) as an uppercase letter in the default C locale.» (from man7.org/linux/man-pages/man3/isprint.3.html). isprint() is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
              – gioele
              Mar 12 '13 at 19:15











            • Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
              – goldilocks
              Mar 12 '13 at 19:45











            • Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
              – goldilocks
              Mar 12 '13 at 19:54







            • 2




              @gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
              – Gilles
              Mar 12 '13 at 21:38










            • @goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
              – Michael Kjörling
              Mar 13 '13 at 10:21














            up vote
            5
            down vote













            You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.



            It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.



            However, this does not have anything to do with the how programs built from C code deal with input. The locale is used to translate input that is handed to an executable, which if the system locale is UTF-8, UTF-8 is what the program gets regardless of whether its source was written in C or something else. So:




            I would be surprised to see code that can only deal with 7-bit clean
            input and cannot be easily adapted to accept a UTF-8-enabled C




            Does not really make sense. A minimal piece of standard C source that reads from standard input receives a stream of bytes from the system. If the system uses UTF-8 and it produced the stream from some HID hardware, then that stream may contain UTF-8 encoded characters. If it came from somewhere else, (eg, a network, a file) it might contain anything, which is what makes the assumption of a UTF-8 standard useful.



            The fact that the C locale is a much more restricted char set than the UTF-8 locale is unrelated. It's just called "the C locale", but in fact it has no more or less to do with composing C code than any other.



            You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.



            The "Roger Leigh" link you posted in a comment I believe refers to using an expanded set (UTF-8) as the C locale in a C library destined for an embedded environment, so that no other locale has to be loaded for the system to deal with UTF-8.



            So the answer to the question, "What would break if the C locale was UTF-8 instead of ASCII?" is, I would guess, nothing, but outside of an embedded environment, etc. there is not much of a need to do this. But very likely it will become the norm at some point for libraries such as GNU C (it might as well be, I think).






            share|improve this answer






















            • The behavior of various syscalls is influenced by the charset of the locale, for example «isupper() will not recognize an A-umlaut (Ä) as an uppercase letter in the default C locale.» (from man7.org/linux/man-pages/man3/isprint.3.html). isprint() is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
              – gioele
              Mar 12 '13 at 19:15











            • Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
              – goldilocks
              Mar 12 '13 at 19:45











            • Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
              – goldilocks
              Mar 12 '13 at 19:54







            • 2




              @gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
              – Gilles
              Mar 12 '13 at 21:38










            • @goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
              – Michael Kjörling
              Mar 13 '13 at 10:21












            up vote
            5
            down vote










            up vote
            5
            down vote









            You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.



            It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.



            However, this does not have anything to do with the how programs built from C code deal with input. The locale is used to translate input that is handed to an executable, which if the system locale is UTF-8, UTF-8 is what the program gets regardless of whether its source was written in C or something else. So:




            I would be surprised to see code that can only deal with 7-bit clean
            input and cannot be easily adapted to accept a UTF-8-enabled C




            Does not really make sense. A minimal piece of standard C source that reads from standard input receives a stream of bytes from the system. If the system uses UTF-8 and it produced the stream from some HID hardware, then that stream may contain UTF-8 encoded characters. If it came from somewhere else, (eg, a network, a file) it might contain anything, which is what makes the assumption of a UTF-8 standard useful.



            The fact that the C locale is a much more restricted char set than the UTF-8 locale is unrelated. It's just called "the C locale", but in fact it has no more or less to do with composing C code than any other.



            You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.



            The "Roger Leigh" link you posted in a comment I believe refers to using an expanded set (UTF-8) as the C locale in a C library destined for an embedded environment, so that no other locale has to be loaded for the system to deal with UTF-8.



            So the answer to the question, "What would break if the C locale was UTF-8 instead of ASCII?" is, I would guess, nothing, but outside of an embedded environment, etc. there is not much of a need to do this. But very likely it will become the norm at some point for libraries such as GNU C (it might as well be, I think).






            share|improve this answer














            You are a bit confused, I think. The "C locale" is a locale like any other, which, as you point out, is conventionally a synonym for 7-bit ASCII.



            It's built into the C library, I suppose so that the library has some kind of fallback -- there can't be no locale.



            However, this does not have anything to do with the how programs built from C code deal with input. The locale is used to translate input that is handed to an executable, which if the system locale is UTF-8, UTF-8 is what the program gets regardless of whether its source was written in C or something else. So:




            I would be surprised to see code that can only deal with 7-bit clean
            input and cannot be easily adapted to accept a UTF-8-enabled C




            Does not really make sense. A minimal piece of standard C source that reads from standard input receives a stream of bytes from the system. If the system uses UTF-8 and it produced the stream from some HID hardware, then that stream may contain UTF-8 encoded characters. If it came from somewhere else, (eg, a network, a file) it might contain anything, which is what makes the assumption of a UTF-8 standard useful.



            The fact that the C locale is a much more restricted char set than the UTF-8 locale is unrelated. It's just called "the C locale", but in fact it has no more or less to do with composing C code than any other.



            You can, in fact, hardcode UTF-8 characters into c-strings in the source. Presuming the system is UTF-8, those strings will look correct when used by the resulting executable.



            The "Roger Leigh" link you posted in a comment I believe refers to using an expanded set (UTF-8) as the C locale in a C library destined for an embedded environment, so that no other locale has to be loaded for the system to deal with UTF-8.



            So the answer to the question, "What would break if the C locale was UTF-8 instead of ASCII?" is, I would guess, nothing, but outside of an embedded environment, etc. there is not much of a need to do this. But very likely it will become the norm at some point for libraries such as GNU C (it might as well be, I think).







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Mar 12 '13 at 16:54

























            answered Mar 12 '13 at 16:48









            goldilocks

            59.9k13138194




            59.9k13138194











            • The behavior of various syscalls is influenced by the charset of the locale, for example «isupper() will not recognize an A-umlaut (Ä) as an uppercase letter in the default C locale.» (from man7.org/linux/man-pages/man3/isprint.3.html). isprint() is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
              – gioele
              Mar 12 '13 at 19:15











            • Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
              – goldilocks
              Mar 12 '13 at 19:45











            • Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
              – goldilocks
              Mar 12 '13 at 19:54







            • 2




              @gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
              – Gilles
              Mar 12 '13 at 21:38










            • @goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
              – Michael Kjörling
              Mar 13 '13 at 10:21
















            • The behavior of various syscalls is influenced by the charset of the locale, for example «isupper() will not recognize an A-umlaut (Ä) as an uppercase letter in the default C locale.» (from man7.org/linux/man-pages/man3/isprint.3.html). isprint() is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
              – gioele
              Mar 12 '13 at 19:15











            • Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
              – goldilocks
              Mar 12 '13 at 19:45











            • Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
              – goldilocks
              Mar 12 '13 at 19:54







            • 2




              @gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
              – Gilles
              Mar 12 '13 at 21:38










            • @goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
              – Michael Kjörling
              Mar 13 '13 at 10:21















            The behavior of various syscalls is influenced by the charset of the locale, for example «isupper() will not recognize an A-umlaut (Ä) as an uppercase letter in the default C locale.» (from man7.org/linux/man-pages/man3/isprint.3.html). isprint() is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
            – gioele
            Mar 12 '13 at 19:15





            The behavior of various syscalls is influenced by the charset of the locale, for example «isupper() will not recognize an A-umlaut (Ä) as an uppercase letter in the default C locale.» (from man7.org/linux/man-pages/man3/isprint.3.html). isprint() is another syscall that is influenced as well by the fact that C is defined as ASCII-only.
            – gioele
            Mar 12 '13 at 19:15













            Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
            – goldilocks
            Mar 12 '13 at 19:45





            Yes, (in theory) those are influenced by the locale, but that locale is usually UTF-8, it is not necessarily 'C'. In GNU, they're broken in this regard, however: gnu.org/software/gnulib/manual/html_node/isupper.html Keep in mind that 100% of the fundamentals of a unix system are coded in C, so the idea that "C doesn't handle UTF-8" is well, just plain incorrect and obviously so. If a program written in C could not deal with UTF-8, there wouldn't be any UTF-8 on the system. Period.
            – goldilocks
            Mar 12 '13 at 19:45













            Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
            – goldilocks
            Mar 12 '13 at 19:54





            Qv. also the POSIX isupper() page pubs.opengroup.org/onlinepubs/9699919799/functions/isupper.html "in the current locale of the process", not "the C locale". This is also in the ISO standard, which refers to "in the C locale" and "in the current locale", usually in the form "if the current locale is the C locale", etc. Keep in mind, again, if you are on linux, GNU C's implementation of some of the ctype functions is broken.
            – goldilocks
            Mar 12 '13 at 19:54





            2




            2




            @gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
            – Gilles
            Mar 12 '13 at 21:38




            @gioele These are library functions, not syscalls. Syscalls are calls to the kernel and are not affected by locales: locales exist purely a user level.
            – Gilles
            Mar 12 '13 at 21:38












            @goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
            – Michael Kjörling
            Mar 13 '13 at 10:21




            @goldilocks It's not quite true that "100% of the fundamentals of a unix system are coded in C". At some level, you pretty much have to have a bit of assembler, or possibly assembly-like C. Examples might include the boot loader loader (no typo), the actual process of task switching, and a few other similarly low-level features. On top of that, though, I agree, C (or higher-level languages) are likely used throughout the code base.
            – Michael Kjörling
            Mar 13 '13 at 10:21

















             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f67680%2fwhat-would-break-if-the-c-locale-was-utf-8-instead-of-ascii%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            How many registers does an x86_64 CPU actually have?

            Nur Jahan