Which terminal encodings are default on Linux, and which are most common?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












I need to make a decision regarding whether a complicated commercial program that I work on should assume a particular terminal encoding for Linux, or instead read it from the terminal (and if so, how).



It's pretty easy to guess which system and terminal encodings are most common on Windows. We can assume that most users configure these through the Control Panel, and that, for instance, their terminal encoding, which is usually non-Unicode, can be easily predicted from the standard configuration for that language/country. (For instance, on a US English machine, it will be OEM-437, while on a Russian machine, it will be OEM-866.)



But it's not clear to me how most users configure their system and terminal encodings on Linux. The savvy ones who often need to use non-ASCII characters probably use a UTF-8 encoding. But what proportion of Linux users fall into that category?



Nor is it clear which method most users use to configure their locale: changing the LANG environment variable, or something else.



A related question would be how Linux configures these by default. My own Linux machine at work (actually a virtual Debian 5 machine that runs via VMWare Player on my Windows machine) is set up by default to use a US-ASCII terminal encoding. However, I'm not sure whether that was set up by administrators at my workplace or that's the setting out of the box.



Please understand that I'm not looking for answers to "Which encoding do you personally use?" but rather some means by which I could figure out the distribution of encodings that Linux users are likely to be using.










share|improve this question














bumped to the homepage by Community♦ 4 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.


















    up vote
    1
    down vote

    favorite












    I need to make a decision regarding whether a complicated commercial program that I work on should assume a particular terminal encoding for Linux, or instead read it from the terminal (and if so, how).



    It's pretty easy to guess which system and terminal encodings are most common on Windows. We can assume that most users configure these through the Control Panel, and that, for instance, their terminal encoding, which is usually non-Unicode, can be easily predicted from the standard configuration for that language/country. (For instance, on a US English machine, it will be OEM-437, while on a Russian machine, it will be OEM-866.)



    But it's not clear to me how most users configure their system and terminal encodings on Linux. The savvy ones who often need to use non-ASCII characters probably use a UTF-8 encoding. But what proportion of Linux users fall into that category?



    Nor is it clear which method most users use to configure their locale: changing the LANG environment variable, or something else.



    A related question would be how Linux configures these by default. My own Linux machine at work (actually a virtual Debian 5 machine that runs via VMWare Player on my Windows machine) is set up by default to use a US-ASCII terminal encoding. However, I'm not sure whether that was set up by administrators at my workplace or that's the setting out of the box.



    Please understand that I'm not looking for answers to "Which encoding do you personally use?" but rather some means by which I could figure out the distribution of encodings that Linux users are likely to be using.










    share|improve this question














    bumped to the homepage by Community♦ 4 mins ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
















      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I need to make a decision regarding whether a complicated commercial program that I work on should assume a particular terminal encoding for Linux, or instead read it from the terminal (and if so, how).



      It's pretty easy to guess which system and terminal encodings are most common on Windows. We can assume that most users configure these through the Control Panel, and that, for instance, their terminal encoding, which is usually non-Unicode, can be easily predicted from the standard configuration for that language/country. (For instance, on a US English machine, it will be OEM-437, while on a Russian machine, it will be OEM-866.)



      But it's not clear to me how most users configure their system and terminal encodings on Linux. The savvy ones who often need to use non-ASCII characters probably use a UTF-8 encoding. But what proportion of Linux users fall into that category?



      Nor is it clear which method most users use to configure their locale: changing the LANG environment variable, or something else.



      A related question would be how Linux configures these by default. My own Linux machine at work (actually a virtual Debian 5 machine that runs via VMWare Player on my Windows machine) is set up by default to use a US-ASCII terminal encoding. However, I'm not sure whether that was set up by administrators at my workplace or that's the setting out of the box.



      Please understand that I'm not looking for answers to "Which encoding do you personally use?" but rather some means by which I could figure out the distribution of encodings that Linux users are likely to be using.










      share|improve this question













      I need to make a decision regarding whether a complicated commercial program that I work on should assume a particular terminal encoding for Linux, or instead read it from the terminal (and if so, how).



      It's pretty easy to guess which system and terminal encodings are most common on Windows. We can assume that most users configure these through the Control Panel, and that, for instance, their terminal encoding, which is usually non-Unicode, can be easily predicted from the standard configuration for that language/country. (For instance, on a US English machine, it will be OEM-437, while on a Russian machine, it will be OEM-866.)



      But it's not clear to me how most users configure their system and terminal encodings on Linux. The savvy ones who often need to use non-ASCII characters probably use a UTF-8 encoding. But what proportion of Linux users fall into that category?



      Nor is it clear which method most users use to configure their locale: changing the LANG environment variable, or something else.



      A related question would be how Linux configures these by default. My own Linux machine at work (actually a virtual Debian 5 machine that runs via VMWare Player on my Windows machine) is set up by default to use a US-ASCII terminal encoding. However, I'm not sure whether that was set up by administrators at my workplace or that's the setting out of the box.



      Please understand that I'm not looking for answers to "Which encoding do you personally use?" but rather some means by which I could figure out the distribution of encodings that Linux users are likely to be using.







      character-encoding






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Feb 2 '14 at 23:30









      Alan

      1093




      1093





      bumped to the homepage by Community♦ 4 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community♦ 4 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.






















          3 Answers
          3






          active

          oldest

          votes

















          up vote
          0
          down vote













          I would use a similar heuristic you are using with Windows users, but via the LANG environmental variable. For example, on my system:



          $ echo $LANG
          en_US.UTF-8


          Here, the code is saying I am using the English language, but with UTF-8 encoding of filenames and files.



          As a general rule, Linux users using UTF-8 will have "UTF-8" at the end of their LANG environmental variable.






          share|improve this answer



























            up vote
            0
            down vote













            Modern Linux installations (for at least some 5 years, probably longer) use UTF-8. How that is handled by setting the environment values LC_CTYPE, LANG, and LANGUAGE. See for example the discussions here or here (Unicode centered).






            share|improve this answer



























              up vote
              0
              down vote













              For reasonably modern Linux/Unix systems, you shouldn't need to worry about terminal encoding. Just use getwchar or fgetws to read from stdin (or the terminal). [Note 1]



              As man getwchar says, in the Notes section:




              It is reasonable to expect that getwchar() will actually read a multibyte sequence from standard input and then convert it to a wide character.




              There is a similar note in man fgetws.



              With Linux, it is also reasonable to expect the encoding of wchar_t to be unicode, regardless of locale. The C99 standard allows the implementation to define the macro __STDC_ISO_10646__ to indicate that wchar_t values correspond to Unicode code points [Note 2], so you can insert a compile-time check for this expectation, which should succeed on modern Linux installs with standard toolchains. It's likely to succeed on modern Unix systems as well, although there is no guarantee.




              Notes:



              [1] You do need to initialize the locale by calling setlocale(LC_ALL, ""); once at the beginning of program execution. See man setlocale.



              [2] The value of __STDC_ISO_10646__ is a date (in format yyyymmL) corresponding to the date of the applicable version of the Unicode standard. The precise wording from the standard (draft) is:




              The following macro names are conditionally defined by the implementation:



              __STDC_ISO_10646__ An integer constant of the form yyyymmL (for example,
              199712L). If this symbol is defined, then every character in the Unicode
              required set, when stored in an object of type wchar_t, has the same
              value as the short identifier of that character. The Unicode required set
              consists of all the characters that are defined by ISO/IEC 10646, along with
              all amendments and technical corrigenda, as of the specified year and
              month. If some other encoding is used, the macro shall not be defined and
              the actual encoding used is implementation-defined.







              share|improve this answer




















              • Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
                – ninjalj
                Jun 12 '16 at 10:34










              Your Answer







              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "106"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              convertImagesToLinks: false,
              noModals: false,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













               

              draft saved


              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f112216%2fwhich-terminal-encodings-are-default-on-linux-and-which-are-most-common%23new-answer', 'question_page');

              );

              Post as a guest






























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes








              up vote
              0
              down vote













              I would use a similar heuristic you are using with Windows users, but via the LANG environmental variable. For example, on my system:



              $ echo $LANG
              en_US.UTF-8


              Here, the code is saying I am using the English language, but with UTF-8 encoding of filenames and files.



              As a general rule, Linux users using UTF-8 will have "UTF-8" at the end of their LANG environmental variable.






              share|improve this answer
























                up vote
                0
                down vote













                I would use a similar heuristic you are using with Windows users, but via the LANG environmental variable. For example, on my system:



                $ echo $LANG
                en_US.UTF-8


                Here, the code is saying I am using the English language, but with UTF-8 encoding of filenames and files.



                As a general rule, Linux users using UTF-8 will have "UTF-8" at the end of their LANG environmental variable.






                share|improve this answer






















                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  I would use a similar heuristic you are using with Windows users, but via the LANG environmental variable. For example, on my system:



                  $ echo $LANG
                  en_US.UTF-8


                  Here, the code is saying I am using the English language, but with UTF-8 encoding of filenames and files.



                  As a general rule, Linux users using UTF-8 will have "UTF-8" at the end of their LANG environmental variable.






                  share|improve this answer












                  I would use a similar heuristic you are using with Windows users, but via the LANG environmental variable. For example, on my system:



                  $ echo $LANG
                  en_US.UTF-8


                  Here, the code is saying I am using the English language, but with UTF-8 encoding of filenames and files.



                  As a general rule, Linux users using UTF-8 will have "UTF-8" at the end of their LANG environmental variable.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Feb 3 '14 at 0:19









                  samiam

                  2,356713




                  2,356713






















                      up vote
                      0
                      down vote













                      Modern Linux installations (for at least some 5 years, probably longer) use UTF-8. How that is handled by setting the environment values LC_CTYPE, LANG, and LANGUAGE. See for example the discussions here or here (Unicode centered).






                      share|improve this answer
























                        up vote
                        0
                        down vote













                        Modern Linux installations (for at least some 5 years, probably longer) use UTF-8. How that is handled by setting the environment values LC_CTYPE, LANG, and LANGUAGE. See for example the discussions here or here (Unicode centered).






                        share|improve this answer






















                          up vote
                          0
                          down vote










                          up vote
                          0
                          down vote









                          Modern Linux installations (for at least some 5 years, probably longer) use UTF-8. How that is handled by setting the environment values LC_CTYPE, LANG, and LANGUAGE. See for example the discussions here or here (Unicode centered).






                          share|improve this answer












                          Modern Linux installations (for at least some 5 years, probably longer) use UTF-8. How that is handled by setting the environment values LC_CTYPE, LANG, and LANGUAGE. See for example the discussions here or here (Unicode centered).







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Feb 3 '14 at 0:20









                          vonbrand

                          14k22444




                          14k22444




















                              up vote
                              0
                              down vote













                              For reasonably modern Linux/Unix systems, you shouldn't need to worry about terminal encoding. Just use getwchar or fgetws to read from stdin (or the terminal). [Note 1]



                              As man getwchar says, in the Notes section:




                              It is reasonable to expect that getwchar() will actually read a multibyte sequence from standard input and then convert it to a wide character.




                              There is a similar note in man fgetws.



                              With Linux, it is also reasonable to expect the encoding of wchar_t to be unicode, regardless of locale. The C99 standard allows the implementation to define the macro __STDC_ISO_10646__ to indicate that wchar_t values correspond to Unicode code points [Note 2], so you can insert a compile-time check for this expectation, which should succeed on modern Linux installs with standard toolchains. It's likely to succeed on modern Unix systems as well, although there is no guarantee.




                              Notes:



                              [1] You do need to initialize the locale by calling setlocale(LC_ALL, ""); once at the beginning of program execution. See man setlocale.



                              [2] The value of __STDC_ISO_10646__ is a date (in format yyyymmL) corresponding to the date of the applicable version of the Unicode standard. The precise wording from the standard (draft) is:




                              The following macro names are conditionally defined by the implementation:



                              __STDC_ISO_10646__ An integer constant of the form yyyymmL (for example,
                              199712L). If this symbol is defined, then every character in the Unicode
                              required set, when stored in an object of type wchar_t, has the same
                              value as the short identifier of that character. The Unicode required set
                              consists of all the characters that are defined by ISO/IEC 10646, along with
                              all amendments and technical corrigenda, as of the specified year and
                              month. If some other encoding is used, the macro shall not be defined and
                              the actual encoding used is implementation-defined.







                              share|improve this answer




















                              • Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
                                – ninjalj
                                Jun 12 '16 at 10:34














                              up vote
                              0
                              down vote













                              For reasonably modern Linux/Unix systems, you shouldn't need to worry about terminal encoding. Just use getwchar or fgetws to read from stdin (or the terminal). [Note 1]



                              As man getwchar says, in the Notes section:




                              It is reasonable to expect that getwchar() will actually read a multibyte sequence from standard input and then convert it to a wide character.




                              There is a similar note in man fgetws.



                              With Linux, it is also reasonable to expect the encoding of wchar_t to be unicode, regardless of locale. The C99 standard allows the implementation to define the macro __STDC_ISO_10646__ to indicate that wchar_t values correspond to Unicode code points [Note 2], so you can insert a compile-time check for this expectation, which should succeed on modern Linux installs with standard toolchains. It's likely to succeed on modern Unix systems as well, although there is no guarantee.




                              Notes:



                              [1] You do need to initialize the locale by calling setlocale(LC_ALL, ""); once at the beginning of program execution. See man setlocale.



                              [2] The value of __STDC_ISO_10646__ is a date (in format yyyymmL) corresponding to the date of the applicable version of the Unicode standard. The precise wording from the standard (draft) is:




                              The following macro names are conditionally defined by the implementation:



                              __STDC_ISO_10646__ An integer constant of the form yyyymmL (for example,
                              199712L). If this symbol is defined, then every character in the Unicode
                              required set, when stored in an object of type wchar_t, has the same
                              value as the short identifier of that character. The Unicode required set
                              consists of all the characters that are defined by ISO/IEC 10646, along with
                              all amendments and technical corrigenda, as of the specified year and
                              month. If some other encoding is used, the macro shall not be defined and
                              the actual encoding used is implementation-defined.







                              share|improve this answer




















                              • Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
                                – ninjalj
                                Jun 12 '16 at 10:34












                              up vote
                              0
                              down vote










                              up vote
                              0
                              down vote









                              For reasonably modern Linux/Unix systems, you shouldn't need to worry about terminal encoding. Just use getwchar or fgetws to read from stdin (or the terminal). [Note 1]



                              As man getwchar says, in the Notes section:




                              It is reasonable to expect that getwchar() will actually read a multibyte sequence from standard input and then convert it to a wide character.




                              There is a similar note in man fgetws.



                              With Linux, it is also reasonable to expect the encoding of wchar_t to be unicode, regardless of locale. The C99 standard allows the implementation to define the macro __STDC_ISO_10646__ to indicate that wchar_t values correspond to Unicode code points [Note 2], so you can insert a compile-time check for this expectation, which should succeed on modern Linux installs with standard toolchains. It's likely to succeed on modern Unix systems as well, although there is no guarantee.




                              Notes:



                              [1] You do need to initialize the locale by calling setlocale(LC_ALL, ""); once at the beginning of program execution. See man setlocale.



                              [2] The value of __STDC_ISO_10646__ is a date (in format yyyymmL) corresponding to the date of the applicable version of the Unicode standard. The precise wording from the standard (draft) is:




                              The following macro names are conditionally defined by the implementation:



                              __STDC_ISO_10646__ An integer constant of the form yyyymmL (for example,
                              199712L). If this symbol is defined, then every character in the Unicode
                              required set, when stored in an object of type wchar_t, has the same
                              value as the short identifier of that character. The Unicode required set
                              consists of all the characters that are defined by ISO/IEC 10646, along with
                              all amendments and technical corrigenda, as of the specified year and
                              month. If some other encoding is used, the macro shall not be defined and
                              the actual encoding used is implementation-defined.







                              share|improve this answer












                              For reasonably modern Linux/Unix systems, you shouldn't need to worry about terminal encoding. Just use getwchar or fgetws to read from stdin (or the terminal). [Note 1]



                              As man getwchar says, in the Notes section:




                              It is reasonable to expect that getwchar() will actually read a multibyte sequence from standard input and then convert it to a wide character.




                              There is a similar note in man fgetws.



                              With Linux, it is also reasonable to expect the encoding of wchar_t to be unicode, regardless of locale. The C99 standard allows the implementation to define the macro __STDC_ISO_10646__ to indicate that wchar_t values correspond to Unicode code points [Note 2], so you can insert a compile-time check for this expectation, which should succeed on modern Linux installs with standard toolchains. It's likely to succeed on modern Unix systems as well, although there is no guarantee.




                              Notes:



                              [1] You do need to initialize the locale by calling setlocale(LC_ALL, ""); once at the beginning of program execution. See man setlocale.



                              [2] The value of __STDC_ISO_10646__ is a date (in format yyyymmL) corresponding to the date of the applicable version of the Unicode standard. The precise wording from the standard (draft) is:




                              The following macro names are conditionally defined by the implementation:



                              __STDC_ISO_10646__ An integer constant of the form yyyymmL (for example,
                              199712L). If this symbol is defined, then every character in the Unicode
                              required set, when stored in an object of type wchar_t, has the same
                              value as the short identifier of that character. The Unicode required set
                              consists of all the characters that are defined by ISO/IEC 10646, along with
                              all amendments and technical corrigenda, as of the specified year and
                              month. If some other encoding is used, the macro shall not be defined and
                              the actual encoding used is implementation-defined.








                              share|improve this answer












                              share|improve this answer



                              share|improve this answer










                              answered Feb 3 '14 at 1:52









                              rici

                              7,3572530




                              7,3572530











                              • Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
                                – ninjalj
                                Jun 12 '16 at 10:34
















                              • Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
                                – ninjalj
                                Jun 12 '16 at 10:34















                              Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
                              – ninjalj
                              Jun 12 '16 at 10:34




                              Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
                              – ninjalj
                              Jun 12 '16 at 10:34

















                               

                              draft saved


                              draft discarded















































                               


                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f112216%2fwhich-terminal-encodings-are-default-on-linux-and-which-are-most-common%23new-answer', 'question_page');

                              );

                              Post as a guest













































































                              Popular posts from this blog

                              How to check contact read email or not when send email to Individual?

                              Bahrain

                              Postfix configuration issue with fips on centos 7; mailgun relay