Which terminal encodings are default on Linux, and which are most common?
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
I need to make a decision regarding whether a complicated commercial program that I work on should assume a particular terminal encoding for Linux, or instead read it from the terminal (and if so, how).
It's pretty easy to guess which system and terminal encodings are most common on Windows. We can assume that most users configure these through the Control Panel, and that, for instance, their terminal encoding, which is usually non-Unicode, can be easily predicted from the standard configuration for that language/country. (For instance, on a US English machine, it will be OEM-437, while on a Russian machine, it will be OEM-866.)
But it's not clear to me how most users configure their system and terminal encodings on Linux. The savvy ones who often need to use non-ASCII characters probably use a UTF-8 encoding. But what proportion of Linux users fall into that category?
Nor is it clear which method most users use to configure their locale: changing the LANG environment variable, or something else.
A related question would be how Linux configures these by default. My own Linux machine at work (actually a virtual Debian 5 machine that runs via VMWare Player on my Windows machine) is set up by default to use a US-ASCII terminal encoding. However, I'm not sure whether that was set up by administrators at my workplace or that's the setting out of the box.
Please understand that I'm not looking for answers to "Which encoding do you personally use?" but rather some means by which I could figure out the distribution of encodings that Linux users are likely to be using.
character-encoding
bumped to the homepage by Community⦠4 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |Â
up vote
1
down vote
favorite
I need to make a decision regarding whether a complicated commercial program that I work on should assume a particular terminal encoding for Linux, or instead read it from the terminal (and if so, how).
It's pretty easy to guess which system and terminal encodings are most common on Windows. We can assume that most users configure these through the Control Panel, and that, for instance, their terminal encoding, which is usually non-Unicode, can be easily predicted from the standard configuration for that language/country. (For instance, on a US English machine, it will be OEM-437, while on a Russian machine, it will be OEM-866.)
But it's not clear to me how most users configure their system and terminal encodings on Linux. The savvy ones who often need to use non-ASCII characters probably use a UTF-8 encoding. But what proportion of Linux users fall into that category?
Nor is it clear which method most users use to configure their locale: changing the LANG environment variable, or something else.
A related question would be how Linux configures these by default. My own Linux machine at work (actually a virtual Debian 5 machine that runs via VMWare Player on my Windows machine) is set up by default to use a US-ASCII terminal encoding. However, I'm not sure whether that was set up by administrators at my workplace or that's the setting out of the box.
Please understand that I'm not looking for answers to "Which encoding do you personally use?" but rather some means by which I could figure out the distribution of encodings that Linux users are likely to be using.
character-encoding
bumped to the homepage by Community⦠4 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I need to make a decision regarding whether a complicated commercial program that I work on should assume a particular terminal encoding for Linux, or instead read it from the terminal (and if so, how).
It's pretty easy to guess which system and terminal encodings are most common on Windows. We can assume that most users configure these through the Control Panel, and that, for instance, their terminal encoding, which is usually non-Unicode, can be easily predicted from the standard configuration for that language/country. (For instance, on a US English machine, it will be OEM-437, while on a Russian machine, it will be OEM-866.)
But it's not clear to me how most users configure their system and terminal encodings on Linux. The savvy ones who often need to use non-ASCII characters probably use a UTF-8 encoding. But what proportion of Linux users fall into that category?
Nor is it clear which method most users use to configure their locale: changing the LANG environment variable, or something else.
A related question would be how Linux configures these by default. My own Linux machine at work (actually a virtual Debian 5 machine that runs via VMWare Player on my Windows machine) is set up by default to use a US-ASCII terminal encoding. However, I'm not sure whether that was set up by administrators at my workplace or that's the setting out of the box.
Please understand that I'm not looking for answers to "Which encoding do you personally use?" but rather some means by which I could figure out the distribution of encodings that Linux users are likely to be using.
character-encoding
I need to make a decision regarding whether a complicated commercial program that I work on should assume a particular terminal encoding for Linux, or instead read it from the terminal (and if so, how).
It's pretty easy to guess which system and terminal encodings are most common on Windows. We can assume that most users configure these through the Control Panel, and that, for instance, their terminal encoding, which is usually non-Unicode, can be easily predicted from the standard configuration for that language/country. (For instance, on a US English machine, it will be OEM-437, while on a Russian machine, it will be OEM-866.)
But it's not clear to me how most users configure their system and terminal encodings on Linux. The savvy ones who often need to use non-ASCII characters probably use a UTF-8 encoding. But what proportion of Linux users fall into that category?
Nor is it clear which method most users use to configure their locale: changing the LANG environment variable, or something else.
A related question would be how Linux configures these by default. My own Linux machine at work (actually a virtual Debian 5 machine that runs via VMWare Player on my Windows machine) is set up by default to use a US-ASCII terminal encoding. However, I'm not sure whether that was set up by administrators at my workplace or that's the setting out of the box.
Please understand that I'm not looking for answers to "Which encoding do you personally use?" but rather some means by which I could figure out the distribution of encodings that Linux users are likely to be using.
character-encoding
character-encoding
asked Feb 2 '14 at 23:30
Alan
1093
1093
bumped to the homepage by Community⦠4 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community⦠4 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |Â
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
0
down vote
I would use a similar heuristic you are using with Windows users, but via the LANG environmental variable. For example, on my system:
$ echo $LANG
en_US.UTF-8
Here, the code is saying I am using the English language, but with UTF-8 encoding of filenames and files.
As a general rule, Linux users using UTF-8 will have "UTF-8" at the end of their LANG environmental variable.
add a comment |Â
up vote
0
down vote
Modern Linux installations (for at least some 5 years, probably longer) use UTF-8. How that is handled by setting the environment values LC_CTYPE
, LANG
, and LANGUAGE
. See for example the discussions here or here (Unicode centered).
add a comment |Â
up vote
0
down vote
For reasonably modern Linux/Unix systems, you shouldn't need to worry about terminal encoding. Just use getwchar
or fgetws
to read from stdin
(or the terminal). [Note 1]
As man getwchar
says, in the Notes
section:
It is reasonable to expect that getwchar() will actually read a multibyte sequence from standard input and then convert it to a wide character.
There is a similar note in man fgetws
.
With Linux, it is also reasonable to expect the encoding of wchar_t
to be unicode, regardless of locale. The C99
standard allows the implementation to define the macro __STDC_ISO_10646__
to indicate that wchar_t
values correspond to Unicode code points [Note 2], so you can insert a compile-time check for this expectation, which should succeed on modern Linux installs with standard toolchains. It's likely to succeed on modern Unix systems as well, although there is no guarantee.
Notes:
[1] You do need to initialize the locale by calling setlocale(LC_ALL, "");
once at the beginning of program execution. See man setlocale
.
[2] The value of __STDC_ISO_10646__
is a date (in format yyyymmL
) corresponding to the date of the applicable version of the Unicode standard. The precise wording from the standard (draft) is:
The following macro names are conditionally defined by the implementation:
__STDC_ISO_10646__
An integer constant of the formyyyymmL
(for example,
199712L
). If this symbol is defined, then every character in the Unicode
required set, when stored in an object of typewchar_t
, has the same
value as the short identifier of that character. The Unicode required set
consists of all the characters that are defined by ISO/IEC 10646, along with
all amendments and technical corrigenda, as of the specified year and
month. If some other encoding is used, the macro shall not be defined and
the actual encoding used is implementation-defined.
Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
â ninjalj
Jun 12 '16 at 10:34
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
I would use a similar heuristic you are using with Windows users, but via the LANG environmental variable. For example, on my system:
$ echo $LANG
en_US.UTF-8
Here, the code is saying I am using the English language, but with UTF-8 encoding of filenames and files.
As a general rule, Linux users using UTF-8 will have "UTF-8" at the end of their LANG environmental variable.
add a comment |Â
up vote
0
down vote
I would use a similar heuristic you are using with Windows users, but via the LANG environmental variable. For example, on my system:
$ echo $LANG
en_US.UTF-8
Here, the code is saying I am using the English language, but with UTF-8 encoding of filenames and files.
As a general rule, Linux users using UTF-8 will have "UTF-8" at the end of their LANG environmental variable.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
I would use a similar heuristic you are using with Windows users, but via the LANG environmental variable. For example, on my system:
$ echo $LANG
en_US.UTF-8
Here, the code is saying I am using the English language, but with UTF-8 encoding of filenames and files.
As a general rule, Linux users using UTF-8 will have "UTF-8" at the end of their LANG environmental variable.
I would use a similar heuristic you are using with Windows users, but via the LANG environmental variable. For example, on my system:
$ echo $LANG
en_US.UTF-8
Here, the code is saying I am using the English language, but with UTF-8 encoding of filenames and files.
As a general rule, Linux users using UTF-8 will have "UTF-8" at the end of their LANG environmental variable.
answered Feb 3 '14 at 0:19
samiam
2,356713
2,356713
add a comment |Â
add a comment |Â
up vote
0
down vote
Modern Linux installations (for at least some 5 years, probably longer) use UTF-8. How that is handled by setting the environment values LC_CTYPE
, LANG
, and LANGUAGE
. See for example the discussions here or here (Unicode centered).
add a comment |Â
up vote
0
down vote
Modern Linux installations (for at least some 5 years, probably longer) use UTF-8. How that is handled by setting the environment values LC_CTYPE
, LANG
, and LANGUAGE
. See for example the discussions here or here (Unicode centered).
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Modern Linux installations (for at least some 5 years, probably longer) use UTF-8. How that is handled by setting the environment values LC_CTYPE
, LANG
, and LANGUAGE
. See for example the discussions here or here (Unicode centered).
Modern Linux installations (for at least some 5 years, probably longer) use UTF-8. How that is handled by setting the environment values LC_CTYPE
, LANG
, and LANGUAGE
. See for example the discussions here or here (Unicode centered).
answered Feb 3 '14 at 0:20
vonbrand
14k22444
14k22444
add a comment |Â
add a comment |Â
up vote
0
down vote
For reasonably modern Linux/Unix systems, you shouldn't need to worry about terminal encoding. Just use getwchar
or fgetws
to read from stdin
(or the terminal). [Note 1]
As man getwchar
says, in the Notes
section:
It is reasonable to expect that getwchar() will actually read a multibyte sequence from standard input and then convert it to a wide character.
There is a similar note in man fgetws
.
With Linux, it is also reasonable to expect the encoding of wchar_t
to be unicode, regardless of locale. The C99
standard allows the implementation to define the macro __STDC_ISO_10646__
to indicate that wchar_t
values correspond to Unicode code points [Note 2], so you can insert a compile-time check for this expectation, which should succeed on modern Linux installs with standard toolchains. It's likely to succeed on modern Unix systems as well, although there is no guarantee.
Notes:
[1] You do need to initialize the locale by calling setlocale(LC_ALL, "");
once at the beginning of program execution. See man setlocale
.
[2] The value of __STDC_ISO_10646__
is a date (in format yyyymmL
) corresponding to the date of the applicable version of the Unicode standard. The precise wording from the standard (draft) is:
The following macro names are conditionally defined by the implementation:
__STDC_ISO_10646__
An integer constant of the formyyyymmL
(for example,
199712L
). If this symbol is defined, then every character in the Unicode
required set, when stored in an object of typewchar_t
, has the same
value as the short identifier of that character. The Unicode required set
consists of all the characters that are defined by ISO/IEC 10646, along with
all amendments and technical corrigenda, as of the specified year and
month. If some other encoding is used, the macro shall not be defined and
the actual encoding used is implementation-defined.
Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
â ninjalj
Jun 12 '16 at 10:34
add a comment |Â
up vote
0
down vote
For reasonably modern Linux/Unix systems, you shouldn't need to worry about terminal encoding. Just use getwchar
or fgetws
to read from stdin
(or the terminal). [Note 1]
As man getwchar
says, in the Notes
section:
It is reasonable to expect that getwchar() will actually read a multibyte sequence from standard input and then convert it to a wide character.
There is a similar note in man fgetws
.
With Linux, it is also reasonable to expect the encoding of wchar_t
to be unicode, regardless of locale. The C99
standard allows the implementation to define the macro __STDC_ISO_10646__
to indicate that wchar_t
values correspond to Unicode code points [Note 2], so you can insert a compile-time check for this expectation, which should succeed on modern Linux installs with standard toolchains. It's likely to succeed on modern Unix systems as well, although there is no guarantee.
Notes:
[1] You do need to initialize the locale by calling setlocale(LC_ALL, "");
once at the beginning of program execution. See man setlocale
.
[2] The value of __STDC_ISO_10646__
is a date (in format yyyymmL
) corresponding to the date of the applicable version of the Unicode standard. The precise wording from the standard (draft) is:
The following macro names are conditionally defined by the implementation:
__STDC_ISO_10646__
An integer constant of the formyyyymmL
(for example,
199712L
). If this symbol is defined, then every character in the Unicode
required set, when stored in an object of typewchar_t
, has the same
value as the short identifier of that character. The Unicode required set
consists of all the characters that are defined by ISO/IEC 10646, along with
all amendments and technical corrigenda, as of the specified year and
month. If some other encoding is used, the macro shall not be defined and
the actual encoding used is implementation-defined.
Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
â ninjalj
Jun 12 '16 at 10:34
add a comment |Â
up vote
0
down vote
up vote
0
down vote
For reasonably modern Linux/Unix systems, you shouldn't need to worry about terminal encoding. Just use getwchar
or fgetws
to read from stdin
(or the terminal). [Note 1]
As man getwchar
says, in the Notes
section:
It is reasonable to expect that getwchar() will actually read a multibyte sequence from standard input and then convert it to a wide character.
There is a similar note in man fgetws
.
With Linux, it is also reasonable to expect the encoding of wchar_t
to be unicode, regardless of locale. The C99
standard allows the implementation to define the macro __STDC_ISO_10646__
to indicate that wchar_t
values correspond to Unicode code points [Note 2], so you can insert a compile-time check for this expectation, which should succeed on modern Linux installs with standard toolchains. It's likely to succeed on modern Unix systems as well, although there is no guarantee.
Notes:
[1] You do need to initialize the locale by calling setlocale(LC_ALL, "");
once at the beginning of program execution. See man setlocale
.
[2] The value of __STDC_ISO_10646__
is a date (in format yyyymmL
) corresponding to the date of the applicable version of the Unicode standard. The precise wording from the standard (draft) is:
The following macro names are conditionally defined by the implementation:
__STDC_ISO_10646__
An integer constant of the formyyyymmL
(for example,
199712L
). If this symbol is defined, then every character in the Unicode
required set, when stored in an object of typewchar_t
, has the same
value as the short identifier of that character. The Unicode required set
consists of all the characters that are defined by ISO/IEC 10646, along with
all amendments and technical corrigenda, as of the specified year and
month. If some other encoding is used, the macro shall not be defined and
the actual encoding used is implementation-defined.
For reasonably modern Linux/Unix systems, you shouldn't need to worry about terminal encoding. Just use getwchar
or fgetws
to read from stdin
(or the terminal). [Note 1]
As man getwchar
says, in the Notes
section:
It is reasonable to expect that getwchar() will actually read a multibyte sequence from standard input and then convert it to a wide character.
There is a similar note in man fgetws
.
With Linux, it is also reasonable to expect the encoding of wchar_t
to be unicode, regardless of locale. The C99
standard allows the implementation to define the macro __STDC_ISO_10646__
to indicate that wchar_t
values correspond to Unicode code points [Note 2], so you can insert a compile-time check for this expectation, which should succeed on modern Linux installs with standard toolchains. It's likely to succeed on modern Unix systems as well, although there is no guarantee.
Notes:
[1] You do need to initialize the locale by calling setlocale(LC_ALL, "");
once at the beginning of program execution. See man setlocale
.
[2] The value of __STDC_ISO_10646__
is a date (in format yyyymmL
) corresponding to the date of the applicable version of the Unicode standard. The precise wording from the standard (draft) is:
The following macro names are conditionally defined by the implementation:
__STDC_ISO_10646__
An integer constant of the formyyyymmL
(for example,
199712L
). If this symbol is defined, then every character in the Unicode
required set, when stored in an object of typewchar_t
, has the same
value as the short identifier of that character. The Unicode required set
consists of all the characters that are defined by ISO/IEC 10646, along with
all amendments and technical corrigenda, as of the specified year and
month. If some other encoding is used, the macro shall not be defined and
the actual encoding used is implementation-defined.
answered Feb 3 '14 at 1:52
rici
7,3572530
7,3572530
Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
â ninjalj
Jun 12 '16 at 10:34
add a comment |Â
Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
â ninjalj
Jun 12 '16 at 10:34
Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
â ninjalj
Jun 12 '16 at 10:34
Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
â ninjalj
Jun 12 '16 at 10:34
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f112216%2fwhich-terminal-encodings-are-default-on-linux-and-which-are-most-common%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password