Which terminal encodings are default on Linux, and which are most common?

up vote
1
down vote

favorite

I need to make a decision regarding whether a complicated commercial program that I work on should assume a particular terminal encoding for Linux, or instead read it from the terminal (and if so, how).

It's pretty easy to guess which system and terminal encodings are most common on Windows. We can assume that most users configure these through the Control Panel, and that, for instance, their terminal encoding, which is usually non-Unicode, can be easily predicted from the standard configuration for that language/country. (For instance, on a US English machine, it will be OEM-437, while on a Russian machine, it will be OEM-866.)

But it's not clear to me how most users configure their system and terminal encodings on Linux. The savvy ones who often need to use non-ASCII characters probably use a UTF-8 encoding. But what proportion of Linux users fall into that category?

Nor is it clear which method most users use to configure their locale: changing the LANG environment variable, or something else.

A related question would be how Linux configures these by default. My own Linux machine at work (actually a virtual Debian 5 machine that runs via VMWare Player on my Windows machine) is set up by default to use a US-ASCII terminal encoding. However, I'm not sure whether that was set up by administrators at my workplace or that's the setting out of the box.

Please understand that I'm not looking for answers to "Which encoding do you personally use?" but rather some means by which I could figure out the distribution of encodings that Linux users are likely to be using.

asked Feb 2 '14 at 23:30

Alan

1093

bumped to the homepage by Communityâ™¦ 4 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a commentÂ |Â

up vote
1
down vote

favorite

Nor is it clear which method most users use to configure their locale: changing the LANG environment variable, or something else.

asked Feb 2 '14 at 23:30

Alan

1093

bumped to the homepage by Communityâ™¦ 4 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a commentÂ |Â

up vote
1
down vote

favorite

Nor is it clear which method most users use to configure their locale: changing the LANG environment variable, or something else.

asked Feb 2 '14 at 23:30

Alan

1093

Nor is it clear which method most users use to configure their locale: changing the LANG environment variable, or something else.

character-encoding

asked Feb 2 '14 at 23:30

Alan

1093

asked Feb 2 '14 at 23:30

Alan

1093

asked Feb 2 '14 at 23:30

Alan

1093

asked Feb 2 '14 at 23:30

Alan

1093

asked Feb 2 '14 at 23:30

Alan

1093

bumped to the homepage by Communityâ™¦ 4 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Communityâ™¦ 4 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a commentÂ |Â

3 Answers
3

active

oldest

votes

up vote
0
down vote

I would use a similar heuristic you are using with Windows users, but via the LANG environmental variable. For example, on my system:

$ echo $LANG
en_US.UTF-8

Here, the code is saying I am using the English language, but with UTF-8 encoding of filenames and files.

As a general rule, Linux users using UTF-8 will have "UTF-8" at the end of their LANG environmental variable.

answered Feb 3 '14 at 0:19

samiam

2,356713

add a commentÂ |Â

up vote
0
down vote

Modern Linux installations (for at least some 5 years, probably longer) use UTF-8. How that is handled by setting the environment values LC_CTYPE, LANG, and LANGUAGE. See for example the discussions here or here (Unicode centered).

answered Feb 3 '14 at 0:20

vonbrand

14k22444

add a commentÂ |Â

up vote
0
down vote

For reasonably modern Linux/Unix systems, you shouldn't need to worry about terminal encoding. Just use getwchar or fgetws to read from stdin (or the terminal). [Note 1]

As man getwchar says, in the Notes section:

It is reasonable to expect that getwchar() will actually read a multibyte sequence from standard input and then convert it to a wide character.

There is a similar note in man fgetws.

With Linux, it is also reasonable to expect the encoding of wchar_t to be unicode, regardless of locale. The C99 standard allows the implementation to define the macro __STDC_ISO_10646__ to indicate that wchar_t values correspond to Unicode code points [Note 2], so you can insert a compile-time check for this expectation, which should succeed on modern Linux installs with standard toolchains. It's likely to succeed on modern Unix systems as well, although there is no guarantee.

Notes:

[1] You do need to initialize the locale by calling setlocale(LC_ALL, ""); once at the beginning of program execution. See man setlocale.

[2] The value of __STDC_ISO_10646__ is a date (in format yyyymmL) corresponding to the date of the applicable version of the Unicode standard. The precise wording from the standard (draft) is:

The following macro names are conditionally defined by the implementation:

__STDC_ISO_10646__ An integer constant of the form yyyymmL (for example,
199712L). If this symbol is defined, then every character in the Unicode
required set, when stored in an object of type wchar_t, has the same
value as the short identifier of that character. The Unicode required set
consists of all the characters that are defined by ISO/IEC 10646, along with
all amendments and technical corrigenda, as of the specified year and
month. If some other encoding is used, the macro shall not be defined and
the actual encoding used is implementation-defined.

answered Feb 3 '14 at 1:52

rici

7,3572530

Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
â€“Â ninjalj
Jun 12 '16 at 10:34

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f112216%2fwhich-terminal-encodings-are-default-on-linux-and-which-are-most-common%23new-answer', 'question_page');

);

Post as a guest

Name

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
0
down vote

I would use a similar heuristic you are using with Windows users, but via the LANG environmental variable. For example, on my system:

$ echo $LANG
en_US.UTF-8

Here, the code is saying I am using the English language, but with UTF-8 encoding of filenames and files.

As a general rule, Linux users using UTF-8 will have "UTF-8" at the end of their LANG environmental variable.

answered Feb 3 '14 at 0:19

samiam

2,356713

add a commentÂ |Â

up vote
0
down vote

I would use a similar heuristic you are using with Windows users, but via the LANG environmental variable. For example, on my system:

$ echo $LANG
en_US.UTF-8

Here, the code is saying I am using the English language, but with UTF-8 encoding of filenames and files.

As a general rule, Linux users using UTF-8 will have "UTF-8" at the end of their LANG environmental variable.

answered Feb 3 '14 at 0:19

samiam

2,356713

add a commentÂ |Â

up vote
0
down vote

I would use a similar heuristic you are using with Windows users, but via the LANG environmental variable. For example, on my system:

$ echo $LANG
en_US.UTF-8

Here, the code is saying I am using the English language, but with UTF-8 encoding of filenames and files.

As a general rule, Linux users using UTF-8 will have "UTF-8" at the end of their LANG environmental variable.

answered Feb 3 '14 at 0:19

samiam

2,356713

I would use a similar heuristic you are using with Windows users, but via the LANG environmental variable. For example, on my system:

$ echo $LANG
en_US.UTF-8

Here, the code is saying I am using the English language, but with UTF-8 encoding of filenames and files.

As a general rule, Linux users using UTF-8 will have "UTF-8" at the end of their LANG environmental variable.

answered Feb 3 '14 at 0:19

samiam

2,356713

answered Feb 3 '14 at 0:19

samiam

2,356713

answered Feb 3 '14 at 0:19

samiam

2,356713

answered Feb 3 '14 at 0:19

samiam

2,356713

add a commentÂ |Â

up vote
0
down vote

answered Feb 3 '14 at 0:20

vonbrand

14k22444

add a commentÂ |Â

up vote
0
down vote

answered Feb 3 '14 at 0:20

vonbrand

14k22444

add a commentÂ |Â

up vote
0
down vote

answered Feb 3 '14 at 0:20

vonbrand

14k22444

answered Feb 3 '14 at 0:20

vonbrand

14k22444

answered Feb 3 '14 at 0:20

vonbrand

14k22444

answered Feb 3 '14 at 0:20

vonbrand

14k22444

answered Feb 3 '14 at 0:20

vonbrand

14k22444

add a commentÂ |Â

up vote
0
down vote

For reasonably modern Linux/Unix systems, you shouldn't need to worry about terminal encoding. Just use getwchar or fgetws to read from stdin (or the terminal). [Note 1]

As man getwchar says, in the Notes section:

It is reasonable to expect that getwchar() will actually read a multibyte sequence from standard input and then convert it to a wide character.

There is a similar note in man fgetws.

Notes:

[1] You do need to initialize the locale by calling setlocale(LC_ALL, ""); once at the beginning of program execution. See man setlocale.

[2] The value of __STDC_ISO_10646__ is a date (in format yyyymmL) corresponding to the date of the applicable version of the Unicode standard. The precise wording from the standard (draft) is:

The following macro names are conditionally defined by the implementation:

__STDC_ISO_10646__ An integer constant of the form yyyymmL (for example,
199712L). If this symbol is defined, then every character in the Unicode
required set, when stored in an object of type wchar_t, has the same
value as the short identifier of that character. The Unicode required set
consists of all the characters that are defined by ISO/IEC 10646, along with
all amendments and technical corrigenda, as of the specified year and
month. If some other encoding is used, the macro shall not be defined and
the actual encoding used is implementation-defined.

answered Feb 3 '14 at 1:52

rici

7,3572530

Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
â€“Â ninjalj
Jun 12 '16 at 10:34

add a commentÂ |Â

up vote
0
down vote

For reasonably modern Linux/Unix systems, you shouldn't need to worry about terminal encoding. Just use getwchar or fgetws to read from stdin (or the terminal). [Note 1]

As man getwchar says, in the Notes section:

It is reasonable to expect that getwchar() will actually read a multibyte sequence from standard input and then convert it to a wide character.

There is a similar note in man fgetws.

Notes:

[1] You do need to initialize the locale by calling setlocale(LC_ALL, ""); once at the beginning of program execution. See man setlocale.

[2] The value of __STDC_ISO_10646__ is a date (in format yyyymmL) corresponding to the date of the applicable version of the Unicode standard. The precise wording from the standard (draft) is:

The following macro names are conditionally defined by the implementation:

__STDC_ISO_10646__ An integer constant of the form yyyymmL (for example,
199712L). If this symbol is defined, then every character in the Unicode
required set, when stored in an object of type wchar_t, has the same
value as the short identifier of that character. The Unicode required set
consists of all the characters that are defined by ISO/IEC 10646, along with
all amendments and technical corrigenda, as of the specified year and
month. If some other encoding is used, the macro shall not be defined and
the actual encoding used is implementation-defined.

answered Feb 3 '14 at 1:52

rici

7,3572530

Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
â€“Â ninjalj
Jun 12 '16 at 10:34

add a commentÂ |Â

up vote
0
down vote

For reasonably modern Linux/Unix systems, you shouldn't need to worry about terminal encoding. Just use getwchar or fgetws to read from stdin (or the terminal). [Note 1]

As man getwchar says, in the Notes section:

It is reasonable to expect that getwchar() will actually read a multibyte sequence from standard input and then convert it to a wide character.

There is a similar note in man fgetws.

Notes:

[1] You do need to initialize the locale by calling setlocale(LC_ALL, ""); once at the beginning of program execution. See man setlocale.

[2] The value of __STDC_ISO_10646__ is a date (in format yyyymmL) corresponding to the date of the applicable version of the Unicode standard. The precise wording from the standard (draft) is:

The following macro names are conditionally defined by the implementation:

__STDC_ISO_10646__ An integer constant of the form yyyymmL (for example,
199712L). If this symbol is defined, then every character in the Unicode
required set, when stored in an object of type wchar_t, has the same
value as the short identifier of that character. The Unicode required set
consists of all the characters that are defined by ISO/IEC 10646, along with
all amendments and technical corrigenda, as of the specified year and
month. If some other encoding is used, the macro shall not be defined and
the actual encoding used is implementation-defined.

answered Feb 3 '14 at 1:52

rici

7,3572530

For reasonably modern Linux/Unix systems, you shouldn't need to worry about terminal encoding. Just use getwchar or fgetws to read from stdin (or the terminal). [Note 1]

As man getwchar says, in the Notes section:

It is reasonable to expect that getwchar() will actually read a multibyte sequence from standard input and then convert it to a wide character.

There is a similar note in man fgetws.

Notes:

[1] You do need to initialize the locale by calling setlocale(LC_ALL, ""); once at the beginning of program execution. See man setlocale.

[2] The value of __STDC_ISO_10646__ is a date (in format yyyymmL) corresponding to the date of the applicable version of the Unicode standard. The precise wording from the standard (draft) is:

The following macro names are conditionally defined by the implementation:

__STDC_ISO_10646__ An integer constant of the form yyyymmL (for example,
199712L). If this symbol is defined, then every character in the Unicode
required set, when stored in an object of type wchar_t, has the same
value as the short identifier of that character. The Unicode required set
consists of all the characters that are defined by ISO/IEC 10646, along with
all amendments and technical corrigenda, as of the specified year and
month. If some other encoding is used, the macro shall not be defined and
the actual encoding used is implementation-defined.

answered Feb 3 '14 at 1:52

rici

7,3572530

answered Feb 3 '14 at 1:52

rici

7,3572530

answered Feb 3 '14 at 1:52

rici

7,3572530

answered Feb 3 '14 at 1:52

rici

7,3572530

Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
â€“Â ninjalj
Jun 12 '16 at 10:34

add a commentÂ |Â

Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
â€“Â ninjalj
Jun 12 '16 at 10:34

Trivia: the date of the macro actually corresponds to the ISO-10646 standard, 199712L corresponds to a non-compatible change, where Korean hangul was moved from some block to another (the "Korean mess", alluded to in the UTF-8 RFC).
â€“Â ninjalj
Jun 12 '16 at 10:34

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu