How to determine the character encoding that a terminal uses in a C/C++ program?

up vote
2
down vote

favorite

I've noticed that SyncTERM uses a different character encoding than the default MacOS terminal emulator, and they're incompatible with one another. For example, say you want to print a block character in a format string. In SyncTERM, which uses the IBM Extended ASCII character encoding, you would use an octal escape sequence like 261. In Terminal.app (and probably iTerm2 as well), this just prints a question mark. Since these terminals use UTF-8, you need to use the uxxxx escape sequence.

So let's say you want to print a certain, not-ASCII, character in a format string, and you want it to work in all terminal emulators, regardless of character set. I'm guessing you would use an entry in the terminfo database, but I'm not really familiar with terminfo. I need some pointers here.

asked Nov 12 '16 at 18:00

user628544

340138

add a commentÂ |Â

up vote
2
down vote

favorite

asked Nov 12 '16 at 18:00

user628544

340138

add a commentÂ |Â

up vote
2
down vote

favorite

asked Nov 12 '16 at 18:00

user628544

340138

asked Nov 12 '16 at 18:00

user628544

340138

asked Nov 12 '16 at 18:00

user628544

340138

asked Nov 12 '16 at 18:00

user628544

340138

asked Nov 12 '16 at 18:00

user628544

340138

add a commentÂ |Â

3 Answers
3

active

oldest

votes

up vote
3
down vote

accepted

Short:

terminfo won't take you there, won't help

there is no reliable way to determine what encoding a terminal actually uses

starting from Unicode literals is the way to go, provided that you know what encoding to want to use on the terminal

the user has to know what locale is appropriate and what encoding the terminal can do

the C standard has functions for converting "wide" characters which you will have available on any Unix-like platform (see for example setlocale, wcrtomb and wcsrtombs)

answered Nov 12 '16 at 19:21

Thomas Dickey

49k584154

add a commentÂ |Â

up vote
2
down vote

Initialize the locale of your app with a setlocale(LC_ALL, "") and then call nl_langinfo(CODESET). This gives you the resolved value from the LANG, LC_CTYPE, LC_ALL environment variables.

This does not tell you how the terminal emulator actually works, but this is what pretty much every application relies on. If this gives incorrect result then your system is misconfigured and almost all other apps will also work incorrectly in your terminal emulator. As an app developer it's not your job to try to detect and fix if it's broken. You can safely assume it's set up correctly for you. As a sysadmin or distribution developer or user hacking around on your system it's your job to make sure the locale variables and the terminal emulator's actual behavior do match.

answered Nov 13 '16 at 12:34

egmont

2,1801711

add a commentÂ |Â

up vote
0
down vote

If the terminal emulator is well-designed and configured appropriately, it will ensure that the value of the environment variable LC_CTYPE is set to a value that is consistent with its encoding. Unfortunately, in practice, checking LC_CTYPE is not always reliable: it may be unset or wrong. (Other environment variables may convey the locale settings, see What should I set my locale to and what are the implications of doing so? for details.)

If you have some idea of which character encodings are likely, you may be able to determine the encoding via heuristics. Display a byte string that has a different width in different encodings, and find out by how much it makes the cursor move. This won't help you in all cases, for example it can't distinguish between single-byte encodings. But if for you the only two likely possibilities are UTF-8 and one legacy encoding, that works well. In my shell startup, I set LC_CTYPE in this way, using a script widthof which I posted in Get the display width of a string of characters. widthof -1 displays a 4-byte string which represents 2 characters in UTF-8, and in which only 3 bytes are printable latin-N characters. Thus a width of 2 means UTF-8 (or some other multibyte encoding, which is not likely for me), a width of 3 means latin-N (with no way to know N), and 4 means some single-byte encoding with printable characters in the range 128Ã¢Â€Â“159.

widthof -1
case $? in
 0) export LC_CTYPE=C;; # 7-bit charset
 2) locale_search .utf8 .UTF-8;; # utf8
 3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
 4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
 *) export LC_CTYPE=C;; # weird charset
esac

answered May 28 at 20:47

Gilles

503k1179951521

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f322833%2fhow-to-determine-the-character-encoding-that-a-terminal-uses-in-a-c-c-program%23new-answer', 'question_page');

);

Post as a guest

Name

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
3
down vote

accepted

Short:

terminfo won't take you there, won't help

there is no reliable way to determine what encoding a terminal actually uses

starting from Unicode literals is the way to go, provided that you know what encoding to want to use on the terminal

the user has to know what locale is appropriate and what encoding the terminal can do

the C standard has functions for converting "wide" characters which you will have available on any Unix-like platform (see for example setlocale, wcrtomb and wcsrtombs)

answered Nov 12 '16 at 19:21

Thomas Dickey

49k584154

add a commentÂ |Â

up vote
3
down vote

accepted

Short:

terminfo won't take you there, won't help

there is no reliable way to determine what encoding a terminal actually uses

starting from Unicode literals is the way to go, provided that you know what encoding to want to use on the terminal

the user has to know what locale is appropriate and what encoding the terminal can do

the C standard has functions for converting "wide" characters which you will have available on any Unix-like platform (see for example setlocale, wcrtomb and wcsrtombs)

answered Nov 12 '16 at 19:21

Thomas Dickey

49k584154

add a commentÂ |Â

up vote
3
down vote

accepted

Short:

terminfo won't take you there, won't help

there is no reliable way to determine what encoding a terminal actually uses

starting from Unicode literals is the way to go, provided that you know what encoding to want to use on the terminal

the user has to know what locale is appropriate and what encoding the terminal can do

the C standard has functions for converting "wide" characters which you will have available on any Unix-like platform (see for example setlocale, wcrtomb and wcsrtombs)

answered Nov 12 '16 at 19:21

Thomas Dickey

49k584154

Short:

terminfo won't take you there, won't help

there is no reliable way to determine what encoding a terminal actually uses

starting from Unicode literals is the way to go, provided that you know what encoding to want to use on the terminal

the user has to know what locale is appropriate and what encoding the terminal can do

the C standard has functions for converting "wide" characters which you will have available on any Unix-like platform (see for example setlocale, wcrtomb and wcsrtombs)

answered Nov 12 '16 at 19:21

Thomas Dickey

49k584154

answered Nov 12 '16 at 19:21

Thomas Dickey

49k584154

answered Nov 12 '16 at 19:21

Thomas Dickey

49k584154

answered Nov 12 '16 at 19:21

Thomas Dickey

49k584154

add a commentÂ |Â

up vote
2
down vote

Initialize the locale of your app with a setlocale(LC_ALL, "") and then call nl_langinfo(CODESET). This gives you the resolved value from the LANG, LC_CTYPE, LC_ALL environment variables.

answered Nov 13 '16 at 12:34

egmont

2,1801711

add a commentÂ |Â

up vote
2
down vote

Initialize the locale of your app with a setlocale(LC_ALL, "") and then call nl_langinfo(CODESET). This gives you the resolved value from the LANG, LC_CTYPE, LC_ALL environment variables.

answered Nov 13 '16 at 12:34

egmont

2,1801711

add a commentÂ |Â

up vote
2
down vote

Initialize the locale of your app with a setlocale(LC_ALL, "") and then call nl_langinfo(CODESET). This gives you the resolved value from the LANG, LC_CTYPE, LC_ALL environment variables.

answered Nov 13 '16 at 12:34

egmont

2,1801711

Initialize the locale of your app with a setlocale(LC_ALL, "") and then call nl_langinfo(CODESET). This gives you the resolved value from the LANG, LC_CTYPE, LC_ALL environment variables.

answered Nov 13 '16 at 12:34

egmont

2,1801711

answered Nov 13 '16 at 12:34

egmont

2,1801711

answered Nov 13 '16 at 12:34

egmont

2,1801711

answered Nov 13 '16 at 12:34

egmont

2,1801711

add a commentÂ |Â

up vote
0
down vote

widthof -1
case $? in
 0) export LC_CTYPE=C;; # 7-bit charset
 2) locale_search .utf8 .UTF-8;; # utf8
 3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
 4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
 *) export LC_CTYPE=C;; # weird charset
esac

answered May 28 at 20:47

Gilles

503k1179951521

add a commentÂ |Â

up vote
0
down vote

widthof -1
case $? in
 0) export LC_CTYPE=C;; # 7-bit charset
 2) locale_search .utf8 .UTF-8;; # utf8
 3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
 4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
 *) export LC_CTYPE=C;; # weird charset
esac

answered May 28 at 20:47

Gilles

503k1179951521

add a commentÂ |Â

up vote
0
down vote

widthof -1
case $? in
 0) export LC_CTYPE=C;; # 7-bit charset
 2) locale_search .utf8 .UTF-8;; # utf8
 3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
 4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
 *) export LC_CTYPE=C;; # weird charset
esac

answered May 28 at 20:47

Gilles

503k1179951521

widthof -1
case $? in
 0) export LC_CTYPE=C;; # 7-bit charset
 2) locale_search .utf8 .UTF-8;; # utf8
 3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
 4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
 *) export LC_CTYPE=C;; # weird charset
esac

answered May 28 at 20:47

Gilles

503k1179951521

answered May 28 at 20:47

Gilles

503k1179951521

answered May 28 at 20:47

Gilles

503k1179951521

answered May 28 at 20:47

Gilles

503k1179951521

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu