How to determine the character encoding that a terminal uses in a C/C++ program?
Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
I've noticed that SyncTERM uses a different character encoding than the default MacOS terminal emulator, and they're incompatible with one another. For example, say you want to print a block character in a format string. In SyncTERM, which uses the IBM Extended ASCII character encoding, you would use an octal escape sequence like 261
. In Terminal.app (and probably iTerm2 as well), this just prints a question mark. Since these terminals use UTF-8, you need to use the uxxxx
escape sequence.
So let's say you want to print a certain, not-ASCII, character in a format string, and you want it to work in all terminal emulators, regardless of character set. I'm guessing you would use an entry in the terminfo database, but I'm not really familiar with terminfo. I need some pointers here.
character-encoding unicode terminal-emulator ascii terminfo
add a comment |Â
up vote
2
down vote
favorite
I've noticed that SyncTERM uses a different character encoding than the default MacOS terminal emulator, and they're incompatible with one another. For example, say you want to print a block character in a format string. In SyncTERM, which uses the IBM Extended ASCII character encoding, you would use an octal escape sequence like 261
. In Terminal.app (and probably iTerm2 as well), this just prints a question mark. Since these terminals use UTF-8, you need to use the uxxxx
escape sequence.
So let's say you want to print a certain, not-ASCII, character in a format string, and you want it to work in all terminal emulators, regardless of character set. I'm guessing you would use an entry in the terminfo database, but I'm not really familiar with terminfo. I need some pointers here.
character-encoding unicode terminal-emulator ascii terminfo
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I've noticed that SyncTERM uses a different character encoding than the default MacOS terminal emulator, and they're incompatible with one another. For example, say you want to print a block character in a format string. In SyncTERM, which uses the IBM Extended ASCII character encoding, you would use an octal escape sequence like 261
. In Terminal.app (and probably iTerm2 as well), this just prints a question mark. Since these terminals use UTF-8, you need to use the uxxxx
escape sequence.
So let's say you want to print a certain, not-ASCII, character in a format string, and you want it to work in all terminal emulators, regardless of character set. I'm guessing you would use an entry in the terminfo database, but I'm not really familiar with terminfo. I need some pointers here.
character-encoding unicode terminal-emulator ascii terminfo
I've noticed that SyncTERM uses a different character encoding than the default MacOS terminal emulator, and they're incompatible with one another. For example, say you want to print a block character in a format string. In SyncTERM, which uses the IBM Extended ASCII character encoding, you would use an octal escape sequence like 261
. In Terminal.app (and probably iTerm2 as well), this just prints a question mark. Since these terminals use UTF-8, you need to use the uxxxx
escape sequence.
So let's say you want to print a certain, not-ASCII, character in a format string, and you want it to work in all terminal emulators, regardless of character set. I'm guessing you would use an entry in the terminfo database, but I'm not really familiar with terminfo. I need some pointers here.
character-encoding unicode terminal-emulator ascii terminfo
asked Nov 12 '16 at 18:00
user628544
340138
340138
add a comment |Â
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
3
down vote
accepted
Short:
- terminfo won't take you there, won't help
- there is no reliable way to determine what encoding a terminal actually uses
- starting from Unicode literals is the way to go, provided that you know what encoding to want to use on the terminal
- the user has to know what locale is appropriate and what encoding the terminal can do
- the C standard has functions for converting "wide" characters which you will have available on any Unix-like platform (see for example
setlocale
,wcrtomb
andwcsrtombs
)
add a comment |Â
up vote
2
down vote
Initialize the locale of your app with a setlocale(LC_ALL, "")
and then call nl_langinfo(CODESET)
. This gives you the resolved value from the LANG, LC_CTYPE, LC_ALL environment variables.
This does not tell you how the terminal emulator actually works, but this is what pretty much every application relies on. If this gives incorrect result then your system is misconfigured and almost all other apps will also work incorrectly in your terminal emulator. As an app developer it's not your job to try to detect and fix if it's broken. You can safely assume it's set up correctly for you. As a sysadmin or distribution developer or user hacking around on your system it's your job to make sure the locale variables and the terminal emulator's actual behavior do match.
add a comment |Â
up vote
0
down vote
If the terminal emulator is well-designed and configured appropriately, it will ensure that the value of the environment variable LC_CTYPE
is set to a value that is consistent with its encoding. Unfortunately, in practice, checking LC_CTYPE
is not always reliable: it may be unset or wrong. (Other environment variables may convey the locale settings, see What should I set my locale to and what are the implications of doing so? for details.)
If you have some idea of which character encodings are likely, you may be able to determine the encoding via heuristics. Display a byte string that has a different width in different encodings, and find out by how much it makes the cursor move. This won't help you in all cases, for example it can't distinguish between single-byte encodings. But if for you the only two likely possibilities are UTF-8 and one legacy encoding, that works well. In my shell startup, I set LC_CTYPE
in this way, using a script widthof
which I posted in Get the display width of a string of characters. widthof -1
displays a 4-byte string which represents 2 characters in UTF-8, and in which only 3 bytes are printable latin-N characters. Thus a width of 2 means UTF-8 (or some other multibyte encoding, which is not likely for me), a width of 3 means latin-N (with no way to know N), and 4 means some single-byte encoding with printable characters in the range 128âÂÂ159.
widthof -1
case $? in
0) export LC_CTYPE=C;; # 7-bit charset
2) locale_search .utf8 .UTF-8;; # utf8
3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
*) export LC_CTYPE=C;; # weird charset
esac
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
Short:
- terminfo won't take you there, won't help
- there is no reliable way to determine what encoding a terminal actually uses
- starting from Unicode literals is the way to go, provided that you know what encoding to want to use on the terminal
- the user has to know what locale is appropriate and what encoding the terminal can do
- the C standard has functions for converting "wide" characters which you will have available on any Unix-like platform (see for example
setlocale
,wcrtomb
andwcsrtombs
)
add a comment |Â
up vote
3
down vote
accepted
Short:
- terminfo won't take you there, won't help
- there is no reliable way to determine what encoding a terminal actually uses
- starting from Unicode literals is the way to go, provided that you know what encoding to want to use on the terminal
- the user has to know what locale is appropriate and what encoding the terminal can do
- the C standard has functions for converting "wide" characters which you will have available on any Unix-like platform (see for example
setlocale
,wcrtomb
andwcsrtombs
)
add a comment |Â
up vote
3
down vote
accepted
up vote
3
down vote
accepted
Short:
- terminfo won't take you there, won't help
- there is no reliable way to determine what encoding a terminal actually uses
- starting from Unicode literals is the way to go, provided that you know what encoding to want to use on the terminal
- the user has to know what locale is appropriate and what encoding the terminal can do
- the C standard has functions for converting "wide" characters which you will have available on any Unix-like platform (see for example
setlocale
,wcrtomb
andwcsrtombs
)
Short:
- terminfo won't take you there, won't help
- there is no reliable way to determine what encoding a terminal actually uses
- starting from Unicode literals is the way to go, provided that you know what encoding to want to use on the terminal
- the user has to know what locale is appropriate and what encoding the terminal can do
- the C standard has functions for converting "wide" characters which you will have available on any Unix-like platform (see for example
setlocale
,wcrtomb
andwcsrtombs
)
answered Nov 12 '16 at 19:21
Thomas Dickey
49k584154
49k584154
add a comment |Â
add a comment |Â
up vote
2
down vote
Initialize the locale of your app with a setlocale(LC_ALL, "")
and then call nl_langinfo(CODESET)
. This gives you the resolved value from the LANG, LC_CTYPE, LC_ALL environment variables.
This does not tell you how the terminal emulator actually works, but this is what pretty much every application relies on. If this gives incorrect result then your system is misconfigured and almost all other apps will also work incorrectly in your terminal emulator. As an app developer it's not your job to try to detect and fix if it's broken. You can safely assume it's set up correctly for you. As a sysadmin or distribution developer or user hacking around on your system it's your job to make sure the locale variables and the terminal emulator's actual behavior do match.
add a comment |Â
up vote
2
down vote
Initialize the locale of your app with a setlocale(LC_ALL, "")
and then call nl_langinfo(CODESET)
. This gives you the resolved value from the LANG, LC_CTYPE, LC_ALL environment variables.
This does not tell you how the terminal emulator actually works, but this is what pretty much every application relies on. If this gives incorrect result then your system is misconfigured and almost all other apps will also work incorrectly in your terminal emulator. As an app developer it's not your job to try to detect and fix if it's broken. You can safely assume it's set up correctly for you. As a sysadmin or distribution developer or user hacking around on your system it's your job to make sure the locale variables and the terminal emulator's actual behavior do match.
add a comment |Â
up vote
2
down vote
up vote
2
down vote
Initialize the locale of your app with a setlocale(LC_ALL, "")
and then call nl_langinfo(CODESET)
. This gives you the resolved value from the LANG, LC_CTYPE, LC_ALL environment variables.
This does not tell you how the terminal emulator actually works, but this is what pretty much every application relies on. If this gives incorrect result then your system is misconfigured and almost all other apps will also work incorrectly in your terminal emulator. As an app developer it's not your job to try to detect and fix if it's broken. You can safely assume it's set up correctly for you. As a sysadmin or distribution developer or user hacking around on your system it's your job to make sure the locale variables and the terminal emulator's actual behavior do match.
Initialize the locale of your app with a setlocale(LC_ALL, "")
and then call nl_langinfo(CODESET)
. This gives you the resolved value from the LANG, LC_CTYPE, LC_ALL environment variables.
This does not tell you how the terminal emulator actually works, but this is what pretty much every application relies on. If this gives incorrect result then your system is misconfigured and almost all other apps will also work incorrectly in your terminal emulator. As an app developer it's not your job to try to detect and fix if it's broken. You can safely assume it's set up correctly for you. As a sysadmin or distribution developer or user hacking around on your system it's your job to make sure the locale variables and the terminal emulator's actual behavior do match.
answered Nov 13 '16 at 12:34
egmont
2,1801711
2,1801711
add a comment |Â
add a comment |Â
up vote
0
down vote
If the terminal emulator is well-designed and configured appropriately, it will ensure that the value of the environment variable LC_CTYPE
is set to a value that is consistent with its encoding. Unfortunately, in practice, checking LC_CTYPE
is not always reliable: it may be unset or wrong. (Other environment variables may convey the locale settings, see What should I set my locale to and what are the implications of doing so? for details.)
If you have some idea of which character encodings are likely, you may be able to determine the encoding via heuristics. Display a byte string that has a different width in different encodings, and find out by how much it makes the cursor move. This won't help you in all cases, for example it can't distinguish between single-byte encodings. But if for you the only two likely possibilities are UTF-8 and one legacy encoding, that works well. In my shell startup, I set LC_CTYPE
in this way, using a script widthof
which I posted in Get the display width of a string of characters. widthof -1
displays a 4-byte string which represents 2 characters in UTF-8, and in which only 3 bytes are printable latin-N characters. Thus a width of 2 means UTF-8 (or some other multibyte encoding, which is not likely for me), a width of 3 means latin-N (with no way to know N), and 4 means some single-byte encoding with printable characters in the range 128âÂÂ159.
widthof -1
case $? in
0) export LC_CTYPE=C;; # 7-bit charset
2) locale_search .utf8 .UTF-8;; # utf8
3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
*) export LC_CTYPE=C;; # weird charset
esac
add a comment |Â
up vote
0
down vote
If the terminal emulator is well-designed and configured appropriately, it will ensure that the value of the environment variable LC_CTYPE
is set to a value that is consistent with its encoding. Unfortunately, in practice, checking LC_CTYPE
is not always reliable: it may be unset or wrong. (Other environment variables may convey the locale settings, see What should I set my locale to and what are the implications of doing so? for details.)
If you have some idea of which character encodings are likely, you may be able to determine the encoding via heuristics. Display a byte string that has a different width in different encodings, and find out by how much it makes the cursor move. This won't help you in all cases, for example it can't distinguish between single-byte encodings. But if for you the only two likely possibilities are UTF-8 and one legacy encoding, that works well. In my shell startup, I set LC_CTYPE
in this way, using a script widthof
which I posted in Get the display width of a string of characters. widthof -1
displays a 4-byte string which represents 2 characters in UTF-8, and in which only 3 bytes are printable latin-N characters. Thus a width of 2 means UTF-8 (or some other multibyte encoding, which is not likely for me), a width of 3 means latin-N (with no way to know N), and 4 means some single-byte encoding with printable characters in the range 128âÂÂ159.
widthof -1
case $? in
0) export LC_CTYPE=C;; # 7-bit charset
2) locale_search .utf8 .UTF-8;; # utf8
3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
*) export LC_CTYPE=C;; # weird charset
esac
add a comment |Â
up vote
0
down vote
up vote
0
down vote
If the terminal emulator is well-designed and configured appropriately, it will ensure that the value of the environment variable LC_CTYPE
is set to a value that is consistent with its encoding. Unfortunately, in practice, checking LC_CTYPE
is not always reliable: it may be unset or wrong. (Other environment variables may convey the locale settings, see What should I set my locale to and what are the implications of doing so? for details.)
If you have some idea of which character encodings are likely, you may be able to determine the encoding via heuristics. Display a byte string that has a different width in different encodings, and find out by how much it makes the cursor move. This won't help you in all cases, for example it can't distinguish between single-byte encodings. But if for you the only two likely possibilities are UTF-8 and one legacy encoding, that works well. In my shell startup, I set LC_CTYPE
in this way, using a script widthof
which I posted in Get the display width of a string of characters. widthof -1
displays a 4-byte string which represents 2 characters in UTF-8, and in which only 3 bytes are printable latin-N characters. Thus a width of 2 means UTF-8 (or some other multibyte encoding, which is not likely for me), a width of 3 means latin-N (with no way to know N), and 4 means some single-byte encoding with printable characters in the range 128âÂÂ159.
widthof -1
case $? in
0) export LC_CTYPE=C;; # 7-bit charset
2) locale_search .utf8 .UTF-8;; # utf8
3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
*) export LC_CTYPE=C;; # weird charset
esac
If the terminal emulator is well-designed and configured appropriately, it will ensure that the value of the environment variable LC_CTYPE
is set to a value that is consistent with its encoding. Unfortunately, in practice, checking LC_CTYPE
is not always reliable: it may be unset or wrong. (Other environment variables may convey the locale settings, see What should I set my locale to and what are the implications of doing so? for details.)
If you have some idea of which character encodings are likely, you may be able to determine the encoding via heuristics. Display a byte string that has a different width in different encodings, and find out by how much it makes the cursor move. This won't help you in all cases, for example it can't distinguish between single-byte encodings. But if for you the only two likely possibilities are UTF-8 and one legacy encoding, that works well. In my shell startup, I set LC_CTYPE
in this way, using a script widthof
which I posted in Get the display width of a string of characters. widthof -1
displays a 4-byte string which represents 2 characters in UTF-8, and in which only 3 bytes are printable latin-N characters. Thus a width of 2 means UTF-8 (or some other multibyte encoding, which is not likely for me), a width of 3 means latin-N (with no way to know N), and 4 means some single-byte encoding with printable characters in the range 128âÂÂ159.
widthof -1
case $? in
0) export LC_CTYPE=C;; # 7-bit charset
2) locale_search .utf8 .UTF-8;; # utf8
3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
*) export LC_CTYPE=C;; # weird charset
esac
answered May 28 at 20:47
Gilles
503k1179951521
503k1179951521
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f322833%2fhow-to-determine-the-character-encoding-that-a-terminal-uses-in-a-c-c-program%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password