How to determine the character encoding that a terminal uses in a C/C++ program?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite












I've noticed that SyncTERM uses a different character encoding than the default MacOS terminal emulator, and they're incompatible with one another. For example, say you want to print a block character in a format string. In SyncTERM, which uses the IBM Extended ASCII character encoding, you would use an octal escape sequence like 261. In Terminal.app (and probably iTerm2 as well), this just prints a question mark. Since these terminals use UTF-8, you need to use the uxxxx escape sequence.



So let's say you want to print a certain, not-ASCII, character in a format string, and you want it to work in all terminal emulators, regardless of character set. I'm guessing you would use an entry in the terminfo database, but I'm not really familiar with terminfo. I need some pointers here.







share|improve this question























    up vote
    2
    down vote

    favorite












    I've noticed that SyncTERM uses a different character encoding than the default MacOS terminal emulator, and they're incompatible with one another. For example, say you want to print a block character in a format string. In SyncTERM, which uses the IBM Extended ASCII character encoding, you would use an octal escape sequence like 261. In Terminal.app (and probably iTerm2 as well), this just prints a question mark. Since these terminals use UTF-8, you need to use the uxxxx escape sequence.



    So let's say you want to print a certain, not-ASCII, character in a format string, and you want it to work in all terminal emulators, regardless of character set. I'm guessing you would use an entry in the terminfo database, but I'm not really familiar with terminfo. I need some pointers here.







    share|improve this question





















      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I've noticed that SyncTERM uses a different character encoding than the default MacOS terminal emulator, and they're incompatible with one another. For example, say you want to print a block character in a format string. In SyncTERM, which uses the IBM Extended ASCII character encoding, you would use an octal escape sequence like 261. In Terminal.app (and probably iTerm2 as well), this just prints a question mark. Since these terminals use UTF-8, you need to use the uxxxx escape sequence.



      So let's say you want to print a certain, not-ASCII, character in a format string, and you want it to work in all terminal emulators, regardless of character set. I'm guessing you would use an entry in the terminfo database, but I'm not really familiar with terminfo. I need some pointers here.







      share|improve this question











      I've noticed that SyncTERM uses a different character encoding than the default MacOS terminal emulator, and they're incompatible with one another. For example, say you want to print a block character in a format string. In SyncTERM, which uses the IBM Extended ASCII character encoding, you would use an octal escape sequence like 261. In Terminal.app (and probably iTerm2 as well), this just prints a question mark. Since these terminals use UTF-8, you need to use the uxxxx escape sequence.



      So let's say you want to print a certain, not-ASCII, character in a format string, and you want it to work in all terminal emulators, regardless of character set. I'm guessing you would use an entry in the terminfo database, but I'm not really familiar with terminfo. I need some pointers here.









      share|improve this question










      share|improve this question




      share|improve this question









      asked Nov 12 '16 at 18:00









      user628544

      340138




      340138




















          3 Answers
          3






          active

          oldest

          votes

















          up vote
          3
          down vote



          accepted










          Short:



          • terminfo won't take you there, won't help

          • there is no reliable way to determine what encoding a terminal actually uses

          • starting from Unicode literals is the way to go, provided that you know what encoding to want to use on the terminal

          • the user has to know what locale is appropriate and what encoding the terminal can do

          • the C standard has functions for converting "wide" characters which you will have available on any Unix-like platform (see for example setlocale, wcrtomb and wcsrtombs)





          share|improve this answer




























            up vote
            2
            down vote













            Initialize the locale of your app with a setlocale(LC_ALL, "") and then call nl_langinfo(CODESET). This gives you the resolved value from the LANG, LC_CTYPE, LC_ALL environment variables.



            This does not tell you how the terminal emulator actually works, but this is what pretty much every application relies on. If this gives incorrect result then your system is misconfigured and almost all other apps will also work incorrectly in your terminal emulator. As an app developer it's not your job to try to detect and fix if it's broken. You can safely assume it's set up correctly for you. As a sysadmin or distribution developer or user hacking around on your system it's your job to make sure the locale variables and the terminal emulator's actual behavior do match.






            share|improve this answer




























              up vote
              0
              down vote













              If the terminal emulator is well-designed and configured appropriately, it will ensure that the value of the environment variable LC_CTYPE is set to a value that is consistent with its encoding. Unfortunately, in practice, checking LC_CTYPE is not always reliable: it may be unset or wrong. (Other environment variables may convey the locale settings, see What should I set my locale to and what are the implications of doing so? for details.)



              If you have some idea of which character encodings are likely, you may be able to determine the encoding via heuristics. Display a byte string that has a different width in different encodings, and find out by how much it makes the cursor move. This won't help you in all cases, for example it can't distinguish between single-byte encodings. But if for you the only two likely possibilities are UTF-8 and one legacy encoding, that works well. In my shell startup, I set LC_CTYPE in this way, using a script widthof which I posted in Get the display width of a string of characters. widthof -1 displays a 4-byte string which represents 2 characters in UTF-8, and in which only 3 bytes are printable latin-N characters. Thus a width of 2 means UTF-8 (or some other multibyte encoding, which is not likely for me), a width of 3 means latin-N (with no way to know N), and 4 means some single-byte encoding with printable characters in the range 128–159.



              widthof -1
              case $? in
              0) export LC_CTYPE=C;; # 7-bit charset
              2) locale_search .utf8 .UTF-8;; # utf8
              3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
              4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
              *) export LC_CTYPE=C;; # weird charset
              esac





              share|improve this answer





















                Your Answer







                StackExchange.ready(function()
                var channelOptions =
                tags: "".split(" "),
                id: "106"
                ;
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function()
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled)
                StackExchange.using("snippets", function()
                createEditor();
                );

                else
                createEditor();

                );

                function createEditor()
                StackExchange.prepareEditor(
                heartbeatType: 'answer',
                convertImagesToLinks: false,
                noModals: false,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: null,
                bindNavPrevention: true,
                postfix: "",
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                );



                );








                 

                draft saved


                draft discarded


















                StackExchange.ready(
                function ()
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f322833%2fhow-to-determine-the-character-encoding-that-a-terminal-uses-in-a-c-c-program%23new-answer', 'question_page');

                );

                Post as a guest






























                3 Answers
                3






                active

                oldest

                votes








                3 Answers
                3






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes








                up vote
                3
                down vote



                accepted










                Short:



                • terminfo won't take you there, won't help

                • there is no reliable way to determine what encoding a terminal actually uses

                • starting from Unicode literals is the way to go, provided that you know what encoding to want to use on the terminal

                • the user has to know what locale is appropriate and what encoding the terminal can do

                • the C standard has functions for converting "wide" characters which you will have available on any Unix-like platform (see for example setlocale, wcrtomb and wcsrtombs)





                share|improve this answer

























                  up vote
                  3
                  down vote



                  accepted










                  Short:



                  • terminfo won't take you there, won't help

                  • there is no reliable way to determine what encoding a terminal actually uses

                  • starting from Unicode literals is the way to go, provided that you know what encoding to want to use on the terminal

                  • the user has to know what locale is appropriate and what encoding the terminal can do

                  • the C standard has functions for converting "wide" characters which you will have available on any Unix-like platform (see for example setlocale, wcrtomb and wcsrtombs)





                  share|improve this answer























                    up vote
                    3
                    down vote



                    accepted







                    up vote
                    3
                    down vote



                    accepted






                    Short:



                    • terminfo won't take you there, won't help

                    • there is no reliable way to determine what encoding a terminal actually uses

                    • starting from Unicode literals is the way to go, provided that you know what encoding to want to use on the terminal

                    • the user has to know what locale is appropriate and what encoding the terminal can do

                    • the C standard has functions for converting "wide" characters which you will have available on any Unix-like platform (see for example setlocale, wcrtomb and wcsrtombs)





                    share|improve this answer













                    Short:



                    • terminfo won't take you there, won't help

                    • there is no reliable way to determine what encoding a terminal actually uses

                    • starting from Unicode literals is the way to go, provided that you know what encoding to want to use on the terminal

                    • the user has to know what locale is appropriate and what encoding the terminal can do

                    • the C standard has functions for converting "wide" characters which you will have available on any Unix-like platform (see for example setlocale, wcrtomb and wcsrtombs)






                    share|improve this answer













                    share|improve this answer



                    share|improve this answer











                    answered Nov 12 '16 at 19:21









                    Thomas Dickey

                    49k584154




                    49k584154






















                        up vote
                        2
                        down vote













                        Initialize the locale of your app with a setlocale(LC_ALL, "") and then call nl_langinfo(CODESET). This gives you the resolved value from the LANG, LC_CTYPE, LC_ALL environment variables.



                        This does not tell you how the terminal emulator actually works, but this is what pretty much every application relies on. If this gives incorrect result then your system is misconfigured and almost all other apps will also work incorrectly in your terminal emulator. As an app developer it's not your job to try to detect and fix if it's broken. You can safely assume it's set up correctly for you. As a sysadmin or distribution developer or user hacking around on your system it's your job to make sure the locale variables and the terminal emulator's actual behavior do match.






                        share|improve this answer

























                          up vote
                          2
                          down vote













                          Initialize the locale of your app with a setlocale(LC_ALL, "") and then call nl_langinfo(CODESET). This gives you the resolved value from the LANG, LC_CTYPE, LC_ALL environment variables.



                          This does not tell you how the terminal emulator actually works, but this is what pretty much every application relies on. If this gives incorrect result then your system is misconfigured and almost all other apps will also work incorrectly in your terminal emulator. As an app developer it's not your job to try to detect and fix if it's broken. You can safely assume it's set up correctly for you. As a sysadmin or distribution developer or user hacking around on your system it's your job to make sure the locale variables and the terminal emulator's actual behavior do match.






                          share|improve this answer























                            up vote
                            2
                            down vote










                            up vote
                            2
                            down vote









                            Initialize the locale of your app with a setlocale(LC_ALL, "") and then call nl_langinfo(CODESET). This gives you the resolved value from the LANG, LC_CTYPE, LC_ALL environment variables.



                            This does not tell you how the terminal emulator actually works, but this is what pretty much every application relies on. If this gives incorrect result then your system is misconfigured and almost all other apps will also work incorrectly in your terminal emulator. As an app developer it's not your job to try to detect and fix if it's broken. You can safely assume it's set up correctly for you. As a sysadmin or distribution developer or user hacking around on your system it's your job to make sure the locale variables and the terminal emulator's actual behavior do match.






                            share|improve this answer













                            Initialize the locale of your app with a setlocale(LC_ALL, "") and then call nl_langinfo(CODESET). This gives you the resolved value from the LANG, LC_CTYPE, LC_ALL environment variables.



                            This does not tell you how the terminal emulator actually works, but this is what pretty much every application relies on. If this gives incorrect result then your system is misconfigured and almost all other apps will also work incorrectly in your terminal emulator. As an app developer it's not your job to try to detect and fix if it's broken. You can safely assume it's set up correctly for you. As a sysadmin or distribution developer or user hacking around on your system it's your job to make sure the locale variables and the terminal emulator's actual behavior do match.







                            share|improve this answer













                            share|improve this answer



                            share|improve this answer











                            answered Nov 13 '16 at 12:34









                            egmont

                            2,1801711




                            2,1801711




















                                up vote
                                0
                                down vote













                                If the terminal emulator is well-designed and configured appropriately, it will ensure that the value of the environment variable LC_CTYPE is set to a value that is consistent with its encoding. Unfortunately, in practice, checking LC_CTYPE is not always reliable: it may be unset or wrong. (Other environment variables may convey the locale settings, see What should I set my locale to and what are the implications of doing so? for details.)



                                If you have some idea of which character encodings are likely, you may be able to determine the encoding via heuristics. Display a byte string that has a different width in different encodings, and find out by how much it makes the cursor move. This won't help you in all cases, for example it can't distinguish between single-byte encodings. But if for you the only two likely possibilities are UTF-8 and one legacy encoding, that works well. In my shell startup, I set LC_CTYPE in this way, using a script widthof which I posted in Get the display width of a string of characters. widthof -1 displays a 4-byte string which represents 2 characters in UTF-8, and in which only 3 bytes are printable latin-N characters. Thus a width of 2 means UTF-8 (or some other multibyte encoding, which is not likely for me), a width of 3 means latin-N (with no way to know N), and 4 means some single-byte encoding with printable characters in the range 128–159.



                                widthof -1
                                case $? in
                                0) export LC_CTYPE=C;; # 7-bit charset
                                2) locale_search .utf8 .UTF-8;; # utf8
                                3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
                                4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
                                *) export LC_CTYPE=C;; # weird charset
                                esac





                                share|improve this answer

























                                  up vote
                                  0
                                  down vote













                                  If the terminal emulator is well-designed and configured appropriately, it will ensure that the value of the environment variable LC_CTYPE is set to a value that is consistent with its encoding. Unfortunately, in practice, checking LC_CTYPE is not always reliable: it may be unset or wrong. (Other environment variables may convey the locale settings, see What should I set my locale to and what are the implications of doing so? for details.)



                                  If you have some idea of which character encodings are likely, you may be able to determine the encoding via heuristics. Display a byte string that has a different width in different encodings, and find out by how much it makes the cursor move. This won't help you in all cases, for example it can't distinguish between single-byte encodings. But if for you the only two likely possibilities are UTF-8 and one legacy encoding, that works well. In my shell startup, I set LC_CTYPE in this way, using a script widthof which I posted in Get the display width of a string of characters. widthof -1 displays a 4-byte string which represents 2 characters in UTF-8, and in which only 3 bytes are printable latin-N characters. Thus a width of 2 means UTF-8 (or some other multibyte encoding, which is not likely for me), a width of 3 means latin-N (with no way to know N), and 4 means some single-byte encoding with printable characters in the range 128–159.



                                  widthof -1
                                  case $? in
                                  0) export LC_CTYPE=C;; # 7-bit charset
                                  2) locale_search .utf8 .UTF-8;; # utf8
                                  3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
                                  4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
                                  *) export LC_CTYPE=C;; # weird charset
                                  esac





                                  share|improve this answer























                                    up vote
                                    0
                                    down vote










                                    up vote
                                    0
                                    down vote









                                    If the terminal emulator is well-designed and configured appropriately, it will ensure that the value of the environment variable LC_CTYPE is set to a value that is consistent with its encoding. Unfortunately, in practice, checking LC_CTYPE is not always reliable: it may be unset or wrong. (Other environment variables may convey the locale settings, see What should I set my locale to and what are the implications of doing so? for details.)



                                    If you have some idea of which character encodings are likely, you may be able to determine the encoding via heuristics. Display a byte string that has a different width in different encodings, and find out by how much it makes the cursor move. This won't help you in all cases, for example it can't distinguish between single-byte encodings. But if for you the only two likely possibilities are UTF-8 and one legacy encoding, that works well. In my shell startup, I set LC_CTYPE in this way, using a script widthof which I posted in Get the display width of a string of characters. widthof -1 displays a 4-byte string which represents 2 characters in UTF-8, and in which only 3 bytes are printable latin-N characters. Thus a width of 2 means UTF-8 (or some other multibyte encoding, which is not likely for me), a width of 3 means latin-N (with no way to know N), and 4 means some single-byte encoding with printable characters in the range 128–159.



                                    widthof -1
                                    case $? in
                                    0) export LC_CTYPE=C;; # 7-bit charset
                                    2) locale_search .utf8 .UTF-8;; # utf8
                                    3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
                                    4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
                                    *) export LC_CTYPE=C;; # weird charset
                                    esac





                                    share|improve this answer













                                    If the terminal emulator is well-designed and configured appropriately, it will ensure that the value of the environment variable LC_CTYPE is set to a value that is consistent with its encoding. Unfortunately, in practice, checking LC_CTYPE is not always reliable: it may be unset or wrong. (Other environment variables may convey the locale settings, see What should I set my locale to and what are the implications of doing so? for details.)



                                    If you have some idea of which character encodings are likely, you may be able to determine the encoding via heuristics. Display a byte string that has a different width in different encodings, and find out by how much it makes the cursor move. This won't help you in all cases, for example it can't distinguish between single-byte encodings. But if for you the only two likely possibilities are UTF-8 and one legacy encoding, that works well. In my shell startup, I set LC_CTYPE in this way, using a script widthof which I posted in Get the display width of a string of characters. widthof -1 displays a 4-byte string which represents 2 characters in UTF-8, and in which only 3 bytes are printable latin-N characters. Thus a width of 2 means UTF-8 (or some other multibyte encoding, which is not likely for me), a width of 3 means latin-N (with no way to know N), and 4 means some single-byte encoding with printable characters in the range 128–159.



                                    widthof -1
                                    case $? in
                                    0) export LC_CTYPE=C;; # 7-bit charset
                                    2) locale_search .utf8 .UTF-8;; # utf8
                                    3) locale_search .iso88591 .ISO8859-1 .latin1 '';; # 8-bit with nonprintable 128-159, we assume latin1
                                    4) locale_search .iso88591 .ISO8859-1 .latin1 '';; # some full 8-bit charset, we assume latin1
                                    *) export LC_CTYPE=C;; # weird charset
                                    esac






                                    share|improve this answer













                                    share|improve this answer



                                    share|improve this answer











                                    answered May 28 at 20:47









                                    Gilles

                                    503k1179951521




                                    503k1179951521






















                                         

                                        draft saved


                                        draft discarded


























                                         


                                        draft saved


                                        draft discarded














                                        StackExchange.ready(
                                        function ()
                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f322833%2fhow-to-determine-the-character-encoding-that-a-terminal-uses-in-a-c-c-program%23new-answer', 'question_page');

                                        );

                                        Post as a guest













































































                                        Popular posts from this blog

                                        How to check contact read email or not when send email to Individual?

                                        Displaying single band from multi-band raster using QGIS

                                        How many registers does an x86_64 CPU actually have?