Convert Unicode surrogate pair to literal string

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
15
down vote

favorite
1












I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:



public static void UnicodeTest()

var highUnicodeChar = "𝐀"; //Not the standard A

var result1 = highUnicodeChar; //this works
var result2 = highUnicodeChar[0].ToString(); // returns ud835



When I assign highUnicodeChar to result1 directly, it retains its literal value of 𝐀. When I try to access it by index, it returns ud835. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char to a string.



In the end, I want result2 to yield the same value as result1. How can I do this?










share|improve this question







New contributor




hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.























    up vote
    15
    down vote

    favorite
    1












    I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:



    public static void UnicodeTest()

    var highUnicodeChar = "𝐀"; //Not the standard A

    var result1 = highUnicodeChar; //this works
    var result2 = highUnicodeChar[0].ToString(); // returns ud835



    When I assign highUnicodeChar to result1 directly, it retains its literal value of 𝐀. When I try to access it by index, it returns ud835. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char to a string.



    In the end, I want result2 to yield the same value as result1. How can I do this?










    share|improve this question







    New contributor




    hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.





















      up vote
      15
      down vote

      favorite
      1









      up vote
      15
      down vote

      favorite
      1






      1





      I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:



      public static void UnicodeTest()

      var highUnicodeChar = "𝐀"; //Not the standard A

      var result1 = highUnicodeChar; //this works
      var result2 = highUnicodeChar[0].ToString(); // returns ud835



      When I assign highUnicodeChar to result1 directly, it retains its literal value of 𝐀. When I try to access it by index, it returns ud835. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char to a string.



      In the end, I want result2 to yield the same value as result1. How can I do this?










      share|improve this question







      New contributor




      hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      I am trying to read a high Unicode character from one string into another. For brevity, I will simplify my code as shown below:



      public static void UnicodeTest()

      var highUnicodeChar = "𝐀"; //Not the standard A

      var result1 = highUnicodeChar; //this works
      var result2 = highUnicodeChar[0].ToString(); // returns ud835



      When I assign highUnicodeChar to result1 directly, it retains its literal value of 𝐀. When I try to access it by index, it returns ud835. As I understand it, this is a surrogate pair of UTF-16 characters used to represent a UTF-32 character. I am pretty sure this problem has to do with trying to implicitly convert a char to a string.



      In the end, I want result2 to yield the same value as result1. How can I do this?







      c# .net unicode unicode-escapes






      share|improve this question







      New contributor




      hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked Oct 1 at 3:42









      hargle

      784




      784




      New contributor




      hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      hargle is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          21
          down vote



          accepted










          In Unicode, you have code points. These are 21 bits long. Your character 𝐀, "Mathematical Bold Capital A", has a code point of U+1D400.



          In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code units encode a single code point.



          In UTF-16, two code units that form a single code point are called a surrogate pair. Surrogate pairs are used to encode any code point greater than 16 bits, i.e. U+10000 and up.



          This gets a little tricky in .NET, as a .NET Char represents a single UTF-16 code unit, and a .NET String is a collection of code units.



          So your code point 𝐀 (U+1D400) can't fit in 16 bits and needs a surrogate pair, meaning your string has two code units in it:



          var highUnicodeChar = "𝐀";
          char a = highUnicodeChar[0]; // code unit 0xD835
          char b = highUnicodeChar[1]; // code unit 0xDC00


          Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.



          You can use IsSurrogatePair to test for a surrogate pair. For instance:



          string GetFullCodePointAtIndex(string s, int idx) =>
          s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);


          Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the "visible thing" most people when asked would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut or various other decorations/modifiers you might want to add. See this answer for a horrifying example of what combining characters can do.



          To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.






          share|improve this answer






















          • Perfect! This solution is exactly what I was looking for, and great explanation as well.
            – hargle
            Oct 1 at 6:09










          • An example of "zero or more combining characters" can be seen on stackoverflow.com/questions/1732348/… and there are tools to generate this, like lingojam.com/GlitchTextGenerator
            – Ismael Miguel
            Oct 1 at 12:37






          • 1




            “code points are 21 bits long” – though it's true that 21 bits is the amount of data with which you can represent any code point, it's not really practically very meaningful to say that this is the “length of a code point”. I don't think such a representation is used anywhere important; for direct access of codepoints you'd actually use UTF-32 or perhaps store the code points in 64 or even 128 bits for reasons of memory uniformity. — Also: umlauts are usually not implemented as combining characters, since most combinations already have a single code point allocated to them.
            – leftaroundabout
            Oct 1 at 15:30











          • @leftaroundabout when identifying the distinction between Unicode and one of its encoding, I find the concept of "21 bits" to be a good way to break people loose of the sorta-but-not-really correct "UTF32=codepoint" idea.
            – Cory Nelson
            Oct 1 at 16:41

















          up vote
          8
          down vote













          It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar string, where an "atomic" character includes both halves of a surrogate pair.



          You can use StringInfo.GetTextElementEnumerator() to do just this, breaking a string down into atomic chunks then taking the first.



          First, define the following extension method:



          public static class TextExtensions

          public static IEnumerable<string> TextElements(this string s)

          // StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
          if (s == null)
          yield break;
          var enumerator = StringInfo.GetTextElementEnumerator(s);
          while (enumerator.MoveNext())
          yield return enumerator.GetTextElement();




          Now, you can do:



          var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";


          Note that StringInfo.GetTextElementEnumerator() will also group Unicode combining characters, so that the first grapheme cluster of the string Ĥ=T̂+V̂ will be Ĥ not H.



          Sample fiddle here.






          share|improve this answer






















          • +1 this is the only reasonable approach. Unfortunately the .NET API (same as most other languages) does not exactly encourage it. There should be a linter for .NET that flags very random access of chars inside a string as an error.
            – Konrad Rudolph
            Oct 1 at 13:44











          Your Answer





          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );






          hargle is a new contributor. Be nice, and check out our Code of Conduct.









           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52584308%2fconvert-unicode-surrogate-pair-to-literal-string%23new-answer', 'question_page');

          );

          Post as a guest






























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          21
          down vote



          accepted










          In Unicode, you have code points. These are 21 bits long. Your character 𝐀, "Mathematical Bold Capital A", has a code point of U+1D400.



          In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code units encode a single code point.



          In UTF-16, two code units that form a single code point are called a surrogate pair. Surrogate pairs are used to encode any code point greater than 16 bits, i.e. U+10000 and up.



          This gets a little tricky in .NET, as a .NET Char represents a single UTF-16 code unit, and a .NET String is a collection of code units.



          So your code point 𝐀 (U+1D400) can't fit in 16 bits and needs a surrogate pair, meaning your string has two code units in it:



          var highUnicodeChar = "𝐀";
          char a = highUnicodeChar[0]; // code unit 0xD835
          char b = highUnicodeChar[1]; // code unit 0xDC00


          Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.



          You can use IsSurrogatePair to test for a surrogate pair. For instance:



          string GetFullCodePointAtIndex(string s, int idx) =>
          s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);


          Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the "visible thing" most people when asked would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut or various other decorations/modifiers you might want to add. See this answer for a horrifying example of what combining characters can do.



          To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.






          share|improve this answer






















          • Perfect! This solution is exactly what I was looking for, and great explanation as well.
            – hargle
            Oct 1 at 6:09










          • An example of "zero or more combining characters" can be seen on stackoverflow.com/questions/1732348/… and there are tools to generate this, like lingojam.com/GlitchTextGenerator
            – Ismael Miguel
            Oct 1 at 12:37






          • 1




            “code points are 21 bits long” – though it's true that 21 bits is the amount of data with which you can represent any code point, it's not really practically very meaningful to say that this is the “length of a code point”. I don't think such a representation is used anywhere important; for direct access of codepoints you'd actually use UTF-32 or perhaps store the code points in 64 or even 128 bits for reasons of memory uniformity. — Also: umlauts are usually not implemented as combining characters, since most combinations already have a single code point allocated to them.
            – leftaroundabout
            Oct 1 at 15:30











          • @leftaroundabout when identifying the distinction between Unicode and one of its encoding, I find the concept of "21 bits" to be a good way to break people loose of the sorta-but-not-really correct "UTF32=codepoint" idea.
            – Cory Nelson
            Oct 1 at 16:41














          up vote
          21
          down vote



          accepted










          In Unicode, you have code points. These are 21 bits long. Your character 𝐀, "Mathematical Bold Capital A", has a code point of U+1D400.



          In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code units encode a single code point.



          In UTF-16, two code units that form a single code point are called a surrogate pair. Surrogate pairs are used to encode any code point greater than 16 bits, i.e. U+10000 and up.



          This gets a little tricky in .NET, as a .NET Char represents a single UTF-16 code unit, and a .NET String is a collection of code units.



          So your code point 𝐀 (U+1D400) can't fit in 16 bits and needs a surrogate pair, meaning your string has two code units in it:



          var highUnicodeChar = "𝐀";
          char a = highUnicodeChar[0]; // code unit 0xD835
          char b = highUnicodeChar[1]; // code unit 0xDC00


          Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.



          You can use IsSurrogatePair to test for a surrogate pair. For instance:



          string GetFullCodePointAtIndex(string s, int idx) =>
          s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);


          Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the "visible thing" most people when asked would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut or various other decorations/modifiers you might want to add. See this answer for a horrifying example of what combining characters can do.



          To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.






          share|improve this answer






















          • Perfect! This solution is exactly what I was looking for, and great explanation as well.
            – hargle
            Oct 1 at 6:09










          • An example of "zero or more combining characters" can be seen on stackoverflow.com/questions/1732348/… and there are tools to generate this, like lingojam.com/GlitchTextGenerator
            – Ismael Miguel
            Oct 1 at 12:37






          • 1




            “code points are 21 bits long” – though it's true that 21 bits is the amount of data with which you can represent any code point, it's not really practically very meaningful to say that this is the “length of a code point”. I don't think such a representation is used anywhere important; for direct access of codepoints you'd actually use UTF-32 or perhaps store the code points in 64 or even 128 bits for reasons of memory uniformity. — Also: umlauts are usually not implemented as combining characters, since most combinations already have a single code point allocated to them.
            – leftaroundabout
            Oct 1 at 15:30











          • @leftaroundabout when identifying the distinction between Unicode and one of its encoding, I find the concept of "21 bits" to be a good way to break people loose of the sorta-but-not-really correct "UTF32=codepoint" idea.
            – Cory Nelson
            Oct 1 at 16:41












          up vote
          21
          down vote



          accepted







          up vote
          21
          down vote



          accepted






          In Unicode, you have code points. These are 21 bits long. Your character 𝐀, "Mathematical Bold Capital A", has a code point of U+1D400.



          In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code units encode a single code point.



          In UTF-16, two code units that form a single code point are called a surrogate pair. Surrogate pairs are used to encode any code point greater than 16 bits, i.e. U+10000 and up.



          This gets a little tricky in .NET, as a .NET Char represents a single UTF-16 code unit, and a .NET String is a collection of code units.



          So your code point 𝐀 (U+1D400) can't fit in 16 bits and needs a surrogate pair, meaning your string has two code units in it:



          var highUnicodeChar = "𝐀";
          char a = highUnicodeChar[0]; // code unit 0xD835
          char b = highUnicodeChar[1]; // code unit 0xDC00


          Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.



          You can use IsSurrogatePair to test for a surrogate pair. For instance:



          string GetFullCodePointAtIndex(string s, int idx) =>
          s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);


          Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the "visible thing" most people when asked would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut or various other decorations/modifiers you might want to add. See this answer for a horrifying example of what combining characters can do.



          To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.






          share|improve this answer














          In Unicode, you have code points. These are 21 bits long. Your character 𝐀, "Mathematical Bold Capital A", has a code point of U+1D400.



          In Unicode encodings, you have code units. These are the natural unit of the encoding: 8-bit for UTF-8, 16-bit for UTF-16, and so on. One or more code units encode a single code point.



          In UTF-16, two code units that form a single code point are called a surrogate pair. Surrogate pairs are used to encode any code point greater than 16 bits, i.e. U+10000 and up.



          This gets a little tricky in .NET, as a .NET Char represents a single UTF-16 code unit, and a .NET String is a collection of code units.



          So your code point 𝐀 (U+1D400) can't fit in 16 bits and needs a surrogate pair, meaning your string has two code units in it:



          var highUnicodeChar = "𝐀";
          char a = highUnicodeChar[0]; // code unit 0xD835
          char b = highUnicodeChar[1]; // code unit 0xDC00


          Meaning when you index into the string like that, you're actually only getting half of the surrogate pair.



          You can use IsSurrogatePair to test for a surrogate pair. For instance:



          string GetFullCodePointAtIndex(string s, int idx) =>
          s.Substring(idx, char.IsSurrogatePair(s, idx) ? 2 : 1);


          Important to note that the rabbit hole of variable encoding in Unicode doesn't end at the code point. A grapheme cluster is the "visible thing" most people when asked would ultimately call a "character". A grapheme cluster is made from one or more code points: a base character, and zero or more combining characters. An example of a combining character is an umlaut or various other decorations/modifiers you might want to add. See this answer for a horrifying example of what combining characters can do.



          To test for a combining character, you can use GetUnicodeCategory to check for an enclosing mark, non-spacing mark, or spacing mark.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Oct 1 at 16:58

























          answered Oct 1 at 4:08









          Cory Nelson

          21.9k24581




          21.9k24581











          • Perfect! This solution is exactly what I was looking for, and great explanation as well.
            – hargle
            Oct 1 at 6:09










          • An example of "zero or more combining characters" can be seen on stackoverflow.com/questions/1732348/… and there are tools to generate this, like lingojam.com/GlitchTextGenerator
            – Ismael Miguel
            Oct 1 at 12:37






          • 1




            “code points are 21 bits long” – though it's true that 21 bits is the amount of data with which you can represent any code point, it's not really practically very meaningful to say that this is the “length of a code point”. I don't think such a representation is used anywhere important; for direct access of codepoints you'd actually use UTF-32 or perhaps store the code points in 64 or even 128 bits for reasons of memory uniformity. — Also: umlauts are usually not implemented as combining characters, since most combinations already have a single code point allocated to them.
            – leftaroundabout
            Oct 1 at 15:30











          • @leftaroundabout when identifying the distinction between Unicode and one of its encoding, I find the concept of "21 bits" to be a good way to break people loose of the sorta-but-not-really correct "UTF32=codepoint" idea.
            – Cory Nelson
            Oct 1 at 16:41
















          • Perfect! This solution is exactly what I was looking for, and great explanation as well.
            – hargle
            Oct 1 at 6:09










          • An example of "zero or more combining characters" can be seen on stackoverflow.com/questions/1732348/… and there are tools to generate this, like lingojam.com/GlitchTextGenerator
            – Ismael Miguel
            Oct 1 at 12:37






          • 1




            “code points are 21 bits long” – though it's true that 21 bits is the amount of data with which you can represent any code point, it's not really practically very meaningful to say that this is the “length of a code point”. I don't think such a representation is used anywhere important; for direct access of codepoints you'd actually use UTF-32 or perhaps store the code points in 64 or even 128 bits for reasons of memory uniformity. — Also: umlauts are usually not implemented as combining characters, since most combinations already have a single code point allocated to them.
            – leftaroundabout
            Oct 1 at 15:30











          • @leftaroundabout when identifying the distinction between Unicode and one of its encoding, I find the concept of "21 bits" to be a good way to break people loose of the sorta-but-not-really correct "UTF32=codepoint" idea.
            – Cory Nelson
            Oct 1 at 16:41















          Perfect! This solution is exactly what I was looking for, and great explanation as well.
          – hargle
          Oct 1 at 6:09




          Perfect! This solution is exactly what I was looking for, and great explanation as well.
          – hargle
          Oct 1 at 6:09












          An example of "zero or more combining characters" can be seen on stackoverflow.com/questions/1732348/… and there are tools to generate this, like lingojam.com/GlitchTextGenerator
          – Ismael Miguel
          Oct 1 at 12:37




          An example of "zero or more combining characters" can be seen on stackoverflow.com/questions/1732348/… and there are tools to generate this, like lingojam.com/GlitchTextGenerator
          – Ismael Miguel
          Oct 1 at 12:37




          1




          1




          “code points are 21 bits long” – though it's true that 21 bits is the amount of data with which you can represent any code point, it's not really practically very meaningful to say that this is the “length of a code point”. I don't think such a representation is used anywhere important; for direct access of codepoints you'd actually use UTF-32 or perhaps store the code points in 64 or even 128 bits for reasons of memory uniformity. — Also: umlauts are usually not implemented as combining characters, since most combinations already have a single code point allocated to them.
          – leftaroundabout
          Oct 1 at 15:30





          “code points are 21 bits long” – though it's true that 21 bits is the amount of data with which you can represent any code point, it's not really practically very meaningful to say that this is the “length of a code point”. I don't think such a representation is used anywhere important; for direct access of codepoints you'd actually use UTF-32 or perhaps store the code points in 64 or even 128 bits for reasons of memory uniformity. — Also: umlauts are usually not implemented as combining characters, since most combinations already have a single code point allocated to them.
          – leftaroundabout
          Oct 1 at 15:30













          @leftaroundabout when identifying the distinction between Unicode and one of its encoding, I find the concept of "21 bits" to be a good way to break people loose of the sorta-but-not-really correct "UTF32=codepoint" idea.
          – Cory Nelson
          Oct 1 at 16:41




          @leftaroundabout when identifying the distinction between Unicode and one of its encoding, I find the concept of "21 bits" to be a good way to break people loose of the sorta-but-not-really correct "UTF32=codepoint" idea.
          – Cory Nelson
          Oct 1 at 16:41












          up vote
          8
          down vote













          It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar string, where an "atomic" character includes both halves of a surrogate pair.



          You can use StringInfo.GetTextElementEnumerator() to do just this, breaking a string down into atomic chunks then taking the first.



          First, define the following extension method:



          public static class TextExtensions

          public static IEnumerable<string> TextElements(this string s)

          // StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
          if (s == null)
          yield break;
          var enumerator = StringInfo.GetTextElementEnumerator(s);
          while (enumerator.MoveNext())
          yield return enumerator.GetTextElement();




          Now, you can do:



          var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";


          Note that StringInfo.GetTextElementEnumerator() will also group Unicode combining characters, so that the first grapheme cluster of the string Ĥ=T̂+V̂ will be Ĥ not H.



          Sample fiddle here.






          share|improve this answer






















          • +1 this is the only reasonable approach. Unfortunately the .NET API (same as most other languages) does not exactly encourage it. There should be a linter for .NET that flags very random access of chars inside a string as an error.
            – Konrad Rudolph
            Oct 1 at 13:44















          up vote
          8
          down vote













          It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar string, where an "atomic" character includes both halves of a surrogate pair.



          You can use StringInfo.GetTextElementEnumerator() to do just this, breaking a string down into atomic chunks then taking the first.



          First, define the following extension method:



          public static class TextExtensions

          public static IEnumerable<string> TextElements(this string s)

          // StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
          if (s == null)
          yield break;
          var enumerator = StringInfo.GetTextElementEnumerator(s);
          while (enumerator.MoveNext())
          yield return enumerator.GetTextElement();




          Now, you can do:



          var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";


          Note that StringInfo.GetTextElementEnumerator() will also group Unicode combining characters, so that the first grapheme cluster of the string Ĥ=T̂+V̂ will be Ĥ not H.



          Sample fiddle here.






          share|improve this answer






















          • +1 this is the only reasonable approach. Unfortunately the .NET API (same as most other languages) does not exactly encourage it. There should be a linter for .NET that flags very random access of chars inside a string as an error.
            – Konrad Rudolph
            Oct 1 at 13:44













          up vote
          8
          down vote










          up vote
          8
          down vote









          It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar string, where an "atomic" character includes both halves of a surrogate pair.



          You can use StringInfo.GetTextElementEnumerator() to do just this, breaking a string down into atomic chunks then taking the first.



          First, define the following extension method:



          public static class TextExtensions

          public static IEnumerable<string> TextElements(this string s)

          // StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
          if (s == null)
          yield break;
          var enumerator = StringInfo.GetTextElementEnumerator(s);
          while (enumerator.MoveNext())
          yield return enumerator.GetTextElement();




          Now, you can do:



          var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";


          Note that StringInfo.GetTextElementEnumerator() will also group Unicode combining characters, so that the first grapheme cluster of the string Ĥ=T̂+V̂ will be Ĥ not H.



          Sample fiddle here.






          share|improve this answer














          It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar string, where an "atomic" character includes both halves of a surrogate pair.



          You can use StringInfo.GetTextElementEnumerator() to do just this, breaking a string down into atomic chunks then taking the first.



          First, define the following extension method:



          public static class TextExtensions

          public static IEnumerable<string> TextElements(this string s)

          // StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
          if (s == null)
          yield break;
          var enumerator = StringInfo.GetTextElementEnumerator(s);
          while (enumerator.MoveNext())
          yield return enumerator.GetTextElement();




          Now, you can do:



          var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";


          Note that StringInfo.GetTextElementEnumerator() will also group Unicode combining characters, so that the first grapheme cluster of the string Ĥ=T̂+V̂ will be Ĥ not H.



          Sample fiddle here.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Oct 1 at 10:08

























          answered Oct 1 at 4:08









          dbc

          50.8k763108




          50.8k763108











          • +1 this is the only reasonable approach. Unfortunately the .NET API (same as most other languages) does not exactly encourage it. There should be a linter for .NET that flags very random access of chars inside a string as an error.
            – Konrad Rudolph
            Oct 1 at 13:44

















          • +1 this is the only reasonable approach. Unfortunately the .NET API (same as most other languages) does not exactly encourage it. There should be a linter for .NET that flags very random access of chars inside a string as an error.
            – Konrad Rudolph
            Oct 1 at 13:44
















          +1 this is the only reasonable approach. Unfortunately the .NET API (same as most other languages) does not exactly encourage it. There should be a linter for .NET that flags very random access of chars inside a string as an error.
          – Konrad Rudolph
          Oct 1 at 13:44





          +1 this is the only reasonable approach. Unfortunately the .NET API (same as most other languages) does not exactly encourage it. There should be a linter for .NET that flags very random access of chars inside a string as an error.
          – Konrad Rudolph
          Oct 1 at 13:44











          hargle is a new contributor. Be nice, and check out our Code of Conduct.









           

          draft saved


          draft discarded


















          hargle is a new contributor. Be nice, and check out our Code of Conduct.












          hargle is a new contributor. Be nice, and check out our Code of Conduct.











          hargle is a new contributor. Be nice, and check out our Code of Conduct.













           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f52584308%2fconvert-unicode-surrogate-pair-to-literal-string%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          How to check contact read email or not when send email to Individual?

          How many registers does an x86_64 CPU actually have?

          Nur Jahan