Why is it not possible to search through text file contents encoded in UTF-16?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












4















I understand that e.g. catfish and gnome-search-utils both can search inside file contents that are UTF-8 encoded. To be able to search for words or numbers within text files one would have to convert them via iconv into UTF-8 first.



If the file is known, text editors like gedit or mousepad have no trouble with UTF-16.



Why is there no search tool (GUI or command-line) with any of the Linux distributions that can handle UTF-16 encoded txt files?



I'm on Xubuntu.










share|improve this question



















  • 6





    ripgrep 0.5.0 supports UTF-16, but (rant) it is a terrible encoding that should never be used, as 1) a UTF-16 string cannot be a C string if it contains any ASCII characters, 2) It is just as much a variable-width encoding as UTF-8, 3) Many tools choke on the BOM, but it is necessary to disambiguate endianness

    – Fox
    May 9 '17 at 15:52






  • 2





    See also utf8everywhere.com

    – tripleee
    May 9 '17 at 18:40











  • @Fox: thanks. ripgrep seems powerful.

    – Enteneller
    May 9 '17 at 21:19











  • @Fox -- you would no more encode a user string in UTF-16 in C, than you would encode them in UTF-8. C only handles ASCII, and you need library functions to convert strings to(or from) UTF-8 OR UTF-16. However, I tend to agree UTF-16 is icky -- especially since it's often UCS-2 in disguise (no BOM, only supports up to Unicode-2) -- especially when talking about WindowsOS files (log files, reg files, may not have BOMs for example).

    – Astara
    Aug 25 '17 at 2:20






  • 1





    @Astara My statement about C-strings was a quick summary of: if a character is in the subset of Unicode that overlaps with ASCII, its encoding in UTF-16 (or UCS-2) contains a null-byte. The only character containing a null-byte in UTF-8 is NUL itself. This means that you can use functions from the standard C library to read, write, copy, etc. UTF-8 strings, but not UTF-16. You won't get proper change-case support, of course, but the basics are free. In any case, this appears to be a digression from a digression

    – Fox
    Aug 25 '17 at 2:38















4















I understand that e.g. catfish and gnome-search-utils both can search inside file contents that are UTF-8 encoded. To be able to search for words or numbers within text files one would have to convert them via iconv into UTF-8 first.



If the file is known, text editors like gedit or mousepad have no trouble with UTF-16.



Why is there no search tool (GUI or command-line) with any of the Linux distributions that can handle UTF-16 encoded txt files?



I'm on Xubuntu.










share|improve this question



















  • 6





    ripgrep 0.5.0 supports UTF-16, but (rant) it is a terrible encoding that should never be used, as 1) a UTF-16 string cannot be a C string if it contains any ASCII characters, 2) It is just as much a variable-width encoding as UTF-8, 3) Many tools choke on the BOM, but it is necessary to disambiguate endianness

    – Fox
    May 9 '17 at 15:52






  • 2





    See also utf8everywhere.com

    – tripleee
    May 9 '17 at 18:40











  • @Fox: thanks. ripgrep seems powerful.

    – Enteneller
    May 9 '17 at 21:19











  • @Fox -- you would no more encode a user string in UTF-16 in C, than you would encode them in UTF-8. C only handles ASCII, and you need library functions to convert strings to(or from) UTF-8 OR UTF-16. However, I tend to agree UTF-16 is icky -- especially since it's often UCS-2 in disguise (no BOM, only supports up to Unicode-2) -- especially when talking about WindowsOS files (log files, reg files, may not have BOMs for example).

    – Astara
    Aug 25 '17 at 2:20






  • 1





    @Astara My statement about C-strings was a quick summary of: if a character is in the subset of Unicode that overlaps with ASCII, its encoding in UTF-16 (or UCS-2) contains a null-byte. The only character containing a null-byte in UTF-8 is NUL itself. This means that you can use functions from the standard C library to read, write, copy, etc. UTF-8 strings, but not UTF-16. You won't get proper change-case support, of course, but the basics are free. In any case, this appears to be a digression from a digression

    – Fox
    Aug 25 '17 at 2:38













4












4








4








I understand that e.g. catfish and gnome-search-utils both can search inside file contents that are UTF-8 encoded. To be able to search for words or numbers within text files one would have to convert them via iconv into UTF-8 first.



If the file is known, text editors like gedit or mousepad have no trouble with UTF-16.



Why is there no search tool (GUI or command-line) with any of the Linux distributions that can handle UTF-16 encoded txt files?



I'm on Xubuntu.










share|improve this question
















I understand that e.g. catfish and gnome-search-utils both can search inside file contents that are UTF-8 encoded. To be able to search for words or numbers within text files one would have to convert them via iconv into UTF-8 first.



If the file is known, text editors like gedit or mousepad have no trouble with UTF-16.



Why is there no search tool (GUI or command-line) with any of the Linux distributions that can handle UTF-16 encoded txt files?



I'm on Xubuntu.







search unicode text






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited May 9 '17 at 21:53









Gilles

535k12810811598




535k12810811598










asked May 9 '17 at 15:33









EntenellerEnteneller

285




285







  • 6





    ripgrep 0.5.0 supports UTF-16, but (rant) it is a terrible encoding that should never be used, as 1) a UTF-16 string cannot be a C string if it contains any ASCII characters, 2) It is just as much a variable-width encoding as UTF-8, 3) Many tools choke on the BOM, but it is necessary to disambiguate endianness

    – Fox
    May 9 '17 at 15:52






  • 2





    See also utf8everywhere.com

    – tripleee
    May 9 '17 at 18:40











  • @Fox: thanks. ripgrep seems powerful.

    – Enteneller
    May 9 '17 at 21:19











  • @Fox -- you would no more encode a user string in UTF-16 in C, than you would encode them in UTF-8. C only handles ASCII, and you need library functions to convert strings to(or from) UTF-8 OR UTF-16. However, I tend to agree UTF-16 is icky -- especially since it's often UCS-2 in disguise (no BOM, only supports up to Unicode-2) -- especially when talking about WindowsOS files (log files, reg files, may not have BOMs for example).

    – Astara
    Aug 25 '17 at 2:20






  • 1





    @Astara My statement about C-strings was a quick summary of: if a character is in the subset of Unicode that overlaps with ASCII, its encoding in UTF-16 (or UCS-2) contains a null-byte. The only character containing a null-byte in UTF-8 is NUL itself. This means that you can use functions from the standard C library to read, write, copy, etc. UTF-8 strings, but not UTF-16. You won't get proper change-case support, of course, but the basics are free. In any case, this appears to be a digression from a digression

    – Fox
    Aug 25 '17 at 2:38












  • 6





    ripgrep 0.5.0 supports UTF-16, but (rant) it is a terrible encoding that should never be used, as 1) a UTF-16 string cannot be a C string if it contains any ASCII characters, 2) It is just as much a variable-width encoding as UTF-8, 3) Many tools choke on the BOM, but it is necessary to disambiguate endianness

    – Fox
    May 9 '17 at 15:52






  • 2





    See also utf8everywhere.com

    – tripleee
    May 9 '17 at 18:40











  • @Fox: thanks. ripgrep seems powerful.

    – Enteneller
    May 9 '17 at 21:19











  • @Fox -- you would no more encode a user string in UTF-16 in C, than you would encode them in UTF-8. C only handles ASCII, and you need library functions to convert strings to(or from) UTF-8 OR UTF-16. However, I tend to agree UTF-16 is icky -- especially since it's often UCS-2 in disguise (no BOM, only supports up to Unicode-2) -- especially when talking about WindowsOS files (log files, reg files, may not have BOMs for example).

    – Astara
    Aug 25 '17 at 2:20






  • 1





    @Astara My statement about C-strings was a quick summary of: if a character is in the subset of Unicode that overlaps with ASCII, its encoding in UTF-16 (or UCS-2) contains a null-byte. The only character containing a null-byte in UTF-8 is NUL itself. This means that you can use functions from the standard C library to read, write, copy, etc. UTF-8 strings, but not UTF-16. You won't get proper change-case support, of course, but the basics are free. In any case, this appears to be a digression from a digression

    – Fox
    Aug 25 '17 at 2:38







6




6





ripgrep 0.5.0 supports UTF-16, but (rant) it is a terrible encoding that should never be used, as 1) a UTF-16 string cannot be a C string if it contains any ASCII characters, 2) It is just as much a variable-width encoding as UTF-8, 3) Many tools choke on the BOM, but it is necessary to disambiguate endianness

– Fox
May 9 '17 at 15:52





ripgrep 0.5.0 supports UTF-16, but (rant) it is a terrible encoding that should never be used, as 1) a UTF-16 string cannot be a C string if it contains any ASCII characters, 2) It is just as much a variable-width encoding as UTF-8, 3) Many tools choke on the BOM, but it is necessary to disambiguate endianness

– Fox
May 9 '17 at 15:52




2




2





See also utf8everywhere.com

– tripleee
May 9 '17 at 18:40





See also utf8everywhere.com

– tripleee
May 9 '17 at 18:40













@Fox: thanks. ripgrep seems powerful.

– Enteneller
May 9 '17 at 21:19





@Fox: thanks. ripgrep seems powerful.

– Enteneller
May 9 '17 at 21:19













@Fox -- you would no more encode a user string in UTF-16 in C, than you would encode them in UTF-8. C only handles ASCII, and you need library functions to convert strings to(or from) UTF-8 OR UTF-16. However, I tend to agree UTF-16 is icky -- especially since it's often UCS-2 in disguise (no BOM, only supports up to Unicode-2) -- especially when talking about WindowsOS files (log files, reg files, may not have BOMs for example).

– Astara
Aug 25 '17 at 2:20





@Fox -- you would no more encode a user string in UTF-16 in C, than you would encode them in UTF-8. C only handles ASCII, and you need library functions to convert strings to(or from) UTF-8 OR UTF-16. However, I tend to agree UTF-16 is icky -- especially since it's often UCS-2 in disguise (no BOM, only supports up to Unicode-2) -- especially when talking about WindowsOS files (log files, reg files, may not have BOMs for example).

– Astara
Aug 25 '17 at 2:20




1




1





@Astara My statement about C-strings was a quick summary of: if a character is in the subset of Unicode that overlaps with ASCII, its encoding in UTF-16 (or UCS-2) contains a null-byte. The only character containing a null-byte in UTF-8 is NUL itself. This means that you can use functions from the standard C library to read, write, copy, etc. UTF-8 strings, but not UTF-16. You won't get proper change-case support, of course, but the basics are free. In any case, this appears to be a digression from a digression

– Fox
Aug 25 '17 at 2:38





@Astara My statement about C-strings was a quick summary of: if a character is in the subset of Unicode that overlaps with ASCII, its encoding in UTF-16 (or UCS-2) contains a null-byte. The only character containing a null-byte in UTF-8 is NUL itself. This means that you can use functions from the standard C library to read, write, copy, etc. UTF-8 strings, but not UTF-16. You won't get proper change-case support, of course, but the basics are free. In any case, this appears to be a digression from a digression

– Fox
Aug 25 '17 at 2:38










2 Answers
2






active

oldest

votes


















6














UTF-16 (or UCS-2) is highly unfriendly for the null-terminated strings used by the C standard library and the POSIX ABI. For example, command line arguments are terminated by NULs (bytes with value zero), and any UTF-16 character with numerical value < 256 contains a zero byte, so any strings of the usual English letters would be impossible to represent in UTF-16 on a command line argument.



That in turn means that either the utilities would need to take input in some other format (say UTF-8) and convert to UTF-16; or they would need to take their input in some other way. The first option would require all such utilities to contain (or link to) code for the conversion, and the second would make interfacing those programs to other utilities somewhat difficult.



Given those difficulties, and the fact that UTF-8 has better backwards-compatibility properties, I'd just guess that few care to use UTF-16 enough to be motivated to create tools for that.






share|improve this answer























  • The null termination code in UTF-16 is two null bytes in a row -- which encodes a null byte for UTF-16. If your command line handles UTF-16, then ascii (or unicode) letter 'A' would be internally represented by 0x41 x00 (on windows x86, lower byte is always 1st, often called 'LSB' (vs. MSB). The thing in 'C', is that UTF-16 is an encoding, BELOW what the language uses. 'C' uses user strings which are automatically converted to the platform's native encoding. So a 'C' prog printing "hello worldn" works on all C-supporting platforms.

    – Astara
    Aug 25 '17 at 2:15












  • @Astara, well, in practice, the tools that exist assume a character of 8 bits, so the first 8-bit byte with value 0 terminates the string. POSIX also defines a string as "A contiguous sequence of bytes terminated by and including the first null byte.", and that a byte is exactly the same as an octet, i.e. 8 bits. So yeah, you'd need to have a tool that explicitly supports UTF-16.

    – ilkkachu
    Aug 25 '17 at 15:53











  • We aren't talking '8-bit' interfaces between tools -- we are talking character interterfaces between tools. Whether those characters are 8 or 32 bits internally isn't something passed out to external tools. The original question asked for a find tool to search for text in files that was UTF-16 encoded. The included version of 'find.exe' in /windows/system32, does that.

    – Astara
    Aug 26 '17 at 0:32











  • @Astara, well, the read() and write() system calls deal in bytes, so the interpretation of a character must be done in the tool.

    – ilkkachu
    Aug 26 '17 at 17:55











  • There are no read/write "system" calls on NT. On Win, there are 'read/write' library calls that present I/O as 8-bit chars, but on NT those library calls convert from 8 to 16-bit when talking to the system.

    – Astara
    Aug 27 '17 at 15:44



















1














Install ripgrep utility which supports UTF-16.



For example:



rg pattern filename



ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)







share|improve this answer






















    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f363946%2fwhy-is-it-not-possible-to-search-through-text-file-contents-encoded-in-utf-16%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    6














    UTF-16 (or UCS-2) is highly unfriendly for the null-terminated strings used by the C standard library and the POSIX ABI. For example, command line arguments are terminated by NULs (bytes with value zero), and any UTF-16 character with numerical value < 256 contains a zero byte, so any strings of the usual English letters would be impossible to represent in UTF-16 on a command line argument.



    That in turn means that either the utilities would need to take input in some other format (say UTF-8) and convert to UTF-16; or they would need to take their input in some other way. The first option would require all such utilities to contain (or link to) code for the conversion, and the second would make interfacing those programs to other utilities somewhat difficult.



    Given those difficulties, and the fact that UTF-8 has better backwards-compatibility properties, I'd just guess that few care to use UTF-16 enough to be motivated to create tools for that.






    share|improve this answer























    • The null termination code in UTF-16 is two null bytes in a row -- which encodes a null byte for UTF-16. If your command line handles UTF-16, then ascii (or unicode) letter 'A' would be internally represented by 0x41 x00 (on windows x86, lower byte is always 1st, often called 'LSB' (vs. MSB). The thing in 'C', is that UTF-16 is an encoding, BELOW what the language uses. 'C' uses user strings which are automatically converted to the platform's native encoding. So a 'C' prog printing "hello worldn" works on all C-supporting platforms.

      – Astara
      Aug 25 '17 at 2:15












    • @Astara, well, in practice, the tools that exist assume a character of 8 bits, so the first 8-bit byte with value 0 terminates the string. POSIX also defines a string as "A contiguous sequence of bytes terminated by and including the first null byte.", and that a byte is exactly the same as an octet, i.e. 8 bits. So yeah, you'd need to have a tool that explicitly supports UTF-16.

      – ilkkachu
      Aug 25 '17 at 15:53











    • We aren't talking '8-bit' interfaces between tools -- we are talking character interterfaces between tools. Whether those characters are 8 or 32 bits internally isn't something passed out to external tools. The original question asked for a find tool to search for text in files that was UTF-16 encoded. The included version of 'find.exe' in /windows/system32, does that.

      – Astara
      Aug 26 '17 at 0:32











    • @Astara, well, the read() and write() system calls deal in bytes, so the interpretation of a character must be done in the tool.

      – ilkkachu
      Aug 26 '17 at 17:55











    • There are no read/write "system" calls on NT. On Win, there are 'read/write' library calls that present I/O as 8-bit chars, but on NT those library calls convert from 8 to 16-bit when talking to the system.

      – Astara
      Aug 27 '17 at 15:44
















    6














    UTF-16 (or UCS-2) is highly unfriendly for the null-terminated strings used by the C standard library and the POSIX ABI. For example, command line arguments are terminated by NULs (bytes with value zero), and any UTF-16 character with numerical value < 256 contains a zero byte, so any strings of the usual English letters would be impossible to represent in UTF-16 on a command line argument.



    That in turn means that either the utilities would need to take input in some other format (say UTF-8) and convert to UTF-16; or they would need to take their input in some other way. The first option would require all such utilities to contain (or link to) code for the conversion, and the second would make interfacing those programs to other utilities somewhat difficult.



    Given those difficulties, and the fact that UTF-8 has better backwards-compatibility properties, I'd just guess that few care to use UTF-16 enough to be motivated to create tools for that.






    share|improve this answer























    • The null termination code in UTF-16 is two null bytes in a row -- which encodes a null byte for UTF-16. If your command line handles UTF-16, then ascii (or unicode) letter 'A' would be internally represented by 0x41 x00 (on windows x86, lower byte is always 1st, often called 'LSB' (vs. MSB). The thing in 'C', is that UTF-16 is an encoding, BELOW what the language uses. 'C' uses user strings which are automatically converted to the platform's native encoding. So a 'C' prog printing "hello worldn" works on all C-supporting platforms.

      – Astara
      Aug 25 '17 at 2:15












    • @Astara, well, in practice, the tools that exist assume a character of 8 bits, so the first 8-bit byte with value 0 terminates the string. POSIX also defines a string as "A contiguous sequence of bytes terminated by and including the first null byte.", and that a byte is exactly the same as an octet, i.e. 8 bits. So yeah, you'd need to have a tool that explicitly supports UTF-16.

      – ilkkachu
      Aug 25 '17 at 15:53











    • We aren't talking '8-bit' interfaces between tools -- we are talking character interterfaces between tools. Whether those characters are 8 or 32 bits internally isn't something passed out to external tools. The original question asked for a find tool to search for text in files that was UTF-16 encoded. The included version of 'find.exe' in /windows/system32, does that.

      – Astara
      Aug 26 '17 at 0:32











    • @Astara, well, the read() and write() system calls deal in bytes, so the interpretation of a character must be done in the tool.

      – ilkkachu
      Aug 26 '17 at 17:55











    • There are no read/write "system" calls on NT. On Win, there are 'read/write' library calls that present I/O as 8-bit chars, but on NT those library calls convert from 8 to 16-bit when talking to the system.

      – Astara
      Aug 27 '17 at 15:44














    6












    6








    6







    UTF-16 (or UCS-2) is highly unfriendly for the null-terminated strings used by the C standard library and the POSIX ABI. For example, command line arguments are terminated by NULs (bytes with value zero), and any UTF-16 character with numerical value < 256 contains a zero byte, so any strings of the usual English letters would be impossible to represent in UTF-16 on a command line argument.



    That in turn means that either the utilities would need to take input in some other format (say UTF-8) and convert to UTF-16; or they would need to take their input in some other way. The first option would require all such utilities to contain (or link to) code for the conversion, and the second would make interfacing those programs to other utilities somewhat difficult.



    Given those difficulties, and the fact that UTF-8 has better backwards-compatibility properties, I'd just guess that few care to use UTF-16 enough to be motivated to create tools for that.






    share|improve this answer













    UTF-16 (or UCS-2) is highly unfriendly for the null-terminated strings used by the C standard library and the POSIX ABI. For example, command line arguments are terminated by NULs (bytes with value zero), and any UTF-16 character with numerical value < 256 contains a zero byte, so any strings of the usual English letters would be impossible to represent in UTF-16 on a command line argument.



    That in turn means that either the utilities would need to take input in some other format (say UTF-8) and convert to UTF-16; or they would need to take their input in some other way. The first option would require all such utilities to contain (or link to) code for the conversion, and the second would make interfacing those programs to other utilities somewhat difficult.



    Given those difficulties, and the fact that UTF-8 has better backwards-compatibility properties, I'd just guess that few care to use UTF-16 enough to be motivated to create tools for that.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered May 9 '17 at 17:56









    ilkkachuilkkachu

    57.8k888163




    57.8k888163












    • The null termination code in UTF-16 is two null bytes in a row -- which encodes a null byte for UTF-16. If your command line handles UTF-16, then ascii (or unicode) letter 'A' would be internally represented by 0x41 x00 (on windows x86, lower byte is always 1st, often called 'LSB' (vs. MSB). The thing in 'C', is that UTF-16 is an encoding, BELOW what the language uses. 'C' uses user strings which are automatically converted to the platform's native encoding. So a 'C' prog printing "hello worldn" works on all C-supporting platforms.

      – Astara
      Aug 25 '17 at 2:15












    • @Astara, well, in practice, the tools that exist assume a character of 8 bits, so the first 8-bit byte with value 0 terminates the string. POSIX also defines a string as "A contiguous sequence of bytes terminated by and including the first null byte.", and that a byte is exactly the same as an octet, i.e. 8 bits. So yeah, you'd need to have a tool that explicitly supports UTF-16.

      – ilkkachu
      Aug 25 '17 at 15:53











    • We aren't talking '8-bit' interfaces between tools -- we are talking character interterfaces between tools. Whether those characters are 8 or 32 bits internally isn't something passed out to external tools. The original question asked for a find tool to search for text in files that was UTF-16 encoded. The included version of 'find.exe' in /windows/system32, does that.

      – Astara
      Aug 26 '17 at 0:32











    • @Astara, well, the read() and write() system calls deal in bytes, so the interpretation of a character must be done in the tool.

      – ilkkachu
      Aug 26 '17 at 17:55











    • There are no read/write "system" calls on NT. On Win, there are 'read/write' library calls that present I/O as 8-bit chars, but on NT those library calls convert from 8 to 16-bit when talking to the system.

      – Astara
      Aug 27 '17 at 15:44


















    • The null termination code in UTF-16 is two null bytes in a row -- which encodes a null byte for UTF-16. If your command line handles UTF-16, then ascii (or unicode) letter 'A' would be internally represented by 0x41 x00 (on windows x86, lower byte is always 1st, often called 'LSB' (vs. MSB). The thing in 'C', is that UTF-16 is an encoding, BELOW what the language uses. 'C' uses user strings which are automatically converted to the platform's native encoding. So a 'C' prog printing "hello worldn" works on all C-supporting platforms.

      – Astara
      Aug 25 '17 at 2:15












    • @Astara, well, in practice, the tools that exist assume a character of 8 bits, so the first 8-bit byte with value 0 terminates the string. POSIX also defines a string as "A contiguous sequence of bytes terminated by and including the first null byte.", and that a byte is exactly the same as an octet, i.e. 8 bits. So yeah, you'd need to have a tool that explicitly supports UTF-16.

      – ilkkachu
      Aug 25 '17 at 15:53











    • We aren't talking '8-bit' interfaces between tools -- we are talking character interterfaces between tools. Whether those characters are 8 or 32 bits internally isn't something passed out to external tools. The original question asked for a find tool to search for text in files that was UTF-16 encoded. The included version of 'find.exe' in /windows/system32, does that.

      – Astara
      Aug 26 '17 at 0:32











    • @Astara, well, the read() and write() system calls deal in bytes, so the interpretation of a character must be done in the tool.

      – ilkkachu
      Aug 26 '17 at 17:55











    • There are no read/write "system" calls on NT. On Win, there are 'read/write' library calls that present I/O as 8-bit chars, but on NT those library calls convert from 8 to 16-bit when talking to the system.

      – Astara
      Aug 27 '17 at 15:44

















    The null termination code in UTF-16 is two null bytes in a row -- which encodes a null byte for UTF-16. If your command line handles UTF-16, then ascii (or unicode) letter 'A' would be internally represented by 0x41 x00 (on windows x86, lower byte is always 1st, often called 'LSB' (vs. MSB). The thing in 'C', is that UTF-16 is an encoding, BELOW what the language uses. 'C' uses user strings which are automatically converted to the platform's native encoding. So a 'C' prog printing "hello worldn" works on all C-supporting platforms.

    – Astara
    Aug 25 '17 at 2:15






    The null termination code in UTF-16 is two null bytes in a row -- which encodes a null byte for UTF-16. If your command line handles UTF-16, then ascii (or unicode) letter 'A' would be internally represented by 0x41 x00 (on windows x86, lower byte is always 1st, often called 'LSB' (vs. MSB). The thing in 'C', is that UTF-16 is an encoding, BELOW what the language uses. 'C' uses user strings which are automatically converted to the platform's native encoding. So a 'C' prog printing "hello worldn" works on all C-supporting platforms.

    – Astara
    Aug 25 '17 at 2:15














    @Astara, well, in practice, the tools that exist assume a character of 8 bits, so the first 8-bit byte with value 0 terminates the string. POSIX also defines a string as "A contiguous sequence of bytes terminated by and including the first null byte.", and that a byte is exactly the same as an octet, i.e. 8 bits. So yeah, you'd need to have a tool that explicitly supports UTF-16.

    – ilkkachu
    Aug 25 '17 at 15:53





    @Astara, well, in practice, the tools that exist assume a character of 8 bits, so the first 8-bit byte with value 0 terminates the string. POSIX also defines a string as "A contiguous sequence of bytes terminated by and including the first null byte.", and that a byte is exactly the same as an octet, i.e. 8 bits. So yeah, you'd need to have a tool that explicitly supports UTF-16.

    – ilkkachu
    Aug 25 '17 at 15:53













    We aren't talking '8-bit' interfaces between tools -- we are talking character interterfaces between tools. Whether those characters are 8 or 32 bits internally isn't something passed out to external tools. The original question asked for a find tool to search for text in files that was UTF-16 encoded. The included version of 'find.exe' in /windows/system32, does that.

    – Astara
    Aug 26 '17 at 0:32





    We aren't talking '8-bit' interfaces between tools -- we are talking character interterfaces between tools. Whether those characters are 8 or 32 bits internally isn't something passed out to external tools. The original question asked for a find tool to search for text in files that was UTF-16 encoded. The included version of 'find.exe' in /windows/system32, does that.

    – Astara
    Aug 26 '17 at 0:32













    @Astara, well, the read() and write() system calls deal in bytes, so the interpretation of a character must be done in the tool.

    – ilkkachu
    Aug 26 '17 at 17:55





    @Astara, well, the read() and write() system calls deal in bytes, so the interpretation of a character must be done in the tool.

    – ilkkachu
    Aug 26 '17 at 17:55













    There are no read/write "system" calls on NT. On Win, there are 'read/write' library calls that present I/O as 8-bit chars, but on NT those library calls convert from 8 to 16-bit when talking to the system.

    – Astara
    Aug 27 '17 at 15:44






    There are no read/write "system" calls on NT. On Win, there are 'read/write' library calls that present I/O as 8-bit chars, but on NT those library calls convert from 8 to 16-bit when talking to the system.

    – Astara
    Aug 27 '17 at 15:44














    1














    Install ripgrep utility which supports UTF-16.



    For example:



    rg pattern filename



    ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)







    share|improve this answer



























      1














      Install ripgrep utility which supports UTF-16.



      For example:



      rg pattern filename



      ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)







      share|improve this answer

























        1












        1








        1







        Install ripgrep utility which supports UTF-16.



        For example:



        rg pattern filename



        ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)







        share|improve this answer













        Install ripgrep utility which supports UTF-16.



        For example:



        rg pattern filename



        ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)








        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 17 at 14:22









        kenorbkenorb

        8,626371106




        8,626371106



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Unix & Linux Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f363946%2fwhy-is-it-not-possible-to-search-through-text-file-contents-encoded-in-utf-16%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown






            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Bahrain

            Postfix configuration issue with fips on centos 7; mailgun relay