UTF8 Character Makes File Inaccessible

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












2















If I run:



scp me@example.com:/home/me/cömmön_file.jpg /home/me/


from my remote server I get:




scp: /home/me/cömmön_file.jpg: No such file or directory




If I swap out the utf8 characters though with a wildcard it will work:



scp me@example.com:/home/me/c?mm?n_file.jpg /home/me/


and/or



scp me@example.com:/home/me/c*mm*n_file.jpg /home/me/


If I use the AWS CLI on my remote machine the behavior also replicates.



Running other commands with the explicit name in them on my remote machine functions as I'd expect.



e.g.



ls -lha /home/me/cömmön_file.jpg



-rw-r--r--. 1 me me 1.1M Jan 15 21:58 /home/me/cömmön_file.jpg




I can rename the file as well with mv.



Is the problem with transmitting the file, or something underlying in my machine hosting the file?



The UTF8 character causing the current issue is https://www.compart.com/en/unicode/U+0308 but I suspect other characters also would reproduce the issue. If I try to rename the file from ö to https://www.compart.com/en/unicode/U+00F6 my machine tells me the files are the same.




mv: ‘/home/me/cömmön_file.jpg’ and ‘/home/me/cömmön_file.jpg’ are the same file




The server hosting the file is:



NAME="CentOS Linux"
VERSION="7 (Core)"


and its locale is:



LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=


the server requesting the file is:



NAME="Amazon Linux"
VERSION="2"


and its locale is:



LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=









share|improve this question
























  • What's the locale running on both instances ? (Type locale) This type of problems might happen when both side don't use the same locale (preferably a .UTF-8 ending one).

    – DrYak
    Jan 16 at 17:54












  • @DrYak I've updated the question with that info, both look the same to me.

    – user3783243
    Jan 16 at 18:00











  • Damn, same locale... - does using rsync instead of scp also barf on that file ?

    – DrYak
    Jan 16 at 18:28






  • 1





    Nope, your Mac betrayed you that you're a Mac user, and that's where the whole thing stems from. 'ö' and 'ö' just aren't the same Unicode codepoint and that's something that Linux conserves (just like it's also case sensitive).

    – DrYak
    Jan 16 at 19:25
















2















If I run:



scp me@example.com:/home/me/cömmön_file.jpg /home/me/


from my remote server I get:




scp: /home/me/cömmön_file.jpg: No such file or directory




If I swap out the utf8 characters though with a wildcard it will work:



scp me@example.com:/home/me/c?mm?n_file.jpg /home/me/


and/or



scp me@example.com:/home/me/c*mm*n_file.jpg /home/me/


If I use the AWS CLI on my remote machine the behavior also replicates.



Running other commands with the explicit name in them on my remote machine functions as I'd expect.



e.g.



ls -lha /home/me/cömmön_file.jpg



-rw-r--r--. 1 me me 1.1M Jan 15 21:58 /home/me/cömmön_file.jpg




I can rename the file as well with mv.



Is the problem with transmitting the file, or something underlying in my machine hosting the file?



The UTF8 character causing the current issue is https://www.compart.com/en/unicode/U+0308 but I suspect other characters also would reproduce the issue. If I try to rename the file from ö to https://www.compart.com/en/unicode/U+00F6 my machine tells me the files are the same.




mv: ‘/home/me/cömmön_file.jpg’ and ‘/home/me/cömmön_file.jpg’ are the same file




The server hosting the file is:



NAME="CentOS Linux"
VERSION="7 (Core)"


and its locale is:



LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=


the server requesting the file is:



NAME="Amazon Linux"
VERSION="2"


and its locale is:



LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=









share|improve this question
























  • What's the locale running on both instances ? (Type locale) This type of problems might happen when both side don't use the same locale (preferably a .UTF-8 ending one).

    – DrYak
    Jan 16 at 17:54












  • @DrYak I've updated the question with that info, both look the same to me.

    – user3783243
    Jan 16 at 18:00











  • Damn, same locale... - does using rsync instead of scp also barf on that file ?

    – DrYak
    Jan 16 at 18:28






  • 1





    Nope, your Mac betrayed you that you're a Mac user, and that's where the whole thing stems from. 'ö' and 'ö' just aren't the same Unicode codepoint and that's something that Linux conserves (just like it's also case sensitive).

    – DrYak
    Jan 16 at 19:25














2












2








2








If I run:



scp me@example.com:/home/me/cömmön_file.jpg /home/me/


from my remote server I get:




scp: /home/me/cömmön_file.jpg: No such file or directory




If I swap out the utf8 characters though with a wildcard it will work:



scp me@example.com:/home/me/c?mm?n_file.jpg /home/me/


and/or



scp me@example.com:/home/me/c*mm*n_file.jpg /home/me/


If I use the AWS CLI on my remote machine the behavior also replicates.



Running other commands with the explicit name in them on my remote machine functions as I'd expect.



e.g.



ls -lha /home/me/cömmön_file.jpg



-rw-r--r--. 1 me me 1.1M Jan 15 21:58 /home/me/cömmön_file.jpg




I can rename the file as well with mv.



Is the problem with transmitting the file, or something underlying in my machine hosting the file?



The UTF8 character causing the current issue is https://www.compart.com/en/unicode/U+0308 but I suspect other characters also would reproduce the issue. If I try to rename the file from ö to https://www.compart.com/en/unicode/U+00F6 my machine tells me the files are the same.




mv: ‘/home/me/cömmön_file.jpg’ and ‘/home/me/cömmön_file.jpg’ are the same file




The server hosting the file is:



NAME="CentOS Linux"
VERSION="7 (Core)"


and its locale is:



LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=


the server requesting the file is:



NAME="Amazon Linux"
VERSION="2"


and its locale is:



LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=









share|improve this question
















If I run:



scp me@example.com:/home/me/cömmön_file.jpg /home/me/


from my remote server I get:




scp: /home/me/cömmön_file.jpg: No such file or directory




If I swap out the utf8 characters though with a wildcard it will work:



scp me@example.com:/home/me/c?mm?n_file.jpg /home/me/


and/or



scp me@example.com:/home/me/c*mm*n_file.jpg /home/me/


If I use the AWS CLI on my remote machine the behavior also replicates.



Running other commands with the explicit name in them on my remote machine functions as I'd expect.



e.g.



ls -lha /home/me/cömmön_file.jpg



-rw-r--r--. 1 me me 1.1M Jan 15 21:58 /home/me/cömmön_file.jpg




I can rename the file as well with mv.



Is the problem with transmitting the file, or something underlying in my machine hosting the file?



The UTF8 character causing the current issue is https://www.compart.com/en/unicode/U+0308 but I suspect other characters also would reproduce the issue. If I try to rename the file from ö to https://www.compart.com/en/unicode/U+00F6 my machine tells me the files are the same.




mv: ‘/home/me/cömmön_file.jpg’ and ‘/home/me/cömmön_file.jpg’ are the same file




The server hosting the file is:



NAME="CentOS Linux"
VERSION="7 (Core)"


and its locale is:



LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=


the server requesting the file is:



NAME="Amazon Linux"
VERSION="2"


and its locale is:



LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=






linux filesystems unicode character-encoding






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 16 at 17:59







user3783243

















asked Jan 16 at 16:52









user3783243user3783243

1275




1275












  • What's the locale running on both instances ? (Type locale) This type of problems might happen when both side don't use the same locale (preferably a .UTF-8 ending one).

    – DrYak
    Jan 16 at 17:54












  • @DrYak I've updated the question with that info, both look the same to me.

    – user3783243
    Jan 16 at 18:00











  • Damn, same locale... - does using rsync instead of scp also barf on that file ?

    – DrYak
    Jan 16 at 18:28






  • 1





    Nope, your Mac betrayed you that you're a Mac user, and that's where the whole thing stems from. 'ö' and 'ö' just aren't the same Unicode codepoint and that's something that Linux conserves (just like it's also case sensitive).

    – DrYak
    Jan 16 at 19:25


















  • What's the locale running on both instances ? (Type locale) This type of problems might happen when both side don't use the same locale (preferably a .UTF-8 ending one).

    – DrYak
    Jan 16 at 17:54












  • @DrYak I've updated the question with that info, both look the same to me.

    – user3783243
    Jan 16 at 18:00











  • Damn, same locale... - does using rsync instead of scp also barf on that file ?

    – DrYak
    Jan 16 at 18:28






  • 1





    Nope, your Mac betrayed you that you're a Mac user, and that's where the whole thing stems from. 'ö' and 'ö' just aren't the same Unicode codepoint and that's something that Linux conserves (just like it's also case sensitive).

    – DrYak
    Jan 16 at 19:25

















What's the locale running on both instances ? (Type locale) This type of problems might happen when both side don't use the same locale (preferably a .UTF-8 ending one).

– DrYak
Jan 16 at 17:54






What's the locale running on both instances ? (Type locale) This type of problems might happen when both side don't use the same locale (preferably a .UTF-8 ending one).

– DrYak
Jan 16 at 17:54














@DrYak I've updated the question with that info, both look the same to me.

– user3783243
Jan 16 at 18:00





@DrYak I've updated the question with that info, both look the same to me.

– user3783243
Jan 16 at 18:00













Damn, same locale... - does using rsync instead of scp also barf on that file ?

– DrYak
Jan 16 at 18:28





Damn, same locale... - does using rsync instead of scp also barf on that file ?

– DrYak
Jan 16 at 18:28




1




1





Nope, your Mac betrayed you that you're a Mac user, and that's where the whole thing stems from. 'ö' and 'ö' just aren't the same Unicode codepoint and that's something that Linux conserves (just like it's also case sensitive).

– DrYak
Jan 16 at 19:25






Nope, your Mac betrayed you that you're a Mac user, and that's where the whole thing stems from. 'ö' and 'ö' just aren't the same Unicode codepoint and that's something that Linux conserves (just like it's also case sensitive).

– DrYak
Jan 16 at 19:25











1 Answer
1






active

oldest

votes


















3














Quick solution:
do not use accented letters on your keyboard, use tab-complete instead (and have your SSH key setup so that tab-complete also works with over the network scp, rsync, etc.) or fall back to wild cards, because what you experience is the normal intended behaviour.




It doesn't work, because you did not type the same filename.



Seems crazy ? That's UTF-8 to you.



Even more crazy: I can use my magical remote mind-reading psychic power to tell you that you have an Apple Mac.



More seriously: that's the crucial information you forgot to give when asking your question, but that you accidentally leaked when typing the question itself.




While copy-pasting the answer above:



# echo "scp me@example.com:/home/me/cömmön_file.jpg" | hexdump -C
00000000 73 63 70 20 6d 65 40 65 78 61 6d 70 6c 65 2e 63 |scp me@example.c|
00000010 6f 6d 3a 2f 68 6f 6d 65 2f 6d 65 2f 63 6f cc 88 |om:/home/me/co..|
00000020 6d 6d 6f cc 88 6e 5f 66 69 6c 65 2e 6a 70 67 20 |mmo..n_file.jpg |
00000030 2f 68 6f 6d 65 2f 6d 65 2f 0a |/home/me/.|
0000003a


Please pay close attention to how the letter 'ö' is coded : 6f cc 88. A litteral 'o' followed by an extra UTF-8 codepoint. (in fact, on my terminal it doesn't even display as 'ö' but as 'o')



When when I (=Linux user) type:



echo /home/me/cömmön_file.jpg | hexdump -C
00000000 2f 68 6f 6d 65 2f 6d 65 2f 63 c3 b6 6d 6d c3 b6 |/home/me/c..mm..|
00000010 6e 5f 66 69 6c 65 2e 6a 70 67 0a |n_file.jpg.|
0000001b


Again look closely at the 'ö' symbol : c3 b6, an entirely different UTF-8 code point and no extra litteral ASCII.




Ultra short explanation : UTF-8 normalization (composition vs decomposition).




Longer explanation :



in Unicode, there are multiple way to code for something that looks like 'ö'.



  • first way is composed characters : there's a code point that's litteraly 'ö' inherited from Latin-1 (ISO/IEC 8859-1:1998) code points, Unicode codepoint U+00f6 (coded as c3 b6 in UTF-8)

  • second way is decomposed characters : you first output the ASCII o, and then append a special code point that means 'Please combine an umlaut to the preceding letter', Unicode codepoint U+0308 (coded as cc 88 in UTF-8)

it's this combining character that enable you to do all the̫ ͨcra̎zy shit̫ ĺiͭke̬̓ ̭Z͉̒a̅l̞gͩoͤ ̤͋aṅd̲ ̹ͨallͦ ̍ͅthͅe oͅt͔̅h̦̊e̠r ͔̋dḁŕ͕k̓ ̃m͍o͉ͅñ͎͖̉s̺͑tr̰͎̈́ỏ͖ͧsi̮͂͑t̚i͙̗ės͓̊̒ ̞ͯt̗͕ẖ̈ͩá̝ṱ̟͒ ͓͐ͦl̈́ṵ̿r͈̾k̼̝ͭ̍ ̹i͖̇̈́n͚̳ ͖̗ͦt͓h̿e͖ ̌m̳͌̽a̪ͥd̺͑n͕͌̐e̿͊s͇s̘͓͊ ̗̈́ö̫́f͕̞ ͕̰̓ìṅ̠sͤ̂a̬̝̿ͪn̘ͫ͆e̜ͯ ̩͓ͣẻ͛ḽ̞̃ḓ̺r̙̦ͥͬi̫̠̔ͮt̰̓̾ͅč͕ͦḧ̞̱͖́̒̽ ͇̳ḁ̖̊̈b̏͑o̳̙̍m̩̪̞ͦi̇ͮn̳͔ͨ̏ͤa̤̯ͣṱ̰ͥï̺̄o̞͖̿n͆ͦs̬̍ ̹ͩ͒th̞̄a̗̗͐͌ͪt͂ ̬̞iͭ̒s̘͇ ̱̯̐̆̒Ũ̺̞̘ͯT̩̀̔̚F̪͒̄-̪̘̈́8̮̆̍͂.̱͍̂



hum.



The rest of the planet uses composed characters whenever possible (because it's more compact and also because it uses the range of Unicode that is compatible with Latin-1, simplifying backward compatibility) and only resort to combining characters for thing that don't have their own code point (mostly less frequent languages).



Apple lives apparently on a different planet, and they have decided that they try to always use combining characters (because they worship the dark lord Za͓̙̘͌l̦̖͉̃ͦ͆͊ͧ̀g͖̭̼̗͉̦̬̍̀̌ͬ̓ͥ҉o̧͉̗̱̥̣̯͍̗̲̩ͪ͋̾͑̈́ͦ̐̓͘͡ ?).



Typing the keyboard letter that looks like 'ö' simply doesn't generate the same binary sequence depending on which computer you type the key.



Then comes into play another thing : most Unix tend to use file systems (like Linux' EXT4) which are sensitive to case AND sensitive to Unicode coding (where UTF-8 is supported). They try to preserve whether the text was composed or not. Thus they make a distinction between the UTF-8 binary sequence 6f cc 88 and c3 b6 even if they code for the same end result 'ö'. (the same way the make a distinction between 'A' and 'a' even if its the same latin letter).
So your 'ö' produced by your keyboard and the 'ö' on the server are not the same.



It happens that stack exchange just store whatever Unicode coding you throw at it as-is, leading to mythical answers as the HTML RegEx parser ones. (Thus your Mac betrayed itself by the specific byte sequence that recorded 'ö').






share|improve this answer

























  • Thanks, you are correct. The tab worked, and when I use debug mode on the AWS CLI I see it sending as %C3%B6 so Terminal was giving the wrong character.

    – user3783243
    Jan 16 at 19:32











  • It was my pet peeve that I had sysops that insisted in using localized unicode/UTF-8 chars in filenames, DNS and DHCP configurations...Best avoiding them altogether.

    – Rui F Ribeiro
    Jan 16 at 19:32











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f494883%2futf8-character-makes-file-inaccessible%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









3














Quick solution:
do not use accented letters on your keyboard, use tab-complete instead (and have your SSH key setup so that tab-complete also works with over the network scp, rsync, etc.) or fall back to wild cards, because what you experience is the normal intended behaviour.




It doesn't work, because you did not type the same filename.



Seems crazy ? That's UTF-8 to you.



Even more crazy: I can use my magical remote mind-reading psychic power to tell you that you have an Apple Mac.



More seriously: that's the crucial information you forgot to give when asking your question, but that you accidentally leaked when typing the question itself.




While copy-pasting the answer above:



# echo "scp me@example.com:/home/me/cömmön_file.jpg" | hexdump -C
00000000 73 63 70 20 6d 65 40 65 78 61 6d 70 6c 65 2e 63 |scp me@example.c|
00000010 6f 6d 3a 2f 68 6f 6d 65 2f 6d 65 2f 63 6f cc 88 |om:/home/me/co..|
00000020 6d 6d 6f cc 88 6e 5f 66 69 6c 65 2e 6a 70 67 20 |mmo..n_file.jpg |
00000030 2f 68 6f 6d 65 2f 6d 65 2f 0a |/home/me/.|
0000003a


Please pay close attention to how the letter 'ö' is coded : 6f cc 88. A litteral 'o' followed by an extra UTF-8 codepoint. (in fact, on my terminal it doesn't even display as 'ö' but as 'o')



When when I (=Linux user) type:



echo /home/me/cömmön_file.jpg | hexdump -C
00000000 2f 68 6f 6d 65 2f 6d 65 2f 63 c3 b6 6d 6d c3 b6 |/home/me/c..mm..|
00000010 6e 5f 66 69 6c 65 2e 6a 70 67 0a |n_file.jpg.|
0000001b


Again look closely at the 'ö' symbol : c3 b6, an entirely different UTF-8 code point and no extra litteral ASCII.




Ultra short explanation : UTF-8 normalization (composition vs decomposition).




Longer explanation :



in Unicode, there are multiple way to code for something that looks like 'ö'.



  • first way is composed characters : there's a code point that's litteraly 'ö' inherited from Latin-1 (ISO/IEC 8859-1:1998) code points, Unicode codepoint U+00f6 (coded as c3 b6 in UTF-8)

  • second way is decomposed characters : you first output the ASCII o, and then append a special code point that means 'Please combine an umlaut to the preceding letter', Unicode codepoint U+0308 (coded as cc 88 in UTF-8)

it's this combining character that enable you to do all the̫ ͨcra̎zy shit̫ ĺiͭke̬̓ ̭Z͉̒a̅l̞gͩoͤ ̤͋aṅd̲ ̹ͨallͦ ̍ͅthͅe oͅt͔̅h̦̊e̠r ͔̋dḁŕ͕k̓ ̃m͍o͉ͅñ͎͖̉s̺͑tr̰͎̈́ỏ͖ͧsi̮͂͑t̚i͙̗ės͓̊̒ ̞ͯt̗͕ẖ̈ͩá̝ṱ̟͒ ͓͐ͦl̈́ṵ̿r͈̾k̼̝ͭ̍ ̹i͖̇̈́n͚̳ ͖̗ͦt͓h̿e͖ ̌m̳͌̽a̪ͥd̺͑n͕͌̐e̿͊s͇s̘͓͊ ̗̈́ö̫́f͕̞ ͕̰̓ìṅ̠sͤ̂a̬̝̿ͪn̘ͫ͆e̜ͯ ̩͓ͣẻ͛ḽ̞̃ḓ̺r̙̦ͥͬi̫̠̔ͮt̰̓̾ͅč͕ͦḧ̞̱͖́̒̽ ͇̳ḁ̖̊̈b̏͑o̳̙̍m̩̪̞ͦi̇ͮn̳͔ͨ̏ͤa̤̯ͣṱ̰ͥï̺̄o̞͖̿n͆ͦs̬̍ ̹ͩ͒th̞̄a̗̗͐͌ͪt͂ ̬̞iͭ̒s̘͇ ̱̯̐̆̒Ũ̺̞̘ͯT̩̀̔̚F̪͒̄-̪̘̈́8̮̆̍͂.̱͍̂



hum.



The rest of the planet uses composed characters whenever possible (because it's more compact and also because it uses the range of Unicode that is compatible with Latin-1, simplifying backward compatibility) and only resort to combining characters for thing that don't have their own code point (mostly less frequent languages).



Apple lives apparently on a different planet, and they have decided that they try to always use combining characters (because they worship the dark lord Za͓̙̘͌l̦̖͉̃ͦ͆͊ͧ̀g͖̭̼̗͉̦̬̍̀̌ͬ̓ͥ҉o̧͉̗̱̥̣̯͍̗̲̩ͪ͋̾͑̈́ͦ̐̓͘͡ ?).



Typing the keyboard letter that looks like 'ö' simply doesn't generate the same binary sequence depending on which computer you type the key.



Then comes into play another thing : most Unix tend to use file systems (like Linux' EXT4) which are sensitive to case AND sensitive to Unicode coding (where UTF-8 is supported). They try to preserve whether the text was composed or not. Thus they make a distinction between the UTF-8 binary sequence 6f cc 88 and c3 b6 even if they code for the same end result 'ö'. (the same way the make a distinction between 'A' and 'a' even if its the same latin letter).
So your 'ö' produced by your keyboard and the 'ö' on the server are not the same.



It happens that stack exchange just store whatever Unicode coding you throw at it as-is, leading to mythical answers as the HTML RegEx parser ones. (Thus your Mac betrayed itself by the specific byte sequence that recorded 'ö').






share|improve this answer

























  • Thanks, you are correct. The tab worked, and when I use debug mode on the AWS CLI I see it sending as %C3%B6 so Terminal was giving the wrong character.

    – user3783243
    Jan 16 at 19:32











  • It was my pet peeve that I had sysops that insisted in using localized unicode/UTF-8 chars in filenames, DNS and DHCP configurations...Best avoiding them altogether.

    – Rui F Ribeiro
    Jan 16 at 19:32
















3














Quick solution:
do not use accented letters on your keyboard, use tab-complete instead (and have your SSH key setup so that tab-complete also works with over the network scp, rsync, etc.) or fall back to wild cards, because what you experience is the normal intended behaviour.




It doesn't work, because you did not type the same filename.



Seems crazy ? That's UTF-8 to you.



Even more crazy: I can use my magical remote mind-reading psychic power to tell you that you have an Apple Mac.



More seriously: that's the crucial information you forgot to give when asking your question, but that you accidentally leaked when typing the question itself.




While copy-pasting the answer above:



# echo "scp me@example.com:/home/me/cömmön_file.jpg" | hexdump -C
00000000 73 63 70 20 6d 65 40 65 78 61 6d 70 6c 65 2e 63 |scp me@example.c|
00000010 6f 6d 3a 2f 68 6f 6d 65 2f 6d 65 2f 63 6f cc 88 |om:/home/me/co..|
00000020 6d 6d 6f cc 88 6e 5f 66 69 6c 65 2e 6a 70 67 20 |mmo..n_file.jpg |
00000030 2f 68 6f 6d 65 2f 6d 65 2f 0a |/home/me/.|
0000003a


Please pay close attention to how the letter 'ö' is coded : 6f cc 88. A litteral 'o' followed by an extra UTF-8 codepoint. (in fact, on my terminal it doesn't even display as 'ö' but as 'o')



When when I (=Linux user) type:



echo /home/me/cömmön_file.jpg | hexdump -C
00000000 2f 68 6f 6d 65 2f 6d 65 2f 63 c3 b6 6d 6d c3 b6 |/home/me/c..mm..|
00000010 6e 5f 66 69 6c 65 2e 6a 70 67 0a |n_file.jpg.|
0000001b


Again look closely at the 'ö' symbol : c3 b6, an entirely different UTF-8 code point and no extra litteral ASCII.




Ultra short explanation : UTF-8 normalization (composition vs decomposition).




Longer explanation :



in Unicode, there are multiple way to code for something that looks like 'ö'.



  • first way is composed characters : there's a code point that's litteraly 'ö' inherited from Latin-1 (ISO/IEC 8859-1:1998) code points, Unicode codepoint U+00f6 (coded as c3 b6 in UTF-8)

  • second way is decomposed characters : you first output the ASCII o, and then append a special code point that means 'Please combine an umlaut to the preceding letter', Unicode codepoint U+0308 (coded as cc 88 in UTF-8)

it's this combining character that enable you to do all the̫ ͨcra̎zy shit̫ ĺiͭke̬̓ ̭Z͉̒a̅l̞gͩoͤ ̤͋aṅd̲ ̹ͨallͦ ̍ͅthͅe oͅt͔̅h̦̊e̠r ͔̋dḁŕ͕k̓ ̃m͍o͉ͅñ͎͖̉s̺͑tr̰͎̈́ỏ͖ͧsi̮͂͑t̚i͙̗ės͓̊̒ ̞ͯt̗͕ẖ̈ͩá̝ṱ̟͒ ͓͐ͦl̈́ṵ̿r͈̾k̼̝ͭ̍ ̹i͖̇̈́n͚̳ ͖̗ͦt͓h̿e͖ ̌m̳͌̽a̪ͥd̺͑n͕͌̐e̿͊s͇s̘͓͊ ̗̈́ö̫́f͕̞ ͕̰̓ìṅ̠sͤ̂a̬̝̿ͪn̘ͫ͆e̜ͯ ̩͓ͣẻ͛ḽ̞̃ḓ̺r̙̦ͥͬi̫̠̔ͮt̰̓̾ͅč͕ͦḧ̞̱͖́̒̽ ͇̳ḁ̖̊̈b̏͑o̳̙̍m̩̪̞ͦi̇ͮn̳͔ͨ̏ͤa̤̯ͣṱ̰ͥï̺̄o̞͖̿n͆ͦs̬̍ ̹ͩ͒th̞̄a̗̗͐͌ͪt͂ ̬̞iͭ̒s̘͇ ̱̯̐̆̒Ũ̺̞̘ͯT̩̀̔̚F̪͒̄-̪̘̈́8̮̆̍͂.̱͍̂



hum.



The rest of the planet uses composed characters whenever possible (because it's more compact and also because it uses the range of Unicode that is compatible with Latin-1, simplifying backward compatibility) and only resort to combining characters for thing that don't have their own code point (mostly less frequent languages).



Apple lives apparently on a different planet, and they have decided that they try to always use combining characters (because they worship the dark lord Za͓̙̘͌l̦̖͉̃ͦ͆͊ͧ̀g͖̭̼̗͉̦̬̍̀̌ͬ̓ͥ҉o̧͉̗̱̥̣̯͍̗̲̩ͪ͋̾͑̈́ͦ̐̓͘͡ ?).



Typing the keyboard letter that looks like 'ö' simply doesn't generate the same binary sequence depending on which computer you type the key.



Then comes into play another thing : most Unix tend to use file systems (like Linux' EXT4) which are sensitive to case AND sensitive to Unicode coding (where UTF-8 is supported). They try to preserve whether the text was composed or not. Thus they make a distinction between the UTF-8 binary sequence 6f cc 88 and c3 b6 even if they code for the same end result 'ö'. (the same way the make a distinction between 'A' and 'a' even if its the same latin letter).
So your 'ö' produced by your keyboard and the 'ö' on the server are not the same.



It happens that stack exchange just store whatever Unicode coding you throw at it as-is, leading to mythical answers as the HTML RegEx parser ones. (Thus your Mac betrayed itself by the specific byte sequence that recorded 'ö').






share|improve this answer

























  • Thanks, you are correct. The tab worked, and when I use debug mode on the AWS CLI I see it sending as %C3%B6 so Terminal was giving the wrong character.

    – user3783243
    Jan 16 at 19:32











  • It was my pet peeve that I had sysops that insisted in using localized unicode/UTF-8 chars in filenames, DNS and DHCP configurations...Best avoiding them altogether.

    – Rui F Ribeiro
    Jan 16 at 19:32














3












3








3







Quick solution:
do not use accented letters on your keyboard, use tab-complete instead (and have your SSH key setup so that tab-complete also works with over the network scp, rsync, etc.) or fall back to wild cards, because what you experience is the normal intended behaviour.




It doesn't work, because you did not type the same filename.



Seems crazy ? That's UTF-8 to you.



Even more crazy: I can use my magical remote mind-reading psychic power to tell you that you have an Apple Mac.



More seriously: that's the crucial information you forgot to give when asking your question, but that you accidentally leaked when typing the question itself.




While copy-pasting the answer above:



# echo "scp me@example.com:/home/me/cömmön_file.jpg" | hexdump -C
00000000 73 63 70 20 6d 65 40 65 78 61 6d 70 6c 65 2e 63 |scp me@example.c|
00000010 6f 6d 3a 2f 68 6f 6d 65 2f 6d 65 2f 63 6f cc 88 |om:/home/me/co..|
00000020 6d 6d 6f cc 88 6e 5f 66 69 6c 65 2e 6a 70 67 20 |mmo..n_file.jpg |
00000030 2f 68 6f 6d 65 2f 6d 65 2f 0a |/home/me/.|
0000003a


Please pay close attention to how the letter 'ö' is coded : 6f cc 88. A litteral 'o' followed by an extra UTF-8 codepoint. (in fact, on my terminal it doesn't even display as 'ö' but as 'o')



When when I (=Linux user) type:



echo /home/me/cömmön_file.jpg | hexdump -C
00000000 2f 68 6f 6d 65 2f 6d 65 2f 63 c3 b6 6d 6d c3 b6 |/home/me/c..mm..|
00000010 6e 5f 66 69 6c 65 2e 6a 70 67 0a |n_file.jpg.|
0000001b


Again look closely at the 'ö' symbol : c3 b6, an entirely different UTF-8 code point and no extra litteral ASCII.




Ultra short explanation : UTF-8 normalization (composition vs decomposition).




Longer explanation :



in Unicode, there are multiple way to code for something that looks like 'ö'.



  • first way is composed characters : there's a code point that's litteraly 'ö' inherited from Latin-1 (ISO/IEC 8859-1:1998) code points, Unicode codepoint U+00f6 (coded as c3 b6 in UTF-8)

  • second way is decomposed characters : you first output the ASCII o, and then append a special code point that means 'Please combine an umlaut to the preceding letter', Unicode codepoint U+0308 (coded as cc 88 in UTF-8)

it's this combining character that enable you to do all the̫ ͨcra̎zy shit̫ ĺiͭke̬̓ ̭Z͉̒a̅l̞gͩoͤ ̤͋aṅd̲ ̹ͨallͦ ̍ͅthͅe oͅt͔̅h̦̊e̠r ͔̋dḁŕ͕k̓ ̃m͍o͉ͅñ͎͖̉s̺͑tr̰͎̈́ỏ͖ͧsi̮͂͑t̚i͙̗ės͓̊̒ ̞ͯt̗͕ẖ̈ͩá̝ṱ̟͒ ͓͐ͦl̈́ṵ̿r͈̾k̼̝ͭ̍ ̹i͖̇̈́n͚̳ ͖̗ͦt͓h̿e͖ ̌m̳͌̽a̪ͥd̺͑n͕͌̐e̿͊s͇s̘͓͊ ̗̈́ö̫́f͕̞ ͕̰̓ìṅ̠sͤ̂a̬̝̿ͪn̘ͫ͆e̜ͯ ̩͓ͣẻ͛ḽ̞̃ḓ̺r̙̦ͥͬi̫̠̔ͮt̰̓̾ͅč͕ͦḧ̞̱͖́̒̽ ͇̳ḁ̖̊̈b̏͑o̳̙̍m̩̪̞ͦi̇ͮn̳͔ͨ̏ͤa̤̯ͣṱ̰ͥï̺̄o̞͖̿n͆ͦs̬̍ ̹ͩ͒th̞̄a̗̗͐͌ͪt͂ ̬̞iͭ̒s̘͇ ̱̯̐̆̒Ũ̺̞̘ͯT̩̀̔̚F̪͒̄-̪̘̈́8̮̆̍͂.̱͍̂



hum.



The rest of the planet uses composed characters whenever possible (because it's more compact and also because it uses the range of Unicode that is compatible with Latin-1, simplifying backward compatibility) and only resort to combining characters for thing that don't have their own code point (mostly less frequent languages).



Apple lives apparently on a different planet, and they have decided that they try to always use combining characters (because they worship the dark lord Za͓̙̘͌l̦̖͉̃ͦ͆͊ͧ̀g͖̭̼̗͉̦̬̍̀̌ͬ̓ͥ҉o̧͉̗̱̥̣̯͍̗̲̩ͪ͋̾͑̈́ͦ̐̓͘͡ ?).



Typing the keyboard letter that looks like 'ö' simply doesn't generate the same binary sequence depending on which computer you type the key.



Then comes into play another thing : most Unix tend to use file systems (like Linux' EXT4) which are sensitive to case AND sensitive to Unicode coding (where UTF-8 is supported). They try to preserve whether the text was composed or not. Thus they make a distinction between the UTF-8 binary sequence 6f cc 88 and c3 b6 even if they code for the same end result 'ö'. (the same way the make a distinction between 'A' and 'a' even if its the same latin letter).
So your 'ö' produced by your keyboard and the 'ö' on the server are not the same.



It happens that stack exchange just store whatever Unicode coding you throw at it as-is, leading to mythical answers as the HTML RegEx parser ones. (Thus your Mac betrayed itself by the specific byte sequence that recorded 'ö').






share|improve this answer















Quick solution:
do not use accented letters on your keyboard, use tab-complete instead (and have your SSH key setup so that tab-complete also works with over the network scp, rsync, etc.) or fall back to wild cards, because what you experience is the normal intended behaviour.




It doesn't work, because you did not type the same filename.



Seems crazy ? That's UTF-8 to you.



Even more crazy: I can use my magical remote mind-reading psychic power to tell you that you have an Apple Mac.



More seriously: that's the crucial information you forgot to give when asking your question, but that you accidentally leaked when typing the question itself.




While copy-pasting the answer above:



# echo "scp me@example.com:/home/me/cömmön_file.jpg" | hexdump -C
00000000 73 63 70 20 6d 65 40 65 78 61 6d 70 6c 65 2e 63 |scp me@example.c|
00000010 6f 6d 3a 2f 68 6f 6d 65 2f 6d 65 2f 63 6f cc 88 |om:/home/me/co..|
00000020 6d 6d 6f cc 88 6e 5f 66 69 6c 65 2e 6a 70 67 20 |mmo..n_file.jpg |
00000030 2f 68 6f 6d 65 2f 6d 65 2f 0a |/home/me/.|
0000003a


Please pay close attention to how the letter 'ö' is coded : 6f cc 88. A litteral 'o' followed by an extra UTF-8 codepoint. (in fact, on my terminal it doesn't even display as 'ö' but as 'o')



When when I (=Linux user) type:



echo /home/me/cömmön_file.jpg | hexdump -C
00000000 2f 68 6f 6d 65 2f 6d 65 2f 63 c3 b6 6d 6d c3 b6 |/home/me/c..mm..|
00000010 6e 5f 66 69 6c 65 2e 6a 70 67 0a |n_file.jpg.|
0000001b


Again look closely at the 'ö' symbol : c3 b6, an entirely different UTF-8 code point and no extra litteral ASCII.




Ultra short explanation : UTF-8 normalization (composition vs decomposition).




Longer explanation :



in Unicode, there are multiple way to code for something that looks like 'ö'.



  • first way is composed characters : there's a code point that's litteraly 'ö' inherited from Latin-1 (ISO/IEC 8859-1:1998) code points, Unicode codepoint U+00f6 (coded as c3 b6 in UTF-8)

  • second way is decomposed characters : you first output the ASCII o, and then append a special code point that means 'Please combine an umlaut to the preceding letter', Unicode codepoint U+0308 (coded as cc 88 in UTF-8)

it's this combining character that enable you to do all the̫ ͨcra̎zy shit̫ ĺiͭke̬̓ ̭Z͉̒a̅l̞gͩoͤ ̤͋aṅd̲ ̹ͨallͦ ̍ͅthͅe oͅt͔̅h̦̊e̠r ͔̋dḁŕ͕k̓ ̃m͍o͉ͅñ͎͖̉s̺͑tr̰͎̈́ỏ͖ͧsi̮͂͑t̚i͙̗ės͓̊̒ ̞ͯt̗͕ẖ̈ͩá̝ṱ̟͒ ͓͐ͦl̈́ṵ̿r͈̾k̼̝ͭ̍ ̹i͖̇̈́n͚̳ ͖̗ͦt͓h̿e͖ ̌m̳͌̽a̪ͥd̺͑n͕͌̐e̿͊s͇s̘͓͊ ̗̈́ö̫́f͕̞ ͕̰̓ìṅ̠sͤ̂a̬̝̿ͪn̘ͫ͆e̜ͯ ̩͓ͣẻ͛ḽ̞̃ḓ̺r̙̦ͥͬi̫̠̔ͮt̰̓̾ͅč͕ͦḧ̞̱͖́̒̽ ͇̳ḁ̖̊̈b̏͑o̳̙̍m̩̪̞ͦi̇ͮn̳͔ͨ̏ͤa̤̯ͣṱ̰ͥï̺̄o̞͖̿n͆ͦs̬̍ ̹ͩ͒th̞̄a̗̗͐͌ͪt͂ ̬̞iͭ̒s̘͇ ̱̯̐̆̒Ũ̺̞̘ͯT̩̀̔̚F̪͒̄-̪̘̈́8̮̆̍͂.̱͍̂



hum.



The rest of the planet uses composed characters whenever possible (because it's more compact and also because it uses the range of Unicode that is compatible with Latin-1, simplifying backward compatibility) and only resort to combining characters for thing that don't have their own code point (mostly less frequent languages).



Apple lives apparently on a different planet, and they have decided that they try to always use combining characters (because they worship the dark lord Za͓̙̘͌l̦̖͉̃ͦ͆͊ͧ̀g͖̭̼̗͉̦̬̍̀̌ͬ̓ͥ҉o̧͉̗̱̥̣̯͍̗̲̩ͪ͋̾͑̈́ͦ̐̓͘͡ ?).



Typing the keyboard letter that looks like 'ö' simply doesn't generate the same binary sequence depending on which computer you type the key.



Then comes into play another thing : most Unix tend to use file systems (like Linux' EXT4) which are sensitive to case AND sensitive to Unicode coding (where UTF-8 is supported). They try to preserve whether the text was composed or not. Thus they make a distinction between the UTF-8 binary sequence 6f cc 88 and c3 b6 even if they code for the same end result 'ö'. (the same way the make a distinction between 'A' and 'a' even if its the same latin letter).
So your 'ö' produced by your keyboard and the 'ö' on the server are not the same.



It happens that stack exchange just store whatever Unicode coding you throw at it as-is, leading to mythical answers as the HTML RegEx parser ones. (Thus your Mac betrayed itself by the specific byte sequence that recorded 'ö').







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 16 at 19:41

























answered Jan 16 at 19:23









DrYakDrYak

1915




1915












  • Thanks, you are correct. The tab worked, and when I use debug mode on the AWS CLI I see it sending as %C3%B6 so Terminal was giving the wrong character.

    – user3783243
    Jan 16 at 19:32











  • It was my pet peeve that I had sysops that insisted in using localized unicode/UTF-8 chars in filenames, DNS and DHCP configurations...Best avoiding them altogether.

    – Rui F Ribeiro
    Jan 16 at 19:32


















  • Thanks, you are correct. The tab worked, and when I use debug mode on the AWS CLI I see it sending as %C3%B6 so Terminal was giving the wrong character.

    – user3783243
    Jan 16 at 19:32











  • It was my pet peeve that I had sysops that insisted in using localized unicode/UTF-8 chars in filenames, DNS and DHCP configurations...Best avoiding them altogether.

    – Rui F Ribeiro
    Jan 16 at 19:32

















Thanks, you are correct. The tab worked, and when I use debug mode on the AWS CLI I see it sending as %C3%B6 so Terminal was giving the wrong character.

– user3783243
Jan 16 at 19:32





Thanks, you are correct. The tab worked, and when I use debug mode on the AWS CLI I see it sending as %C3%B6 so Terminal was giving the wrong character.

– user3783243
Jan 16 at 19:32













It was my pet peeve that I had sysops that insisted in using localized unicode/UTF-8 chars in filenames, DNS and DHCP configurations...Best avoiding them altogether.

– Rui F Ribeiro
Jan 16 at 19:32






It was my pet peeve that I had sysops that insisted in using localized unicode/UTF-8 chars in filenames, DNS and DHCP configurations...Best avoiding them altogether.

– Rui F Ribeiro
Jan 16 at 19:32


















draft saved

draft discarded
















































Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f494883%2futf8-character-makes-file-inaccessible%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown






Popular posts from this blog

How to check contact read email or not when send email to Individual?

How many registers does an x86_64 CPU actually have?

Nur Jahan