UTF8 Character Makes File Inaccessible
Clash Royale CLAN TAG#URR8PPP
If I run:
scp me@example.com:/home/me/cömmön_file.jpg /home/me/
from my remote server I get:
scp: /home/me/cömmön_file.jpg: No such file or directory
If I swap out the utf8 characters though with a wildcard it will work:
scp me@example.com:/home/me/c?mm?n_file.jpg /home/me/
and/or
scp me@example.com:/home/me/c*mm*n_file.jpg /home/me/
If I use the AWS CLI on my remote machine the behavior also replicates.
Running other commands with the explicit name in them on my remote machine functions as I'd expect.
e.g.
ls -lha /home/me/cömmön_file.jpg
-rw-r--r--. 1 me me 1.1M Jan 15 21:58 /home/me/cömmön_file.jpg
I can rename the file as well with mv
.
Is the problem with transmitting the file, or something underlying in my machine hosting the file?
The UTF8 character causing the current issue is https://www.compart.com/en/unicode/U+0308 but I suspect other characters also would reproduce the issue. If I try to rename the file from ö
to https://www.compart.com/en/unicode/U+00F6 my machine tells me the files are the same.
mv: ‘/home/me/cömmön_file.jpg’ and ‘/home/me/cömmön_file.jpg’ are the same file
The server hosting the file is:
NAME="CentOS Linux"
VERSION="7 (Core)"
and its locale
is:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
the server requesting the file is:
NAME="Amazon Linux"
VERSION="2"
and its locale
is:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
linux filesystems unicode character-encoding
add a comment |
If I run:
scp me@example.com:/home/me/cömmön_file.jpg /home/me/
from my remote server I get:
scp: /home/me/cömmön_file.jpg: No such file or directory
If I swap out the utf8 characters though with a wildcard it will work:
scp me@example.com:/home/me/c?mm?n_file.jpg /home/me/
and/or
scp me@example.com:/home/me/c*mm*n_file.jpg /home/me/
If I use the AWS CLI on my remote machine the behavior also replicates.
Running other commands with the explicit name in them on my remote machine functions as I'd expect.
e.g.
ls -lha /home/me/cömmön_file.jpg
-rw-r--r--. 1 me me 1.1M Jan 15 21:58 /home/me/cömmön_file.jpg
I can rename the file as well with mv
.
Is the problem with transmitting the file, or something underlying in my machine hosting the file?
The UTF8 character causing the current issue is https://www.compart.com/en/unicode/U+0308 but I suspect other characters also would reproduce the issue. If I try to rename the file from ö
to https://www.compart.com/en/unicode/U+00F6 my machine tells me the files are the same.
mv: ‘/home/me/cömmön_file.jpg’ and ‘/home/me/cömmön_file.jpg’ are the same file
The server hosting the file is:
NAME="CentOS Linux"
VERSION="7 (Core)"
and its locale
is:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
the server requesting the file is:
NAME="Amazon Linux"
VERSION="2"
and its locale
is:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
linux filesystems unicode character-encoding
What's the locale running on both instances ? (Typelocale
) This type of problems might happen when both side don't use the same locale (preferably a.UTF-8
ending one).
– DrYak
Jan 16 at 17:54
@DrYak I've updated the question with that info, both look the same to me.
– user3783243
Jan 16 at 18:00
Damn, same locale... - does usingrsync
instead ofscp
also barf on that file ?
– DrYak
Jan 16 at 18:28
1
Nope, your Mac betrayed you that you're a Mac user, and that's where the whole thing stems from. 'ö' and 'ö' just aren't the same Unicode codepoint and that's something that Linux conserves (just like it's also case sensitive).
– DrYak
Jan 16 at 19:25
add a comment |
If I run:
scp me@example.com:/home/me/cömmön_file.jpg /home/me/
from my remote server I get:
scp: /home/me/cömmön_file.jpg: No such file or directory
If I swap out the utf8 characters though with a wildcard it will work:
scp me@example.com:/home/me/c?mm?n_file.jpg /home/me/
and/or
scp me@example.com:/home/me/c*mm*n_file.jpg /home/me/
If I use the AWS CLI on my remote machine the behavior also replicates.
Running other commands with the explicit name in them on my remote machine functions as I'd expect.
e.g.
ls -lha /home/me/cömmön_file.jpg
-rw-r--r--. 1 me me 1.1M Jan 15 21:58 /home/me/cömmön_file.jpg
I can rename the file as well with mv
.
Is the problem with transmitting the file, or something underlying in my machine hosting the file?
The UTF8 character causing the current issue is https://www.compart.com/en/unicode/U+0308 but I suspect other characters also would reproduce the issue. If I try to rename the file from ö
to https://www.compart.com/en/unicode/U+00F6 my machine tells me the files are the same.
mv: ‘/home/me/cömmön_file.jpg’ and ‘/home/me/cömmön_file.jpg’ are the same file
The server hosting the file is:
NAME="CentOS Linux"
VERSION="7 (Core)"
and its locale
is:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
the server requesting the file is:
NAME="Amazon Linux"
VERSION="2"
and its locale
is:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
linux filesystems unicode character-encoding
If I run:
scp me@example.com:/home/me/cömmön_file.jpg /home/me/
from my remote server I get:
scp: /home/me/cömmön_file.jpg: No such file or directory
If I swap out the utf8 characters though with a wildcard it will work:
scp me@example.com:/home/me/c?mm?n_file.jpg /home/me/
and/or
scp me@example.com:/home/me/c*mm*n_file.jpg /home/me/
If I use the AWS CLI on my remote machine the behavior also replicates.
Running other commands with the explicit name in them on my remote machine functions as I'd expect.
e.g.
ls -lha /home/me/cömmön_file.jpg
-rw-r--r--. 1 me me 1.1M Jan 15 21:58 /home/me/cömmön_file.jpg
I can rename the file as well with mv
.
Is the problem with transmitting the file, or something underlying in my machine hosting the file?
The UTF8 character causing the current issue is https://www.compart.com/en/unicode/U+0308 but I suspect other characters also would reproduce the issue. If I try to rename the file from ö
to https://www.compart.com/en/unicode/U+00F6 my machine tells me the files are the same.
mv: ‘/home/me/cömmön_file.jpg’ and ‘/home/me/cömmön_file.jpg’ are the same file
The server hosting the file is:
NAME="CentOS Linux"
VERSION="7 (Core)"
and its locale
is:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
the server requesting the file is:
NAME="Amazon Linux"
VERSION="2"
and its locale
is:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
linux filesystems unicode character-encoding
linux filesystems unicode character-encoding
edited Jan 16 at 17:59
user3783243
asked Jan 16 at 16:52
user3783243user3783243
1275
1275
What's the locale running on both instances ? (Typelocale
) This type of problems might happen when both side don't use the same locale (preferably a.UTF-8
ending one).
– DrYak
Jan 16 at 17:54
@DrYak I've updated the question with that info, both look the same to me.
– user3783243
Jan 16 at 18:00
Damn, same locale... - does usingrsync
instead ofscp
also barf on that file ?
– DrYak
Jan 16 at 18:28
1
Nope, your Mac betrayed you that you're a Mac user, and that's where the whole thing stems from. 'ö' and 'ö' just aren't the same Unicode codepoint and that's something that Linux conserves (just like it's also case sensitive).
– DrYak
Jan 16 at 19:25
add a comment |
What's the locale running on both instances ? (Typelocale
) This type of problems might happen when both side don't use the same locale (preferably a.UTF-8
ending one).
– DrYak
Jan 16 at 17:54
@DrYak I've updated the question with that info, both look the same to me.
– user3783243
Jan 16 at 18:00
Damn, same locale... - does usingrsync
instead ofscp
also barf on that file ?
– DrYak
Jan 16 at 18:28
1
Nope, your Mac betrayed you that you're a Mac user, and that's where the whole thing stems from. 'ö' and 'ö' just aren't the same Unicode codepoint and that's something that Linux conserves (just like it's also case sensitive).
– DrYak
Jan 16 at 19:25
What's the locale running on both instances ? (Type
locale
) This type of problems might happen when both side don't use the same locale (preferably a .UTF-8
ending one).– DrYak
Jan 16 at 17:54
What's the locale running on both instances ? (Type
locale
) This type of problems might happen when both side don't use the same locale (preferably a .UTF-8
ending one).– DrYak
Jan 16 at 17:54
@DrYak I've updated the question with that info, both look the same to me.
– user3783243
Jan 16 at 18:00
@DrYak I've updated the question with that info, both look the same to me.
– user3783243
Jan 16 at 18:00
Damn, same locale... - does using
rsync
instead of scp
also barf on that file ?– DrYak
Jan 16 at 18:28
Damn, same locale... - does using
rsync
instead of scp
also barf on that file ?– DrYak
Jan 16 at 18:28
1
1
Nope, your Mac betrayed you that you're a Mac user, and that's where the whole thing stems from. 'ö' and 'ö' just aren't the same Unicode codepoint and that's something that Linux conserves (just like it's also case sensitive).
– DrYak
Jan 16 at 19:25
Nope, your Mac betrayed you that you're a Mac user, and that's where the whole thing stems from. 'ö' and 'ö' just aren't the same Unicode codepoint and that's something that Linux conserves (just like it's also case sensitive).
– DrYak
Jan 16 at 19:25
add a comment |
1 Answer
1
active
oldest
votes
Quick solution:
do not use accented letters on your keyboard, use tab-complete instead (and have your SSH key setup so that tab-complete also works with over the network scp
, rsync
, etc.) or fall back to wild cards, because what you experience is the normal intended behaviour.
It doesn't work, because you did not type the same filename.
Seems crazy ? That's UTF-8 to you.
Even more crazy: I can use my magical remote mind-reading psychic power to tell you that you have an Apple Mac.
More seriously: that's the crucial information you forgot to give when asking your question, but that you accidentally leaked when typing the question itself.
While copy-pasting the answer above:
# echo "scp me@example.com:/home/me/cömmön_file.jpg" | hexdump -C
00000000 73 63 70 20 6d 65 40 65 78 61 6d 70 6c 65 2e 63 |scp me@example.c|
00000010 6f 6d 3a 2f 68 6f 6d 65 2f 6d 65 2f 63 6f cc 88 |om:/home/me/co..|
00000020 6d 6d 6f cc 88 6e 5f 66 69 6c 65 2e 6a 70 67 20 |mmo..n_file.jpg |
00000030 2f 68 6f 6d 65 2f 6d 65 2f 0a |/home/me/.|
0000003a
Please pay close attention to how the letter 'ö' is coded : 6f cc 88
. A litteral 'o' followed by an extra UTF-8 codepoint. (in fact, on my terminal it doesn't even display as 'ö' but as 'o')
When when I (=Linux user) type:
echo /home/me/cömmön_file.jpg | hexdump -C
00000000 2f 68 6f 6d 65 2f 6d 65 2f 63 c3 b6 6d 6d c3 b6 |/home/me/c..mm..|
00000010 6e 5f 66 69 6c 65 2e 6a 70 67 0a |n_file.jpg.|
0000001b
Again look closely at the 'ö' symbol : c3 b6
, an entirely different UTF-8 code point and no extra litteral ASCII.
Ultra short explanation : UTF-8 normalization (composition vs decomposition).
Longer explanation :
in Unicode, there are multiple way to code for something that looks like 'ö'.
- first way is composed characters : there's a code point that's litteraly 'ö' inherited from Latin-1 (ISO/IEC 8859-1:1998) code points, Unicode codepoint U+00f6 (coded as c3 b6 in UTF-8)
- second way is decomposed characters : you first output the ASCII o, and then append a special code point that means 'Please combine an umlaut to the preceding letter', Unicode codepoint U+0308 (coded as cc 88 in UTF-8)
it's this combining character that enable you to do all the̫ ͨcra̎zy shit̫ ĺiͭke̬̓ ̭Z͉̒a̅l̞gͩoͤ ̤͋aṅd̲ ̹ͨallͦ ̍ͅthͅe oͅt͔̅h̦̊e̠r ͔̋dḁŕ͕k̓ ̃m͍o͉ͅñ͎͖̉s̺͑tr̰͎̈́ỏ͖ͧsi̮͂͑t̚i͙̗ės͓̊̒ ̞ͯt̗͕ẖ̈ͩá̝ṱ̟͒ ͓͐ͦl̈́ṵ̿r͈̾k̼̝ͭ̍ ̹i͖̇̈́n͚̳ ͖̗ͦt͓h̿e͖ ̌m̳͌̽a̪ͥd̺͑n͕͌̐e̿͊s͇s̘͓͊ ̗̈́ö̫́f͕̞ ͕̰̓ìṅ̠sͤ̂a̬̝̿ͪn̘ͫ͆e̜ͯ ̩͓ͣẻ͛ḽ̞̃ḓ̺r̙̦ͥͬi̫̠̔ͮt̰̓̾ͅč͕ͦḧ̞̱͖́̒̽ ͇̳ḁ̖̊̈b̏͑o̳̙̍m̩̪̞ͦi̇ͮn̳͔ͨ̏ͤa̤̯ͣṱ̰ͥï̺̄o̞͖̿n͆ͦs̬̍ ̹ͩ͒th̞̄a̗̗͐͌ͪt͂ ̬̞iͭ̒s̘͇ ̱̯̐̆̒Ũ̺̞̘ͯT̩̀̔̚F̪͒̄-̪̘̈́8̮̆̍͂.̱͍̂
hum.
The rest of the planet uses composed characters whenever possible (because it's more compact and also because it uses the range of Unicode that is compatible with Latin-1, simplifying backward compatibility) and only resort to combining characters for thing that don't have their own code point (mostly less frequent languages).
Apple lives apparently on a different planet, and they have decided that they try to always use combining characters (because they worship the dark lord Za͓̙̘͌l̦̖͉̃ͦ͆͊ͧ̀g͖̭̼̗͉̦̬̍̀̌ͬ̓ͥ҉o̧͉̗̱̥̣̯͍̗̲̩ͪ͋̾͑̈́ͦ̐̓͘͡ ?).
Typing the keyboard letter that looks like 'ö' simply doesn't generate the same binary sequence depending on which computer you type the key.
Then comes into play another thing : most Unix tend to use file systems (like Linux' EXT4) which are sensitive to case AND sensitive to Unicode coding (where UTF-8 is supported). They try to preserve whether the text was composed or not. Thus they make a distinction between the UTF-8 binary sequence 6f cc 88
and c3 b6
even if they code for the same end result 'ö'. (the same way the make a distinction between 'A' and 'a' even if its the same latin letter).
So your 'ö' produced by your keyboard and the 'ö' on the server are not the same.
It happens that stack exchange just store whatever Unicode coding you throw at it as-is, leading to mythical answers as the HTML RegEx parser ones. (Thus your Mac betrayed itself by the specific byte sequence that recorded 'ö').
Thanks, you are correct. The tab worked, and when I use debug mode on the AWS CLI I see it sending as%C3%B6
so Terminal was giving the wrong character.
– user3783243
Jan 16 at 19:32
It was my pet peeve that I had sysops that insisted in using localized unicode/UTF-8 chars in filenames, DNS and DHCP configurations...Best avoiding them altogether.
– Rui F Ribeiro
Jan 16 at 19:32
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f494883%2futf8-character-makes-file-inaccessible%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Quick solution:
do not use accented letters on your keyboard, use tab-complete instead (and have your SSH key setup so that tab-complete also works with over the network scp
, rsync
, etc.) or fall back to wild cards, because what you experience is the normal intended behaviour.
It doesn't work, because you did not type the same filename.
Seems crazy ? That's UTF-8 to you.
Even more crazy: I can use my magical remote mind-reading psychic power to tell you that you have an Apple Mac.
More seriously: that's the crucial information you forgot to give when asking your question, but that you accidentally leaked when typing the question itself.
While copy-pasting the answer above:
# echo "scp me@example.com:/home/me/cömmön_file.jpg" | hexdump -C
00000000 73 63 70 20 6d 65 40 65 78 61 6d 70 6c 65 2e 63 |scp me@example.c|
00000010 6f 6d 3a 2f 68 6f 6d 65 2f 6d 65 2f 63 6f cc 88 |om:/home/me/co..|
00000020 6d 6d 6f cc 88 6e 5f 66 69 6c 65 2e 6a 70 67 20 |mmo..n_file.jpg |
00000030 2f 68 6f 6d 65 2f 6d 65 2f 0a |/home/me/.|
0000003a
Please pay close attention to how the letter 'ö' is coded : 6f cc 88
. A litteral 'o' followed by an extra UTF-8 codepoint. (in fact, on my terminal it doesn't even display as 'ö' but as 'o')
When when I (=Linux user) type:
echo /home/me/cömmön_file.jpg | hexdump -C
00000000 2f 68 6f 6d 65 2f 6d 65 2f 63 c3 b6 6d 6d c3 b6 |/home/me/c..mm..|
00000010 6e 5f 66 69 6c 65 2e 6a 70 67 0a |n_file.jpg.|
0000001b
Again look closely at the 'ö' symbol : c3 b6
, an entirely different UTF-8 code point and no extra litteral ASCII.
Ultra short explanation : UTF-8 normalization (composition vs decomposition).
Longer explanation :
in Unicode, there are multiple way to code for something that looks like 'ö'.
- first way is composed characters : there's a code point that's litteraly 'ö' inherited from Latin-1 (ISO/IEC 8859-1:1998) code points, Unicode codepoint U+00f6 (coded as c3 b6 in UTF-8)
- second way is decomposed characters : you first output the ASCII o, and then append a special code point that means 'Please combine an umlaut to the preceding letter', Unicode codepoint U+0308 (coded as cc 88 in UTF-8)
it's this combining character that enable you to do all the̫ ͨcra̎zy shit̫ ĺiͭke̬̓ ̭Z͉̒a̅l̞gͩoͤ ̤͋aṅd̲ ̹ͨallͦ ̍ͅthͅe oͅt͔̅h̦̊e̠r ͔̋dḁŕ͕k̓ ̃m͍o͉ͅñ͎͖̉s̺͑tr̰͎̈́ỏ͖ͧsi̮͂͑t̚i͙̗ės͓̊̒ ̞ͯt̗͕ẖ̈ͩá̝ṱ̟͒ ͓͐ͦl̈́ṵ̿r͈̾k̼̝ͭ̍ ̹i͖̇̈́n͚̳ ͖̗ͦt͓h̿e͖ ̌m̳͌̽a̪ͥd̺͑n͕͌̐e̿͊s͇s̘͓͊ ̗̈́ö̫́f͕̞ ͕̰̓ìṅ̠sͤ̂a̬̝̿ͪn̘ͫ͆e̜ͯ ̩͓ͣẻ͛ḽ̞̃ḓ̺r̙̦ͥͬi̫̠̔ͮt̰̓̾ͅč͕ͦḧ̞̱͖́̒̽ ͇̳ḁ̖̊̈b̏͑o̳̙̍m̩̪̞ͦi̇ͮn̳͔ͨ̏ͤa̤̯ͣṱ̰ͥï̺̄o̞͖̿n͆ͦs̬̍ ̹ͩ͒th̞̄a̗̗͐͌ͪt͂ ̬̞iͭ̒s̘͇ ̱̯̐̆̒Ũ̺̞̘ͯT̩̀̔̚F̪͒̄-̪̘̈́8̮̆̍͂.̱͍̂
hum.
The rest of the planet uses composed characters whenever possible (because it's more compact and also because it uses the range of Unicode that is compatible with Latin-1, simplifying backward compatibility) and only resort to combining characters for thing that don't have their own code point (mostly less frequent languages).
Apple lives apparently on a different planet, and they have decided that they try to always use combining characters (because they worship the dark lord Za͓̙̘͌l̦̖͉̃ͦ͆͊ͧ̀g͖̭̼̗͉̦̬̍̀̌ͬ̓ͥ҉o̧͉̗̱̥̣̯͍̗̲̩ͪ͋̾͑̈́ͦ̐̓͘͡ ?).
Typing the keyboard letter that looks like 'ö' simply doesn't generate the same binary sequence depending on which computer you type the key.
Then comes into play another thing : most Unix tend to use file systems (like Linux' EXT4) which are sensitive to case AND sensitive to Unicode coding (where UTF-8 is supported). They try to preserve whether the text was composed or not. Thus they make a distinction between the UTF-8 binary sequence 6f cc 88
and c3 b6
even if they code for the same end result 'ö'. (the same way the make a distinction between 'A' and 'a' even if its the same latin letter).
So your 'ö' produced by your keyboard and the 'ö' on the server are not the same.
It happens that stack exchange just store whatever Unicode coding you throw at it as-is, leading to mythical answers as the HTML RegEx parser ones. (Thus your Mac betrayed itself by the specific byte sequence that recorded 'ö').
Thanks, you are correct. The tab worked, and when I use debug mode on the AWS CLI I see it sending as%C3%B6
so Terminal was giving the wrong character.
– user3783243
Jan 16 at 19:32
It was my pet peeve that I had sysops that insisted in using localized unicode/UTF-8 chars in filenames, DNS and DHCP configurations...Best avoiding them altogether.
– Rui F Ribeiro
Jan 16 at 19:32
add a comment |
Quick solution:
do not use accented letters on your keyboard, use tab-complete instead (and have your SSH key setup so that tab-complete also works with over the network scp
, rsync
, etc.) or fall back to wild cards, because what you experience is the normal intended behaviour.
It doesn't work, because you did not type the same filename.
Seems crazy ? That's UTF-8 to you.
Even more crazy: I can use my magical remote mind-reading psychic power to tell you that you have an Apple Mac.
More seriously: that's the crucial information you forgot to give when asking your question, but that you accidentally leaked when typing the question itself.
While copy-pasting the answer above:
# echo "scp me@example.com:/home/me/cömmön_file.jpg" | hexdump -C
00000000 73 63 70 20 6d 65 40 65 78 61 6d 70 6c 65 2e 63 |scp me@example.c|
00000010 6f 6d 3a 2f 68 6f 6d 65 2f 6d 65 2f 63 6f cc 88 |om:/home/me/co..|
00000020 6d 6d 6f cc 88 6e 5f 66 69 6c 65 2e 6a 70 67 20 |mmo..n_file.jpg |
00000030 2f 68 6f 6d 65 2f 6d 65 2f 0a |/home/me/.|
0000003a
Please pay close attention to how the letter 'ö' is coded : 6f cc 88
. A litteral 'o' followed by an extra UTF-8 codepoint. (in fact, on my terminal it doesn't even display as 'ö' but as 'o')
When when I (=Linux user) type:
echo /home/me/cömmön_file.jpg | hexdump -C
00000000 2f 68 6f 6d 65 2f 6d 65 2f 63 c3 b6 6d 6d c3 b6 |/home/me/c..mm..|
00000010 6e 5f 66 69 6c 65 2e 6a 70 67 0a |n_file.jpg.|
0000001b
Again look closely at the 'ö' symbol : c3 b6
, an entirely different UTF-8 code point and no extra litteral ASCII.
Ultra short explanation : UTF-8 normalization (composition vs decomposition).
Longer explanation :
in Unicode, there are multiple way to code for something that looks like 'ö'.
- first way is composed characters : there's a code point that's litteraly 'ö' inherited from Latin-1 (ISO/IEC 8859-1:1998) code points, Unicode codepoint U+00f6 (coded as c3 b6 in UTF-8)
- second way is decomposed characters : you first output the ASCII o, and then append a special code point that means 'Please combine an umlaut to the preceding letter', Unicode codepoint U+0308 (coded as cc 88 in UTF-8)
it's this combining character that enable you to do all the̫ ͨcra̎zy shit̫ ĺiͭke̬̓ ̭Z͉̒a̅l̞gͩoͤ ̤͋aṅd̲ ̹ͨallͦ ̍ͅthͅe oͅt͔̅h̦̊e̠r ͔̋dḁŕ͕k̓ ̃m͍o͉ͅñ͎͖̉s̺͑tr̰͎̈́ỏ͖ͧsi̮͂͑t̚i͙̗ės͓̊̒ ̞ͯt̗͕ẖ̈ͩá̝ṱ̟͒ ͓͐ͦl̈́ṵ̿r͈̾k̼̝ͭ̍ ̹i͖̇̈́n͚̳ ͖̗ͦt͓h̿e͖ ̌m̳͌̽a̪ͥd̺͑n͕͌̐e̿͊s͇s̘͓͊ ̗̈́ö̫́f͕̞ ͕̰̓ìṅ̠sͤ̂a̬̝̿ͪn̘ͫ͆e̜ͯ ̩͓ͣẻ͛ḽ̞̃ḓ̺r̙̦ͥͬi̫̠̔ͮt̰̓̾ͅč͕ͦḧ̞̱͖́̒̽ ͇̳ḁ̖̊̈b̏͑o̳̙̍m̩̪̞ͦi̇ͮn̳͔ͨ̏ͤa̤̯ͣṱ̰ͥï̺̄o̞͖̿n͆ͦs̬̍ ̹ͩ͒th̞̄a̗̗͐͌ͪt͂ ̬̞iͭ̒s̘͇ ̱̯̐̆̒Ũ̺̞̘ͯT̩̀̔̚F̪͒̄-̪̘̈́8̮̆̍͂.̱͍̂
hum.
The rest of the planet uses composed characters whenever possible (because it's more compact and also because it uses the range of Unicode that is compatible with Latin-1, simplifying backward compatibility) and only resort to combining characters for thing that don't have their own code point (mostly less frequent languages).
Apple lives apparently on a different planet, and they have decided that they try to always use combining characters (because they worship the dark lord Za͓̙̘͌l̦̖͉̃ͦ͆͊ͧ̀g͖̭̼̗͉̦̬̍̀̌ͬ̓ͥ҉o̧͉̗̱̥̣̯͍̗̲̩ͪ͋̾͑̈́ͦ̐̓͘͡ ?).
Typing the keyboard letter that looks like 'ö' simply doesn't generate the same binary sequence depending on which computer you type the key.
Then comes into play another thing : most Unix tend to use file systems (like Linux' EXT4) which are sensitive to case AND sensitive to Unicode coding (where UTF-8 is supported). They try to preserve whether the text was composed or not. Thus they make a distinction between the UTF-8 binary sequence 6f cc 88
and c3 b6
even if they code for the same end result 'ö'. (the same way the make a distinction between 'A' and 'a' even if its the same latin letter).
So your 'ö' produced by your keyboard and the 'ö' on the server are not the same.
It happens that stack exchange just store whatever Unicode coding you throw at it as-is, leading to mythical answers as the HTML RegEx parser ones. (Thus your Mac betrayed itself by the specific byte sequence that recorded 'ö').
Thanks, you are correct. The tab worked, and when I use debug mode on the AWS CLI I see it sending as%C3%B6
so Terminal was giving the wrong character.
– user3783243
Jan 16 at 19:32
It was my pet peeve that I had sysops that insisted in using localized unicode/UTF-8 chars in filenames, DNS and DHCP configurations...Best avoiding them altogether.
– Rui F Ribeiro
Jan 16 at 19:32
add a comment |
Quick solution:
do not use accented letters on your keyboard, use tab-complete instead (and have your SSH key setup so that tab-complete also works with over the network scp
, rsync
, etc.) or fall back to wild cards, because what you experience is the normal intended behaviour.
It doesn't work, because you did not type the same filename.
Seems crazy ? That's UTF-8 to you.
Even more crazy: I can use my magical remote mind-reading psychic power to tell you that you have an Apple Mac.
More seriously: that's the crucial information you forgot to give when asking your question, but that you accidentally leaked when typing the question itself.
While copy-pasting the answer above:
# echo "scp me@example.com:/home/me/cömmön_file.jpg" | hexdump -C
00000000 73 63 70 20 6d 65 40 65 78 61 6d 70 6c 65 2e 63 |scp me@example.c|
00000010 6f 6d 3a 2f 68 6f 6d 65 2f 6d 65 2f 63 6f cc 88 |om:/home/me/co..|
00000020 6d 6d 6f cc 88 6e 5f 66 69 6c 65 2e 6a 70 67 20 |mmo..n_file.jpg |
00000030 2f 68 6f 6d 65 2f 6d 65 2f 0a |/home/me/.|
0000003a
Please pay close attention to how the letter 'ö' is coded : 6f cc 88
. A litteral 'o' followed by an extra UTF-8 codepoint. (in fact, on my terminal it doesn't even display as 'ö' but as 'o')
When when I (=Linux user) type:
echo /home/me/cömmön_file.jpg | hexdump -C
00000000 2f 68 6f 6d 65 2f 6d 65 2f 63 c3 b6 6d 6d c3 b6 |/home/me/c..mm..|
00000010 6e 5f 66 69 6c 65 2e 6a 70 67 0a |n_file.jpg.|
0000001b
Again look closely at the 'ö' symbol : c3 b6
, an entirely different UTF-8 code point and no extra litteral ASCII.
Ultra short explanation : UTF-8 normalization (composition vs decomposition).
Longer explanation :
in Unicode, there are multiple way to code for something that looks like 'ö'.
- first way is composed characters : there's a code point that's litteraly 'ö' inherited from Latin-1 (ISO/IEC 8859-1:1998) code points, Unicode codepoint U+00f6 (coded as c3 b6 in UTF-8)
- second way is decomposed characters : you first output the ASCII o, and then append a special code point that means 'Please combine an umlaut to the preceding letter', Unicode codepoint U+0308 (coded as cc 88 in UTF-8)
it's this combining character that enable you to do all the̫ ͨcra̎zy shit̫ ĺiͭke̬̓ ̭Z͉̒a̅l̞gͩoͤ ̤͋aṅd̲ ̹ͨallͦ ̍ͅthͅe oͅt͔̅h̦̊e̠r ͔̋dḁŕ͕k̓ ̃m͍o͉ͅñ͎͖̉s̺͑tr̰͎̈́ỏ͖ͧsi̮͂͑t̚i͙̗ės͓̊̒ ̞ͯt̗͕ẖ̈ͩá̝ṱ̟͒ ͓͐ͦl̈́ṵ̿r͈̾k̼̝ͭ̍ ̹i͖̇̈́n͚̳ ͖̗ͦt͓h̿e͖ ̌m̳͌̽a̪ͥd̺͑n͕͌̐e̿͊s͇s̘͓͊ ̗̈́ö̫́f͕̞ ͕̰̓ìṅ̠sͤ̂a̬̝̿ͪn̘ͫ͆e̜ͯ ̩͓ͣẻ͛ḽ̞̃ḓ̺r̙̦ͥͬi̫̠̔ͮt̰̓̾ͅč͕ͦḧ̞̱͖́̒̽ ͇̳ḁ̖̊̈b̏͑o̳̙̍m̩̪̞ͦi̇ͮn̳͔ͨ̏ͤa̤̯ͣṱ̰ͥï̺̄o̞͖̿n͆ͦs̬̍ ̹ͩ͒th̞̄a̗̗͐͌ͪt͂ ̬̞iͭ̒s̘͇ ̱̯̐̆̒Ũ̺̞̘ͯT̩̀̔̚F̪͒̄-̪̘̈́8̮̆̍͂.̱͍̂
hum.
The rest of the planet uses composed characters whenever possible (because it's more compact and also because it uses the range of Unicode that is compatible with Latin-1, simplifying backward compatibility) and only resort to combining characters for thing that don't have their own code point (mostly less frequent languages).
Apple lives apparently on a different planet, and they have decided that they try to always use combining characters (because they worship the dark lord Za͓̙̘͌l̦̖͉̃ͦ͆͊ͧ̀g͖̭̼̗͉̦̬̍̀̌ͬ̓ͥ҉o̧͉̗̱̥̣̯͍̗̲̩ͪ͋̾͑̈́ͦ̐̓͘͡ ?).
Typing the keyboard letter that looks like 'ö' simply doesn't generate the same binary sequence depending on which computer you type the key.
Then comes into play another thing : most Unix tend to use file systems (like Linux' EXT4) which are sensitive to case AND sensitive to Unicode coding (where UTF-8 is supported). They try to preserve whether the text was composed or not. Thus they make a distinction between the UTF-8 binary sequence 6f cc 88
and c3 b6
even if they code for the same end result 'ö'. (the same way the make a distinction between 'A' and 'a' even if its the same latin letter).
So your 'ö' produced by your keyboard and the 'ö' on the server are not the same.
It happens that stack exchange just store whatever Unicode coding you throw at it as-is, leading to mythical answers as the HTML RegEx parser ones. (Thus your Mac betrayed itself by the specific byte sequence that recorded 'ö').
Quick solution:
do not use accented letters on your keyboard, use tab-complete instead (and have your SSH key setup so that tab-complete also works with over the network scp
, rsync
, etc.) or fall back to wild cards, because what you experience is the normal intended behaviour.
It doesn't work, because you did not type the same filename.
Seems crazy ? That's UTF-8 to you.
Even more crazy: I can use my magical remote mind-reading psychic power to tell you that you have an Apple Mac.
More seriously: that's the crucial information you forgot to give when asking your question, but that you accidentally leaked when typing the question itself.
While copy-pasting the answer above:
# echo "scp me@example.com:/home/me/cömmön_file.jpg" | hexdump -C
00000000 73 63 70 20 6d 65 40 65 78 61 6d 70 6c 65 2e 63 |scp me@example.c|
00000010 6f 6d 3a 2f 68 6f 6d 65 2f 6d 65 2f 63 6f cc 88 |om:/home/me/co..|
00000020 6d 6d 6f cc 88 6e 5f 66 69 6c 65 2e 6a 70 67 20 |mmo..n_file.jpg |
00000030 2f 68 6f 6d 65 2f 6d 65 2f 0a |/home/me/.|
0000003a
Please pay close attention to how the letter 'ö' is coded : 6f cc 88
. A litteral 'o' followed by an extra UTF-8 codepoint. (in fact, on my terminal it doesn't even display as 'ö' but as 'o')
When when I (=Linux user) type:
echo /home/me/cömmön_file.jpg | hexdump -C
00000000 2f 68 6f 6d 65 2f 6d 65 2f 63 c3 b6 6d 6d c3 b6 |/home/me/c..mm..|
00000010 6e 5f 66 69 6c 65 2e 6a 70 67 0a |n_file.jpg.|
0000001b
Again look closely at the 'ö' symbol : c3 b6
, an entirely different UTF-8 code point and no extra litteral ASCII.
Ultra short explanation : UTF-8 normalization (composition vs decomposition).
Longer explanation :
in Unicode, there are multiple way to code for something that looks like 'ö'.
- first way is composed characters : there's a code point that's litteraly 'ö' inherited from Latin-1 (ISO/IEC 8859-1:1998) code points, Unicode codepoint U+00f6 (coded as c3 b6 in UTF-8)
- second way is decomposed characters : you first output the ASCII o, and then append a special code point that means 'Please combine an umlaut to the preceding letter', Unicode codepoint U+0308 (coded as cc 88 in UTF-8)
it's this combining character that enable you to do all the̫ ͨcra̎zy shit̫ ĺiͭke̬̓ ̭Z͉̒a̅l̞gͩoͤ ̤͋aṅd̲ ̹ͨallͦ ̍ͅthͅe oͅt͔̅h̦̊e̠r ͔̋dḁŕ͕k̓ ̃m͍o͉ͅñ͎͖̉s̺͑tr̰͎̈́ỏ͖ͧsi̮͂͑t̚i͙̗ės͓̊̒ ̞ͯt̗͕ẖ̈ͩá̝ṱ̟͒ ͓͐ͦl̈́ṵ̿r͈̾k̼̝ͭ̍ ̹i͖̇̈́n͚̳ ͖̗ͦt͓h̿e͖ ̌m̳͌̽a̪ͥd̺͑n͕͌̐e̿͊s͇s̘͓͊ ̗̈́ö̫́f͕̞ ͕̰̓ìṅ̠sͤ̂a̬̝̿ͪn̘ͫ͆e̜ͯ ̩͓ͣẻ͛ḽ̞̃ḓ̺r̙̦ͥͬi̫̠̔ͮt̰̓̾ͅč͕ͦḧ̞̱͖́̒̽ ͇̳ḁ̖̊̈b̏͑o̳̙̍m̩̪̞ͦi̇ͮn̳͔ͨ̏ͤa̤̯ͣṱ̰ͥï̺̄o̞͖̿n͆ͦs̬̍ ̹ͩ͒th̞̄a̗̗͐͌ͪt͂ ̬̞iͭ̒s̘͇ ̱̯̐̆̒Ũ̺̞̘ͯT̩̀̔̚F̪͒̄-̪̘̈́8̮̆̍͂.̱͍̂
hum.
The rest of the planet uses composed characters whenever possible (because it's more compact and also because it uses the range of Unicode that is compatible with Latin-1, simplifying backward compatibility) and only resort to combining characters for thing that don't have their own code point (mostly less frequent languages).
Apple lives apparently on a different planet, and they have decided that they try to always use combining characters (because they worship the dark lord Za͓̙̘͌l̦̖͉̃ͦ͆͊ͧ̀g͖̭̼̗͉̦̬̍̀̌ͬ̓ͥ҉o̧͉̗̱̥̣̯͍̗̲̩ͪ͋̾͑̈́ͦ̐̓͘͡ ?).
Typing the keyboard letter that looks like 'ö' simply doesn't generate the same binary sequence depending on which computer you type the key.
Then comes into play another thing : most Unix tend to use file systems (like Linux' EXT4) which are sensitive to case AND sensitive to Unicode coding (where UTF-8 is supported). They try to preserve whether the text was composed or not. Thus they make a distinction between the UTF-8 binary sequence 6f cc 88
and c3 b6
even if they code for the same end result 'ö'. (the same way the make a distinction between 'A' and 'a' even if its the same latin letter).
So your 'ö' produced by your keyboard and the 'ö' on the server are not the same.
It happens that stack exchange just store whatever Unicode coding you throw at it as-is, leading to mythical answers as the HTML RegEx parser ones. (Thus your Mac betrayed itself by the specific byte sequence that recorded 'ö').
edited Jan 16 at 19:41
answered Jan 16 at 19:23
DrYakDrYak
1915
1915
Thanks, you are correct. The tab worked, and when I use debug mode on the AWS CLI I see it sending as%C3%B6
so Terminal was giving the wrong character.
– user3783243
Jan 16 at 19:32
It was my pet peeve that I had sysops that insisted in using localized unicode/UTF-8 chars in filenames, DNS and DHCP configurations...Best avoiding them altogether.
– Rui F Ribeiro
Jan 16 at 19:32
add a comment |
Thanks, you are correct. The tab worked, and when I use debug mode on the AWS CLI I see it sending as%C3%B6
so Terminal was giving the wrong character.
– user3783243
Jan 16 at 19:32
It was my pet peeve that I had sysops that insisted in using localized unicode/UTF-8 chars in filenames, DNS and DHCP configurations...Best avoiding them altogether.
– Rui F Ribeiro
Jan 16 at 19:32
Thanks, you are correct. The tab worked, and when I use debug mode on the AWS CLI I see it sending as
%C3%B6
so Terminal was giving the wrong character.– user3783243
Jan 16 at 19:32
Thanks, you are correct. The tab worked, and when I use debug mode on the AWS CLI I see it sending as
%C3%B6
so Terminal was giving the wrong character.– user3783243
Jan 16 at 19:32
It was my pet peeve that I had sysops that insisted in using localized unicode/UTF-8 chars in filenames, DNS and DHCP configurations...Best avoiding them altogether.
– Rui F Ribeiro
Jan 16 at 19:32
It was my pet peeve that I had sysops that insisted in using localized unicode/UTF-8 chars in filenames, DNS and DHCP configurations...Best avoiding them altogether.
– Rui F Ribeiro
Jan 16 at 19:32
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f494883%2futf8-character-makes-file-inaccessible%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What's the locale running on both instances ? (Type
locale
) This type of problems might happen when both side don't use the same locale (preferably a.UTF-8
ending one).– DrYak
Jan 16 at 17:54
@DrYak I've updated the question with that info, both look the same to me.
– user3783243
Jan 16 at 18:00
Damn, same locale... - does using
rsync
instead ofscp
also barf on that file ?– DrYak
Jan 16 at 18:28
1
Nope, your Mac betrayed you that you're a Mac user, and that's where the whole thing stems from. 'ö' and 'ö' just aren't the same Unicode codepoint and that's something that Linux conserves (just like it's also case sensitive).
– DrYak
Jan 16 at 19:25