Removing characters with sed [duplicate]
Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
This question already has an answer here:
Match language range in shell, sed or awk
2 answers
I am working on AIX unix and trying to remove non-printable characters from file the data looks like Caucasian male lives in Arizona w/ fiancÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ
in file when I view in Notepad++ using UTF-8 encoding. When I try to view file in unix I get ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâ instead of the special characters.
I want to replace all those special characters with space.
I tried sed 's/[^[:print:]]/ /g' file
but it does not remove those characters.My locale are listed below when I run locale -a
C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US
I even tried sed -e 's/[^ -~]/ /g' file
and it did not remove the characters.
I see that others stackflow answers used UTF-8
locale with GNU sed and this worked but I do not have that locale.
Also I am using ksh
.
text-processing sed ksh aix
marked as duplicate by Isaac, Goro, RalfFriedl, Shadur, X Tian Sep 27 at 8:53
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |Â
up vote
2
down vote
favorite
This question already has an answer here:
Match language range in shell, sed or awk
2 answers
I am working on AIX unix and trying to remove non-printable characters from file the data looks like Caucasian male lives in Arizona w/ fiancÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ
in file when I view in Notepad++ using UTF-8 encoding. When I try to view file in unix I get ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâ instead of the special characters.
I want to replace all those special characters with space.
I tried sed 's/[^[:print:]]/ /g' file
but it does not remove those characters.My locale are listed below when I run locale -a
C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US
I even tried sed -e 's/[^ -~]/ /g' file
and it did not remove the characters.
I see that others stackflow answers used UTF-8
locale with GNU sed and this worked but I do not have that locale.
Also I am using ksh
.
text-processing sed ksh aix
marked as duplicate by Isaac, Goro, RalfFriedl, Shadur, X Tian Sep 27 at 8:53
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
Ã
andâÂÂ
look pretty printable to me. A UTF-8Ã
is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is alsoÃ
as it happens which is printable, 0x83 would be a control character in both though
â Stéphane Chazelas
Sep 25 at 19:53
Possible dublicate unix.stackexchange.com/questions/201751/â¦
â Goro
Sep 25 at 20:05
1
@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
â Auguster
Sep 25 at 20:09
To actually show what the characeters are it is useful to show their hex values. Something like:echo "fiancÃÃÃÃÃÃÃÃÃÃ" | od -tx1
, or, maybe if your sed supports it:echo "fiancÃÃÃÃÃÃÃÃÃÃ" | sed -n l
.
â Isaac
Sep 25 at 21:08
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
This question already has an answer here:
Match language range in shell, sed or awk
2 answers
I am working on AIX unix and trying to remove non-printable characters from file the data looks like Caucasian male lives in Arizona w/ fiancÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ
in file when I view in Notepad++ using UTF-8 encoding. When I try to view file in unix I get ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâ instead of the special characters.
I want to replace all those special characters with space.
I tried sed 's/[^[:print:]]/ /g' file
but it does not remove those characters.My locale are listed below when I run locale -a
C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US
I even tried sed -e 's/[^ -~]/ /g' file
and it did not remove the characters.
I see that others stackflow answers used UTF-8
locale with GNU sed and this worked but I do not have that locale.
Also I am using ksh
.
text-processing sed ksh aix
This question already has an answer here:
Match language range in shell, sed or awk
2 answers
I am working on AIX unix and trying to remove non-printable characters from file the data looks like Caucasian male lives in Arizona w/ fiancÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ
in file when I view in Notepad++ using UTF-8 encoding. When I try to view file in unix I get ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâ instead of the special characters.
I want to replace all those special characters with space.
I tried sed 's/[^[:print:]]/ /g' file
but it does not remove those characters.My locale are listed below when I run locale -a
C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US
I even tried sed -e 's/[^ -~]/ /g' file
and it did not remove the characters.
I see that others stackflow answers used UTF-8
locale with GNU sed and this worked but I do not have that locale.
Also I am using ksh
.
This question already has an answer here:
Match language range in shell, sed or awk
2 answers
text-processing sed ksh aix
text-processing sed ksh aix
edited Sep 25 at 19:29
asked Sep 25 at 19:13
Auguster
133
133
marked as duplicate by Isaac, Goro, RalfFriedl, Shadur, X Tian Sep 27 at 8:53
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
marked as duplicate by Isaac, Goro, RalfFriedl, Shadur, X Tian Sep 27 at 8:53
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
Ã
andâÂÂ
look pretty printable to me. A UTF-8Ã
is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is alsoÃ
as it happens which is printable, 0x83 would be a control character in both though
â Stéphane Chazelas
Sep 25 at 19:53
Possible dublicate unix.stackexchange.com/questions/201751/â¦
â Goro
Sep 25 at 20:05
1
@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
â Auguster
Sep 25 at 20:09
To actually show what the characeters are it is useful to show their hex values. Something like:echo "fiancÃÃÃÃÃÃÃÃÃÃ" | od -tx1
, or, maybe if your sed supports it:echo "fiancÃÃÃÃÃÃÃÃÃÃ" | sed -n l
.
â Isaac
Sep 25 at 21:08
add a comment |Â
Ã
andâÂÂ
look pretty printable to me. A UTF-8Ã
is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is alsoÃ
as it happens which is printable, 0x83 would be a control character in both though
â Stéphane Chazelas
Sep 25 at 19:53
Possible dublicate unix.stackexchange.com/questions/201751/â¦
â Goro
Sep 25 at 20:05
1
@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
â Auguster
Sep 25 at 20:09
To actually show what the characeters are it is useful to show their hex values. Something like:echo "fiancÃÃÃÃÃÃÃÃÃÃ" | od -tx1
, or, maybe if your sed supports it:echo "fiancÃÃÃÃÃÃÃÃÃÃ" | sed -n l
.
â Isaac
Sep 25 at 21:08
Ã
and âÂÂ
look pretty printable to me. A UTF-8 Ã
is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is also Ã
as it happens which is printable, 0x83 would be a control character in both thoughâ Stéphane Chazelas
Sep 25 at 19:53
Ã
and âÂÂ
look pretty printable to me. A UTF-8 Ã
is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is also Ã
as it happens which is printable, 0x83 would be a control character in both thoughâ Stéphane Chazelas
Sep 25 at 19:53
Possible dublicate unix.stackexchange.com/questions/201751/â¦
â Goro
Sep 25 at 20:05
Possible dublicate unix.stackexchange.com/questions/201751/â¦
â Goro
Sep 25 at 20:05
1
1
@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
â Auguster
Sep 25 at 20:09
@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
â Auguster
Sep 25 at 20:09
To actually show what the characeters are it is useful to show their hex values. Something like:
echo "fiancÃÃÃÃÃÃÃÃÃÃ" | od -tx1
, or, maybe if your sed supports it: echo "fiancÃÃÃÃÃÃÃÃÃÃ" | sed -n l
.â Isaac
Sep 25 at 21:08
To actually show what the characeters are it is useful to show their hex values. Something like:
echo "fiancÃÃÃÃÃÃÃÃÃÃ" | od -tx1
, or, maybe if your sed supports it: echo "fiancÃÃÃÃÃÃÃÃÃÃ" | sed -n l
.â Isaac
Sep 25 at 21:08
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
1
down vote
accepted
If the current locale already uses UTF-8 as the charset (and file is written using that charset):
<file LC_ALL=C sed 's/[^ -~]//g'
Or, to include control characters in AIX sed:
<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"
@Stéphane what does printf do here? If I am deleting all characters and saving to another file do I need to use printf?
â Auguster
Sep 26 at 13:50
@Auguster,printf
is there to expand thet
into a TAB character andr
into a CR character. If usingksh93
on AIX, you can also use$'s/[^[:print:]tr]//g'
â Stéphane Chazelas
Sep 26 at 15:16
add a comment |Â
up vote
3
down vote
You can use the command tr
as follows:
tr -cd '[:print:]trn'
Explanation:
`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab
Examples based on Centos 7:
tris GNU and UTF-8 encoding
$ echo "fiancÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ" | tr -cd '[:print:]trn'
fianc
$ echo "get ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâ " | tr -cd '[:print:]trn'
get ^^^^^^
echo " Caucasian male lives in Arizona w/ fiancâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ" | tr -cd '[:print:]trn'
Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^
That did not work for me I tried echo" Caucasian male lives in Arizona w/ fiancâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ" | tr -d '[:print:]'
and got output as some unreadable text
â Auguster
Sep 25 at 19:36
1
LC_ALL=C tr ...
â Jeff Schaller
Sep 25 at 19:38
1
LC_ALL=C tr -cd '[:print:]' < input
works here
â Jeff Schaller
Sep 25 at 19:43
1
echo "fiancÃÃÃÃÃÃÃÃÃÃ" | tr -cd '[:print:]trn'
should returnfiancÃÃÃÃÃÃÃÃÃÃ
asÃ
is a printable character. GNUtr
doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does removeÃ
(or whatever bytes those are made of) as ASCII has no such character in the first place.
â Stéphane Chazelas
Sep 25 at 22:46
1
Because CentOStr
is GNUtr
and you probably tried it in a UTF-8 locale whereÃ
is made of 2 bytes and GNUtr
doesn't support multibyte characters. If you useLC_ALL=C
as suggested by Auguster, it will work (at removing thoseÃ
however they're encoded) regardless of whethertr
supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of à has as well as its single byte iso8859-1 encoding)
â Stéphane Chazelas
Sep 25 at 22:52
 |Â
show 3 more comments
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
If the current locale already uses UTF-8 as the charset (and file is written using that charset):
<file LC_ALL=C sed 's/[^ -~]//g'
Or, to include control characters in AIX sed:
<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"
@Stéphane what does printf do here? If I am deleting all characters and saving to another file do I need to use printf?
â Auguster
Sep 26 at 13:50
@Auguster,printf
is there to expand thet
into a TAB character andr
into a CR character. If usingksh93
on AIX, you can also use$'s/[^[:print:]tr]//g'
â Stéphane Chazelas
Sep 26 at 15:16
add a comment |Â
up vote
1
down vote
accepted
If the current locale already uses UTF-8 as the charset (and file is written using that charset):
<file LC_ALL=C sed 's/[^ -~]//g'
Or, to include control characters in AIX sed:
<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"
@Stéphane what does printf do here? If I am deleting all characters and saving to another file do I need to use printf?
â Auguster
Sep 26 at 13:50
@Auguster,printf
is there to expand thet
into a TAB character andr
into a CR character. If usingksh93
on AIX, you can also use$'s/[^[:print:]tr]//g'
â Stéphane Chazelas
Sep 26 at 15:16
add a comment |Â
up vote
1
down vote
accepted
up vote
1
down vote
accepted
If the current locale already uses UTF-8 as the charset (and file is written using that charset):
<file LC_ALL=C sed 's/[^ -~]//g'
Or, to include control characters in AIX sed:
<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"
If the current locale already uses UTF-8 as the charset (and file is written using that charset):
<file LC_ALL=C sed 's/[^ -~]//g'
Or, to include control characters in AIX sed:
<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"
edited Sep 25 at 22:57
Stéphane Chazelas
287k53529867
287k53529867
answered Sep 25 at 21:55
Isaac
7,56011137
7,56011137
@Stéphane what does printf do here? If I am deleting all characters and saving to another file do I need to use printf?
â Auguster
Sep 26 at 13:50
@Auguster,printf
is there to expand thet
into a TAB character andr
into a CR character. If usingksh93
on AIX, you can also use$'s/[^[:print:]tr]//g'
â Stéphane Chazelas
Sep 26 at 15:16
add a comment |Â
@Stéphane what does printf do here? If I am deleting all characters and saving to another file do I need to use printf?
â Auguster
Sep 26 at 13:50
@Auguster,printf
is there to expand thet
into a TAB character andr
into a CR character. If usingksh93
on AIX, you can also use$'s/[^[:print:]tr]//g'
â Stéphane Chazelas
Sep 26 at 15:16
@Stéphane what does printf do here? If I am deleting all characters and saving to another file do I need to use printf?
â Auguster
Sep 26 at 13:50
@Stéphane what does printf do here? If I am deleting all characters and saving to another file do I need to use printf?
â Auguster
Sep 26 at 13:50
@Auguster,
printf
is there to expand the t
into a TAB character and r
into a CR character. If using ksh93
on AIX, you can also use $'s/[^[:print:]tr]//g'
â Stéphane Chazelas
Sep 26 at 15:16
@Auguster,
printf
is there to expand the t
into a TAB character and r
into a CR character. If using ksh93
on AIX, you can also use $'s/[^[:print:]tr]//g'
â Stéphane Chazelas
Sep 26 at 15:16
add a comment |Â
up vote
3
down vote
You can use the command tr
as follows:
tr -cd '[:print:]trn'
Explanation:
`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab
Examples based on Centos 7:
tris GNU and UTF-8 encoding
$ echo "fiancÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ" | tr -cd '[:print:]trn'
fianc
$ echo "get ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâ " | tr -cd '[:print:]trn'
get ^^^^^^
echo " Caucasian male lives in Arizona w/ fiancâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ" | tr -cd '[:print:]trn'
Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^
That did not work for me I tried echo" Caucasian male lives in Arizona w/ fiancâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ" | tr -d '[:print:]'
and got output as some unreadable text
â Auguster
Sep 25 at 19:36
1
LC_ALL=C tr ...
â Jeff Schaller
Sep 25 at 19:38
1
LC_ALL=C tr -cd '[:print:]' < input
works here
â Jeff Schaller
Sep 25 at 19:43
1
echo "fiancÃÃÃÃÃÃÃÃÃÃ" | tr -cd '[:print:]trn'
should returnfiancÃÃÃÃÃÃÃÃÃÃ
asÃ
is a printable character. GNUtr
doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does removeÃ
(or whatever bytes those are made of) as ASCII has no such character in the first place.
â Stéphane Chazelas
Sep 25 at 22:46
1
Because CentOStr
is GNUtr
and you probably tried it in a UTF-8 locale whereÃ
is made of 2 bytes and GNUtr
doesn't support multibyte characters. If you useLC_ALL=C
as suggested by Auguster, it will work (at removing thoseÃ
however they're encoded) regardless of whethertr
supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of à has as well as its single byte iso8859-1 encoding)
â Stéphane Chazelas
Sep 25 at 22:52
 |Â
show 3 more comments
up vote
3
down vote
You can use the command tr
as follows:
tr -cd '[:print:]trn'
Explanation:
`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab
Examples based on Centos 7:
tris GNU and UTF-8 encoding
$ echo "fiancÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ" | tr -cd '[:print:]trn'
fianc
$ echo "get ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâ " | tr -cd '[:print:]trn'
get ^^^^^^
echo " Caucasian male lives in Arizona w/ fiancâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ" | tr -cd '[:print:]trn'
Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^
That did not work for me I tried echo" Caucasian male lives in Arizona w/ fiancâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ" | tr -d '[:print:]'
and got output as some unreadable text
â Auguster
Sep 25 at 19:36
1
LC_ALL=C tr ...
â Jeff Schaller
Sep 25 at 19:38
1
LC_ALL=C tr -cd '[:print:]' < input
works here
â Jeff Schaller
Sep 25 at 19:43
1
echo "fiancÃÃÃÃÃÃÃÃÃÃ" | tr -cd '[:print:]trn'
should returnfiancÃÃÃÃÃÃÃÃÃÃ
asÃ
is a printable character. GNUtr
doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does removeÃ
(or whatever bytes those are made of) as ASCII has no such character in the first place.
â Stéphane Chazelas
Sep 25 at 22:46
1
Because CentOStr
is GNUtr
and you probably tried it in a UTF-8 locale whereÃ
is made of 2 bytes and GNUtr
doesn't support multibyte characters. If you useLC_ALL=C
as suggested by Auguster, it will work (at removing thoseÃ
however they're encoded) regardless of whethertr
supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of à has as well as its single byte iso8859-1 encoding)
â Stéphane Chazelas
Sep 25 at 22:52
 |Â
show 3 more comments
up vote
3
down vote
up vote
3
down vote
You can use the command tr
as follows:
tr -cd '[:print:]trn'
Explanation:
`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab
Examples based on Centos 7:
tris GNU and UTF-8 encoding
$ echo "fiancÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ" | tr -cd '[:print:]trn'
fianc
$ echo "get ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâ " | tr -cd '[:print:]trn'
get ^^^^^^
echo " Caucasian male lives in Arizona w/ fiancâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ" | tr -cd '[:print:]trn'
Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^
You can use the command tr
as follows:
tr -cd '[:print:]trn'
Explanation:
`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab
Examples based on Centos 7:
tris GNU and UTF-8 encoding
$ echo "fiancÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ" | tr -cd '[:print:]trn'
fianc
$ echo "get ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâ " | tr -cd '[:print:]trn'
get ^^^^^^
echo " Caucasian male lives in Arizona w/ fiancâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ" | tr -cd '[:print:]trn'
Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^
edited Sep 25 at 22:58
answered Sep 25 at 19:23
Goro
6,42352863
6,42352863
That did not work for me I tried echo" Caucasian male lives in Arizona w/ fiancâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ" | tr -d '[:print:]'
and got output as some unreadable text
â Auguster
Sep 25 at 19:36
1
LC_ALL=C tr ...
â Jeff Schaller
Sep 25 at 19:38
1
LC_ALL=C tr -cd '[:print:]' < input
works here
â Jeff Schaller
Sep 25 at 19:43
1
echo "fiancÃÃÃÃÃÃÃÃÃÃ" | tr -cd '[:print:]trn'
should returnfiancÃÃÃÃÃÃÃÃÃÃ
asÃ
is a printable character. GNUtr
doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does removeÃ
(or whatever bytes those are made of) as ASCII has no such character in the first place.
â Stéphane Chazelas
Sep 25 at 22:46
1
Because CentOStr
is GNUtr
and you probably tried it in a UTF-8 locale whereÃ
is made of 2 bytes and GNUtr
doesn't support multibyte characters. If you useLC_ALL=C
as suggested by Auguster, it will work (at removing thoseÃ
however they're encoded) regardless of whethertr
supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of à has as well as its single byte iso8859-1 encoding)
â Stéphane Chazelas
Sep 25 at 22:52
 |Â
show 3 more comments
That did not work for me I tried echo" Caucasian male lives in Arizona w/ fiancâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ" | tr -d '[:print:]'
and got output as some unreadable text
â Auguster
Sep 25 at 19:36
1
LC_ALL=C tr ...
â Jeff Schaller
Sep 25 at 19:38
1
LC_ALL=C tr -cd '[:print:]' < input
works here
â Jeff Schaller
Sep 25 at 19:43
1
echo "fiancÃÃÃÃÃÃÃÃÃÃ" | tr -cd '[:print:]trn'
should returnfiancÃÃÃÃÃÃÃÃÃÃ
asÃ
is a printable character. GNUtr
doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does removeÃ
(or whatever bytes those are made of) as ASCII has no such character in the first place.
â Stéphane Chazelas
Sep 25 at 22:46
1
Because CentOStr
is GNUtr
and you probably tried it in a UTF-8 locale whereÃ
is made of 2 bytes and GNUtr
doesn't support multibyte characters. If you useLC_ALL=C
as suggested by Auguster, it will work (at removing thoseÃ
however they're encoded) regardless of whethertr
supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of à has as well as its single byte iso8859-1 encoding)
â Stéphane Chazelas
Sep 25 at 22:52
That did not work for me I tried echo
" Caucasian male lives in Arizona w/ fiancâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ" | tr -d '[:print:]'
and got output as some unreadable textâ Auguster
Sep 25 at 19:36
That did not work for me I tried echo
" Caucasian male lives in Arizona w/ fiancâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂâÂÂ^âÂÂ" | tr -d '[:print:]'
and got output as some unreadable textâ Auguster
Sep 25 at 19:36
1
1
LC_ALL=C tr ...
â Jeff Schaller
Sep 25 at 19:38
LC_ALL=C tr ...
â Jeff Schaller
Sep 25 at 19:38
1
1
LC_ALL=C tr -cd '[:print:]' < input
works hereâ Jeff Schaller
Sep 25 at 19:43
LC_ALL=C tr -cd '[:print:]' < input
works hereâ Jeff Schaller
Sep 25 at 19:43
1
1
echo "fiancÃÃÃÃÃÃÃÃÃÃ" | tr -cd '[:print:]trn'
should return fiancÃÃÃÃÃÃÃÃÃÃ
as Ã
is a printable character. GNU tr
doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove Ã
(or whatever bytes those are made of) as ASCII has no such character in the first place.â Stéphane Chazelas
Sep 25 at 22:46
echo "fiancÃÃÃÃÃÃÃÃÃÃ" | tr -cd '[:print:]trn'
should return fiancÃÃÃÃÃÃÃÃÃÃ
as Ã
is a printable character. GNU tr
doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove Ã
(or whatever bytes those are made of) as ASCII has no such character in the first place.â Stéphane Chazelas
Sep 25 at 22:46
1
1
Because CentOS
tr
is GNU tr
and you probably tried it in a UTF-8 locale where Ã
is made of 2 bytes and GNU tr
doesn't support multibyte characters. If you use LC_ALL=C
as suggested by Auguster, it will work (at removing those Ã
however they're encoded) regardless of whether tr
supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of à has as well as its single byte iso8859-1 encoding)â Stéphane Chazelas
Sep 25 at 22:52
Because CentOS
tr
is GNU tr
and you probably tried it in a UTF-8 locale where Ã
is made of 2 bytes and GNU tr
doesn't support multibyte characters. If you use LC_ALL=C
as suggested by Auguster, it will work (at removing those Ã
however they're encoded) regardless of whether tr
supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of à has as well as its single byte iso8859-1 encoding)â Stéphane Chazelas
Sep 25 at 22:52
 |Â
show 3 more comments
Ã
andâÂÂ
look pretty printable to me. A UTF-8Ã
is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is alsoÃ
as it happens which is printable, 0x83 would be a control character in both thoughâ Stéphane Chazelas
Sep 25 at 19:53
Possible dublicate unix.stackexchange.com/questions/201751/â¦
â Goro
Sep 25 at 20:05
1
@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
â Auguster
Sep 25 at 20:09
To actually show what the characeters are it is useful to show their hex values. Something like:
echo "fiancÃÃÃÃÃÃÃÃÃÃ" | od -tx1
, or, maybe if your sed supports it:echo "fiancÃÃÃÃÃÃÃÃÃÃ" | sed -n l
.â Isaac
Sep 25 at 21:08