Removing characters with sed [duplicate]

up vote
2
down vote

favorite

This question already has an answer here:

Match language range in shell, sed or awk

2 answers

I am working on AIX unix and trying to remove non-printable characters from file the data looks like Caucasian male lives in Arizona w/ fiancÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ in file when I view in Notepad++ using UTF-8 encoding. When I try to view file in unix I get ^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’ instead of the special characters.

I want to replace all those special characters with space.

I tried sed 's/[^[:print:]]/ /g' file but it does not remove those characters.My locale are listed below when I run locale -a

C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US

I even tried sed -e 's/[^ -~]/ /g' file and it did not remove the characters.

I see that others stackflow answers used UTF-8 locale with GNU sed and this worked but I do not have that locale.

Also I am using ksh.

edited Sep 25 at 19:29

asked Sep 25 at 19:13

Auguster

133

marked as duplicate by Isaac, Goro, RalfFriedl, Shadur, X Tian Sep 27 at 8:53

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

Ãƒ and Ã¢Â–Â’ look pretty printable to me. A UTF-8 Ãƒ is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is also Ãƒ as it happens which is printable, 0x83 would be a control character in both though
â€“Â StÃ©phane Chazelas
Sep 25 at 19:53

Possible dublicate unix.stackexchange.com/questions/201751/â€¦
â€“Â Goro
Sep 25 at 20:05

1

@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
â€“Â Auguster
Sep 25 at 20:09

To actually show what the characeters are it is useful to show their hex values. Something like: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | od -tx1, or, maybe if your sed supports it: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | sed -n l.
â€“Â Isaac
Sep 25 at 21:08

add a commentÂ |Â

up vote
2
down vote

favorite

This question already has an answer here:

Match language range in shell, sed or awk

2 answers

I want to replace all those special characters with space.

I tried sed 's/[^[:print:]]/ /g' file but it does not remove those characters.My locale are listed below when I run locale -a

C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US

I even tried sed -e 's/[^ -~]/ /g' file and it did not remove the characters.

I see that others stackflow answers used UTF-8 locale with GNU sed and this worked but I do not have that locale.

Also I am using ksh.

edited Sep 25 at 19:29

asked Sep 25 at 19:13

Auguster

133

marked as duplicate by Isaac, Goro, RalfFriedl, Shadur, X Tian Sep 27 at 8:53

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

Ãƒ and Ã¢Â–Â’ look pretty printable to me. A UTF-8 Ãƒ is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is also Ãƒ as it happens which is printable, 0x83 would be a control character in both though
â€“Â StÃ©phane Chazelas
Sep 25 at 19:53

Possible dublicate unix.stackexchange.com/questions/201751/â€¦
â€“Â Goro
Sep 25 at 20:05

1

@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
â€“Â Auguster
Sep 25 at 20:09

To actually show what the characeters are it is useful to show their hex values. Something like: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | od -tx1, or, maybe if your sed supports it: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | sed -n l.
â€“Â Isaac
Sep 25 at 21:08

add a commentÂ |Â

up vote
2
down vote

favorite

This question already has an answer here:

Match language range in shell, sed or awk

2 answers

I want to replace all those special characters with space.

I tried sed 's/[^[:print:]]/ /g' file but it does not remove those characters.My locale are listed below when I run locale -a

C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US

I even tried sed -e 's/[^ -~]/ /g' file and it did not remove the characters.

I see that others stackflow answers used UTF-8 locale with GNU sed and this worked but I do not have that locale.

Also I am using ksh.

edited Sep 25 at 19:29

asked Sep 25 at 19:13

Auguster

133

This question already has an answer here:

Match language range in shell, sed or awk

2 answers

I want to replace all those special characters with space.

I tried sed 's/[^[:print:]]/ /g' file but it does not remove those characters.My locale are listed below when I run locale -a

C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US

I even tried sed -e 's/[^ -~]/ /g' file and it did not remove the characters.

I see that others stackflow answers used UTF-8 locale with GNU sed and this worked but I do not have that locale.

Also I am using ksh.

This question already has an answer here:

Match language range in shell, sed or awk

2 answers

text-processing sed ksh aix

edited Sep 25 at 19:29

asked Sep 25 at 19:13

Auguster

133

edited Sep 25 at 19:29

asked Sep 25 at 19:13

Auguster

133

edited Sep 25 at 19:29

asked Sep 25 at 19:13

Auguster

133

asked Sep 25 at 19:13

Auguster

133

asked Sep 25 at 19:13

Auguster

133

marked as duplicate by Isaac, Goro, RalfFriedl, Shadur, X Tian Sep 27 at 8:53

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

marked as duplicate by Isaac, Goro, RalfFriedl, Shadur, X Tian Sep 27 at 8:53

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

Ãƒ and Ã¢Â–Â’ look pretty printable to me. A UTF-8 Ãƒ is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is also Ãƒ as it happens which is printable, 0x83 would be a control character in both though
â€“Â StÃ©phane Chazelas
Sep 25 at 19:53

Possible dublicate unix.stackexchange.com/questions/201751/â€¦
â€“Â Goro
Sep 25 at 20:05

1

@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
â€“Â Auguster
Sep 25 at 20:09

To actually show what the characeters are it is useful to show their hex values. Something like: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | od -tx1, or, maybe if your sed supports it: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | sed -n l.
â€“Â Isaac
Sep 25 at 21:08

add a commentÂ |Â

Ãƒ and Ã¢Â–Â’ look pretty printable to me. A UTF-8 Ãƒ is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is also Ãƒ as it happens which is printable, 0x83 would be a control character in both though
â€“Â StÃ©phane Chazelas
Sep 25 at 19:53

Possible dublicate unix.stackexchange.com/questions/201751/â€¦
â€“Â Goro
Sep 25 at 20:05

1

@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
â€“Â Auguster
Sep 25 at 20:09

To actually show what the characeters are it is useful to show their hex values. Something like: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | od -tx1, or, maybe if your sed supports it: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | sed -n l.
â€“Â Isaac
Sep 25 at 21:08

Ãƒ and Ã¢Â–Â’ look pretty printable to me. A UTF-8 Ãƒ is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is also Ãƒ as it happens which is printable, 0x83 would be a control character in both though
â€“Â StÃ©phane Chazelas
Sep 25 at 19:53

Possible dublicate unix.stackexchange.com/questions/201751/â€¦
â€“Â Goro
Sep 25 at 20:05

@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
â€“Â Auguster
Sep 25 at 20:09

To actually show what the characeters are it is useful to show their hex values. Something like: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | od -tx1, or, maybe if your sed supports it: echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | sed -n l.
â€“Â Isaac
Sep 25 at 21:08

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

If the current locale already uses UTF-8 as the charset (and file is written using that charset):

<file LC_ALL=C sed 's/[^ -~]//g'

Or, to include control characters in AIX sed:

<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"

edited Sep 25 at 22:57

StÃ©phane Chazelas

287k53529867

answered Sep 25 at 21:55

Isaac

7,56011137

@StÃ©phane what does printf do here? If I am deleting all characters and saving to another file do I need to use printf?
â€“Â Auguster
Sep 26 at 13:50

@Auguster, printf is there to expand the t into a TAB character and r into a CR character. If using ksh93 on AIX, you can also use $'s/[^[:print:]tr]//g'
â€“Â StÃ©phane Chazelas
Sep 26 at 15:16

add a commentÂ |Â

up vote
3
down vote

You can use the command tr as follows:

tr -cd '[:print:]trn'

Explanation:

`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab

Examples based on Centos 7:tris GNU and UTF-8 encoding

$ echo "fiancÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚" | tr -cd '[:print:]trn'
fianc

$ echo "get ^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’ " | tr -cd '[:print:]trn'
get ^^^^^^

echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -cd '[:print:]trn'
 Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^

edited Sep 25 at 22:58

answered Sep 25 at 19:23

Goro

6,42352863

That did not work for me I tried echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -d '[:print:]' and got output as some unreadable text
â€“Â Auguster
Sep 25 at 19:36

1

LC_ALL=C tr ...
â€“Â Jeff Schaller
Sep 25 at 19:38

1

LC_ALL=C tr -cd '[:print:]' < input works here
â€“Â Jeff Schaller
Sep 25 at 19:43

1

echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | tr -cd '[:print:]trn' should return fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ as Ã‚ is a printable character. GNU tr doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove Ã‚ (or whatever bytes those are made of) as ASCII has no such character in the first place.
â€“Â StÃ©phane Chazelas
Sep 25 at 22:46

1

Because CentOS tr is GNU tr and you probably tried it in a UTF-8 locale where Ãƒ is made of 2 bytes and GNU tr doesn't support multibyte characters. If you use LC_ALL=C as suggested by Auguster, it will work (at removing those Ãƒ however they're encoded) regardless of whether tr supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of Ãƒ has as well as its single byte iso8859-1 encoding)
â€“Â StÃ©phane Chazelas
Sep 25 at 22:52

Â |Â
show 3 more comments

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

If the current locale already uses UTF-8 as the charset (and file is written using that charset):

<file LC_ALL=C sed 's/[^ -~]//g'

Or, to include control characters in AIX sed:

<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"

edited Sep 25 at 22:57

StÃ©phane Chazelas

287k53529867

answered Sep 25 at 21:55

Isaac

7,56011137

@StÃ©phane what does printf do here? If I am deleting all characters and saving to another file do I need to use printf?
â€“Â Auguster
Sep 26 at 13:50

@Auguster, printf is there to expand the t into a TAB character and r into a CR character. If using ksh93 on AIX, you can also use $'s/[^[:print:]tr]//g'
â€“Â StÃ©phane Chazelas
Sep 26 at 15:16

add a commentÂ |Â

up vote
1
down vote

accepted

If the current locale already uses UTF-8 as the charset (and file is written using that charset):

<file LC_ALL=C sed 's/[^ -~]//g'

Or, to include control characters in AIX sed:

<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"

edited Sep 25 at 22:57

StÃ©phane Chazelas

287k53529867

answered Sep 25 at 21:55

Isaac

7,56011137

@StÃ©phane what does printf do here? If I am deleting all characters and saving to another file do I need to use printf?
â€“Â Auguster
Sep 26 at 13:50

@Auguster, printf is there to expand the t into a TAB character and r into a CR character. If using ksh93 on AIX, you can also use $'s/[^[:print:]tr]//g'
â€“Â StÃ©phane Chazelas
Sep 26 at 15:16

add a commentÂ |Â

up vote
1
down vote

accepted

If the current locale already uses UTF-8 as the charset (and file is written using that charset):

<file LC_ALL=C sed 's/[^ -~]//g'

Or, to include control characters in AIX sed:

<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"

edited Sep 25 at 22:57

StÃ©phane Chazelas

287k53529867

answered Sep 25 at 21:55

Isaac

7,56011137

If the current locale already uses UTF-8 as the charset (and file is written using that charset):

<file LC_ALL=C sed 's/[^ -~]//g'

Or, to include control characters in AIX sed:

<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"

edited Sep 25 at 22:57

StÃ©phane Chazelas

287k53529867

answered Sep 25 at 21:55

Isaac

7,56011137

edited Sep 25 at 22:57

StÃ©phane Chazelas

287k53529867

edited Sep 25 at 22:57

StÃ©phane Chazelas

287k53529867

edited Sep 25 at 22:57

StÃ©phane Chazelas

287k53529867

answered Sep 25 at 21:55

Isaac

7,56011137

answered Sep 25 at 21:55

Isaac

7,56011137

answered Sep 25 at 21:55

Isaac

7,56011137

@StÃ©phane what does printf do here? If I am deleting all characters and saving to another file do I need to use printf?
â€“Â Auguster
Sep 26 at 13:50

@Auguster, printf is there to expand the t into a TAB character and r into a CR character. If using ksh93 on AIX, you can also use $'s/[^[:print:]tr]//g'
â€“Â StÃ©phane Chazelas
Sep 26 at 15:16

add a commentÂ |Â

@StÃ©phane what does printf do here? If I am deleting all characters and saving to another file do I need to use printf?
â€“Â Auguster
Sep 26 at 13:50

@Auguster, printf is there to expand the t into a TAB character and r into a CR character. If using ksh93 on AIX, you can also use $'s/[^[:print:]tr]//g'
â€“Â StÃ©phane Chazelas
Sep 26 at 15:16

@StÃ©phane what does printf do here? If I am deleting all characters and saving to another file do I need to use printf?
â€“Â Auguster
Sep 26 at 13:50

@Auguster, printf is there to expand the t into a TAB character and r into a CR character. If using ksh93 on AIX, you can also use $'s/[^[:print:]tr]//g'
â€“Â StÃ©phane Chazelas
Sep 26 at 15:16

add a commentÂ |Â

up vote
3
down vote

You can use the command tr as follows:

tr -cd '[:print:]trn'

Explanation:

`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab

Examples based on Centos 7:tris GNU and UTF-8 encoding

$ echo "fiancÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚" | tr -cd '[:print:]trn'
fianc

$ echo "get ^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’ " | tr -cd '[:print:]trn'
get ^^^^^^

echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -cd '[:print:]trn'
 Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^

edited Sep 25 at 22:58

answered Sep 25 at 19:23

Goro

6,42352863

That did not work for me I tried echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -d '[:print:]' and got output as some unreadable text
â€“Â Auguster
Sep 25 at 19:36

1

LC_ALL=C tr ...
â€“Â Jeff Schaller
Sep 25 at 19:38

1

LC_ALL=C tr -cd '[:print:]' < input works here
â€“Â Jeff Schaller
Sep 25 at 19:43

1

echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | tr -cd '[:print:]trn' should return fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ as Ã‚ is a printable character. GNU tr doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove Ã‚ (or whatever bytes those are made of) as ASCII has no such character in the first place.
â€“Â StÃ©phane Chazelas
Sep 25 at 22:46

1

Because CentOS tr is GNU tr and you probably tried it in a UTF-8 locale where Ãƒ is made of 2 bytes and GNU tr doesn't support multibyte characters. If you use LC_ALL=C as suggested by Auguster, it will work (at removing those Ãƒ however they're encoded) regardless of whether tr supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of Ãƒ has as well as its single byte iso8859-1 encoding)
â€“Â StÃ©phane Chazelas
Sep 25 at 22:52

Â |Â
show 3 more comments

up vote
3
down vote

You can use the command tr as follows:

tr -cd '[:print:]trn'

Explanation:

`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab

Examples based on Centos 7:tris GNU and UTF-8 encoding

$ echo "fiancÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚" | tr -cd '[:print:]trn'
fianc

$ echo "get ^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’ " | tr -cd '[:print:]trn'
get ^^^^^^

echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -cd '[:print:]trn'
 Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^

edited Sep 25 at 22:58

answered Sep 25 at 19:23

Goro

6,42352863

That did not work for me I tried echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -d '[:print:]' and got output as some unreadable text
â€“Â Auguster
Sep 25 at 19:36

1

LC_ALL=C tr ...
â€“Â Jeff Schaller
Sep 25 at 19:38

1

LC_ALL=C tr -cd '[:print:]' < input works here
â€“Â Jeff Schaller
Sep 25 at 19:43

1

echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | tr -cd '[:print:]trn' should return fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ as Ã‚ is a printable character. GNU tr doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove Ã‚ (or whatever bytes those are made of) as ASCII has no such character in the first place.
â€“Â StÃ©phane Chazelas
Sep 25 at 22:46

1

Because CentOS tr is GNU tr and you probably tried it in a UTF-8 locale where Ãƒ is made of 2 bytes and GNU tr doesn't support multibyte characters. If you use LC_ALL=C as suggested by Auguster, it will work (at removing those Ãƒ however they're encoded) regardless of whether tr supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of Ãƒ has as well as its single byte iso8859-1 encoding)
â€“Â StÃ©phane Chazelas
Sep 25 at 22:52

Â |Â
show 3 more comments

up vote
3
down vote

You can use the command tr as follows:

tr -cd '[:print:]trn'

Explanation:

`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab

Examples based on Centos 7:tris GNU and UTF-8 encoding

$ echo "fiancÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚" | tr -cd '[:print:]trn'
fianc

$ echo "get ^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’ " | tr -cd '[:print:]trn'
get ^^^^^^

echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -cd '[:print:]trn'
 Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^

edited Sep 25 at 22:58

answered Sep 25 at 19:23

Goro

6,42352863

You can use the command tr as follows:

tr -cd '[:print:]trn'

Explanation:

`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab

Examples based on Centos 7:tris GNU and UTF-8 encoding

$ echo "fiancÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚ÃƒÂƒÃƒÂ‚" | tr -cd '[:print:]trn'
fianc

$ echo "get ^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’ " | tr -cd '[:print:]trn'
get ^^^^^^

echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -cd '[:print:]trn'
 Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^

edited Sep 25 at 22:58

answered Sep 25 at 19:23

Goro

6,42352863

edited Sep 25 at 22:58

answered Sep 25 at 19:23

Goro

6,42352863

answered Sep 25 at 19:23

Goro

6,42352863

answered Sep 25 at 19:23

Goro

6,42352863

That did not work for me I tried echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -d '[:print:]' and got output as some unreadable text
â€“Â Auguster
Sep 25 at 19:36

1

LC_ALL=C tr ...
â€“Â Jeff Schaller
Sep 25 at 19:38

1

LC_ALL=C tr -cd '[:print:]' < input works here
â€“Â Jeff Schaller
Sep 25 at 19:43

1

echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | tr -cd '[:print:]trn' should return fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ as Ã‚ is a printable character. GNU tr doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove Ã‚ (or whatever bytes those are made of) as ASCII has no such character in the first place.
â€“Â StÃ©phane Chazelas
Sep 25 at 22:46

1

Because CentOS tr is GNU tr and you probably tried it in a UTF-8 locale where Ãƒ is made of 2 bytes and GNU tr doesn't support multibyte characters. If you use LC_ALL=C as suggested by Auguster, it will work (at removing those Ãƒ however they're encoded) regardless of whether tr supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of Ãƒ has as well as its single byte iso8859-1 encoding)
â€“Â StÃ©phane Chazelas
Sep 25 at 22:52

Â |Â
show 3 more comments

That did not work for me I tried echo " Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -d '[:print:]' and got output as some unreadable text
â€“Â Auguster
Sep 25 at 19:36

1

LC_ALL=C tr ...
â€“Â Jeff Schaller
Sep 25 at 19:38

1

LC_ALL=C tr -cd '[:print:]' < input works here
â€“Â Jeff Schaller
Sep 25 at 19:43

1

echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | tr -cd '[:print:]trn' should return fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ as Ã‚ is a printable character. GNU tr doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove Ã‚ (or whatever bytes those are made of) as ASCII has no such character in the first place.
â€“Â StÃ©phane Chazelas
Sep 25 at 22:46

1

Because CentOS tr is GNU tr and you probably tried it in a UTF-8 locale where Ãƒ is made of 2 bytes and GNU tr doesn't support multibyte characters. If you use LC_ALL=C as suggested by Auguster, it will work (at removing those Ãƒ however they're encoded) regardless of whether tr supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of Ãƒ has as well as its single byte iso8859-1 encoding)
â€“Â StÃ©phane Chazelas
Sep 25 at 22:52

That did not work for me I tried echo

" Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -d '[:print:]'

and got output as some unreadable text
â€“Â Auguster
Sep 25 at 19:36

That did not work for me I tried echo

" Caucasian male lives in Arizona w/ fiancÃ¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’Ã¢Â–Â’^Ã¢Â–Â’" | tr -d '[:print:]'

and got output as some unreadable text
â€“Â Auguster
Sep 25 at 19:36

LC_ALL=C tr ...
â€“Â Jeff Schaller
Sep 25 at 19:38

LC_ALL=C tr -cd '[:print:]' < input works here
â€“Â Jeff Schaller
Sep 25 at 19:43

echo "fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚" | tr -cd '[:print:]trn' should return fiancÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ÃƒÃ‚ as Ã‚ is a printable character. GNU tr doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove Ã‚ (or whatever bytes those are made of) as ASCII has no such character in the first place.
â€“Â StÃ©phane Chazelas
Sep 25 at 22:46

Because CentOS tr is GNU tr and you probably tried it in a UTF-8 locale where Ãƒ is made of 2 bytes and GNU tr doesn't support multibyte characters. If you use LC_ALL=C as suggested by Auguster, it will work (at removing those Ãƒ however they're encoded) regardless of whether tr supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of Ãƒ has as well as its single byte iso8859-1 encoding)
â€“Â StÃ©phane Chazelas
Sep 25 at 22:52

Â |Â
show 3 more comments

搜尋此網誌

mjhjmtu