How to printf literal characters from/to file in bash?

Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
I want to filter a file by character (for the purpose of removing invalid xml characters which I cannot control the generation of) but I cannot seem to even be able to copy individual characters from one file to another. I used printf to copy literal sections including carriage returns before, but now it does not copy a carriage return as one, but as some empty length string. My code:
infile=$1
outfile=$2
touch $outfile
while IFS= read -r -n1 char
do
# display one character at a time
printf "%s" "$char" >> $outfile
done < "$infile"
diff $infile $outfile
I don't mind using sed or awk, but I would have to encode the allowed characters.
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
bash xml newlines special-characters printf
 |Â
show 2 more comments
up vote
1
down vote
favorite
I want to filter a file by character (for the purpose of removing invalid xml characters which I cannot control the generation of) but I cannot seem to even be able to copy individual characters from one file to another. I used printf to copy literal sections including carriage returns before, but now it does not copy a carriage return as one, but as some empty length string. My code:
infile=$1
outfile=$2
touch $outfile
while IFS= read -r -n1 char
do
# display one character at a time
printf "%s" "$char" >> $outfile
done < "$infile"
diff $infile $outfile
I don't mind using sed or awk, but I would have to encode the allowed characters.
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
bash xml newlines special-characters printf
1
Have you looked intoiconv?
â DopeGhoti
Feb 1 at 20:32
2
Without looking deeper into your code, are you aware, that there are much better and faster tools than bash available to you for "text stream filtering/conversion", like perl/sed/awk/.... ?
â Alex Stragies
Feb 1 at 20:42
I use sed and awk within bash, so sure, those work, but I don't know how to encode the characters into an argument for them.
â Timothy Swan
Feb 1 at 21:03
See also Why is using a shell loop to process text considered bad practice?
â Wildcard
Feb 1 at 21:58
2
Doing this in bash or sh is guaranteed to fail on many real-world unicode input files (including all UTF-16 input) because the NUL byte (0x00) is a valid byte in many UTF-16 characters (and is also a valid character in standard UTF-8 - modified UTF-8 uses0xC0 0x80in place of any actual NUL bytes, allowing NUL to be used as terminator) - andshis completely incapable of storing NUL bytes in any variable. Useawkorperlinstead. Or some other unicode-aware tool likeiconvinstead.
â cas
Feb 2 at 1:53
 |Â
show 2 more comments
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I want to filter a file by character (for the purpose of removing invalid xml characters which I cannot control the generation of) but I cannot seem to even be able to copy individual characters from one file to another. I used printf to copy literal sections including carriage returns before, but now it does not copy a carriage return as one, but as some empty length string. My code:
infile=$1
outfile=$2
touch $outfile
while IFS= read -r -n1 char
do
# display one character at a time
printf "%s" "$char" >> $outfile
done < "$infile"
diff $infile $outfile
I don't mind using sed or awk, but I would have to encode the allowed characters.
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
bash xml newlines special-characters printf
I want to filter a file by character (for the purpose of removing invalid xml characters which I cannot control the generation of) but I cannot seem to even be able to copy individual characters from one file to another. I used printf to copy literal sections including carriage returns before, but now it does not copy a carriage return as one, but as some empty length string. My code:
infile=$1
outfile=$2
touch $outfile
while IFS= read -r -n1 char
do
# display one character at a time
printf "%s" "$char" >> $outfile
done < "$infile"
diff $infile $outfile
I don't mind using sed or awk, but I would have to encode the allowed characters.
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
bash xml newlines special-characters printf
edited Feb 1 at 21:02
asked Feb 1 at 20:20
Timothy Swan
816
816
1
Have you looked intoiconv?
â DopeGhoti
Feb 1 at 20:32
2
Without looking deeper into your code, are you aware, that there are much better and faster tools than bash available to you for "text stream filtering/conversion", like perl/sed/awk/.... ?
â Alex Stragies
Feb 1 at 20:42
I use sed and awk within bash, so sure, those work, but I don't know how to encode the characters into an argument for them.
â Timothy Swan
Feb 1 at 21:03
See also Why is using a shell loop to process text considered bad practice?
â Wildcard
Feb 1 at 21:58
2
Doing this in bash or sh is guaranteed to fail on many real-world unicode input files (including all UTF-16 input) because the NUL byte (0x00) is a valid byte in many UTF-16 characters (and is also a valid character in standard UTF-8 - modified UTF-8 uses0xC0 0x80in place of any actual NUL bytes, allowing NUL to be used as terminator) - andshis completely incapable of storing NUL bytes in any variable. Useawkorperlinstead. Or some other unicode-aware tool likeiconvinstead.
â cas
Feb 2 at 1:53
 |Â
show 2 more comments
1
Have you looked intoiconv?
â DopeGhoti
Feb 1 at 20:32
2
Without looking deeper into your code, are you aware, that there are much better and faster tools than bash available to you for "text stream filtering/conversion", like perl/sed/awk/.... ?
â Alex Stragies
Feb 1 at 20:42
I use sed and awk within bash, so sure, those work, but I don't know how to encode the characters into an argument for them.
â Timothy Swan
Feb 1 at 21:03
See also Why is using a shell loop to process text considered bad practice?
â Wildcard
Feb 1 at 21:58
2
Doing this in bash or sh is guaranteed to fail on many real-world unicode input files (including all UTF-16 input) because the NUL byte (0x00) is a valid byte in many UTF-16 characters (and is also a valid character in standard UTF-8 - modified UTF-8 uses0xC0 0x80in place of any actual NUL bytes, allowing NUL to be used as terminator) - andshis completely incapable of storing NUL bytes in any variable. Useawkorperlinstead. Or some other unicode-aware tool likeiconvinstead.
â cas
Feb 2 at 1:53
1
1
Have you looked into
iconv?â DopeGhoti
Feb 1 at 20:32
Have you looked into
iconv?â DopeGhoti
Feb 1 at 20:32
2
2
Without looking deeper into your code, are you aware, that there are much better and faster tools than bash available to you for "text stream filtering/conversion", like perl/sed/awk/.... ?
â Alex Stragies
Feb 1 at 20:42
Without looking deeper into your code, are you aware, that there are much better and faster tools than bash available to you for "text stream filtering/conversion", like perl/sed/awk/.... ?
â Alex Stragies
Feb 1 at 20:42
I use sed and awk within bash, so sure, those work, but I don't know how to encode the characters into an argument for them.
â Timothy Swan
Feb 1 at 21:03
I use sed and awk within bash, so sure, those work, but I don't know how to encode the characters into an argument for them.
â Timothy Swan
Feb 1 at 21:03
See also Why is using a shell loop to process text considered bad practice?
â Wildcard
Feb 1 at 21:58
See also Why is using a shell loop to process text considered bad practice?
â Wildcard
Feb 1 at 21:58
2
2
Doing this in bash or sh is guaranteed to fail on many real-world unicode input files (including all UTF-16 input) because the NUL byte (
0x00) is a valid byte in many UTF-16 characters (and is also a valid character in standard UTF-8 - modified UTF-8 uses 0xC0 0x80 in place of any actual NUL bytes, allowing NUL to be used as terminator) - and sh is completely incapable of storing NUL bytes in any variable. Use awk or perl instead. Or some other unicode-aware tool like iconv instead.â cas
Feb 2 at 1:53
Doing this in bash or sh is guaranteed to fail on many real-world unicode input files (including all UTF-16 input) because the NUL byte (
0x00) is a valid byte in many UTF-16 characters (and is also a valid character in standard UTF-8 - modified UTF-8 uses 0xC0 0x80 in place of any actual NUL bytes, allowing NUL to be used as terminator) - and sh is completely incapable of storing NUL bytes in any variable. Use awk or perl instead. Or some other unicode-aware tool like iconv instead.â cas
Feb 2 at 1:53
 |Â
show 2 more comments
1 Answer
1
active
oldest
votes
up vote
2
down vote
accepted
Carriage return shouldn't be a problem, read should read it just fine. The newline (linefeed) is, since it's the default delimiter for read. You could use the read -d '' trick to make it work.
echo $'r' | xxd; # CR
echo $'n' | xxd; # LF fails
echo $'n' | IFS= read -d '' -r -n1 x; echo "$x" # LF ok
But, like they say, you probably don't want to do stuff like this in the shell. tr would be just what you need for deleting a fixed set of characters, but at least GNU tr works on bytes, not characters, so it's not much use for Unicode.
I think this Perl should work, for UTF-8 data, if your locales are correctly set to UTF-8:
perl -C -pe 'tr/x09x0ax0dx20-xd7ffxe000-xfffdx10000-x10ffff//cd' < in > out
But better test it, I'm not that used to Unicode quirks.
tr/abc//cd deletes the characters that are not listed in abc (tr/// is actually meant to transform characters to others, see perlop). It takes lists of characters, as well as ranges, and xHH means the character with hex value HH, and xHHHH one with value HHHH. So the above accepts 0x09, 0x0a, 0x0d, everything from 0x20 to 0xd7ff etc.
The list above is taken directly from the list presented in the question. I'll leave it to the end user to evaluate if it should be changed.
Correct list for perltrshould bex9-xAxD-xDx20-xD7FFxE000-xFDCFxFE00-xFFFDx10000-x1FFFDx20000-x2FFFDx30000-x3FFFDx40000-x4FFFDx50000-x5FFFDx60000-x6FFFDx70000-x7FFFDx80000-x8FFFDx90000-x9FFFDxA0000-xAFFFDxB0000-xBFFFDxC0000-xCFFFDxD0000-xDFFFDxE0000-xEFFFDxF0000-xFFFFDx100000-x10FFFD
â Isaac
Feb 2 at 0:39
I used the perl line which this answer recommended and it worked for the instance that I was dealing with. I may replace @isaac's line if that's the consensus. Can someone comment about the difference or recommend me an official resource to understand that syntax for myself?
â Timothy Swan
Feb 2 at 17:43
That line is removing the non-characters U+FDD0-U+FDEF and fffe-ffff in each of the unicode pages (fffe-ffff,1fffe-1ffff, 2fffe-2ffff, ⦠⦠up to 10fffe-10ffff). Non-characters should not be used in interchange with other users
â Isaac
Feb 2 at 17:58
@TimothySwan, yeah, I just copied the list from your question. The beginning in isaac's list looks the same (though I think the range inxD-xDis just redundant, it's the same as justx0d), but it does exclude#fdd0to#fdffand all the higher ones where the low byte isfforfe
â ilkkachu
Feb 3 at 0:31
@isaac, though do note that the link you refer to also has this question and answer: "Q: Can noncharacters simply be deleted from input text? A: No. Doing so can lead to security problems." so it would probably be better to do something else about them. Replace with some known placeholder? Just reject the whole input?
â ilkkachu
Feb 3 at 12:21
 |Â
show 3 more comments
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
Carriage return shouldn't be a problem, read should read it just fine. The newline (linefeed) is, since it's the default delimiter for read. You could use the read -d '' trick to make it work.
echo $'r' | xxd; # CR
echo $'n' | xxd; # LF fails
echo $'n' | IFS= read -d '' -r -n1 x; echo "$x" # LF ok
But, like they say, you probably don't want to do stuff like this in the shell. tr would be just what you need for deleting a fixed set of characters, but at least GNU tr works on bytes, not characters, so it's not much use for Unicode.
I think this Perl should work, for UTF-8 data, if your locales are correctly set to UTF-8:
perl -C -pe 'tr/x09x0ax0dx20-xd7ffxe000-xfffdx10000-x10ffff//cd' < in > out
But better test it, I'm not that used to Unicode quirks.
tr/abc//cd deletes the characters that are not listed in abc (tr/// is actually meant to transform characters to others, see perlop). It takes lists of characters, as well as ranges, and xHH means the character with hex value HH, and xHHHH one with value HHHH. So the above accepts 0x09, 0x0a, 0x0d, everything from 0x20 to 0xd7ff etc.
The list above is taken directly from the list presented in the question. I'll leave it to the end user to evaluate if it should be changed.
Correct list for perltrshould bex9-xAxD-xDx20-xD7FFxE000-xFDCFxFE00-xFFFDx10000-x1FFFDx20000-x2FFFDx30000-x3FFFDx40000-x4FFFDx50000-x5FFFDx60000-x6FFFDx70000-x7FFFDx80000-x8FFFDx90000-x9FFFDxA0000-xAFFFDxB0000-xBFFFDxC0000-xCFFFDxD0000-xDFFFDxE0000-xEFFFDxF0000-xFFFFDx100000-x10FFFD
â Isaac
Feb 2 at 0:39
I used the perl line which this answer recommended and it worked for the instance that I was dealing with. I may replace @isaac's line if that's the consensus. Can someone comment about the difference or recommend me an official resource to understand that syntax for myself?
â Timothy Swan
Feb 2 at 17:43
That line is removing the non-characters U+FDD0-U+FDEF and fffe-ffff in each of the unicode pages (fffe-ffff,1fffe-1ffff, 2fffe-2ffff, ⦠⦠up to 10fffe-10ffff). Non-characters should not be used in interchange with other users
â Isaac
Feb 2 at 17:58
@TimothySwan, yeah, I just copied the list from your question. The beginning in isaac's list looks the same (though I think the range inxD-xDis just redundant, it's the same as justx0d), but it does exclude#fdd0to#fdffand all the higher ones where the low byte isfforfe
â ilkkachu
Feb 3 at 0:31
@isaac, though do note that the link you refer to also has this question and answer: "Q: Can noncharacters simply be deleted from input text? A: No. Doing so can lead to security problems." so it would probably be better to do something else about them. Replace with some known placeholder? Just reject the whole input?
â ilkkachu
Feb 3 at 12:21
 |Â
show 3 more comments
up vote
2
down vote
accepted
Carriage return shouldn't be a problem, read should read it just fine. The newline (linefeed) is, since it's the default delimiter for read. You could use the read -d '' trick to make it work.
echo $'r' | xxd; # CR
echo $'n' | xxd; # LF fails
echo $'n' | IFS= read -d '' -r -n1 x; echo "$x" # LF ok
But, like they say, you probably don't want to do stuff like this in the shell. tr would be just what you need for deleting a fixed set of characters, but at least GNU tr works on bytes, not characters, so it's not much use for Unicode.
I think this Perl should work, for UTF-8 data, if your locales are correctly set to UTF-8:
perl -C -pe 'tr/x09x0ax0dx20-xd7ffxe000-xfffdx10000-x10ffff//cd' < in > out
But better test it, I'm not that used to Unicode quirks.
tr/abc//cd deletes the characters that are not listed in abc (tr/// is actually meant to transform characters to others, see perlop). It takes lists of characters, as well as ranges, and xHH means the character with hex value HH, and xHHHH one with value HHHH. So the above accepts 0x09, 0x0a, 0x0d, everything from 0x20 to 0xd7ff etc.
The list above is taken directly from the list presented in the question. I'll leave it to the end user to evaluate if it should be changed.
Correct list for perltrshould bex9-xAxD-xDx20-xD7FFxE000-xFDCFxFE00-xFFFDx10000-x1FFFDx20000-x2FFFDx30000-x3FFFDx40000-x4FFFDx50000-x5FFFDx60000-x6FFFDx70000-x7FFFDx80000-x8FFFDx90000-x9FFFDxA0000-xAFFFDxB0000-xBFFFDxC0000-xCFFFDxD0000-xDFFFDxE0000-xEFFFDxF0000-xFFFFDx100000-x10FFFD
â Isaac
Feb 2 at 0:39
I used the perl line which this answer recommended and it worked for the instance that I was dealing with. I may replace @isaac's line if that's the consensus. Can someone comment about the difference or recommend me an official resource to understand that syntax for myself?
â Timothy Swan
Feb 2 at 17:43
That line is removing the non-characters U+FDD0-U+FDEF and fffe-ffff in each of the unicode pages (fffe-ffff,1fffe-1ffff, 2fffe-2ffff, ⦠⦠up to 10fffe-10ffff). Non-characters should not be used in interchange with other users
â Isaac
Feb 2 at 17:58
@TimothySwan, yeah, I just copied the list from your question. The beginning in isaac's list looks the same (though I think the range inxD-xDis just redundant, it's the same as justx0d), but it does exclude#fdd0to#fdffand all the higher ones where the low byte isfforfe
â ilkkachu
Feb 3 at 0:31
@isaac, though do note that the link you refer to also has this question and answer: "Q: Can noncharacters simply be deleted from input text? A: No. Doing so can lead to security problems." so it would probably be better to do something else about them. Replace with some known placeholder? Just reject the whole input?
â ilkkachu
Feb 3 at 12:21
 |Â
show 3 more comments
up vote
2
down vote
accepted
up vote
2
down vote
accepted
Carriage return shouldn't be a problem, read should read it just fine. The newline (linefeed) is, since it's the default delimiter for read. You could use the read -d '' trick to make it work.
echo $'r' | xxd; # CR
echo $'n' | xxd; # LF fails
echo $'n' | IFS= read -d '' -r -n1 x; echo "$x" # LF ok
But, like they say, you probably don't want to do stuff like this in the shell. tr would be just what you need for deleting a fixed set of characters, but at least GNU tr works on bytes, not characters, so it's not much use for Unicode.
I think this Perl should work, for UTF-8 data, if your locales are correctly set to UTF-8:
perl -C -pe 'tr/x09x0ax0dx20-xd7ffxe000-xfffdx10000-x10ffff//cd' < in > out
But better test it, I'm not that used to Unicode quirks.
tr/abc//cd deletes the characters that are not listed in abc (tr/// is actually meant to transform characters to others, see perlop). It takes lists of characters, as well as ranges, and xHH means the character with hex value HH, and xHHHH one with value HHHH. So the above accepts 0x09, 0x0a, 0x0d, everything from 0x20 to 0xd7ff etc.
The list above is taken directly from the list presented in the question. I'll leave it to the end user to evaluate if it should be changed.
Carriage return shouldn't be a problem, read should read it just fine. The newline (linefeed) is, since it's the default delimiter for read. You could use the read -d '' trick to make it work.
echo $'r' | xxd; # CR
echo $'n' | xxd; # LF fails
echo $'n' | IFS= read -d '' -r -n1 x; echo "$x" # LF ok
But, like they say, you probably don't want to do stuff like this in the shell. tr would be just what you need for deleting a fixed set of characters, but at least GNU tr works on bytes, not characters, so it's not much use for Unicode.
I think this Perl should work, for UTF-8 data, if your locales are correctly set to UTF-8:
perl -C -pe 'tr/x09x0ax0dx20-xd7ffxe000-xfffdx10000-x10ffff//cd' < in > out
But better test it, I'm not that used to Unicode quirks.
tr/abc//cd deletes the characters that are not listed in abc (tr/// is actually meant to transform characters to others, see perlop). It takes lists of characters, as well as ranges, and xHH means the character with hex value HH, and xHHHH one with value HHHH. So the above accepts 0x09, 0x0a, 0x0d, everything from 0x20 to 0xd7ff etc.
The list above is taken directly from the list presented in the question. I'll leave it to the end user to evaluate if it should be changed.
edited Feb 3 at 20:13
answered Feb 1 at 21:30
ilkkachu
49.8k674137
49.8k674137
Correct list for perltrshould bex9-xAxD-xDx20-xD7FFxE000-xFDCFxFE00-xFFFDx10000-x1FFFDx20000-x2FFFDx30000-x3FFFDx40000-x4FFFDx50000-x5FFFDx60000-x6FFFDx70000-x7FFFDx80000-x8FFFDx90000-x9FFFDxA0000-xAFFFDxB0000-xBFFFDxC0000-xCFFFDxD0000-xDFFFDxE0000-xEFFFDxF0000-xFFFFDx100000-x10FFFD
â Isaac
Feb 2 at 0:39
I used the perl line which this answer recommended and it worked for the instance that I was dealing with. I may replace @isaac's line if that's the consensus. Can someone comment about the difference or recommend me an official resource to understand that syntax for myself?
â Timothy Swan
Feb 2 at 17:43
That line is removing the non-characters U+FDD0-U+FDEF and fffe-ffff in each of the unicode pages (fffe-ffff,1fffe-1ffff, 2fffe-2ffff, ⦠⦠up to 10fffe-10ffff). Non-characters should not be used in interchange with other users
â Isaac
Feb 2 at 17:58
@TimothySwan, yeah, I just copied the list from your question. The beginning in isaac's list looks the same (though I think the range inxD-xDis just redundant, it's the same as justx0d), but it does exclude#fdd0to#fdffand all the higher ones where the low byte isfforfe
â ilkkachu
Feb 3 at 0:31
@isaac, though do note that the link you refer to also has this question and answer: "Q: Can noncharacters simply be deleted from input text? A: No. Doing so can lead to security problems." so it would probably be better to do something else about them. Replace with some known placeholder? Just reject the whole input?
â ilkkachu
Feb 3 at 12:21
 |Â
show 3 more comments
Correct list for perltrshould bex9-xAxD-xDx20-xD7FFxE000-xFDCFxFE00-xFFFDx10000-x1FFFDx20000-x2FFFDx30000-x3FFFDx40000-x4FFFDx50000-x5FFFDx60000-x6FFFDx70000-x7FFFDx80000-x8FFFDx90000-x9FFFDxA0000-xAFFFDxB0000-xBFFFDxC0000-xCFFFDxD0000-xDFFFDxE0000-xEFFFDxF0000-xFFFFDx100000-x10FFFD
â Isaac
Feb 2 at 0:39
I used the perl line which this answer recommended and it worked for the instance that I was dealing with. I may replace @isaac's line if that's the consensus. Can someone comment about the difference or recommend me an official resource to understand that syntax for myself?
â Timothy Swan
Feb 2 at 17:43
That line is removing the non-characters U+FDD0-U+FDEF and fffe-ffff in each of the unicode pages (fffe-ffff,1fffe-1ffff, 2fffe-2ffff, ⦠⦠up to 10fffe-10ffff). Non-characters should not be used in interchange with other users
â Isaac
Feb 2 at 17:58
@TimothySwan, yeah, I just copied the list from your question. The beginning in isaac's list looks the same (though I think the range inxD-xDis just redundant, it's the same as justx0d), but it does exclude#fdd0to#fdffand all the higher ones where the low byte isfforfe
â ilkkachu
Feb 3 at 0:31
@isaac, though do note that the link you refer to also has this question and answer: "Q: Can noncharacters simply be deleted from input text? A: No. Doing so can lead to security problems." so it would probably be better to do something else about them. Replace with some known placeholder? Just reject the whole input?
â ilkkachu
Feb 3 at 12:21
Correct list for perl
tr should be x9-xAxD-xDx20-xD7FFxE000-xFDCFxFE00-xFFFDx10000-x1FFFDx20000-x2FFFDx30000-x3FFFDx40000-x4FFFDx50000-x5FFFDx60000-x6FFFDx70000-x7FFFDx80000-x8FFFDx90000-x9FFFDxA0000-xAFFFDxB0000-xBFFFDxC0000-xCFFFDxD0000-xDFFFDxE0000-xEFFFDxF0000-xFFFFDx100000-x10FFFDâ Isaac
Feb 2 at 0:39
Correct list for perl
tr should be x9-xAxD-xDx20-xD7FFxE000-xFDCFxFE00-xFFFDx10000-x1FFFDx20000-x2FFFDx30000-x3FFFDx40000-x4FFFDx50000-x5FFFDx60000-x6FFFDx70000-x7FFFDx80000-x8FFFDx90000-x9FFFDxA0000-xAFFFDxB0000-xBFFFDxC0000-xCFFFDxD0000-xDFFFDxE0000-xEFFFDxF0000-xFFFFDx100000-x10FFFDâ Isaac
Feb 2 at 0:39
I used the perl line which this answer recommended and it worked for the instance that I was dealing with. I may replace @isaac's line if that's the consensus. Can someone comment about the difference or recommend me an official resource to understand that syntax for myself?
â Timothy Swan
Feb 2 at 17:43
I used the perl line which this answer recommended and it worked for the instance that I was dealing with. I may replace @isaac's line if that's the consensus. Can someone comment about the difference or recommend me an official resource to understand that syntax for myself?
â Timothy Swan
Feb 2 at 17:43
That line is removing the non-characters U+FDD0-U+FDEF and fffe-ffff in each of the unicode pages (fffe-ffff,1fffe-1ffff, 2fffe-2ffff, ⦠⦠up to 10fffe-10ffff). Non-characters should not be used in interchange with other users
â Isaac
Feb 2 at 17:58
That line is removing the non-characters U+FDD0-U+FDEF and fffe-ffff in each of the unicode pages (fffe-ffff,1fffe-1ffff, 2fffe-2ffff, ⦠⦠up to 10fffe-10ffff). Non-characters should not be used in interchange with other users
â Isaac
Feb 2 at 17:58
@TimothySwan, yeah, I just copied the list from your question. The beginning in isaac's list looks the same (though I think the range in
xD-xD is just redundant, it's the same as just x0d), but it does exclude #fdd0 to #fdff and all the higher ones where the low byte is ff or feâ ilkkachu
Feb 3 at 0:31
@TimothySwan, yeah, I just copied the list from your question. The beginning in isaac's list looks the same (though I think the range in
xD-xD is just redundant, it's the same as just x0d), but it does exclude #fdd0 to #fdff and all the higher ones where the low byte is ff or feâ ilkkachu
Feb 3 at 0:31
@isaac, though do note that the link you refer to also has this question and answer: "Q: Can noncharacters simply be deleted from input text? A: No. Doing so can lead to security problems." so it would probably be better to do something else about them. Replace with some known placeholder? Just reject the whole input?
â ilkkachu
Feb 3 at 12:21
@isaac, though do note that the link you refer to also has this question and answer: "Q: Can noncharacters simply be deleted from input text? A: No. Doing so can lead to security problems." so it would probably be better to do something else about them. Replace with some known placeholder? Just reject the whole input?
â ilkkachu
Feb 3 at 12:21
 |Â
show 3 more comments
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f421286%2fhow-to-printf-literal-characters-from-to-file-in-bash%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
Have you looked into
iconv?â DopeGhoti
Feb 1 at 20:32
2
Without looking deeper into your code, are you aware, that there are much better and faster tools than bash available to you for "text stream filtering/conversion", like perl/sed/awk/.... ?
â Alex Stragies
Feb 1 at 20:42
I use sed and awk within bash, so sure, those work, but I don't know how to encode the characters into an argument for them.
â Timothy Swan
Feb 1 at 21:03
See also Why is using a shell loop to process text considered bad practice?
â Wildcard
Feb 1 at 21:58
2
Doing this in bash or sh is guaranteed to fail on many real-world unicode input files (including all UTF-16 input) because the NUL byte (
0x00) is a valid byte in many UTF-16 characters (and is also a valid character in standard UTF-8 - modified UTF-8 uses0xC0 0x80in place of any actual NUL bytes, allowing NUL to be used as terminator) - andshis completely incapable of storing NUL bytes in any variable. Useawkorperlinstead. Or some other unicode-aware tool likeiconvinstead.â cas
Feb 2 at 1:53