How to printf literal characters from/to file in bash?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












I want to filter a file by character (for the purpose of removing invalid xml characters which I cannot control the generation of) but I cannot seem to even be able to copy individual characters from one file to another. I used printf to copy literal sections including carriage returns before, but now it does not copy a carriage return as one, but as some empty length string. My code:



infile=$1
outfile=$2
touch $outfile
while IFS= read -r -n1 char
do
# display one character at a time
printf "%s" "$char" >> $outfile
done < "$infile"
diff $infile $outfile


I don't mind using sed or awk, but I would have to encode the allowed characters.

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */







share|improve this question


















  • 1




    Have you looked into iconv?
    – DopeGhoti
    Feb 1 at 20:32






  • 2




    Without looking deeper into your code, are you aware, that there are much better and faster tools than bash available to you for "text stream filtering/conversion", like perl/sed/awk/.... ?
    – Alex Stragies
    Feb 1 at 20:42










  • I use sed and awk within bash, so sure, those work, but I don't know how to encode the characters into an argument for them.
    – Timothy Swan
    Feb 1 at 21:03










  • See also Why is using a shell loop to process text considered bad practice?
    – Wildcard
    Feb 1 at 21:58






  • 2




    Doing this in bash or sh is guaranteed to fail on many real-world unicode input files (including all UTF-16 input) because the NUL byte (0x00) is a valid byte in many UTF-16 characters (and is also a valid character in standard UTF-8 - modified UTF-8 uses 0xC0 0x80 in place of any actual NUL bytes, allowing NUL to be used as terminator) - and sh is completely incapable of storing NUL bytes in any variable. Use awk or perl instead. Or some other unicode-aware tool like iconv instead.
    – cas
    Feb 2 at 1:53















up vote
1
down vote

favorite












I want to filter a file by character (for the purpose of removing invalid xml characters which I cannot control the generation of) but I cannot seem to even be able to copy individual characters from one file to another. I used printf to copy literal sections including carriage returns before, but now it does not copy a carriage return as one, but as some empty length string. My code:



infile=$1
outfile=$2
touch $outfile
while IFS= read -r -n1 char
do
# display one character at a time
printf "%s" "$char" >> $outfile
done < "$infile"
diff $infile $outfile


I don't mind using sed or awk, but I would have to encode the allowed characters.

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */







share|improve this question


















  • 1




    Have you looked into iconv?
    – DopeGhoti
    Feb 1 at 20:32






  • 2




    Without looking deeper into your code, are you aware, that there are much better and faster tools than bash available to you for "text stream filtering/conversion", like perl/sed/awk/.... ?
    – Alex Stragies
    Feb 1 at 20:42










  • I use sed and awk within bash, so sure, those work, but I don't know how to encode the characters into an argument for them.
    – Timothy Swan
    Feb 1 at 21:03










  • See also Why is using a shell loop to process text considered bad practice?
    – Wildcard
    Feb 1 at 21:58






  • 2




    Doing this in bash or sh is guaranteed to fail on many real-world unicode input files (including all UTF-16 input) because the NUL byte (0x00) is a valid byte in many UTF-16 characters (and is also a valid character in standard UTF-8 - modified UTF-8 uses 0xC0 0x80 in place of any actual NUL bytes, allowing NUL to be used as terminator) - and sh is completely incapable of storing NUL bytes in any variable. Use awk or perl instead. Or some other unicode-aware tool like iconv instead.
    – cas
    Feb 2 at 1:53













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I want to filter a file by character (for the purpose of removing invalid xml characters which I cannot control the generation of) but I cannot seem to even be able to copy individual characters from one file to another. I used printf to copy literal sections including carriage returns before, but now it does not copy a carriage return as one, but as some empty length string. My code:



infile=$1
outfile=$2
touch $outfile
while IFS= read -r -n1 char
do
# display one character at a time
printf "%s" "$char" >> $outfile
done < "$infile"
diff $infile $outfile


I don't mind using sed or awk, but I would have to encode the allowed characters.

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */







share|improve this question














I want to filter a file by character (for the purpose of removing invalid xml characters which I cannot control the generation of) but I cannot seem to even be able to copy individual characters from one file to another. I used printf to copy literal sections including carriage returns before, but now it does not copy a carriage return as one, but as some empty length string. My code:



infile=$1
outfile=$2
touch $outfile
while IFS= read -r -n1 char
do
# display one character at a time
printf "%s" "$char" >> $outfile
done < "$infile"
diff $infile $outfile


I don't mind using sed or awk, but I would have to encode the allowed characters.

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */









share|improve this question













share|improve this question




share|improve this question








edited Feb 1 at 21:02

























asked Feb 1 at 20:20









Timothy Swan

816




816







  • 1




    Have you looked into iconv?
    – DopeGhoti
    Feb 1 at 20:32






  • 2




    Without looking deeper into your code, are you aware, that there are much better and faster tools than bash available to you for "text stream filtering/conversion", like perl/sed/awk/.... ?
    – Alex Stragies
    Feb 1 at 20:42










  • I use sed and awk within bash, so sure, those work, but I don't know how to encode the characters into an argument for them.
    – Timothy Swan
    Feb 1 at 21:03










  • See also Why is using a shell loop to process text considered bad practice?
    – Wildcard
    Feb 1 at 21:58






  • 2




    Doing this in bash or sh is guaranteed to fail on many real-world unicode input files (including all UTF-16 input) because the NUL byte (0x00) is a valid byte in many UTF-16 characters (and is also a valid character in standard UTF-8 - modified UTF-8 uses 0xC0 0x80 in place of any actual NUL bytes, allowing NUL to be used as terminator) - and sh is completely incapable of storing NUL bytes in any variable. Use awk or perl instead. Or some other unicode-aware tool like iconv instead.
    – cas
    Feb 2 at 1:53













  • 1




    Have you looked into iconv?
    – DopeGhoti
    Feb 1 at 20:32






  • 2




    Without looking deeper into your code, are you aware, that there are much better and faster tools than bash available to you for "text stream filtering/conversion", like perl/sed/awk/.... ?
    – Alex Stragies
    Feb 1 at 20:42










  • I use sed and awk within bash, so sure, those work, but I don't know how to encode the characters into an argument for them.
    – Timothy Swan
    Feb 1 at 21:03










  • See also Why is using a shell loop to process text considered bad practice?
    – Wildcard
    Feb 1 at 21:58






  • 2




    Doing this in bash or sh is guaranteed to fail on many real-world unicode input files (including all UTF-16 input) because the NUL byte (0x00) is a valid byte in many UTF-16 characters (and is also a valid character in standard UTF-8 - modified UTF-8 uses 0xC0 0x80 in place of any actual NUL bytes, allowing NUL to be used as terminator) - and sh is completely incapable of storing NUL bytes in any variable. Use awk or perl instead. Or some other unicode-aware tool like iconv instead.
    – cas
    Feb 2 at 1:53








1




1




Have you looked into iconv?
– DopeGhoti
Feb 1 at 20:32




Have you looked into iconv?
– DopeGhoti
Feb 1 at 20:32




2




2




Without looking deeper into your code, are you aware, that there are much better and faster tools than bash available to you for "text stream filtering/conversion", like perl/sed/awk/.... ?
– Alex Stragies
Feb 1 at 20:42




Without looking deeper into your code, are you aware, that there are much better and faster tools than bash available to you for "text stream filtering/conversion", like perl/sed/awk/.... ?
– Alex Stragies
Feb 1 at 20:42












I use sed and awk within bash, so sure, those work, but I don't know how to encode the characters into an argument for them.
– Timothy Swan
Feb 1 at 21:03




I use sed and awk within bash, so sure, those work, but I don't know how to encode the characters into an argument for them.
– Timothy Swan
Feb 1 at 21:03












See also Why is using a shell loop to process text considered bad practice?
– Wildcard
Feb 1 at 21:58




See also Why is using a shell loop to process text considered bad practice?
– Wildcard
Feb 1 at 21:58




2




2




Doing this in bash or sh is guaranteed to fail on many real-world unicode input files (including all UTF-16 input) because the NUL byte (0x00) is a valid byte in many UTF-16 characters (and is also a valid character in standard UTF-8 - modified UTF-8 uses 0xC0 0x80 in place of any actual NUL bytes, allowing NUL to be used as terminator) - and sh is completely incapable of storing NUL bytes in any variable. Use awk or perl instead. Or some other unicode-aware tool like iconv instead.
– cas
Feb 2 at 1:53





Doing this in bash or sh is guaranteed to fail on many real-world unicode input files (including all UTF-16 input) because the NUL byte (0x00) is a valid byte in many UTF-16 characters (and is also a valid character in standard UTF-8 - modified UTF-8 uses 0xC0 0x80 in place of any actual NUL bytes, allowing NUL to be used as terminator) - and sh is completely incapable of storing NUL bytes in any variable. Use awk or perl instead. Or some other unicode-aware tool like iconv instead.
– cas
Feb 2 at 1:53











1 Answer
1






active

oldest

votes

















up vote
2
down vote



accepted










Carriage return shouldn't be a problem, read should read it just fine. The newline (linefeed) is, since it's the default delimiter for read. You could use the read -d '' trick to make it work.



echo $'r' | xxd; # CR
echo $'n' | xxd; # LF fails
echo $'n' | IFS= read -d '' -r -n1 x; echo "$x" # LF ok


But, like they say, you probably don't want to do stuff like this in the shell. tr would be just what you need for deleting a fixed set of characters, but at least GNU tr works on bytes, not characters, so it's not much use for Unicode.



I think this Perl should work, for UTF-8 data, if your locales are correctly set to UTF-8:



perl -C -pe 'tr/x09x0ax0dx20-xd7ffxe000-xfffdx10000-x10ffff//cd' < in > out


But better test it, I'm not that used to Unicode quirks.



tr/abc//cd deletes the characters that are not listed in abc (tr/// is actually meant to transform characters to others, see perlop). It takes lists of characters, as well as ranges, and xHH means the character with hex value HH, and xHHHH one with value HHHH. So the above accepts 0x09, 0x0a, 0x0d, everything from 0x20 to 0xd7ff etc.



The list above is taken directly from the list presented in the question. I'll leave it to the end user to evaluate if it should be changed.






share|improve this answer






















  • Correct list for perl tr should be x9-xAxD-xDx20-xD7FFxE000-xFDCFxFE00-xFFFDx10000-x1FFFDx20000-x2FFFDx30000-x3FFFDx40000-x4FFFDx50000-x5FFFDx60000-x6FFFDx70000-x7FFFDx80000-x8FFFDx90000-x9FFFDxA0000-xAFFFDxB0000-xBFFFDxC0000-xCFFFDxD0000-xDFFFDxE0000-xEFFFDxF0000-xFFFFDx100000-x10FFFD
    – Isaac
    Feb 2 at 0:39










  • I used the perl line which this answer recommended and it worked for the instance that I was dealing with. I may replace @isaac's line if that's the consensus. Can someone comment about the difference or recommend me an official resource to understand that syntax for myself?
    – Timothy Swan
    Feb 2 at 17:43










  • That line is removing the non-characters U+FDD0-U+FDEF and fffe-ffff in each of the unicode pages (fffe-ffff,1fffe-1ffff, 2fffe-2ffff, … … up to 10fffe-10ffff). Non-characters should not be used in interchange with other users
    – Isaac
    Feb 2 at 17:58











  • @TimothySwan, yeah, I just copied the list from your question. The beginning in isaac's list looks the same (though I think the range in xD-xD is just redundant, it's the same as just x0d), but it does exclude #fdd0 to #fdff and all the higher ones where the low byte is ff or fe
    – ilkkachu
    Feb 3 at 0:31










  • @isaac, though do note that the link you refer to also has this question and answer: "Q: Can noncharacters simply be deleted from input text? A: No. Doing so can lead to security problems." so it would probably be better to do something else about them. Replace with some known placeholder? Just reject the whole input?
    – ilkkachu
    Feb 3 at 12:21










Your Answer







StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f421286%2fhow-to-printf-literal-characters-from-to-file-in-bash%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote



accepted










Carriage return shouldn't be a problem, read should read it just fine. The newline (linefeed) is, since it's the default delimiter for read. You could use the read -d '' trick to make it work.



echo $'r' | xxd; # CR
echo $'n' | xxd; # LF fails
echo $'n' | IFS= read -d '' -r -n1 x; echo "$x" # LF ok


But, like they say, you probably don't want to do stuff like this in the shell. tr would be just what you need for deleting a fixed set of characters, but at least GNU tr works on bytes, not characters, so it's not much use for Unicode.



I think this Perl should work, for UTF-8 data, if your locales are correctly set to UTF-8:



perl -C -pe 'tr/x09x0ax0dx20-xd7ffxe000-xfffdx10000-x10ffff//cd' < in > out


But better test it, I'm not that used to Unicode quirks.



tr/abc//cd deletes the characters that are not listed in abc (tr/// is actually meant to transform characters to others, see perlop). It takes lists of characters, as well as ranges, and xHH means the character with hex value HH, and xHHHH one with value HHHH. So the above accepts 0x09, 0x0a, 0x0d, everything from 0x20 to 0xd7ff etc.



The list above is taken directly from the list presented in the question. I'll leave it to the end user to evaluate if it should be changed.






share|improve this answer






















  • Correct list for perl tr should be x9-xAxD-xDx20-xD7FFxE000-xFDCFxFE00-xFFFDx10000-x1FFFDx20000-x2FFFDx30000-x3FFFDx40000-x4FFFDx50000-x5FFFDx60000-x6FFFDx70000-x7FFFDx80000-x8FFFDx90000-x9FFFDxA0000-xAFFFDxB0000-xBFFFDxC0000-xCFFFDxD0000-xDFFFDxE0000-xEFFFDxF0000-xFFFFDx100000-x10FFFD
    – Isaac
    Feb 2 at 0:39










  • I used the perl line which this answer recommended and it worked for the instance that I was dealing with. I may replace @isaac's line if that's the consensus. Can someone comment about the difference or recommend me an official resource to understand that syntax for myself?
    – Timothy Swan
    Feb 2 at 17:43










  • That line is removing the non-characters U+FDD0-U+FDEF and fffe-ffff in each of the unicode pages (fffe-ffff,1fffe-1ffff, 2fffe-2ffff, … … up to 10fffe-10ffff). Non-characters should not be used in interchange with other users
    – Isaac
    Feb 2 at 17:58











  • @TimothySwan, yeah, I just copied the list from your question. The beginning in isaac's list looks the same (though I think the range in xD-xD is just redundant, it's the same as just x0d), but it does exclude #fdd0 to #fdff and all the higher ones where the low byte is ff or fe
    – ilkkachu
    Feb 3 at 0:31










  • @isaac, though do note that the link you refer to also has this question and answer: "Q: Can noncharacters simply be deleted from input text? A: No. Doing so can lead to security problems." so it would probably be better to do something else about them. Replace with some known placeholder? Just reject the whole input?
    – ilkkachu
    Feb 3 at 12:21














up vote
2
down vote



accepted










Carriage return shouldn't be a problem, read should read it just fine. The newline (linefeed) is, since it's the default delimiter for read. You could use the read -d '' trick to make it work.



echo $'r' | xxd; # CR
echo $'n' | xxd; # LF fails
echo $'n' | IFS= read -d '' -r -n1 x; echo "$x" # LF ok


But, like they say, you probably don't want to do stuff like this in the shell. tr would be just what you need for deleting a fixed set of characters, but at least GNU tr works on bytes, not characters, so it's not much use for Unicode.



I think this Perl should work, for UTF-8 data, if your locales are correctly set to UTF-8:



perl -C -pe 'tr/x09x0ax0dx20-xd7ffxe000-xfffdx10000-x10ffff//cd' < in > out


But better test it, I'm not that used to Unicode quirks.



tr/abc//cd deletes the characters that are not listed in abc (tr/// is actually meant to transform characters to others, see perlop). It takes lists of characters, as well as ranges, and xHH means the character with hex value HH, and xHHHH one with value HHHH. So the above accepts 0x09, 0x0a, 0x0d, everything from 0x20 to 0xd7ff etc.



The list above is taken directly from the list presented in the question. I'll leave it to the end user to evaluate if it should be changed.






share|improve this answer






















  • Correct list for perl tr should be x9-xAxD-xDx20-xD7FFxE000-xFDCFxFE00-xFFFDx10000-x1FFFDx20000-x2FFFDx30000-x3FFFDx40000-x4FFFDx50000-x5FFFDx60000-x6FFFDx70000-x7FFFDx80000-x8FFFDx90000-x9FFFDxA0000-xAFFFDxB0000-xBFFFDxC0000-xCFFFDxD0000-xDFFFDxE0000-xEFFFDxF0000-xFFFFDx100000-x10FFFD
    – Isaac
    Feb 2 at 0:39










  • I used the perl line which this answer recommended and it worked for the instance that I was dealing with. I may replace @isaac's line if that's the consensus. Can someone comment about the difference or recommend me an official resource to understand that syntax for myself?
    – Timothy Swan
    Feb 2 at 17:43










  • That line is removing the non-characters U+FDD0-U+FDEF and fffe-ffff in each of the unicode pages (fffe-ffff,1fffe-1ffff, 2fffe-2ffff, … … up to 10fffe-10ffff). Non-characters should not be used in interchange with other users
    – Isaac
    Feb 2 at 17:58











  • @TimothySwan, yeah, I just copied the list from your question. The beginning in isaac's list looks the same (though I think the range in xD-xD is just redundant, it's the same as just x0d), but it does exclude #fdd0 to #fdff and all the higher ones where the low byte is ff or fe
    – ilkkachu
    Feb 3 at 0:31










  • @isaac, though do note that the link you refer to also has this question and answer: "Q: Can noncharacters simply be deleted from input text? A: No. Doing so can lead to security problems." so it would probably be better to do something else about them. Replace with some known placeholder? Just reject the whole input?
    – ilkkachu
    Feb 3 at 12:21












up vote
2
down vote



accepted







up vote
2
down vote



accepted






Carriage return shouldn't be a problem, read should read it just fine. The newline (linefeed) is, since it's the default delimiter for read. You could use the read -d '' trick to make it work.



echo $'r' | xxd; # CR
echo $'n' | xxd; # LF fails
echo $'n' | IFS= read -d '' -r -n1 x; echo "$x" # LF ok


But, like they say, you probably don't want to do stuff like this in the shell. tr would be just what you need for deleting a fixed set of characters, but at least GNU tr works on bytes, not characters, so it's not much use for Unicode.



I think this Perl should work, for UTF-8 data, if your locales are correctly set to UTF-8:



perl -C -pe 'tr/x09x0ax0dx20-xd7ffxe000-xfffdx10000-x10ffff//cd' < in > out


But better test it, I'm not that used to Unicode quirks.



tr/abc//cd deletes the characters that are not listed in abc (tr/// is actually meant to transform characters to others, see perlop). It takes lists of characters, as well as ranges, and xHH means the character with hex value HH, and xHHHH one with value HHHH. So the above accepts 0x09, 0x0a, 0x0d, everything from 0x20 to 0xd7ff etc.



The list above is taken directly from the list presented in the question. I'll leave it to the end user to evaluate if it should be changed.






share|improve this answer














Carriage return shouldn't be a problem, read should read it just fine. The newline (linefeed) is, since it's the default delimiter for read. You could use the read -d '' trick to make it work.



echo $'r' | xxd; # CR
echo $'n' | xxd; # LF fails
echo $'n' | IFS= read -d '' -r -n1 x; echo "$x" # LF ok


But, like they say, you probably don't want to do stuff like this in the shell. tr would be just what you need for deleting a fixed set of characters, but at least GNU tr works on bytes, not characters, so it's not much use for Unicode.



I think this Perl should work, for UTF-8 data, if your locales are correctly set to UTF-8:



perl -C -pe 'tr/x09x0ax0dx20-xd7ffxe000-xfffdx10000-x10ffff//cd' < in > out


But better test it, I'm not that used to Unicode quirks.



tr/abc//cd deletes the characters that are not listed in abc (tr/// is actually meant to transform characters to others, see perlop). It takes lists of characters, as well as ranges, and xHH means the character with hex value HH, and xHHHH one with value HHHH. So the above accepts 0x09, 0x0a, 0x0d, everything from 0x20 to 0xd7ff etc.



The list above is taken directly from the list presented in the question. I'll leave it to the end user to evaluate if it should be changed.







share|improve this answer














share|improve this answer



share|improve this answer








edited Feb 3 at 20:13

























answered Feb 1 at 21:30









ilkkachu

49.8k674137




49.8k674137











  • Correct list for perl tr should be x9-xAxD-xDx20-xD7FFxE000-xFDCFxFE00-xFFFDx10000-x1FFFDx20000-x2FFFDx30000-x3FFFDx40000-x4FFFDx50000-x5FFFDx60000-x6FFFDx70000-x7FFFDx80000-x8FFFDx90000-x9FFFDxA0000-xAFFFDxB0000-xBFFFDxC0000-xCFFFDxD0000-xDFFFDxE0000-xEFFFDxF0000-xFFFFDx100000-x10FFFD
    – Isaac
    Feb 2 at 0:39










  • I used the perl line which this answer recommended and it worked for the instance that I was dealing with. I may replace @isaac's line if that's the consensus. Can someone comment about the difference or recommend me an official resource to understand that syntax for myself?
    – Timothy Swan
    Feb 2 at 17:43










  • That line is removing the non-characters U+FDD0-U+FDEF and fffe-ffff in each of the unicode pages (fffe-ffff,1fffe-1ffff, 2fffe-2ffff, … … up to 10fffe-10ffff). Non-characters should not be used in interchange with other users
    – Isaac
    Feb 2 at 17:58











  • @TimothySwan, yeah, I just copied the list from your question. The beginning in isaac's list looks the same (though I think the range in xD-xD is just redundant, it's the same as just x0d), but it does exclude #fdd0 to #fdff and all the higher ones where the low byte is ff or fe
    – ilkkachu
    Feb 3 at 0:31










  • @isaac, though do note that the link you refer to also has this question and answer: "Q: Can noncharacters simply be deleted from input text? A: No. Doing so can lead to security problems." so it would probably be better to do something else about them. Replace with some known placeholder? Just reject the whole input?
    – ilkkachu
    Feb 3 at 12:21
















  • Correct list for perl tr should be x9-xAxD-xDx20-xD7FFxE000-xFDCFxFE00-xFFFDx10000-x1FFFDx20000-x2FFFDx30000-x3FFFDx40000-x4FFFDx50000-x5FFFDx60000-x6FFFDx70000-x7FFFDx80000-x8FFFDx90000-x9FFFDxA0000-xAFFFDxB0000-xBFFFDxC0000-xCFFFDxD0000-xDFFFDxE0000-xEFFFDxF0000-xFFFFDx100000-x10FFFD
    – Isaac
    Feb 2 at 0:39










  • I used the perl line which this answer recommended and it worked for the instance that I was dealing with. I may replace @isaac's line if that's the consensus. Can someone comment about the difference or recommend me an official resource to understand that syntax for myself?
    – Timothy Swan
    Feb 2 at 17:43










  • That line is removing the non-characters U+FDD0-U+FDEF and fffe-ffff in each of the unicode pages (fffe-ffff,1fffe-1ffff, 2fffe-2ffff, … … up to 10fffe-10ffff). Non-characters should not be used in interchange with other users
    – Isaac
    Feb 2 at 17:58











  • @TimothySwan, yeah, I just copied the list from your question. The beginning in isaac's list looks the same (though I think the range in xD-xD is just redundant, it's the same as just x0d), but it does exclude #fdd0 to #fdff and all the higher ones where the low byte is ff or fe
    – ilkkachu
    Feb 3 at 0:31










  • @isaac, though do note that the link you refer to also has this question and answer: "Q: Can noncharacters simply be deleted from input text? A: No. Doing so can lead to security problems." so it would probably be better to do something else about them. Replace with some known placeholder? Just reject the whole input?
    – ilkkachu
    Feb 3 at 12:21















Correct list for perl tr should be x9-xAxD-xDx20-xD7FFxE000-xFDCFxFE00-xFFFDx10000-x1FFFDx20000-x2FFFDx30000-x3FFFDx40000-x4FFFDx50000-x5FFFDx60000-x6FFFDx70000-x7FFFDx80000-x8FFFDx90000-x9FFFDxA0000-xAFFFDxB0000-xBFFFDxC0000-xCFFFDxD0000-xDFFFDxE0000-xEFFFDxF0000-xFFFFDx100000-x10FFFD
– Isaac
Feb 2 at 0:39




Correct list for perl tr should be x9-xAxD-xDx20-xD7FFxE000-xFDCFxFE00-xFFFDx10000-x1FFFDx20000-x2FFFDx30000-x3FFFDx40000-x4FFFDx50000-x5FFFDx60000-x6FFFDx70000-x7FFFDx80000-x8FFFDx90000-x9FFFDxA0000-xAFFFDxB0000-xBFFFDxC0000-xCFFFDxD0000-xDFFFDxE0000-xEFFFDxF0000-xFFFFDx100000-x10FFFD
– Isaac
Feb 2 at 0:39












I used the perl line which this answer recommended and it worked for the instance that I was dealing with. I may replace @isaac's line if that's the consensus. Can someone comment about the difference or recommend me an official resource to understand that syntax for myself?
– Timothy Swan
Feb 2 at 17:43




I used the perl line which this answer recommended and it worked for the instance that I was dealing with. I may replace @isaac's line if that's the consensus. Can someone comment about the difference or recommend me an official resource to understand that syntax for myself?
– Timothy Swan
Feb 2 at 17:43












That line is removing the non-characters U+FDD0-U+FDEF and fffe-ffff in each of the unicode pages (fffe-ffff,1fffe-1ffff, 2fffe-2ffff, … … up to 10fffe-10ffff). Non-characters should not be used in interchange with other users
– Isaac
Feb 2 at 17:58





That line is removing the non-characters U+FDD0-U+FDEF and fffe-ffff in each of the unicode pages (fffe-ffff,1fffe-1ffff, 2fffe-2ffff, … … up to 10fffe-10ffff). Non-characters should not be used in interchange with other users
– Isaac
Feb 2 at 17:58













@TimothySwan, yeah, I just copied the list from your question. The beginning in isaac's list looks the same (though I think the range in xD-xD is just redundant, it's the same as just x0d), but it does exclude #fdd0 to #fdff and all the higher ones where the low byte is ff or fe
– ilkkachu
Feb 3 at 0:31




@TimothySwan, yeah, I just copied the list from your question. The beginning in isaac's list looks the same (though I think the range in xD-xD is just redundant, it's the same as just x0d), but it does exclude #fdd0 to #fdff and all the higher ones where the low byte is ff or fe
– ilkkachu
Feb 3 at 0:31












@isaac, though do note that the link you refer to also has this question and answer: "Q: Can noncharacters simply be deleted from input text? A: No. Doing so can lead to security problems." so it would probably be better to do something else about them. Replace with some known placeholder? Just reject the whole input?
– ilkkachu
Feb 3 at 12:21




@isaac, though do note that the link you refer to also has this question and answer: "Q: Can noncharacters simply be deleted from input text? A: No. Doing so can lead to security problems." so it would probably be better to do something else about them. Replace with some known placeholder? Just reject the whole input?
– ilkkachu
Feb 3 at 12:21












 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f421286%2fhow-to-printf-literal-characters-from-to-file-in-bash%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Peggy Mitchell

Palaiologos

The Forum (Inglewood, California)