How can I remove the BOM from a UTF-8 file?

Clash Royale CLAN TAG#URR8PPP
up vote
27
down vote
favorite
I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?
$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines
command-line files unicode
add a comment |Â
up vote
27
down vote
favorite
I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?
$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines
command-line files unicode
Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
â Stéphane Chazelas
Jul 23 '17 at 10:40
1
I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/⦠Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
â Oskar Skog
Jul 23 '17 at 11:24
add a comment |Â
up vote
27
down vote
favorite
up vote
27
down vote
favorite
I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?
$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines
command-line files unicode
I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?
$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines
command-line files unicode
command-line files unicode
edited Jul 23 '17 at 10:06
Michael Homer
43.7k6113152
43.7k6113152
asked Jul 23 '17 at 10:05
m13r
7741714
7741714
Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
â Stéphane Chazelas
Jul 23 '17 at 10:40
1
I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/⦠Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
â Oskar Skog
Jul 23 '17 at 11:24
add a comment |Â
Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
â Stéphane Chazelas
Jul 23 '17 at 10:40
1
I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/⦠Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
â Oskar Skog
Jul 23 '17 at 11:24
Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
â Stéphane Chazelas
Jul 23 '17 at 10:40
Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
â Stéphane Chazelas
Jul 23 '17 at 10:40
1
1
I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/⦠Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
â Oskar Skog
Jul 23 '17 at 11:24
I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/⦠Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
â Oskar Skog
Jul 23 '17 at 11:24
add a comment |Â
6 Answers
6
active
oldest
votes
up vote
38
down vote
accepted
If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn't.
sed '1s/^xEFxBBxBF//' < orig.txt > new.txt
You can also overwrite the existing file with the -i option:
sed -i '1s/^xEFxBBxBF//' orig.txt
4
this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
â hildred
Jul 23 '17 at 15:29
3
@hildred I've tested it with theen_US.UTF-8locale and it worked. When will it fail?
â m13r
Jul 24 '17 at 6:55
2
@m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
â hildred
Jul 24 '17 at 16:25
3
To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
â Joshua
Jul 24 '17 at 17:41
@CSM nice, but for one special case it does not work: Bevore:-<U+FEFF>chapterxxxAfter:+chapterxxx^MExplanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
â Cutton Eye
Feb 20 at 15:55
 |Â
show 2 more comments
up vote
42
down vote
A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.
dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.
dos2unix test.xml
12
I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
â Johan Myréen
Jul 23 '17 at 14:02
13
What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
â ilkkachu
Jul 23 '17 at 14:09
Comments are not for extended discussion; this conversation has been moved to chat.
â terdonâ¦
Jul 24 '17 at 14:07
Is there a way of not converting the line endings and just remove the BOM withdos2unix?
â m13r
Jul 25 '17 at 7:55
2
@m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
â Arrow
Jul 26 '17 at 5:51
 |Â
show 2 more comments
up vote
15
down vote
It is possible to remove the BOM from a file with the tail command:
tail -c +4 withBOM.txt > withoutBOM.txt
Why 4? The BOM has 3 byte.
â deviantfan
Jul 23 '17 at 17:12
5
@deviantfan Which is why you need to start at the 4th byte if you want to skip it.
â Stéphane Chazelas
Jul 23 '17 at 18:33
6
tailis using 1 based indexing?! WTF!
â CodesInChaos
Jul 23 '17 at 19:31
3
@CodesInChaos,tail -c -1ortail -c 1(whattailis generally used for) is the content starting with the last byte,tail -c +1starting with the first byte.tail -c 0/tail -c +0for that would be a lot more unintuitive.
â Stéphane Chazelas
Jul 23 '17 at 23:05
1
@deviantfan:(dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU(head -c3 >/dev/null; cat)-- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
â dave_thompson_085
Jul 24 '17 at 6:16
 |Â
show 2 more comments
up vote
8
down vote
Using VIM
Open file in VIM:
vi text.xmlRemove BOM encoding:
:set nobombSave and quit:
:wq
add a comment |Â
up vote
4
down vote
You can use
LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename
to remove the byte order mark from the beginning of the file, if it has any, as well as convert any CR LF newlines to LF only. The LANG=C LC_ALL=C tells the shell you want the command to run in the default C locale (also known as the default POSIX locale), where the three bytes forming the Byte Order Mark are treated as bytes. The -i option to sed means in-place. If you use -i.old, then sed saves the original file as filename.old, and the new file (with the modifications, if any) as filename.
I personally like to have this as ~/bin/fix-ms; for example, as
#!/bin/dash
export LANG=C LC_ALL=C
if [ $# -gt 0 ]; then
for FILE in "$@" ; do
sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
done
else
exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
fi
so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run
find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix
or, if I just want to look at such a file, without modifying it, I can run
~/bin/ms-fix < filename | less
and not see the ugly <U+FEFF> in my UTF-8 terminal.
Why not simplysed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"?
â Stéphane Chazelas
Jul 24 '17 at 14:02
@StéphaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, whichsed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
â Nominal Animal
Jul 24 '17 at 14:24
@StéphaneChazelas: The--before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
â Nominal Animal
Jul 24 '17 at 14:27
add a comment |Â
up vote
0
down vote
Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)
Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.
add a comment |Â
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
38
down vote
accepted
If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn't.
sed '1s/^xEFxBBxBF//' < orig.txt > new.txt
You can also overwrite the existing file with the -i option:
sed -i '1s/^xEFxBBxBF//' orig.txt
4
this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
â hildred
Jul 23 '17 at 15:29
3
@hildred I've tested it with theen_US.UTF-8locale and it worked. When will it fail?
â m13r
Jul 24 '17 at 6:55
2
@m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
â hildred
Jul 24 '17 at 16:25
3
To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
â Joshua
Jul 24 '17 at 17:41
@CSM nice, but for one special case it does not work: Bevore:-<U+FEFF>chapterxxxAfter:+chapterxxx^MExplanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
â Cutton Eye
Feb 20 at 15:55
 |Â
show 2 more comments
up vote
38
down vote
accepted
If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn't.
sed '1s/^xEFxBBxBF//' < orig.txt > new.txt
You can also overwrite the existing file with the -i option:
sed -i '1s/^xEFxBBxBF//' orig.txt
4
this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
â hildred
Jul 23 '17 at 15:29
3
@hildred I've tested it with theen_US.UTF-8locale and it worked. When will it fail?
â m13r
Jul 24 '17 at 6:55
2
@m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
â hildred
Jul 24 '17 at 16:25
3
To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
â Joshua
Jul 24 '17 at 17:41
@CSM nice, but for one special case it does not work: Bevore:-<U+FEFF>chapterxxxAfter:+chapterxxx^MExplanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
â Cutton Eye
Feb 20 at 15:55
 |Â
show 2 more comments
up vote
38
down vote
accepted
up vote
38
down vote
accepted
If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn't.
sed '1s/^xEFxBBxBF//' < orig.txt > new.txt
You can also overwrite the existing file with the -i option:
sed -i '1s/^xEFxBBxBF//' orig.txt
If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn't.
sed '1s/^xEFxBBxBF//' < orig.txt > new.txt
You can also overwrite the existing file with the -i option:
sed -i '1s/^xEFxBBxBF//' orig.txt
edited Jul 24 '17 at 7:57
Stéphane Chazelas
288k54535873
288k54535873
answered Jul 23 '17 at 14:08
CSM
60244
60244
4
this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
â hildred
Jul 23 '17 at 15:29
3
@hildred I've tested it with theen_US.UTF-8locale and it worked. When will it fail?
â m13r
Jul 24 '17 at 6:55
2
@m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
â hildred
Jul 24 '17 at 16:25
3
To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
â Joshua
Jul 24 '17 at 17:41
@CSM nice, but for one special case it does not work: Bevore:-<U+FEFF>chapterxxxAfter:+chapterxxx^MExplanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
â Cutton Eye
Feb 20 at 15:55
 |Â
show 2 more comments
4
this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
â hildred
Jul 23 '17 at 15:29
3
@hildred I've tested it with theen_US.UTF-8locale and it worked. When will it fail?
â m13r
Jul 24 '17 at 6:55
2
@m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
â hildred
Jul 24 '17 at 16:25
3
To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
â Joshua
Jul 24 '17 at 17:41
@CSM nice, but for one special case it does not work: Bevore:-<U+FEFF>chapterxxxAfter:+chapterxxx^MExplanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
â Cutton Eye
Feb 20 at 15:55
4
4
this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
â hildred
Jul 23 '17 at 15:29
this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
â hildred
Jul 23 '17 at 15:29
3
3
@hildred I've tested it with the
en_US.UTF-8 locale and it worked. When will it fail?â m13r
Jul 24 '17 at 6:55
@hildred I've tested it with the
en_US.UTF-8 locale and it worked. When will it fail?â m13r
Jul 24 '17 at 6:55
2
2
@m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
â hildred
Jul 24 '17 at 16:25
@m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
â hildred
Jul 24 '17 at 16:25
3
3
To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
â Joshua
Jul 24 '17 at 17:41
To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
â Joshua
Jul 24 '17 at 17:41
@CSM nice, but for one special case it does not work: Bevore:
-<U+FEFF>chapterxxx After: +chapterxxx^M Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?â Cutton Eye
Feb 20 at 15:55
@CSM nice, but for one special case it does not work: Bevore:
-<U+FEFF>chapterxxx After: +chapterxxx^M Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?â Cutton Eye
Feb 20 at 15:55
 |Â
show 2 more comments
up vote
42
down vote
A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.
dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.
dos2unix test.xml
12
I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
â Johan Myréen
Jul 23 '17 at 14:02
13
What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
â ilkkachu
Jul 23 '17 at 14:09
Comments are not for extended discussion; this conversation has been moved to chat.
â terdonâ¦
Jul 24 '17 at 14:07
Is there a way of not converting the line endings and just remove the BOM withdos2unix?
â m13r
Jul 25 '17 at 7:55
2
@m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
â Arrow
Jul 26 '17 at 5:51
 |Â
show 2 more comments
up vote
42
down vote
A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.
dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.
dos2unix test.xml
12
I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
â Johan Myréen
Jul 23 '17 at 14:02
13
What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
â ilkkachu
Jul 23 '17 at 14:09
Comments are not for extended discussion; this conversation has been moved to chat.
â terdonâ¦
Jul 24 '17 at 14:07
Is there a way of not converting the line endings and just remove the BOM withdos2unix?
â m13r
Jul 25 '17 at 7:55
2
@m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
â Arrow
Jul 26 '17 at 5:51
 |Â
show 2 more comments
up vote
42
down vote
up vote
42
down vote
A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.
dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.
dos2unix test.xml
A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.
dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.
dos2unix test.xml
answered Jul 23 '17 at 10:42
Stéphane Chazelas
288k54535873
288k54535873
12
I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
â Johan Myréen
Jul 23 '17 at 14:02
13
What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
â ilkkachu
Jul 23 '17 at 14:09
Comments are not for extended discussion; this conversation has been moved to chat.
â terdonâ¦
Jul 24 '17 at 14:07
Is there a way of not converting the line endings and just remove the BOM withdos2unix?
â m13r
Jul 25 '17 at 7:55
2
@m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
â Arrow
Jul 26 '17 at 5:51
 |Â
show 2 more comments
12
I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
â Johan Myréen
Jul 23 '17 at 14:02
13
What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
â ilkkachu
Jul 23 '17 at 14:09
Comments are not for extended discussion; this conversation has been moved to chat.
â terdonâ¦
Jul 24 '17 at 14:07
Is there a way of not converting the line endings and just remove the BOM withdos2unix?
â m13r
Jul 25 '17 at 7:55
2
@m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
â Arrow
Jul 26 '17 at 5:51
12
12
I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
â Johan Myréen
Jul 23 '17 at 14:02
I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
â Johan Myréen
Jul 23 '17 at 14:02
13
13
What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
â ilkkachu
Jul 23 '17 at 14:09
What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
â ilkkachu
Jul 23 '17 at 14:09
Comments are not for extended discussion; this conversation has been moved to chat.
â terdonâ¦
Jul 24 '17 at 14:07
Comments are not for extended discussion; this conversation has been moved to chat.
â terdonâ¦
Jul 24 '17 at 14:07
Is there a way of not converting the line endings and just remove the BOM with
dos2unix?â m13r
Jul 25 '17 at 7:55
Is there a way of not converting the line endings and just remove the BOM with
dos2unix?â m13r
Jul 25 '17 at 7:55
2
2
@m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
â Arrow
Jul 26 '17 at 5:51
@m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
â Arrow
Jul 26 '17 at 5:51
 |Â
show 2 more comments
up vote
15
down vote
It is possible to remove the BOM from a file with the tail command:
tail -c +4 withBOM.txt > withoutBOM.txt
Why 4? The BOM has 3 byte.
â deviantfan
Jul 23 '17 at 17:12
5
@deviantfan Which is why you need to start at the 4th byte if you want to skip it.
â Stéphane Chazelas
Jul 23 '17 at 18:33
6
tailis using 1 based indexing?! WTF!
â CodesInChaos
Jul 23 '17 at 19:31
3
@CodesInChaos,tail -c -1ortail -c 1(whattailis generally used for) is the content starting with the last byte,tail -c +1starting with the first byte.tail -c 0/tail -c +0for that would be a lot more unintuitive.
â Stéphane Chazelas
Jul 23 '17 at 23:05
1
@deviantfan:(dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU(head -c3 >/dev/null; cat)-- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
â dave_thompson_085
Jul 24 '17 at 6:16
 |Â
show 2 more comments
up vote
15
down vote
It is possible to remove the BOM from a file with the tail command:
tail -c +4 withBOM.txt > withoutBOM.txt
Why 4? The BOM has 3 byte.
â deviantfan
Jul 23 '17 at 17:12
5
@deviantfan Which is why you need to start at the 4th byte if you want to skip it.
â Stéphane Chazelas
Jul 23 '17 at 18:33
6
tailis using 1 based indexing?! WTF!
â CodesInChaos
Jul 23 '17 at 19:31
3
@CodesInChaos,tail -c -1ortail -c 1(whattailis generally used for) is the content starting with the last byte,tail -c +1starting with the first byte.tail -c 0/tail -c +0for that would be a lot more unintuitive.
â Stéphane Chazelas
Jul 23 '17 at 23:05
1
@deviantfan:(dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU(head -c3 >/dev/null; cat)-- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
â dave_thompson_085
Jul 24 '17 at 6:16
 |Â
show 2 more comments
up vote
15
down vote
up vote
15
down vote
It is possible to remove the BOM from a file with the tail command:
tail -c +4 withBOM.txt > withoutBOM.txt
It is possible to remove the BOM from a file with the tail command:
tail -c +4 withBOM.txt > withoutBOM.txt
edited Jul 24 '17 at 5:49
answered Jul 23 '17 at 10:05
m13r
7741714
7741714
Why 4? The BOM has 3 byte.
â deviantfan
Jul 23 '17 at 17:12
5
@deviantfan Which is why you need to start at the 4th byte if you want to skip it.
â Stéphane Chazelas
Jul 23 '17 at 18:33
6
tailis using 1 based indexing?! WTF!
â CodesInChaos
Jul 23 '17 at 19:31
3
@CodesInChaos,tail -c -1ortail -c 1(whattailis generally used for) is the content starting with the last byte,tail -c +1starting with the first byte.tail -c 0/tail -c +0for that would be a lot more unintuitive.
â Stéphane Chazelas
Jul 23 '17 at 23:05
1
@deviantfan:(dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU(head -c3 >/dev/null; cat)-- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
â dave_thompson_085
Jul 24 '17 at 6:16
 |Â
show 2 more comments
Why 4? The BOM has 3 byte.
â deviantfan
Jul 23 '17 at 17:12
5
@deviantfan Which is why you need to start at the 4th byte if you want to skip it.
â Stéphane Chazelas
Jul 23 '17 at 18:33
6
tailis using 1 based indexing?! WTF!
â CodesInChaos
Jul 23 '17 at 19:31
3
@CodesInChaos,tail -c -1ortail -c 1(whattailis generally used for) is the content starting with the last byte,tail -c +1starting with the first byte.tail -c 0/tail -c +0for that would be a lot more unintuitive.
â Stéphane Chazelas
Jul 23 '17 at 23:05
1
@deviantfan:(dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU(head -c3 >/dev/null; cat)-- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
â dave_thompson_085
Jul 24 '17 at 6:16
Why 4? The BOM has 3 byte.
â deviantfan
Jul 23 '17 at 17:12
Why 4? The BOM has 3 byte.
â deviantfan
Jul 23 '17 at 17:12
5
5
@deviantfan Which is why you need to start at the 4th byte if you want to skip it.
â Stéphane Chazelas
Jul 23 '17 at 18:33
@deviantfan Which is why you need to start at the 4th byte if you want to skip it.
â Stéphane Chazelas
Jul 23 '17 at 18:33
6
6
tail is using 1 based indexing?! WTF!â CodesInChaos
Jul 23 '17 at 19:31
tail is using 1 based indexing?! WTF!â CodesInChaos
Jul 23 '17 at 19:31
3
3
@CodesInChaos,
tail -c -1 or tail -c 1 (what tail is generally used for) is the content starting with the last byte, tail -c +1 starting with the first byte. tail -c 0/tail -c +0 for that would be a lot more unintuitive.â Stéphane Chazelas
Jul 23 '17 at 23:05
@CodesInChaos,
tail -c -1 or tail -c 1 (what tail is generally used for) is the content starting with the last byte, tail -c +1 starting with the first byte. tail -c 0/tail -c +0 for that would be a lot more unintuitive.â Stéphane Chazelas
Jul 23 '17 at 23:05
1
1
@deviantfan:
(dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU (head -c3 >/dev/null; cat) -- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.â dave_thompson_085
Jul 24 '17 at 6:16
@deviantfan:
(dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU (head -c3 >/dev/null; cat) -- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.â dave_thompson_085
Jul 24 '17 at 6:16
 |Â
show 2 more comments
up vote
8
down vote
Using VIM
Open file in VIM:
vi text.xmlRemove BOM encoding:
:set nobombSave and quit:
:wq
add a comment |Â
up vote
8
down vote
Using VIM
Open file in VIM:
vi text.xmlRemove BOM encoding:
:set nobombSave and quit:
:wq
add a comment |Â
up vote
8
down vote
up vote
8
down vote
Using VIM
Open file in VIM:
vi text.xmlRemove BOM encoding:
:set nobombSave and quit:
:wq
Using VIM
Open file in VIM:
vi text.xmlRemove BOM encoding:
:set nobombSave and quit:
:wq
edited Jan 4 at 17:55
answered Dec 24 '17 at 18:05
Joshua Pinter
18415
18415
add a comment |Â
add a comment |Â
up vote
4
down vote
You can use
LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename
to remove the byte order mark from the beginning of the file, if it has any, as well as convert any CR LF newlines to LF only. The LANG=C LC_ALL=C tells the shell you want the command to run in the default C locale (also known as the default POSIX locale), where the three bytes forming the Byte Order Mark are treated as bytes. The -i option to sed means in-place. If you use -i.old, then sed saves the original file as filename.old, and the new file (with the modifications, if any) as filename.
I personally like to have this as ~/bin/fix-ms; for example, as
#!/bin/dash
export LANG=C LC_ALL=C
if [ $# -gt 0 ]; then
for FILE in "$@" ; do
sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
done
else
exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
fi
so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run
find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix
or, if I just want to look at such a file, without modifying it, I can run
~/bin/ms-fix < filename | less
and not see the ugly <U+FEFF> in my UTF-8 terminal.
Why not simplysed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"?
â Stéphane Chazelas
Jul 24 '17 at 14:02
@StéphaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, whichsed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
â Nominal Animal
Jul 24 '17 at 14:24
@StéphaneChazelas: The--before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
â Nominal Animal
Jul 24 '17 at 14:27
add a comment |Â
up vote
4
down vote
You can use
LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename
to remove the byte order mark from the beginning of the file, if it has any, as well as convert any CR LF newlines to LF only. The LANG=C LC_ALL=C tells the shell you want the command to run in the default C locale (also known as the default POSIX locale), where the three bytes forming the Byte Order Mark are treated as bytes. The -i option to sed means in-place. If you use -i.old, then sed saves the original file as filename.old, and the new file (with the modifications, if any) as filename.
I personally like to have this as ~/bin/fix-ms; for example, as
#!/bin/dash
export LANG=C LC_ALL=C
if [ $# -gt 0 ]; then
for FILE in "$@" ; do
sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
done
else
exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
fi
so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run
find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix
or, if I just want to look at such a file, without modifying it, I can run
~/bin/ms-fix < filename | less
and not see the ugly <U+FEFF> in my UTF-8 terminal.
Why not simplysed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"?
â Stéphane Chazelas
Jul 24 '17 at 14:02
@StéphaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, whichsed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
â Nominal Animal
Jul 24 '17 at 14:24
@StéphaneChazelas: The--before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
â Nominal Animal
Jul 24 '17 at 14:27
add a comment |Â
up vote
4
down vote
up vote
4
down vote
You can use
LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename
to remove the byte order mark from the beginning of the file, if it has any, as well as convert any CR LF newlines to LF only. The LANG=C LC_ALL=C tells the shell you want the command to run in the default C locale (also known as the default POSIX locale), where the three bytes forming the Byte Order Mark are treated as bytes. The -i option to sed means in-place. If you use -i.old, then sed saves the original file as filename.old, and the new file (with the modifications, if any) as filename.
I personally like to have this as ~/bin/fix-ms; for example, as
#!/bin/dash
export LANG=C LC_ALL=C
if [ $# -gt 0 ]; then
for FILE in "$@" ; do
sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
done
else
exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
fi
so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run
find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix
or, if I just want to look at such a file, without modifying it, I can run
~/bin/ms-fix < filename | less
and not see the ugly <U+FEFF> in my UTF-8 terminal.
You can use
LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename
to remove the byte order mark from the beginning of the file, if it has any, as well as convert any CR LF newlines to LF only. The LANG=C LC_ALL=C tells the shell you want the command to run in the default C locale (also known as the default POSIX locale), where the three bytes forming the Byte Order Mark are treated as bytes. The -i option to sed means in-place. If you use -i.old, then sed saves the original file as filename.old, and the new file (with the modifications, if any) as filename.
I personally like to have this as ~/bin/fix-ms; for example, as
#!/bin/dash
export LANG=C LC_ALL=C
if [ $# -gt 0 ]; then
for FILE in "$@" ; do
sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
done
else
exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
fi
so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run
find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix
or, if I just want to look at such a file, without modifying it, I can run
~/bin/ms-fix < filename | less
and not see the ugly <U+FEFF> in my UTF-8 terminal.
edited Jul 24 '17 at 14:25
answered Jul 23 '17 at 19:10
Nominal Animal
2,820812
2,820812
Why not simplysed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"?
â Stéphane Chazelas
Jul 24 '17 at 14:02
@StéphaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, whichsed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
â Nominal Animal
Jul 24 '17 at 14:24
@StéphaneChazelas: The--before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
â Nominal Animal
Jul 24 '17 at 14:27
add a comment |Â
Why not simplysed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"?
â Stéphane Chazelas
Jul 24 '17 at 14:02
@StéphaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, whichsed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
â Nominal Animal
Jul 24 '17 at 14:24
@StéphaneChazelas: The--before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
â Nominal Animal
Jul 24 '17 at 14:27
Why not simply
sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"?â Stéphane Chazelas
Jul 24 '17 at 14:02
Why not simply
sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"?â Stéphane Chazelas
Jul 24 '17 at 14:02
@StéphaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, which
sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@" does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.â Nominal Animal
Jul 24 '17 at 14:24
@StéphaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, which
sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@" does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.â Nominal Animal
Jul 24 '17 at 14:24
@StéphaneChazelas: The
-- before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!â Nominal Animal
Jul 24 '17 at 14:27
@StéphaneChazelas: The
-- before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!â Nominal Animal
Jul 24 '17 at 14:27
add a comment |Â
up vote
0
down vote
Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)
Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.
add a comment |Â
up vote
0
down vote
Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)
Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)
Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.
Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)
Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.
answered 15 mins ago
Wernfried Domscheit
1061
1061
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f381230%2fhow-can-i-remove-the-bom-from-a-utf-8-file%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
â Stéphane Chazelas
Jul 23 '17 at 10:40
1
I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/⦠Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
â Oskar Skog
Jul 23 '17 at 11:24