How can I remove the BOM from a UTF-8 file?

up vote
27
down vote

favorite

I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?

$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines

edited Jul 23 '17 at 10:06

Michael Homer

43.7k6113152

asked Jul 23 '17 at 10:05

m13r

7741714

Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 10:40

1

I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/â€¦ Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
â€“Â Oskar Skog
Jul 23 '17 at 11:24

add a commentÂ |Â

up vote
27
down vote

favorite

I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?

$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines

edited Jul 23 '17 at 10:06

Michael Homer

43.7k6113152

asked Jul 23 '17 at 10:05

m13r

7741714

Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 10:40

1

I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/â€¦ Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
â€“Â Oskar Skog
Jul 23 '17 at 11:24

add a commentÂ |Â

up vote
27
down vote

favorite

I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?

$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines

edited Jul 23 '17 at 10:06

Michael Homer

43.7k6113152

asked Jul 23 '17 at 10:05

m13r

7741714

I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?

$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines

command-line files unicode

edited Jul 23 '17 at 10:06

Michael Homer

43.7k6113152

asked Jul 23 '17 at 10:05

m13r

7741714

edited Jul 23 '17 at 10:06

Michael Homer

43.7k6113152

asked Jul 23 '17 at 10:05

m13r

7741714

edited Jul 23 '17 at 10:06

Michael Homer

43.7k6113152

edited Jul 23 '17 at 10:06

Michael Homer

43.7k6113152

edited Jul 23 '17 at 10:06

Michael Homer

43.7k6113152

asked Jul 23 '17 at 10:05

m13r

7741714

asked Jul 23 '17 at 10:05

m13r

7741714

asked Jul 23 '17 at 10:05

m13r

7741714

Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 10:40

1

I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/â€¦ Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
â€“Â Oskar Skog
Jul 23 '17 at 11:24

add a commentÂ |Â

Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 10:40

1

I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/â€¦ Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
â€“Â Oskar Skog
Jul 23 '17 at 11:24

Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 10:40

I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/â€¦ Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
â€“Â Oskar Skog
Jul 23 '17 at 11:24

add a commentÂ |Â

6 Answers
6

active

oldest

votes

up vote
38
down vote

accepted

If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn't.

sed '1s/^xEFxBBxBF//' < orig.txt > new.txt

You can also overwrite the existing file with the -i option:

sed -i '1s/^xEFxBBxBF//' orig.txt

edited Jul 24 '17 at 7:57

StÃ©phane Chazelas

288k54535873

answered Jul 23 '17 at 14:08

CSM

60244

4

this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
â€“Â hildred
Jul 23 '17 at 15:29

3

@hildred I've tested it with the en_US.UTF-8 locale and it worked. When will it fail?
â€“Â m13r
Jul 24 '17 at 6:55

2

@m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
â€“Â hildred
Jul 24 '17 at 16:25

3

To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
â€“Â Joshua
Jul 24 '17 at 17:41

@CSM nice, but for one special case it does not work: Bevore: -<U+FEFF>chapterxxx After: +chapterxxx^M Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
â€“Â Cutton Eye
Feb 20 at 15:55

Â |Â
show 2 more comments

up vote
42
down vote

A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.

dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.

dos2unix test.xml

answered Jul 23 '17 at 10:42

StÃ©phane Chazelas

288k54535873

12

I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
â€“Â Johan MyrÃ©en
Jul 23 '17 at 14:02

13

What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
â€“Â ilkkachu
Jul 23 '17 at 14:09

Comments are not for extended discussion; this conversation has been moved to chat.
â€“Â terdonâ™¦
Jul 24 '17 at 14:07

Is there a way of not converting the line endings and just remove the BOM with dos2unix?
â€“Â m13r
Jul 25 '17 at 7:55

2

@m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
â€“Â Arrow
Jul 26 '17 at 5:51

Â |Â
show 2 more comments

up vote
15
down vote

It is possible to remove the BOM from a file with the tail command:

tail -c +4 withBOM.txt > withoutBOM.txt

edited Jul 24 '17 at 5:49

answered Jul 23 '17 at 10:05

m13r

7741714

Why 4? The BOM has 3 byte.
â€“Â deviantfan
Jul 23 '17 at 17:12

5

@deviantfan Which is why you need to start at the 4th byte if you want to skip it.
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 18:33

6

tail is using 1 based indexing?! WTF!
â€“Â CodesInChaos
Jul 23 '17 at 19:31

3

@CodesInChaos, tail -c -1 or tail -c 1 (what tail is generally used for) is the content starting with the last byte, tail -c +1 starting with the first byte. tail -c 0/tail -c +0 for that would be a lot more unintuitive.
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 23:05

1

@deviantfan: (dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU (head -c3 >/dev/null; cat) -- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
â€“Â dave_thompson_085
Jul 24 '17 at 6:16

Â |Â
show 2 more comments

up vote
8
down vote

Using VIM

Open file in VIM:
```
vi text.xml
```

Remove BOM encoding:
```
:set nobomb
```

Save and quit:
```
:wq
```

edited Jan 4 at 17:55

answered Dec 24 '17 at 18:05

Joshua Pinter

18415

add a commentÂ |Â

up vote
4
down vote

You can use

LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename

to remove the byte order mark from the beginning of the file, if it has any, as well as convert any CR LF newlines to LF only. The LANG=C LC_ALL=C tells the shell you want the command to run in the default C locale (also known as the default POSIX locale), where the three bytes forming the Byte Order Mark are treated as bytes. The -i option to sed means in-place. If you use -i.old, then sed saves the original file as filename.old, and the new file (with the modifications, if any) as filename.

I personally like to have this as ~/bin/fix-ms; for example, as

#!/bin/dash
export LANG=C LC_ALL=C
if [ $# -gt 0 ]; then
 for FILE in "$@" ; do
 sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
 done
else
 exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
fi

so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run

find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix

or, if I just want to look at such a file, without modifying it, I can run

~/bin/ms-fix < filename | less

and not see the ugly <U+FEFF> in my UTF-8 terminal.

edited Jul 24 '17 at 14:25

answered Jul 23 '17 at 19:10

Nominal Animal

2,820812

Why not simply sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"?
â€“Â StÃ©phane Chazelas
Jul 24 '17 at 14:02

@StÃ©phaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, which sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@" does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
â€“Â Nominal Animal
Jul 24 '17 at 14:24

@StÃ©phaneChazelas: The -- before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
â€“Â Nominal Animal
Jul 24 '17 at 14:27

add a commentÂ |Â

up vote
0
down vote

Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)

Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.

answered 15 mins ago

Wernfried Domscheit

1061

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f381230%2fhow-can-i-remove-the-bom-from-a-utf-8-file%23new-answer', 'question_page');

);

Post as a guest

Name

6 Answers
6

active

oldest

votes

6 Answers
6

active

oldest

votes

up vote
38
down vote

accepted

If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn't.

sed '1s/^xEFxBBxBF//' < orig.txt > new.txt

You can also overwrite the existing file with the -i option:

sed -i '1s/^xEFxBBxBF//' orig.txt

edited Jul 24 '17 at 7:57

StÃ©phane Chazelas

288k54535873

answered Jul 23 '17 at 14:08

CSM

60244

4

this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
â€“Â hildred
Jul 23 '17 at 15:29

3

@hildred I've tested it with the en_US.UTF-8 locale and it worked. When will it fail?
â€“Â m13r
Jul 24 '17 at 6:55

2

@m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
â€“Â hildred
Jul 24 '17 at 16:25

3

To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
â€“Â Joshua
Jul 24 '17 at 17:41

@CSM nice, but for one special case it does not work: Bevore: -<U+FEFF>chapterxxx After: +chapterxxx^M Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
â€“Â Cutton Eye
Feb 20 at 15:55

Â |Â
show 2 more comments

up vote
38
down vote

accepted

If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn't.

sed '1s/^xEFxBBxBF//' < orig.txt > new.txt

You can also overwrite the existing file with the -i option:

sed -i '1s/^xEFxBBxBF//' orig.txt

edited Jul 24 '17 at 7:57

StÃ©phane Chazelas

288k54535873

answered Jul 23 '17 at 14:08

CSM

60244

4

this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
â€“Â hildred
Jul 23 '17 at 15:29

3

@hildred I've tested it with the en_US.UTF-8 locale and it worked. When will it fail?
â€“Â m13r
Jul 24 '17 at 6:55

2

@m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
â€“Â hildred
Jul 24 '17 at 16:25

3

To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
â€“Â Joshua
Jul 24 '17 at 17:41

@CSM nice, but for one special case it does not work: Bevore: -<U+FEFF>chapterxxx After: +chapterxxx^M Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
â€“Â Cutton Eye
Feb 20 at 15:55

Â |Â
show 2 more comments

up vote
38
down vote

accepted

If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn't.

sed '1s/^xEFxBBxBF//' < orig.txt > new.txt

You can also overwrite the existing file with the -i option:

sed -i '1s/^xEFxBBxBF//' orig.txt

edited Jul 24 '17 at 7:57

StÃ©phane Chazelas

288k54535873

answered Jul 23 '17 at 14:08

CSM

60244

If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn't.

sed '1s/^xEFxBBxBF//' < orig.txt > new.txt

You can also overwrite the existing file with the -i option:

sed -i '1s/^xEFxBBxBF//' orig.txt

edited Jul 24 '17 at 7:57

StÃ©phane Chazelas

288k54535873

answered Jul 23 '17 at 14:08

CSM

60244

edited Jul 24 '17 at 7:57

StÃ©phane Chazelas

288k54535873

edited Jul 24 '17 at 7:57

StÃ©phane Chazelas

288k54535873

edited Jul 24 '17 at 7:57

StÃ©phane Chazelas

288k54535873

answered Jul 23 '17 at 14:08

CSM

60244

answered Jul 23 '17 at 14:08

CSM

60244

answered Jul 23 '17 at 14:08

CSM

60244

4

this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
â€“Â hildred
Jul 23 '17 at 15:29

3

@hildred I've tested it with the en_US.UTF-8 locale and it worked. When will it fail?
â€“Â m13r
Jul 24 '17 at 6:55

2

@m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
â€“Â hildred
Jul 24 '17 at 16:25

3

To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
â€“Â Joshua
Jul 24 '17 at 17:41

@CSM nice, but for one special case it does not work: Bevore: -<U+FEFF>chapterxxx After: +chapterxxx^M Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
â€“Â Cutton Eye
Feb 20 at 15:55

Â |Â
show 2 more comments

4

this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
â€“Â hildred
Jul 23 '17 at 15:29

3

@hildred I've tested it with the en_US.UTF-8 locale and it worked. When will it fail?
â€“Â m13r
Jul 24 '17 at 6:55

2

@m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
â€“Â hildred
Jul 24 '17 at 16:25

3

To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
â€“Â Joshua
Jul 24 '17 at 17:41

@CSM nice, but for one special case it does not work: Bevore: -<U+FEFF>chapterxxx After: +chapterxxx^M Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
â€“Â Cutton Eye
Feb 20 at 15:55

this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
â€“Â hildred
Jul 23 '17 at 15:29

@hildred I've tested it with the en_US.UTF-8 locale and it worked. When will it fail?
â€“Â m13r
Jul 24 '17 at 6:55

@m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
â€“Â hildred
Jul 24 '17 at 16:25

To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
â€“Â Joshua
Jul 24 '17 at 17:41

@CSM nice, but for one special case it does not work: Bevore: -<U+FEFF>chapterxxx After: +chapterxxx^M Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
â€“Â Cutton Eye
Feb 20 at 15:55

Â |Â
show 2 more comments

up vote
42
down vote

A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.

dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.

dos2unix test.xml

answered Jul 23 '17 at 10:42

StÃ©phane Chazelas

288k54535873

12

I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
â€“Â Johan MyrÃ©en
Jul 23 '17 at 14:02

13

What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
â€“Â ilkkachu
Jul 23 '17 at 14:09

Comments are not for extended discussion; this conversation has been moved to chat.
â€“Â terdonâ™¦
Jul 24 '17 at 14:07

Is there a way of not converting the line endings and just remove the BOM with dos2unix?
â€“Â m13r
Jul 25 '17 at 7:55

2

@m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
â€“Â Arrow
Jul 26 '17 at 5:51

Â |Â
show 2 more comments

up vote
42
down vote

A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.

dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.

dos2unix test.xml

answered Jul 23 '17 at 10:42

StÃ©phane Chazelas

288k54535873

12

I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
â€“Â Johan MyrÃ©en
Jul 23 '17 at 14:02

13

What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
â€“Â ilkkachu
Jul 23 '17 at 14:09

Comments are not for extended discussion; this conversation has been moved to chat.
â€“Â terdonâ™¦
Jul 24 '17 at 14:07

Is there a way of not converting the line endings and just remove the BOM with dos2unix?
â€“Â m13r
Jul 25 '17 at 7:55

2

@m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
â€“Â Arrow
Jul 26 '17 at 5:51

Â |Â
show 2 more comments

up vote
42
down vote

A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.

dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.

dos2unix test.xml

answered Jul 23 '17 at 10:42

StÃ©phane Chazelas

288k54535873

A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.

dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.

dos2unix test.xml

answered Jul 23 '17 at 10:42

StÃ©phane Chazelas

288k54535873

answered Jul 23 '17 at 10:42

StÃ©phane Chazelas

288k54535873

answered Jul 23 '17 at 10:42

StÃ©phane Chazelas

288k54535873

answered Jul 23 '17 at 10:42

StÃ©phane Chazelas

288k54535873

12

I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
â€“Â Johan MyrÃ©en
Jul 23 '17 at 14:02

13

What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
â€“Â ilkkachu
Jul 23 '17 at 14:09

Comments are not for extended discussion; this conversation has been moved to chat.
â€“Â terdonâ™¦
Jul 24 '17 at 14:07

Is there a way of not converting the line endings and just remove the BOM with dos2unix?
â€“Â m13r
Jul 25 '17 at 7:55

2

@m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
â€“Â Arrow
Jul 26 '17 at 5:51

Â |Â
show 2 more comments

12

I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
â€“Â Johan MyrÃ©en
Jul 23 '17 at 14:02

13

What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
â€“Â ilkkachu
Jul 23 '17 at 14:09

Comments are not for extended discussion; this conversation has been moved to chat.
â€“Â terdonâ™¦
Jul 24 '17 at 14:07

Is there a way of not converting the line endings and just remove the BOM with dos2unix?
â€“Â m13r
Jul 25 '17 at 7:55

2

@m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
â€“Â Arrow
Jul 26 '17 at 5:51

I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
â€“Â Johan MyrÃ©en
Jul 23 '17 at 14:02

What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
â€“Â ilkkachu
Jul 23 '17 at 14:09

Comments are not for extended discussion; this conversation has been moved to chat.
â€“Â terdonâ™¦
Jul 24 '17 at 14:07

Is there a way of not converting the line endings and just remove the BOM with dos2unix?
â€“Â m13r
Jul 25 '17 at 7:55

@m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
â€“Â Arrow
Jul 26 '17 at 5:51

Â |Â
show 2 more comments

up vote
15
down vote

It is possible to remove the BOM from a file with the tail command:

tail -c +4 withBOM.txt > withoutBOM.txt

edited Jul 24 '17 at 5:49

answered Jul 23 '17 at 10:05

m13r

7741714

Why 4? The BOM has 3 byte.
â€“Â deviantfan
Jul 23 '17 at 17:12

5

@deviantfan Which is why you need to start at the 4th byte if you want to skip it.
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 18:33

6

tail is using 1 based indexing?! WTF!
â€“Â CodesInChaos
Jul 23 '17 at 19:31

3

@CodesInChaos, tail -c -1 or tail -c 1 (what tail is generally used for) is the content starting with the last byte, tail -c +1 starting with the first byte. tail -c 0/tail -c +0 for that would be a lot more unintuitive.
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 23:05

1

@deviantfan: (dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU (head -c3 >/dev/null; cat) -- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
â€“Â dave_thompson_085
Jul 24 '17 at 6:16

Â |Â
show 2 more comments

up vote
15
down vote

It is possible to remove the BOM from a file with the tail command:

tail -c +4 withBOM.txt > withoutBOM.txt

edited Jul 24 '17 at 5:49

answered Jul 23 '17 at 10:05

m13r

7741714

Why 4? The BOM has 3 byte.
â€“Â deviantfan
Jul 23 '17 at 17:12

5

@deviantfan Which is why you need to start at the 4th byte if you want to skip it.
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 18:33

6

tail is using 1 based indexing?! WTF!
â€“Â CodesInChaos
Jul 23 '17 at 19:31

3

@CodesInChaos, tail -c -1 or tail -c 1 (what tail is generally used for) is the content starting with the last byte, tail -c +1 starting with the first byte. tail -c 0/tail -c +0 for that would be a lot more unintuitive.
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 23:05

1

@deviantfan: (dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU (head -c3 >/dev/null; cat) -- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
â€“Â dave_thompson_085
Jul 24 '17 at 6:16

Â |Â
show 2 more comments

up vote
15
down vote

It is possible to remove the BOM from a file with the tail command:

tail -c +4 withBOM.txt > withoutBOM.txt

edited Jul 24 '17 at 5:49

answered Jul 23 '17 at 10:05

m13r

7741714

It is possible to remove the BOM from a file with the tail command:

tail -c +4 withBOM.txt > withoutBOM.txt

edited Jul 24 '17 at 5:49

answered Jul 23 '17 at 10:05

m13r

7741714

edited Jul 24 '17 at 5:49

answered Jul 23 '17 at 10:05

m13r

7741714

answered Jul 23 '17 at 10:05

m13r

7741714

answered Jul 23 '17 at 10:05

m13r

7741714

Why 4? The BOM has 3 byte.
â€“Â deviantfan
Jul 23 '17 at 17:12

5

@deviantfan Which is why you need to start at the 4th byte if you want to skip it.
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 18:33

6

tail is using 1 based indexing?! WTF!
â€“Â CodesInChaos
Jul 23 '17 at 19:31

3

@CodesInChaos, tail -c -1 or tail -c 1 (what tail is generally used for) is the content starting with the last byte, tail -c +1 starting with the first byte. tail -c 0/tail -c +0 for that would be a lot more unintuitive.
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 23:05

1

@deviantfan: (dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU (head -c3 >/dev/null; cat) -- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
â€“Â dave_thompson_085
Jul 24 '17 at 6:16

Â |Â
show 2 more comments

Why 4? The BOM has 3 byte.
â€“Â deviantfan
Jul 23 '17 at 17:12

5

@deviantfan Which is why you need to start at the 4th byte if you want to skip it.
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 18:33

6

tail is using 1 based indexing?! WTF!
â€“Â CodesInChaos
Jul 23 '17 at 19:31

3

@CodesInChaos, tail -c -1 or tail -c 1 (what tail is generally used for) is the content starting with the last byte, tail -c +1 starting with the first byte. tail -c 0/tail -c +0 for that would be a lot more unintuitive.
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 23:05

1

@deviantfan: (dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU (head -c3 >/dev/null; cat) -- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
â€“Â dave_thompson_085
Jul 24 '17 at 6:16

Why 4? The BOM has 3 byte.
â€“Â deviantfan
Jul 23 '17 at 17:12

@deviantfan Which is why you need to start at the 4th byte if you want to skip it.
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 18:33

tail is using 1 based indexing?! WTF!
â€“Â CodesInChaos
Jul 23 '17 at 19:31

@CodesInChaos, tail -c -1 or tail -c 1 (what tail is generally used for) is the content starting with the last byte, tail -c +1 starting with the first byte. tail -c 0/tail -c +0 for that would be a lot more unintuitive.
â€“Â StÃ©phane Chazelas
Jul 23 '17 at 23:05

@deviantfan: (dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU (head -c3 >/dev/null; cat) -- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
â€“Â dave_thompson_085
Jul 24 '17 at 6:16

Â |Â
show 2 more comments

up vote
8
down vote

Using VIM

Open file in VIM:
```
vi text.xml
```

Remove BOM encoding:
```
:set nobomb
```

Save and quit:
```
:wq
```

edited Jan 4 at 17:55

answered Dec 24 '17 at 18:05

Joshua Pinter

18415

add a commentÂ |Â

up vote
8
down vote

Using VIM

Open file in VIM:
```
vi text.xml
```

Remove BOM encoding:
```
:set nobomb
```

Save and quit:
```
:wq
```

edited Jan 4 at 17:55

answered Dec 24 '17 at 18:05

Joshua Pinter

18415

add a commentÂ |Â

up vote
8
down vote

Using VIM

Open file in VIM:
```
vi text.xml
```

Remove BOM encoding:
```
:set nobomb
```

Save and quit:
```
:wq
```

edited Jan 4 at 17:55

answered Dec 24 '17 at 18:05

Joshua Pinter

18415

Using VIM

Open file in VIM:
```
vi text.xml
```

Remove BOM encoding:
```
:set nobomb
```

Save and quit:
```
:wq
```

edited Jan 4 at 17:55

answered Dec 24 '17 at 18:05

Joshua Pinter

18415

edited Jan 4 at 17:55

answered Dec 24 '17 at 18:05

Joshua Pinter

18415

answered Dec 24 '17 at 18:05

Joshua Pinter

18415

answered Dec 24 '17 at 18:05

Joshua Pinter

18415

add a commentÂ |Â

up vote
4
down vote

You can use

LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename

I personally like to have this as ~/bin/fix-ms; for example, as

#!/bin/dash
export LANG=C LC_ALL=C
if [ $# -gt 0 ]; then
 for FILE in "$@" ; do
 sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
 done
else
 exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
fi

so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run

find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix

or, if I just want to look at such a file, without modifying it, I can run

~/bin/ms-fix < filename | less

and not see the ugly <U+FEFF> in my UTF-8 terminal.

edited Jul 24 '17 at 14:25

answered Jul 23 '17 at 19:10

Nominal Animal

2,820812

@StÃ©phaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, which sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@" does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
â€“Â Nominal Animal
Jul 24 '17 at 14:24

@StÃ©phaneChazelas: The -- before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
â€“Â Nominal Animal
Jul 24 '17 at 14:27

add a commentÂ |Â

up vote
4
down vote

You can use

LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename

I personally like to have this as ~/bin/fix-ms; for example, as

#!/bin/dash
export LANG=C LC_ALL=C
if [ $# -gt 0 ]; then
 for FILE in "$@" ; do
 sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
 done
else
 exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
fi

so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run

find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix

or, if I just want to look at such a file, without modifying it, I can run

~/bin/ms-fix < filename | less

and not see the ugly <U+FEFF> in my UTF-8 terminal.

edited Jul 24 '17 at 14:25

answered Jul 23 '17 at 19:10

Nominal Animal

2,820812

@StÃ©phaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, which sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@" does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
â€“Â Nominal Animal
Jul 24 '17 at 14:24

@StÃ©phaneChazelas: The -- before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
â€“Â Nominal Animal
Jul 24 '17 at 14:27

add a commentÂ |Â

up vote
4
down vote

You can use

LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename

I personally like to have this as ~/bin/fix-ms; for example, as

#!/bin/dash
export LANG=C LC_ALL=C
if [ $# -gt 0 ]; then
 for FILE in "$@" ; do
 sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
 done
else
 exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
fi

so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run

find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix

or, if I just want to look at such a file, without modifying it, I can run

~/bin/ms-fix < filename | less

and not see the ugly <U+FEFF> in my UTF-8 terminal.

edited Jul 24 '17 at 14:25

answered Jul 23 '17 at 19:10

Nominal Animal

2,820812

You can use

LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename

I personally like to have this as ~/bin/fix-ms; for example, as

#!/bin/dash
export LANG=C LC_ALL=C
if [ $# -gt 0 ]; then
 for FILE in "$@" ; do
 sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
 done
else
 exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
fi

so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run

find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix

or, if I just want to look at such a file, without modifying it, I can run

~/bin/ms-fix < filename | less

and not see the ugly <U+FEFF> in my UTF-8 terminal.

edited Jul 24 '17 at 14:25

answered Jul 23 '17 at 19:10

Nominal Animal

2,820812

edited Jul 24 '17 at 14:25

answered Jul 23 '17 at 19:10

Nominal Animal

2,820812

answered Jul 23 '17 at 19:10

Nominal Animal

2,820812

answered Jul 23 '17 at 19:10

Nominal Animal

2,820812

@StÃ©phaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, which sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@" does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
â€“Â Nominal Animal
Jul 24 '17 at 14:24

@StÃ©phaneChazelas: The -- before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
â€“Â Nominal Animal
Jul 24 '17 at 14:27

add a commentÂ |Â

@StÃ©phaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, which sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@" does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
â€“Â Nominal Animal
Jul 24 '17 at 14:24

@StÃ©phaneChazelas: The -- before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
â€“Â Nominal Animal
Jul 24 '17 at 14:27

@StÃ©phaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, which sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@" does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
â€“Â Nominal Animal
Jul 24 '17 at 14:24

@StÃ©phaneChazelas: The -- before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
â€“Â Nominal Animal
Jul 24 '17 at 14:27

add a commentÂ |Â

up vote
0
down vote

Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)

Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.

answered 15 mins ago

Wernfried Domscheit

1061

add a commentÂ |Â

up vote
0
down vote

Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)

Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.

answered 15 mins ago

Wernfried Domscheit

1061

add a commentÂ |Â

up vote
0
down vote

Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)

Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.

answered 15 mins ago

Wernfried Domscheit

1061

Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)

Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.

answered 15 mins ago

Wernfried Domscheit

1061

answered 15 mins ago

Wernfried Domscheit

1061

answered 15 mins ago

Wernfried Domscheit

1061

answered 15 mins ago

Wernfried Domscheit

1061

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu