How can I remove the BOM from a UTF-8 file?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
27
down vote

favorite
6












I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?



$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines









share|improve this question























  • Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
    – Stéphane Chazelas
    Jul 23 '17 at 10:40






  • 1




    I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/… Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
    – Oskar Skog
    Jul 23 '17 at 11:24














up vote
27
down vote

favorite
6












I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?



$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines









share|improve this question























  • Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
    – Stéphane Chazelas
    Jul 23 '17 at 10:40






  • 1




    I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/… Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
    – Oskar Skog
    Jul 23 '17 at 11:24












up vote
27
down vote

favorite
6









up vote
27
down vote

favorite
6






6





I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?



$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines









share|improve this question















I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?



$ file test.xml
test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines






command-line files unicode






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jul 23 '17 at 10:06









Michael Homer

43.7k6113152




43.7k6113152










asked Jul 23 '17 at 10:05









m13r

7741714




7741714











  • Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
    – Stéphane Chazelas
    Jul 23 '17 at 10:40






  • 1




    I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/… Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
    – Oskar Skog
    Jul 23 '17 at 11:24
















  • Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
    – Stéphane Chazelas
    Jul 23 '17 at 10:40






  • 1




    I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/… Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
    – Oskar Skog
    Jul 23 '17 at 11:24















Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
– Stéphane Chazelas
Jul 23 '17 at 10:40




Similar: AWK with BOM: Is there any cool way to handle Unicode BOM with regexp?
– Stéphane Chazelas
Jul 23 '17 at 10:40




1




1




I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/… Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
– Oskar Skog
Jul 23 '17 at 11:24




I've made a farily simple tool to do just that a few months ago: oskog97.com/read/?path=/small-scripts/killbom&referer=/… Might be worth installing something like it in /usr/local/bin if you have many UTF-8 encoded files with BOMs.
– Oskar Skog
Jul 23 '17 at 11:24










6 Answers
6






active

oldest

votes

















up vote
38
down vote



accepted










If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn't.



sed '1s/^xEFxBBxBF//' < orig.txt > new.txt


You can also overwrite the existing file with the -i option:



sed -i '1s/^xEFxBBxBF//' orig.txt





share|improve this answer


















  • 4




    this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
    – hildred
    Jul 23 '17 at 15:29






  • 3




    @hildred I've tested it with the en_US.UTF-8 locale and it worked. When will it fail?
    – m13r
    Jul 24 '17 at 6:55







  • 2




    @m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
    – hildred
    Jul 24 '17 at 16:25







  • 3




    To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
    – Joshua
    Jul 24 '17 at 17:41










  • @CSM nice, but for one special case it does not work: Bevore: -<U+FEFF>chapterxxx After: +chapterxxx^M Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
    – Cutton Eye
    Feb 20 at 15:55

















up vote
42
down vote













A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.



dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.



dos2unix test.xml





share|improve this answer
















  • 12




    I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
    – Johan Myréen
    Jul 23 '17 at 14:02






  • 13




    What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
    – ilkkachu
    Jul 23 '17 at 14:09










  • Comments are not for extended discussion; this conversation has been moved to chat.
    – terdon♦
    Jul 24 '17 at 14:07










  • Is there a way of not converting the line endings and just remove the BOM with dos2unix?
    – m13r
    Jul 25 '17 at 7:55







  • 2




    @m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
    – Arrow
    Jul 26 '17 at 5:51

















up vote
15
down vote













It is possible to remove the BOM from a file with the tail command:



tail -c +4 withBOM.txt > withoutBOM.txt





share|improve this answer






















  • Why 4? The BOM has 3 byte.
    – deviantfan
    Jul 23 '17 at 17:12






  • 5




    @deviantfan Which is why you need to start at the 4th byte if you want to skip it.
    – Stéphane Chazelas
    Jul 23 '17 at 18:33






  • 6




    tail is using 1 based indexing?! WTF!
    – CodesInChaos
    Jul 23 '17 at 19:31







  • 3




    @CodesInChaos, tail -c -1 or tail -c 1 (what tail is generally used for) is the content starting with the last byte, tail -c +1 starting with the first byte. tail -c 0/tail -c +0 for that would be a lot more unintuitive.
    – Stéphane Chazelas
    Jul 23 '17 at 23:05






  • 1




    @deviantfan: (dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU (head -c3 >/dev/null; cat) -- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
    – dave_thompson_085
    Jul 24 '17 at 6:16

















up vote
8
down vote













Using VIM




  1. Open file in VIM:



    vi text.xml



  2. Remove BOM encoding:



    :set nobomb



  3. Save and quit:



    :wq






share|improve this answer





























    up vote
    4
    down vote













    You can use



    LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename


    to remove the byte order mark from the beginning of the file, if it has any, as well as convert any CR LF newlines to LF only. The LANG=C LC_ALL=C tells the shell you want the command to run in the default C locale (also known as the default POSIX locale), where the three bytes forming the Byte Order Mark are treated as bytes. The -i option to sed means in-place. If you use -i.old, then sed saves the original file as filename.old, and the new file (with the modifications, if any) as filename.




    I personally like to have this as ~/bin/fix-ms; for example, as



    #!/bin/dash
    export LANG=C LC_ALL=C
    if [ $# -gt 0 ]; then
    for FILE in "$@" ; do
    sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
    done
    else
    exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
    fi


    so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run



    find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix


    or, if I just want to look at such a file, without modifying it, I can run



    ~/bin/ms-fix < filename | less


    and not see the ugly <U+FEFF> in my UTF-8 terminal.






    share|improve this answer






















    • Why not simply sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"?
      – Stéphane Chazelas
      Jul 24 '17 at 14:02










    • @StéphaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, which sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@" does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
      – Nominal Animal
      Jul 24 '17 at 14:24











    • @StéphaneChazelas: The -- before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
      – Nominal Animal
      Jul 24 '17 at 14:27

















    up vote
    0
    down vote













    Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)



    Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.






    share|improve this answer




















      Your Answer







      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "106"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f381230%2fhow-can-i-remove-the-bom-from-a-utf-8-file%23new-answer', 'question_page');

      );

      Post as a guest






























      6 Answers
      6






      active

      oldest

      votes








      6 Answers
      6






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      38
      down vote



      accepted










      If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn't.



      sed '1s/^xEFxBBxBF//' < orig.txt > new.txt


      You can also overwrite the existing file with the -i option:



      sed -i '1s/^xEFxBBxBF//' orig.txt





      share|improve this answer


















      • 4




        this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
        – hildred
        Jul 23 '17 at 15:29






      • 3




        @hildred I've tested it with the en_US.UTF-8 locale and it worked. When will it fail?
        – m13r
        Jul 24 '17 at 6:55







      • 2




        @m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
        – hildred
        Jul 24 '17 at 16:25







      • 3




        To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
        – Joshua
        Jul 24 '17 at 17:41










      • @CSM nice, but for one special case it does not work: Bevore: -<U+FEFF>chapterxxx After: +chapterxxx^M Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
        – Cutton Eye
        Feb 20 at 15:55














      up vote
      38
      down vote



      accepted










      If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn't.



      sed '1s/^xEFxBBxBF//' < orig.txt > new.txt


      You can also overwrite the existing file with the -i option:



      sed -i '1s/^xEFxBBxBF//' orig.txt





      share|improve this answer


















      • 4




        this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
        – hildred
        Jul 23 '17 at 15:29






      • 3




        @hildred I've tested it with the en_US.UTF-8 locale and it worked. When will it fail?
        – m13r
        Jul 24 '17 at 6:55







      • 2




        @m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
        – hildred
        Jul 24 '17 at 16:25







      • 3




        To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
        – Joshua
        Jul 24 '17 at 17:41










      • @CSM nice, but for one special case it does not work: Bevore: -<U+FEFF>chapterxxx After: +chapterxxx^M Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
        – Cutton Eye
        Feb 20 at 15:55












      up vote
      38
      down vote



      accepted







      up vote
      38
      down vote



      accepted






      If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn't.



      sed '1s/^xEFxBBxBF//' < orig.txt > new.txt


      You can also overwrite the existing file with the -i option:



      sed -i '1s/^xEFxBBxBF//' orig.txt





      share|improve this answer














      If you're not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn't.



      sed '1s/^xEFxBBxBF//' < orig.txt > new.txt


      You can also overwrite the existing file with the -i option:



      sed -i '1s/^xEFxBBxBF//' orig.txt






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Jul 24 '17 at 7:57









      Stéphane Chazelas

      288k54535873




      288k54535873










      answered Jul 23 '17 at 14:08









      CSM

      60244




      60244







      • 4




        this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
        – hildred
        Jul 23 '17 at 15:29






      • 3




        @hildred I've tested it with the en_US.UTF-8 locale and it worked. When will it fail?
        – m13r
        Jul 24 '17 at 6:55







      • 2




        @m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
        – hildred
        Jul 24 '17 at 16:25







      • 3




        To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
        – Joshua
        Jul 24 '17 at 17:41










      • @CSM nice, but for one special case it does not work: Bevore: -<U+FEFF>chapterxxx After: +chapterxxx^M Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
        – Cutton Eye
        Feb 20 at 15:55












      • 4




        this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
        – hildred
        Jul 23 '17 at 15:29






      • 3




        @hildred I've tested it with the en_US.UTF-8 locale and it worked. When will it fail?
        – m13r
        Jul 24 '17 at 6:55







      • 2




        @m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
        – hildred
        Jul 24 '17 at 16:25







      • 3




        To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
        – Joshua
        Jul 24 '17 at 17:41










      • @CSM nice, but for one special case it does not work: Bevore: -<U+FEFF>chapterxxx After: +chapterxxx^M Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
        – Cutton Eye
        Feb 20 at 15:55







      4




      4




      this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
      – hildred
      Jul 23 '17 at 15:29




      this may not work in a utf8 locale, but prepending a locale override to c or posix will always work.
      – hildred
      Jul 23 '17 at 15:29




      3




      3




      @hildred I've tested it with the en_US.UTF-8 locale and it worked. When will it fail?
      – m13r
      Jul 24 '17 at 6:55





      @hildred I've tested it with the en_US.UTF-8 locale and it worked. When will it fail?
      – m13r
      Jul 24 '17 at 6:55





      2




      2




      @m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
      – hildred
      Jul 24 '17 at 16:25





      @m13r, It depends on the version of sed and compile options. In the failure case a very new version of sed with Unicode character classes will bring the three byte sequence in as a single character which does not match the three character sequence. However in such case you can do a sixteen bit character match. However this is a new feature and not universally present. If you want to test I recommend compiling the latest version.
      – hildred
      Jul 24 '17 at 16:25





      3




      3




      To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
      – Joshua
      Jul 24 '17 at 17:41




      To fix it to work with a unicode-enabled sed do LC_ALL=C sed '1s/^xEFxBBxBF//'
      – Joshua
      Jul 24 '17 at 17:41












      @CSM nice, but for one special case it does not work: Bevore: -<U+FEFF>chapterxxx After: +chapterxxx^M Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
      – Cutton Eye
      Feb 20 at 15:55




      @CSM nice, but for one special case it does not work: Bevore: -<U+FEFF>chapterxxx After: +chapterxxx^M Explanation: Using MS-word for typos in latex-file. Latex under Linux is showing errors mentioned. Output is from a git system. How could I alter the expression to catch this special case too?
      – Cutton Eye
      Feb 20 at 15:55












      up vote
      42
      down vote













      A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.



      dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.



      dos2unix test.xml





      share|improve this answer
















      • 12




        I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
        – Johan Myréen
        Jul 23 '17 at 14:02






      • 13




        What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
        – ilkkachu
        Jul 23 '17 at 14:09










      • Comments are not for extended discussion; this conversation has been moved to chat.
        – terdon♦
        Jul 24 '17 at 14:07










      • Is there a way of not converting the line endings and just remove the BOM with dos2unix?
        – m13r
        Jul 25 '17 at 7:55







      • 2




        @m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
        – Arrow
        Jul 26 '17 at 5:51














      up vote
      42
      down vote













      A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.



      dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.



      dos2unix test.xml





      share|improve this answer
















      • 12




        I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
        – Johan Myréen
        Jul 23 '17 at 14:02






      • 13




        What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
        – ilkkachu
        Jul 23 '17 at 14:09










      • Comments are not for extended discussion; this conversation has been moved to chat.
        – terdon♦
        Jul 24 '17 at 14:07










      • Is there a way of not converting the line endings and just remove the BOM with dos2unix?
        – m13r
        Jul 25 '17 at 7:55







      • 2




        @m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
        – Arrow
        Jul 26 '17 at 5:51












      up vote
      42
      down vote










      up vote
      42
      down vote









      A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.



      dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.



      dos2unix test.xml





      share|improve this answer












      A BOM doesn't make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.



      dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.



      dos2unix test.xml






      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Jul 23 '17 at 10:42









      Stéphane Chazelas

      288k54535873




      288k54535873







      • 12




        I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
        – Johan Myréen
        Jul 23 '17 at 14:02






      • 13




        What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
        – ilkkachu
        Jul 23 '17 at 14:09










      • Comments are not for extended discussion; this conversation has been moved to chat.
        – terdon♦
        Jul 24 '17 at 14:07










      • Is there a way of not converting the line endings and just remove the BOM with dos2unix?
        – m13r
        Jul 25 '17 at 7:55







      • 2




        @m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
        – Arrow
        Jul 26 '17 at 5:51












      • 12




        I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
        – Johan Myréen
        Jul 23 '17 at 14:02






      • 13




        What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
        – ilkkachu
        Jul 23 '17 at 14:09










      • Comments are not for extended discussion; this conversation has been moved to chat.
        – terdon♦
        Jul 24 '17 at 14:07










      • Is there a way of not converting the line endings and just remove the BOM with dos2unix?
        – m13r
        Jul 25 '17 at 7:55







      • 2




        @m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
        – Arrow
        Jul 26 '17 at 5:51







      12




      12




      I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
      – Johan Myréen
      Jul 23 '17 at 14:02




      I agree that a UTF-8 encoded BOM does not make sense, but believe it or not, there are lots of people who think it is a great idea that helps differentiate UTF-8 from other 8-bit encodings. So it is a matter of taste. Windows Notepad adds a BOM on purpose.
      – Johan Myréen
      Jul 23 '17 at 14:02




      13




      13




      What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
      – ilkkachu
      Jul 23 '17 at 14:09




      What does it matter if it makes sense or not, when the context is just a question on how to remove it? According to Wikipedia, Notepad requires the BOM to recognize a file as UTF-8, and Google Docs also adds it while exporting a file as text. I doubt they all do it by mistake.
      – ilkkachu
      Jul 23 '17 at 14:09












      Comments are not for extended discussion; this conversation has been moved to chat.
      – terdon♦
      Jul 24 '17 at 14:07




      Comments are not for extended discussion; this conversation has been moved to chat.
      – terdon♦
      Jul 24 '17 at 14:07












      Is there a way of not converting the line endings and just remove the BOM with dos2unix?
      – m13r
      Jul 25 '17 at 7:55





      Is there a way of not converting the line endings and just remove the BOM with dos2unix?
      – m13r
      Jul 25 '17 at 7:55





      2




      2




      @m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
      – Arrow
      Jul 26 '17 at 5:51




      @m13r Then use the sed script in this answer. That will remove only the bom (if it exist), nothing else will be changed.
      – Arrow
      Jul 26 '17 at 5:51










      up vote
      15
      down vote













      It is possible to remove the BOM from a file with the tail command:



      tail -c +4 withBOM.txt > withoutBOM.txt





      share|improve this answer






















      • Why 4? The BOM has 3 byte.
        – deviantfan
        Jul 23 '17 at 17:12






      • 5




        @deviantfan Which is why you need to start at the 4th byte if you want to skip it.
        – Stéphane Chazelas
        Jul 23 '17 at 18:33






      • 6




        tail is using 1 based indexing?! WTF!
        – CodesInChaos
        Jul 23 '17 at 19:31







      • 3




        @CodesInChaos, tail -c -1 or tail -c 1 (what tail is generally used for) is the content starting with the last byte, tail -c +1 starting with the first byte. tail -c 0/tail -c +0 for that would be a lot more unintuitive.
        – Stéphane Chazelas
        Jul 23 '17 at 23:05






      • 1




        @deviantfan: (dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU (head -c3 >/dev/null; cat) -- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
        – dave_thompson_085
        Jul 24 '17 at 6:16














      up vote
      15
      down vote













      It is possible to remove the BOM from a file with the tail command:



      tail -c +4 withBOM.txt > withoutBOM.txt





      share|improve this answer






















      • Why 4? The BOM has 3 byte.
        – deviantfan
        Jul 23 '17 at 17:12






      • 5




        @deviantfan Which is why you need to start at the 4th byte if you want to skip it.
        – Stéphane Chazelas
        Jul 23 '17 at 18:33






      • 6




        tail is using 1 based indexing?! WTF!
        – CodesInChaos
        Jul 23 '17 at 19:31







      • 3




        @CodesInChaos, tail -c -1 or tail -c 1 (what tail is generally used for) is the content starting with the last byte, tail -c +1 starting with the first byte. tail -c 0/tail -c +0 for that would be a lot more unintuitive.
        – Stéphane Chazelas
        Jul 23 '17 at 23:05






      • 1




        @deviantfan: (dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU (head -c3 >/dev/null; cat) -- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
        – dave_thompson_085
        Jul 24 '17 at 6:16












      up vote
      15
      down vote










      up vote
      15
      down vote









      It is possible to remove the BOM from a file with the tail command:



      tail -c +4 withBOM.txt > withoutBOM.txt





      share|improve this answer














      It is possible to remove the BOM from a file with the tail command:



      tail -c +4 withBOM.txt > withoutBOM.txt






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Jul 24 '17 at 5:49

























      answered Jul 23 '17 at 10:05









      m13r

      7741714




      7741714











      • Why 4? The BOM has 3 byte.
        – deviantfan
        Jul 23 '17 at 17:12






      • 5




        @deviantfan Which is why you need to start at the 4th byte if you want to skip it.
        – Stéphane Chazelas
        Jul 23 '17 at 18:33






      • 6




        tail is using 1 based indexing?! WTF!
        – CodesInChaos
        Jul 23 '17 at 19:31







      • 3




        @CodesInChaos, tail -c -1 or tail -c 1 (what tail is generally used for) is the content starting with the last byte, tail -c +1 starting with the first byte. tail -c 0/tail -c +0 for that would be a lot more unintuitive.
        – Stéphane Chazelas
        Jul 23 '17 at 23:05






      • 1




        @deviantfan: (dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU (head -c3 >/dev/null; cat) -- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
        – dave_thompson_085
        Jul 24 '17 at 6:16
















      • Why 4? The BOM has 3 byte.
        – deviantfan
        Jul 23 '17 at 17:12






      • 5




        @deviantfan Which is why you need to start at the 4th byte if you want to skip it.
        – Stéphane Chazelas
        Jul 23 '17 at 18:33






      • 6




        tail is using 1 based indexing?! WTF!
        – CodesInChaos
        Jul 23 '17 at 19:31







      • 3




        @CodesInChaos, tail -c -1 or tail -c 1 (what tail is generally used for) is the content starting with the last byte, tail -c +1 starting with the first byte. tail -c 0/tail -c +0 for that would be a lot more unintuitive.
        – Stéphane Chazelas
        Jul 23 '17 at 23:05






      • 1




        @deviantfan: (dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU (head -c3 >/dev/null; cat) -- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
        – dave_thompson_085
        Jul 24 '17 at 6:16















      Why 4? The BOM has 3 byte.
      – deviantfan
      Jul 23 '17 at 17:12




      Why 4? The BOM has 3 byte.
      – deviantfan
      Jul 23 '17 at 17:12




      5




      5




      @deviantfan Which is why you need to start at the 4th byte if you want to skip it.
      – Stéphane Chazelas
      Jul 23 '17 at 18:33




      @deviantfan Which is why you need to start at the 4th byte if you want to skip it.
      – Stéphane Chazelas
      Jul 23 '17 at 18:33




      6




      6




      tail is using 1 based indexing?! WTF!
      – CodesInChaos
      Jul 23 '17 at 19:31





      tail is using 1 based indexing?! WTF!
      – CodesInChaos
      Jul 23 '17 at 19:31





      3




      3




      @CodesInChaos, tail -c -1 or tail -c 1 (what tail is generally used for) is the content starting with the last byte, tail -c +1 starting with the first byte. tail -c 0/tail -c +0 for that would be a lot more unintuitive.
      – Stéphane Chazelas
      Jul 23 '17 at 23:05




      @CodesInChaos, tail -c -1 or tail -c 1 (what tail is generally used for) is the content starting with the last byte, tail -c +1 starting with the first byte. tail -c 0/tail -c +0 for that would be a lot more unintuitive.
      – Stéphane Chazelas
      Jul 23 '17 at 23:05




      1




      1




      @deviantfan: (dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU (head -c3 >/dev/null; cat) -- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
      – dave_thompson_085
      Jul 24 '17 at 6:16




      @deviantfan: (dd bs=1 count=3 of=/dev/null; cat) <input >output. Or with GNU (head -c3 >/dev/null; cat) -- even in UTF8 or other non-singlebyte locale; GNU head does 'char'=byte.
      – dave_thompson_085
      Jul 24 '17 at 6:16










      up vote
      8
      down vote













      Using VIM




      1. Open file in VIM:



        vi text.xml



      2. Remove BOM encoding:



        :set nobomb



      3. Save and quit:



        :wq






      share|improve this answer


























        up vote
        8
        down vote













        Using VIM




        1. Open file in VIM:



          vi text.xml



        2. Remove BOM encoding:



          :set nobomb



        3. Save and quit:



          :wq






        share|improve this answer
























          up vote
          8
          down vote










          up vote
          8
          down vote









          Using VIM




          1. Open file in VIM:



            vi text.xml



          2. Remove BOM encoding:



            :set nobomb



          3. Save and quit:



            :wq






          share|improve this answer














          Using VIM




          1. Open file in VIM:



            vi text.xml



          2. Remove BOM encoding:



            :set nobomb



          3. Save and quit:



            :wq







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jan 4 at 17:55

























          answered Dec 24 '17 at 18:05









          Joshua Pinter

          18415




          18415




















              up vote
              4
              down vote













              You can use



              LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename


              to remove the byte order mark from the beginning of the file, if it has any, as well as convert any CR LF newlines to LF only. The LANG=C LC_ALL=C tells the shell you want the command to run in the default C locale (also known as the default POSIX locale), where the three bytes forming the Byte Order Mark are treated as bytes. The -i option to sed means in-place. If you use -i.old, then sed saves the original file as filename.old, and the new file (with the modifications, if any) as filename.




              I personally like to have this as ~/bin/fix-ms; for example, as



              #!/bin/dash
              export LANG=C LC_ALL=C
              if [ $# -gt 0 ]; then
              for FILE in "$@" ; do
              sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
              done
              else
              exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
              fi


              so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run



              find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix


              or, if I just want to look at such a file, without modifying it, I can run



              ~/bin/ms-fix < filename | less


              and not see the ugly <U+FEFF> in my UTF-8 terminal.






              share|improve this answer






















              • Why not simply sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"?
                – Stéphane Chazelas
                Jul 24 '17 at 14:02










              • @StéphaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, which sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@" does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
                – Nominal Animal
                Jul 24 '17 at 14:24











              • @StéphaneChazelas: The -- before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
                – Nominal Animal
                Jul 24 '17 at 14:27














              up vote
              4
              down vote













              You can use



              LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename


              to remove the byte order mark from the beginning of the file, if it has any, as well as convert any CR LF newlines to LF only. The LANG=C LC_ALL=C tells the shell you want the command to run in the default C locale (also known as the default POSIX locale), where the three bytes forming the Byte Order Mark are treated as bytes. The -i option to sed means in-place. If you use -i.old, then sed saves the original file as filename.old, and the new file (with the modifications, if any) as filename.




              I personally like to have this as ~/bin/fix-ms; for example, as



              #!/bin/dash
              export LANG=C LC_ALL=C
              if [ $# -gt 0 ]; then
              for FILE in "$@" ; do
              sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
              done
              else
              exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
              fi


              so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run



              find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix


              or, if I just want to look at such a file, without modifying it, I can run



              ~/bin/ms-fix < filename | less


              and not see the ugly <U+FEFF> in my UTF-8 terminal.






              share|improve this answer






















              • Why not simply sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"?
                – Stéphane Chazelas
                Jul 24 '17 at 14:02










              • @StéphaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, which sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@" does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
                – Nominal Animal
                Jul 24 '17 at 14:24











              • @StéphaneChazelas: The -- before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
                – Nominal Animal
                Jul 24 '17 at 14:27












              up vote
              4
              down vote










              up vote
              4
              down vote









              You can use



              LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename


              to remove the byte order mark from the beginning of the file, if it has any, as well as convert any CR LF newlines to LF only. The LANG=C LC_ALL=C tells the shell you want the command to run in the default C locale (also known as the default POSIX locale), where the three bytes forming the Byte Order Mark are treated as bytes. The -i option to sed means in-place. If you use -i.old, then sed saves the original file as filename.old, and the new file (with the modifications, if any) as filename.




              I personally like to have this as ~/bin/fix-ms; for example, as



              #!/bin/dash
              export LANG=C LC_ALL=C
              if [ $# -gt 0 ]; then
              for FILE in "$@" ; do
              sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
              done
              else
              exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
              fi


              so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run



              find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix


              or, if I just want to look at such a file, without modifying it, I can run



              ~/bin/ms-fix < filename | less


              and not see the ugly <U+FEFF> in my UTF-8 terminal.






              share|improve this answer














              You can use



              LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename


              to remove the byte order mark from the beginning of the file, if it has any, as well as convert any CR LF newlines to LF only. The LANG=C LC_ALL=C tells the shell you want the command to run in the default C locale (also known as the default POSIX locale), where the three bytes forming the Byte Order Mark are treated as bytes. The -i option to sed means in-place. If you use -i.old, then sed saves the original file as filename.old, and the new file (with the modifications, if any) as filename.




              I personally like to have this as ~/bin/fix-ms; for example, as



              #!/bin/dash
              export LANG=C LC_ALL=C
              if [ $# -gt 0 ]; then
              for FILE in "$@" ; do
              sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
              done
              else
              exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
              fi


              so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run



              find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix


              or, if I just want to look at such a file, without modifying it, I can run



              ~/bin/ms-fix < filename | less


              and not see the ugly <U+FEFF> in my UTF-8 terminal.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Jul 24 '17 at 14:25

























              answered Jul 23 '17 at 19:10









              Nominal Animal

              2,820812




              2,820812











              • Why not simply sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"?
                – Stéphane Chazelas
                Jul 24 '17 at 14:02










              • @StéphaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, which sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@" does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
                – Nominal Animal
                Jul 24 '17 at 14:24











              • @StéphaneChazelas: The -- before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
                – Nominal Animal
                Jul 24 '17 at 14:27
















              • Why not simply sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"?
                – Stéphane Chazelas
                Jul 24 '17 at 14:02










              • @StéphaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, which sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@" does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
                – Nominal Animal
                Jul 24 '17 at 14:24











              • @StéphaneChazelas: The -- before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
                – Nominal Animal
                Jul 24 '17 at 14:27















              Why not simply sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"?
              – Stéphane Chazelas
              Jul 24 '17 at 14:02




              Why not simply sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@"?
              – Stéphane Chazelas
              Jul 24 '17 at 14:02












              @StéphaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, which sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@" does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
              – Nominal Animal
              Jul 24 '17 at 14:24





              @StéphaneChazelas: Because I want the script to exit immediately if there is an issue with a replacement, which sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$@" does not do; it does return an exit code, but it processes all files listed in the argument list before exiting.
              – Nominal Animal
              Jul 24 '17 at 14:24













              @StéphaneChazelas: The -- before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
              – Nominal Animal
              Jul 24 '17 at 14:27




              @StéphaneChazelas: The -- before the file name(s) is, of course, important: without it, file names beginning with a dash may be considered options by sed. I edited those into my answer; thank you for the reminder!
              – Nominal Animal
              Jul 24 '17 at 14:27










              up vote
              0
              down vote













              Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)



              Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.






              share|improve this answer
























                up vote
                0
                down vote













                Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)



                Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.






                share|improve this answer






















                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)



                  Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.






                  share|improve this answer












                  Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)



                  Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered 15 mins ago









                  Wernfried Domscheit

                  1061




                  1061



























                       

                      draft saved


                      draft discarded















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f381230%2fhow-can-i-remove-the-bom-from-a-utf-8-file%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Popular posts from this blog

                      Peggy Mitchell

                      Palaiologos

                      The Forum (Inglewood, California)