sed escaped charcter not matching in large file

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite












I have large (~180MB) xml file with some wrong characters in it, for example



<Data ss:Type="String">7402953^@</Data>


The ^@ part should by removed. The job supposed to be done with



sed -i 's/^@//g' /tmp/large.xml


but for some unknown reason it doesn't work as expected if string is located in my large xml file. If the file has only few KB in size, sed works perfectly.



It looks like a bug but I think it can't be because the task is quite obvious. I'm doing something wrong?







share|improve this question















  • 1




    Why you say that is not working? Can you see ^@ in the file after its execution? If that so, try to isolate one example of the ^@ not replaced in the file, and take and small slice of the file containing that ^@... then, make sure that is really ^@.. probably, you have something in the middle; you could use xxd to be sure
    – matsib.dev
    May 8 at 20:57










  • A null character — if that's what it is; they appear like that — can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something like grep -C 10 -Pa 'x00' large.xml to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.
    – Gaultheria
    May 8 at 22:03














up vote
2
down vote

favorite












I have large (~180MB) xml file with some wrong characters in it, for example



<Data ss:Type="String">7402953^@</Data>


The ^@ part should by removed. The job supposed to be done with



sed -i 's/^@//g' /tmp/large.xml


but for some unknown reason it doesn't work as expected if string is located in my large xml file. If the file has only few KB in size, sed works perfectly.



It looks like a bug but I think it can't be because the task is quite obvious. I'm doing something wrong?







share|improve this question















  • 1




    Why you say that is not working? Can you see ^@ in the file after its execution? If that so, try to isolate one example of the ^@ not replaced in the file, and take and small slice of the file containing that ^@... then, make sure that is really ^@.. probably, you have something in the middle; you could use xxd to be sure
    – matsib.dev
    May 8 at 20:57










  • A null character — if that's what it is; they appear like that — can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something like grep -C 10 -Pa 'x00' large.xml to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.
    – Gaultheria
    May 8 at 22:03












up vote
2
down vote

favorite









up vote
2
down vote

favorite











I have large (~180MB) xml file with some wrong characters in it, for example



<Data ss:Type="String">7402953^@</Data>


The ^@ part should by removed. The job supposed to be done with



sed -i 's/^@//g' /tmp/large.xml


but for some unknown reason it doesn't work as expected if string is located in my large xml file. If the file has only few KB in size, sed works perfectly.



It looks like a bug but I think it can't be because the task is quite obvious. I'm doing something wrong?







share|improve this question











I have large (~180MB) xml file with some wrong characters in it, for example



<Data ss:Type="String">7402953^@</Data>


The ^@ part should by removed. The job supposed to be done with



sed -i 's/^@//g' /tmp/large.xml


but for some unknown reason it doesn't work as expected if string is located in my large xml file. If the file has only few KB in size, sed works perfectly.



It looks like a bug but I think it can't be because the task is quite obvious. I'm doing something wrong?









share|improve this question










share|improve this question




share|improve this question









asked May 8 at 20:38









dMedia

132




132







  • 1




    Why you say that is not working? Can you see ^@ in the file after its execution? If that so, try to isolate one example of the ^@ not replaced in the file, and take and small slice of the file containing that ^@... then, make sure that is really ^@.. probably, you have something in the middle; you could use xxd to be sure
    – matsib.dev
    May 8 at 20:57










  • A null character — if that's what it is; they appear like that — can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something like grep -C 10 -Pa 'x00' large.xml to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.
    – Gaultheria
    May 8 at 22:03












  • 1




    Why you say that is not working? Can you see ^@ in the file after its execution? If that so, try to isolate one example of the ^@ not replaced in the file, and take and small slice of the file containing that ^@... then, make sure that is really ^@.. probably, you have something in the middle; you could use xxd to be sure
    – matsib.dev
    May 8 at 20:57










  • A null character — if that's what it is; they appear like that — can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something like grep -C 10 -Pa 'x00' large.xml to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.
    – Gaultheria
    May 8 at 22:03







1




1




Why you say that is not working? Can you see ^@ in the file after its execution? If that so, try to isolate one example of the ^@ not replaced in the file, and take and small slice of the file containing that ^@... then, make sure that is really ^@.. probably, you have something in the middle; you could use xxd to be sure
– matsib.dev
May 8 at 20:57




Why you say that is not working? Can you see ^@ in the file after its execution? If that so, try to isolate one example of the ^@ not replaced in the file, and take and small slice of the file containing that ^@... then, make sure that is really ^@.. probably, you have something in the middle; you could use xxd to be sure
– matsib.dev
May 8 at 20:57












A null character — if that's what it is; they appear like that — can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something like grep -C 10 -Pa 'x00' large.xml to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.
– Gaultheria
May 8 at 22:03




A null character — if that's what it is; they appear like that — can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something like grep -C 10 -Pa 'x00' large.xml to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.
– Gaultheria
May 8 at 22:03










2 Answers
2






active

oldest

votes

















up vote
5
down vote



accepted










Judging by your question (because there are no examples), I would say that ^@ in the big file are not actually the two characters (^ and @), but one of those unprintable characters.



You can input that unprintable character in the terminal with Ctrl + v + Ctrl + 2.

Use that in sed instead of the characters ^ and @ and it should be fine.



Also remove the escape sequence because it is not needed for the unprintable character.






share|improve this answer



















  • 1




    Yes it was actually x00 that was displayed as ^@ (and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to press Ctrl + v + Ctrl + 2) so I just used sed -i 's/x00//g' /tmp/large.xml and now it works as expected.
    – dMedia
    May 9 at 7:21










  • I am not sure why Ctrl + v followed by Ctrl + 2 did not work. That is the default behavior for bash and zsh, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
    – Iskustvo
    May 9 at 8:14

















up vote
0
down vote













awk



If a solution using awk is acceptable, this will remove all non-printable characters.



This works in GNU awk (Linux) and BSD awk (Mac).



awk ' gsub(/[^[:print:][:blank:]]/,"",$0) ; print $0 ' input.xml > output.xml



  • gsub(/[^[:print:][:blank:]]/,"",$0)
    From each line of input, remove any unwanted characters.


    • [:print:]
      Any printable character.


    • [:blank:]
      Space or tab.


    • [^[:print:][:blank:]]
      Any character not included in those two classes.



  • print $0
    Print each line of input.


  • > output.xml
    Save the output to a file instead of printing it to the screen.

Do the same thing with fewer keystrokes (it's just a little harder to read):



awk 'gsub(/[^[:print:][:blank:]]/,"")1' input.xml > output.xml


  • You don't need to specify ,$0 (the entire line of input) in gsub if you're examining the entire line.

  • The 1 at the end means "now do the default action (ie, print) for every line".





share|improve this answer























    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f442634%2fsed-escaped-charcter-not-matching-in-large-file%23new-answer', 'question_page');

    );

    Post as a guest






























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    5
    down vote



    accepted










    Judging by your question (because there are no examples), I would say that ^@ in the big file are not actually the two characters (^ and @), but one of those unprintable characters.



    You can input that unprintable character in the terminal with Ctrl + v + Ctrl + 2.

    Use that in sed instead of the characters ^ and @ and it should be fine.



    Also remove the escape sequence because it is not needed for the unprintable character.






    share|improve this answer



















    • 1




      Yes it was actually x00 that was displayed as ^@ (and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to press Ctrl + v + Ctrl + 2) so I just used sed -i 's/x00//g' /tmp/large.xml and now it works as expected.
      – dMedia
      May 9 at 7:21










    • I am not sure why Ctrl + v followed by Ctrl + 2 did not work. That is the default behavior for bash and zsh, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
      – Iskustvo
      May 9 at 8:14














    up vote
    5
    down vote



    accepted










    Judging by your question (because there are no examples), I would say that ^@ in the big file are not actually the two characters (^ and @), but one of those unprintable characters.



    You can input that unprintable character in the terminal with Ctrl + v + Ctrl + 2.

    Use that in sed instead of the characters ^ and @ and it should be fine.



    Also remove the escape sequence because it is not needed for the unprintable character.






    share|improve this answer



















    • 1




      Yes it was actually x00 that was displayed as ^@ (and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to press Ctrl + v + Ctrl + 2) so I just used sed -i 's/x00//g' /tmp/large.xml and now it works as expected.
      – dMedia
      May 9 at 7:21










    • I am not sure why Ctrl + v followed by Ctrl + 2 did not work. That is the default behavior for bash and zsh, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
      – Iskustvo
      May 9 at 8:14












    up vote
    5
    down vote



    accepted







    up vote
    5
    down vote



    accepted






    Judging by your question (because there are no examples), I would say that ^@ in the big file are not actually the two characters (^ and @), but one of those unprintable characters.



    You can input that unprintable character in the terminal with Ctrl + v + Ctrl + 2.

    Use that in sed instead of the characters ^ and @ and it should be fine.



    Also remove the escape sequence because it is not needed for the unprintable character.






    share|improve this answer















    Judging by your question (because there are no examples), I would say that ^@ in the big file are not actually the two characters (^ and @), but one of those unprintable characters.



    You can input that unprintable character in the terminal with Ctrl + v + Ctrl + 2.

    Use that in sed instead of the characters ^ and @ and it should be fine.



    Also remove the escape sequence because it is not needed for the unprintable character.







    share|improve this answer















    share|improve this answer



    share|improve this answer








    edited May 8 at 21:07


























    answered May 8 at 20:54









    Iskustvo

    667118




    667118







    • 1




      Yes it was actually x00 that was displayed as ^@ (and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to press Ctrl + v + Ctrl + 2) so I just used sed -i 's/x00//g' /tmp/large.xml and now it works as expected.
      – dMedia
      May 9 at 7:21










    • I am not sure why Ctrl + v followed by Ctrl + 2 did not work. That is the default behavior for bash and zsh, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
      – Iskustvo
      May 9 at 8:14












    • 1




      Yes it was actually x00 that was displayed as ^@ (and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to press Ctrl + v + Ctrl + 2) so I just used sed -i 's/x00//g' /tmp/large.xml and now it works as expected.
      – dMedia
      May 9 at 7:21










    • I am not sure why Ctrl + v followed by Ctrl + 2 did not work. That is the default behavior for bash and zsh, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
      – Iskustvo
      May 9 at 8:14







    1




    1




    Yes it was actually x00 that was displayed as ^@ (and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to press Ctrl + v + Ctrl + 2) so I just used sed -i 's/x00//g' /tmp/large.xml and now it works as expected.
    – dMedia
    May 9 at 7:21




    Yes it was actually x00 that was displayed as ^@ (and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to press Ctrl + v + Ctrl + 2) so I just used sed -i 's/x00//g' /tmp/large.xml and now it works as expected.
    – dMedia
    May 9 at 7:21












    I am not sure why Ctrl + v followed by Ctrl + 2 did not work. That is the default behavior for bash and zsh, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
    – Iskustvo
    May 9 at 8:14




    I am not sure why Ctrl + v followed by Ctrl + 2 did not work. That is the default behavior for bash and zsh, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
    – Iskustvo
    May 9 at 8:14












    up vote
    0
    down vote













    awk



    If a solution using awk is acceptable, this will remove all non-printable characters.



    This works in GNU awk (Linux) and BSD awk (Mac).



    awk ' gsub(/[^[:print:][:blank:]]/,"",$0) ; print $0 ' input.xml > output.xml



    • gsub(/[^[:print:][:blank:]]/,"",$0)
      From each line of input, remove any unwanted characters.


      • [:print:]
        Any printable character.


      • [:blank:]
        Space or tab.


      • [^[:print:][:blank:]]
        Any character not included in those two classes.



    • print $0
      Print each line of input.


    • > output.xml
      Save the output to a file instead of printing it to the screen.

    Do the same thing with fewer keystrokes (it's just a little harder to read):



    awk 'gsub(/[^[:print:][:blank:]]/,"")1' input.xml > output.xml


    • You don't need to specify ,$0 (the entire line of input) in gsub if you're examining the entire line.

    • The 1 at the end means "now do the default action (ie, print) for every line".





    share|improve this answer



























      up vote
      0
      down vote













      awk



      If a solution using awk is acceptable, this will remove all non-printable characters.



      This works in GNU awk (Linux) and BSD awk (Mac).



      awk ' gsub(/[^[:print:][:blank:]]/,"",$0) ; print $0 ' input.xml > output.xml



      • gsub(/[^[:print:][:blank:]]/,"",$0)
        From each line of input, remove any unwanted characters.


        • [:print:]
          Any printable character.


        • [:blank:]
          Space or tab.


        • [^[:print:][:blank:]]
          Any character not included in those two classes.



      • print $0
        Print each line of input.


      • > output.xml
        Save the output to a file instead of printing it to the screen.

      Do the same thing with fewer keystrokes (it's just a little harder to read):



      awk 'gsub(/[^[:print:][:blank:]]/,"")1' input.xml > output.xml


      • You don't need to specify ,$0 (the entire line of input) in gsub if you're examining the entire line.

      • The 1 at the end means "now do the default action (ie, print) for every line".





      share|improve this answer

























        up vote
        0
        down vote










        up vote
        0
        down vote









        awk



        If a solution using awk is acceptable, this will remove all non-printable characters.



        This works in GNU awk (Linux) and BSD awk (Mac).



        awk ' gsub(/[^[:print:][:blank:]]/,"",$0) ; print $0 ' input.xml > output.xml



        • gsub(/[^[:print:][:blank:]]/,"",$0)
          From each line of input, remove any unwanted characters.


          • [:print:]
            Any printable character.


          • [:blank:]
            Space or tab.


          • [^[:print:][:blank:]]
            Any character not included in those two classes.



        • print $0
          Print each line of input.


        • > output.xml
          Save the output to a file instead of printing it to the screen.

        Do the same thing with fewer keystrokes (it's just a little harder to read):



        awk 'gsub(/[^[:print:][:blank:]]/,"")1' input.xml > output.xml


        • You don't need to specify ,$0 (the entire line of input) in gsub if you're examining the entire line.

        • The 1 at the end means "now do the default action (ie, print) for every line".





        share|improve this answer















        awk



        If a solution using awk is acceptable, this will remove all non-printable characters.



        This works in GNU awk (Linux) and BSD awk (Mac).



        awk ' gsub(/[^[:print:][:blank:]]/,"",$0) ; print $0 ' input.xml > output.xml



        • gsub(/[^[:print:][:blank:]]/,"",$0)
          From each line of input, remove any unwanted characters.


          • [:print:]
            Any printable character.


          • [:blank:]
            Space or tab.


          • [^[:print:][:blank:]]
            Any character not included in those two classes.



        • print $0
          Print each line of input.


        • > output.xml
          Save the output to a file instead of printing it to the screen.

        Do the same thing with fewer keystrokes (it's just a little harder to read):



        awk 'gsub(/[^[:print:][:blank:]]/,"")1' input.xml > output.xml


        • You don't need to specify ,$0 (the entire line of input) in gsub if you're examining the entire line.

        • The 1 at the end means "now do the default action (ie, print) for every line".






        share|improve this answer















        share|improve this answer



        share|improve this answer








        edited May 9 at 4:48


























        answered May 9 at 4:22









        Gaultheria

        3404




        3404






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f442634%2fsed-escaped-charcter-not-matching-in-large-file%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Displaying single band from multi-band raster using QGIS

            How many registers does an x86_64 CPU actually have?