sed escaped charcter not matching in large file
Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
I have large (~180MB) xml file with some wrong characters in it, for example
<Data ss:Type="String">7402953^@</Data>
The ^@
part should by removed. The job supposed to be done with
sed -i 's/^@//g' /tmp/large.xml
but for some unknown reason it doesn't work as expected if string is located in my large xml file. If the file has only few KB in size, sed
works perfectly.
It looks like a bug but I think it can't be because the task is quite obvious. I'm doing something wrong?
linux sed regular-expression
add a comment |Â
up vote
2
down vote
favorite
I have large (~180MB) xml file with some wrong characters in it, for example
<Data ss:Type="String">7402953^@</Data>
The ^@
part should by removed. The job supposed to be done with
sed -i 's/^@//g' /tmp/large.xml
but for some unknown reason it doesn't work as expected if string is located in my large xml file. If the file has only few KB in size, sed
works perfectly.
It looks like a bug but I think it can't be because the task is quite obvious. I'm doing something wrong?
linux sed regular-expression
1
Why you say that is not working? Can you see^@
in the file after its execution? If that so, try to isolate one example of the^@
not replaced in the file, and take and small slice of the file containing that^@
... then, make sure that is really^@
.. probably, you have something in the middle; you could usexxd
to be sure
â matsib.dev
May 8 at 20:57
A null character â if that's what it is; they appear like that â can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something likegrep -C 10 -Pa 'x00' large.xml
to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.
â Gaultheria
May 8 at 22:03
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I have large (~180MB) xml file with some wrong characters in it, for example
<Data ss:Type="String">7402953^@</Data>
The ^@
part should by removed. The job supposed to be done with
sed -i 's/^@//g' /tmp/large.xml
but for some unknown reason it doesn't work as expected if string is located in my large xml file. If the file has only few KB in size, sed
works perfectly.
It looks like a bug but I think it can't be because the task is quite obvious. I'm doing something wrong?
linux sed regular-expression
I have large (~180MB) xml file with some wrong characters in it, for example
<Data ss:Type="String">7402953^@</Data>
The ^@
part should by removed. The job supposed to be done with
sed -i 's/^@//g' /tmp/large.xml
but for some unknown reason it doesn't work as expected if string is located in my large xml file. If the file has only few KB in size, sed
works perfectly.
It looks like a bug but I think it can't be because the task is quite obvious. I'm doing something wrong?
linux sed regular-expression
asked May 8 at 20:38
dMedia
132
132
1
Why you say that is not working? Can you see^@
in the file after its execution? If that so, try to isolate one example of the^@
not replaced in the file, and take and small slice of the file containing that^@
... then, make sure that is really^@
.. probably, you have something in the middle; you could usexxd
to be sure
â matsib.dev
May 8 at 20:57
A null character â if that's what it is; they appear like that â can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something likegrep -C 10 -Pa 'x00' large.xml
to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.
â Gaultheria
May 8 at 22:03
add a comment |Â
1
Why you say that is not working? Can you see^@
in the file after its execution? If that so, try to isolate one example of the^@
not replaced in the file, and take and small slice of the file containing that^@
... then, make sure that is really^@
.. probably, you have something in the middle; you could usexxd
to be sure
â matsib.dev
May 8 at 20:57
A null character â if that's what it is; they appear like that â can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something likegrep -C 10 -Pa 'x00' large.xml
to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.
â Gaultheria
May 8 at 22:03
1
1
Why you say that is not working? Can you see
^@
in the file after its execution? If that so, try to isolate one example of the ^@
not replaced in the file, and take and small slice of the file containing that ^@
... then, make sure that is really ^@
.. probably, you have something in the middle; you could use xxd
to be sureâ matsib.dev
May 8 at 20:57
Why you say that is not working? Can you see
^@
in the file after its execution? If that so, try to isolate one example of the ^@
not replaced in the file, and take and small slice of the file containing that ^@
... then, make sure that is really ^@
.. probably, you have something in the middle; you could use xxd
to be sureâ matsib.dev
May 8 at 20:57
A null character â if that's what it is; they appear like that â can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something like
grep -C 10 -Pa 'x00' large.xml
to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.â Gaultheria
May 8 at 22:03
A null character â if that's what it is; they appear like that â can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something like
grep -C 10 -Pa 'x00' large.xml
to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.â Gaultheria
May 8 at 22:03
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
5
down vote
accepted
Judging by your question (because there are no examples), I would say that ^@
in the big file are not actually the two characters (^
and @
), but one of those unprintable characters.
You can input that unprintable character in the terminal with Ctrl + v
+ Ctrl + 2
.
Use that in sed
instead of the characters ^
and @
and it should be fine.
Also remove the escape sequence because it is not needed for the unprintable character.
1
Yes it was actuallyx00
that was displayed as^@
(and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to pressCtrl + v
+Ctrl + 2
) so I just usedsed -i 's/x00//g' /tmp/large.xml
and now it works as expected.
â dMedia
May 9 at 7:21
I am not sure whyCtrl + v
followed byCtrl + 2
did not work. That is the default behavior forbash
andzsh
, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
â Iskustvo
May 9 at 8:14
add a comment |Â
up vote
0
down vote
awk
If a solution using awk
is acceptable, this will remove all non-printable characters.
This works in GNU awk (Linux) and BSD awk (Mac).
awk ' gsub(/[^[:print:][:blank:]]/,"",$0) ; print $0 ' input.xml > output.xml
gsub(/[^[:print:][:blank:]]/,"",$0)
From each line of input, remove any unwanted characters.[:print:]
Any printable character.[:blank:]
Space or tab.[^[:print:][:blank:]]
Any character not included in those two classes.
print $0
Print each line of input.> output.xml
Save the output to a file instead of printing it to the screen.
Do the same thing with fewer keystrokes (it's just a little harder to read):
awk 'gsub(/[^[:print:][:blank:]]/,"")1' input.xml > output.xml
- You don't need to specify
,$0
(the entire line of input) ingsub
if you're examining the entire line. - The
1
at the end means "now do the default action (ie, print) for every line".
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
5
down vote
accepted
Judging by your question (because there are no examples), I would say that ^@
in the big file are not actually the two characters (^
and @
), but one of those unprintable characters.
You can input that unprintable character in the terminal with Ctrl + v
+ Ctrl + 2
.
Use that in sed
instead of the characters ^
and @
and it should be fine.
Also remove the escape sequence because it is not needed for the unprintable character.
1
Yes it was actuallyx00
that was displayed as^@
(and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to pressCtrl + v
+Ctrl + 2
) so I just usedsed -i 's/x00//g' /tmp/large.xml
and now it works as expected.
â dMedia
May 9 at 7:21
I am not sure whyCtrl + v
followed byCtrl + 2
did not work. That is the default behavior forbash
andzsh
, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
â Iskustvo
May 9 at 8:14
add a comment |Â
up vote
5
down vote
accepted
Judging by your question (because there are no examples), I would say that ^@
in the big file are not actually the two characters (^
and @
), but one of those unprintable characters.
You can input that unprintable character in the terminal with Ctrl + v
+ Ctrl + 2
.
Use that in sed
instead of the characters ^
and @
and it should be fine.
Also remove the escape sequence because it is not needed for the unprintable character.
1
Yes it was actuallyx00
that was displayed as^@
(and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to pressCtrl + v
+Ctrl + 2
) so I just usedsed -i 's/x00//g' /tmp/large.xml
and now it works as expected.
â dMedia
May 9 at 7:21
I am not sure whyCtrl + v
followed byCtrl + 2
did not work. That is the default behavior forbash
andzsh
, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
â Iskustvo
May 9 at 8:14
add a comment |Â
up vote
5
down vote
accepted
up vote
5
down vote
accepted
Judging by your question (because there are no examples), I would say that ^@
in the big file are not actually the two characters (^
and @
), but one of those unprintable characters.
You can input that unprintable character in the terminal with Ctrl + v
+ Ctrl + 2
.
Use that in sed
instead of the characters ^
and @
and it should be fine.
Also remove the escape sequence because it is not needed for the unprintable character.
Judging by your question (because there are no examples), I would say that ^@
in the big file are not actually the two characters (^
and @
), but one of those unprintable characters.
You can input that unprintable character in the terminal with Ctrl + v
+ Ctrl + 2
.
Use that in sed
instead of the characters ^
and @
and it should be fine.
Also remove the escape sequence because it is not needed for the unprintable character.
edited May 8 at 21:07
answered May 8 at 20:54
Iskustvo
667118
667118
1
Yes it was actuallyx00
that was displayed as^@
(and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to pressCtrl + v
+Ctrl + 2
) so I just usedsed -i 's/x00//g' /tmp/large.xml
and now it works as expected.
â dMedia
May 9 at 7:21
I am not sure whyCtrl + v
followed byCtrl + 2
did not work. That is the default behavior forbash
andzsh
, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
â Iskustvo
May 9 at 8:14
add a comment |Â
1
Yes it was actuallyx00
that was displayed as^@
(and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to pressCtrl + v
+Ctrl + 2
) so I just usedsed -i 's/x00//g' /tmp/large.xml
and now it works as expected.
â dMedia
May 9 at 7:21
I am not sure whyCtrl + v
followed byCtrl + 2
did not work. That is the default behavior forbash
andzsh
, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
â Iskustvo
May 9 at 8:14
1
1
Yes it was actually
x00
that was displayed as ^@
(and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to press Ctrl + v
+ Ctrl + 2
) so I just used sed -i 's/x00//g' /tmp/large.xml
and now it works as expected.â dMedia
May 9 at 7:21
Yes it was actually
x00
that was displayed as ^@
(and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to press Ctrl + v
+ Ctrl + 2
) so I just used sed -i 's/x00//g' /tmp/large.xml
and now it works as expected.â dMedia
May 9 at 7:21
I am not sure why
Ctrl + v
followed by Ctrl + 2
did not work. That is the default behavior for bash
and zsh
, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :Dâ Iskustvo
May 9 at 8:14
I am not sure why
Ctrl + v
followed by Ctrl + 2
did not work. That is the default behavior for bash
and zsh
, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :Dâ Iskustvo
May 9 at 8:14
add a comment |Â
up vote
0
down vote
awk
If a solution using awk
is acceptable, this will remove all non-printable characters.
This works in GNU awk (Linux) and BSD awk (Mac).
awk ' gsub(/[^[:print:][:blank:]]/,"",$0) ; print $0 ' input.xml > output.xml
gsub(/[^[:print:][:blank:]]/,"",$0)
From each line of input, remove any unwanted characters.[:print:]
Any printable character.[:blank:]
Space or tab.[^[:print:][:blank:]]
Any character not included in those two classes.
print $0
Print each line of input.> output.xml
Save the output to a file instead of printing it to the screen.
Do the same thing with fewer keystrokes (it's just a little harder to read):
awk 'gsub(/[^[:print:][:blank:]]/,"")1' input.xml > output.xml
- You don't need to specify
,$0
(the entire line of input) ingsub
if you're examining the entire line. - The
1
at the end means "now do the default action (ie, print) for every line".
add a comment |Â
up vote
0
down vote
awk
If a solution using awk
is acceptable, this will remove all non-printable characters.
This works in GNU awk (Linux) and BSD awk (Mac).
awk ' gsub(/[^[:print:][:blank:]]/,"",$0) ; print $0 ' input.xml > output.xml
gsub(/[^[:print:][:blank:]]/,"",$0)
From each line of input, remove any unwanted characters.[:print:]
Any printable character.[:blank:]
Space or tab.[^[:print:][:blank:]]
Any character not included in those two classes.
print $0
Print each line of input.> output.xml
Save the output to a file instead of printing it to the screen.
Do the same thing with fewer keystrokes (it's just a little harder to read):
awk 'gsub(/[^[:print:][:blank:]]/,"")1' input.xml > output.xml
- You don't need to specify
,$0
(the entire line of input) ingsub
if you're examining the entire line. - The
1
at the end means "now do the default action (ie, print) for every line".
add a comment |Â
up vote
0
down vote
up vote
0
down vote
awk
If a solution using awk
is acceptable, this will remove all non-printable characters.
This works in GNU awk (Linux) and BSD awk (Mac).
awk ' gsub(/[^[:print:][:blank:]]/,"",$0) ; print $0 ' input.xml > output.xml
gsub(/[^[:print:][:blank:]]/,"",$0)
From each line of input, remove any unwanted characters.[:print:]
Any printable character.[:blank:]
Space or tab.[^[:print:][:blank:]]
Any character not included in those two classes.
print $0
Print each line of input.> output.xml
Save the output to a file instead of printing it to the screen.
Do the same thing with fewer keystrokes (it's just a little harder to read):
awk 'gsub(/[^[:print:][:blank:]]/,"")1' input.xml > output.xml
- You don't need to specify
,$0
(the entire line of input) ingsub
if you're examining the entire line. - The
1
at the end means "now do the default action (ie, print) for every line".
awk
If a solution using awk
is acceptable, this will remove all non-printable characters.
This works in GNU awk (Linux) and BSD awk (Mac).
awk ' gsub(/[^[:print:][:blank:]]/,"",$0) ; print $0 ' input.xml > output.xml
gsub(/[^[:print:][:blank:]]/,"",$0)
From each line of input, remove any unwanted characters.[:print:]
Any printable character.[:blank:]
Space or tab.[^[:print:][:blank:]]
Any character not included in those two classes.
print $0
Print each line of input.> output.xml
Save the output to a file instead of printing it to the screen.
Do the same thing with fewer keystrokes (it's just a little harder to read):
awk 'gsub(/[^[:print:][:blank:]]/,"")1' input.xml > output.xml
- You don't need to specify
,$0
(the entire line of input) ingsub
if you're examining the entire line. - The
1
at the end means "now do the default action (ie, print) for every line".
edited May 9 at 4:48
answered May 9 at 4:22
Gaultheria
3404
3404
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f442634%2fsed-escaped-charcter-not-matching-in-large-file%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
Why you say that is not working? Can you see
^@
in the file after its execution? If that so, try to isolate one example of the^@
not replaced in the file, and take and small slice of the file containing that^@
... then, make sure that is really^@
.. probably, you have something in the middle; you could usexxd
to be sureâ matsib.dev
May 8 at 20:57
A null character â if that's what it is; they appear like that â can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something like
grep -C 10 -Pa 'x00' large.xml
to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.â Gaultheria
May 8 at 22:03