sed escaped charcter not matching in large file

up vote
2
down vote

favorite

I have large (~180MB) xml file with some wrong characters in it, for example

<Data ss:Type="String">7402953^@</Data>

The ^@ part should by removed. The job supposed to be done with

sed -i 's/^@//g' /tmp/large.xml

but for some unknown reason it doesn't work as expected if string is located in my large xml file. If the file has only few KB in size, sed works perfectly.

It looks like a bug but I think it can't be because the task is quite obvious. I'm doing something wrong?

asked May 8 at 20:38

dMedia

132

1

Why you say that is not working? Can you see ^@ in the file after its execution? If that so, try to isolate one example of the ^@ not replaced in the file, and take and small slice of the file containing that ^@... then, make sure that is really ^@.. probably, you have something in the middle; you could use xxd to be sure
â€“Â matsib.dev
May 8 at 20:57

A null character Ã¢Â€Â” if that's what it is; they appear like that Ã¢Â€Â” can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something like grep -C 10 -Pa 'x00' large.xml to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.
â€“Â Gaultheria
May 8 at 22:03

add a commentÂ |Â

up vote
2
down vote

favorite

I have large (~180MB) xml file with some wrong characters in it, for example

<Data ss:Type="String">7402953^@</Data>

The ^@ part should by removed. The job supposed to be done with

sed -i 's/^@//g' /tmp/large.xml

but for some unknown reason it doesn't work as expected if string is located in my large xml file. If the file has only few KB in size, sed works perfectly.

It looks like a bug but I think it can't be because the task is quite obvious. I'm doing something wrong?

asked May 8 at 20:38

dMedia

132

1

Why you say that is not working? Can you see ^@ in the file after its execution? If that so, try to isolate one example of the ^@ not replaced in the file, and take and small slice of the file containing that ^@... then, make sure that is really ^@.. probably, you have something in the middle; you could use xxd to be sure
â€“Â matsib.dev
May 8 at 20:57

A null character Ã¢Â€Â” if that's what it is; they appear like that Ã¢Â€Â” can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something like grep -C 10 -Pa 'x00' large.xml to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.
â€“Â Gaultheria
May 8 at 22:03

add a commentÂ |Â

up vote
2
down vote

favorite

I have large (~180MB) xml file with some wrong characters in it, for example

<Data ss:Type="String">7402953^@</Data>

The ^@ part should by removed. The job supposed to be done with

sed -i 's/^@//g' /tmp/large.xml

but for some unknown reason it doesn't work as expected if string is located in my large xml file. If the file has only few KB in size, sed works perfectly.

It looks like a bug but I think it can't be because the task is quite obvious. I'm doing something wrong?

asked May 8 at 20:38

dMedia

132

I have large (~180MB) xml file with some wrong characters in it, for example

<Data ss:Type="String">7402953^@</Data>

The ^@ part should by removed. The job supposed to be done with

sed -i 's/^@//g' /tmp/large.xml

but for some unknown reason it doesn't work as expected if string is located in my large xml file. If the file has only few KB in size, sed works perfectly.

It looks like a bug but I think it can't be because the task is quite obvious. I'm doing something wrong?

asked May 8 at 20:38

dMedia

132

asked May 8 at 20:38

dMedia

132

asked May 8 at 20:38

dMedia

132

asked May 8 at 20:38

dMedia

132

1

Why you say that is not working? Can you see ^@ in the file after its execution? If that so, try to isolate one example of the ^@ not replaced in the file, and take and small slice of the file containing that ^@... then, make sure that is really ^@.. probably, you have something in the middle; you could use xxd to be sure
â€“Â matsib.dev
May 8 at 20:57

A null character Ã¢Â€Â” if that's what it is; they appear like that Ã¢Â€Â” can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something like grep -C 10 -Pa 'x00' large.xml to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.
â€“Â Gaultheria
May 8 at 22:03

add a commentÂ |Â

1

Why you say that is not working? Can you see ^@ in the file after its execution? If that so, try to isolate one example of the ^@ not replaced in the file, and take and small slice of the file containing that ^@... then, make sure that is really ^@.. probably, you have something in the middle; you could use xxd to be sure
â€“Â matsib.dev
May 8 at 20:57

A null character Ã¢Â€Â” if that's what it is; they appear like that Ã¢Â€Â” can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something like grep -C 10 -Pa 'x00' large.xml to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.
â€“Â Gaultheria
May 8 at 22:03

Why you say that is not working? Can you see ^@ in the file after its execution? If that so, try to isolate one example of the ^@ not replaced in the file, and take and small slice of the file containing that ^@... then, make sure that is really ^@.. probably, you have something in the middle; you could use xxd to be sure
â€“Â matsib.dev
May 8 at 20:57

A null character Ã¢Â€Â” if that's what it is; they appear like that Ã¢Â€Â” can indicate a write error, and therefore an unspecified amount (possibly more than one line) of missing data. Use something like grep -C 10 -Pa 'x00' large.xml to test for null characters, and have a look at the surrounding lines of context; if there are unusually long lines or "jumps" in the file, you might have lost data during file creation.
â€“Â Gaultheria
May 8 at 22:03

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
5
down vote

accepted

Judging by your question (because there are no examples), I would say that ^@ in the big file are not actually the two characters (^ and @), but one of those unprintable characters.

You can input that unprintable character in the terminal with Ctrl + v + Ctrl + 2.

Use that in sed instead of the characters ^ and @ and it should be fine.

Also remove the escape sequence because it is not needed for the unprintable character.

edited May 8 at 21:07

answered May 8 at 20:54

Iskustvo

667118

1

Yes it was actually x00 that was displayed as ^@ (and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to press Ctrl + v + Ctrl + 2) so I just used sed -i 's/x00//g' /tmp/large.xml and now it works as expected.
â€“Â dMedia
May 9 at 7:21

I am not sure why Ctrl + v followed by Ctrl + 2 did not work. That is the default behavior for bash and zsh, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
â€“Â Iskustvo
May 9 at 8:14

add a commentÂ |Â

up vote
0
down vote

awk

If a solution using awk is acceptable, this will remove all non-printable characters.

This works in GNU awk (Linux) and BSD awk (Mac).

awk ' gsub(/[^[:print:][:blank:]]/,"",$0) ; print $0 ' input.xml > output.xml

gsub(/[^[:print:][:blank:]]/,"",$0)
From each line of input, remove any unwanted characters.
- [:print:]
  Any printable character.
- [:blank:]
  Space or tab.
- [^[:print:][:blank:]]
  Any character not included in those two classes.

print $0
Print each line of input.

> output.xml
Save the output to a file instead of printing it to the screen.

Do the same thing with fewer keystrokes (it's just a little harder to read):

awk 'gsub(/[^[:print:][:blank:]]/,"")1' input.xml > output.xml

You don't need to specify ,$0 (the entire line of input) in gsub if you're examining the entire line.

The 1 at the end means "now do the default action (ie, print) for every line".

edited May 9 at 4:48

answered May 9 at 4:22

Gaultheria

3404

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f442634%2fsed-escaped-charcter-not-matching-in-large-file%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
5
down vote

accepted

Judging by your question (because there are no examples), I would say that ^@ in the big file are not actually the two characters (^ and @), but one of those unprintable characters.

You can input that unprintable character in the terminal with Ctrl + v + Ctrl + 2.

Use that in sed instead of the characters ^ and @ and it should be fine.

Also remove the escape sequence because it is not needed for the unprintable character.

edited May 8 at 21:07

answered May 8 at 20:54

Iskustvo

667118

1

Yes it was actually x00 that was displayed as ^@ (and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to press Ctrl + v + Ctrl + 2) so I just used sed -i 's/x00//g' /tmp/large.xml and now it works as expected.
â€“Â dMedia
May 9 at 7:21

I am not sure why Ctrl + v followed by Ctrl + 2 did not work. That is the default behavior for bash and zsh, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
â€“Â Iskustvo
May 9 at 8:14

add a commentÂ |Â

up vote
5
down vote

accepted

Judging by your question (because there are no examples), I would say that ^@ in the big file are not actually the two characters (^ and @), but one of those unprintable characters.

You can input that unprintable character in the terminal with Ctrl + v + Ctrl + 2.

Use that in sed instead of the characters ^ and @ and it should be fine.

Also remove the escape sequence because it is not needed for the unprintable character.

edited May 8 at 21:07

answered May 8 at 20:54

Iskustvo

667118

1

Yes it was actually x00 that was displayed as ^@ (and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to press Ctrl + v + Ctrl + 2) so I just used sed -i 's/x00//g' /tmp/large.xml and now it works as expected.
â€“Â dMedia
May 9 at 7:21

I am not sure why Ctrl + v followed by Ctrl + 2 did not work. That is the default behavior for bash and zsh, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
â€“Â Iskustvo
May 9 at 8:14

add a commentÂ |Â

up vote
5
down vote

accepted

Judging by your question (because there are no examples), I would say that ^@ in the big file are not actually the two characters (^ and @), but one of those unprintable characters.

You can input that unprintable character in the terminal with Ctrl + v + Ctrl + 2.

Use that in sed instead of the characters ^ and @ and it should be fine.

Also remove the escape sequence because it is not needed for the unprintable character.

edited May 8 at 21:07

answered May 8 at 20:54

Iskustvo

667118

Judging by your question (because there are no examples), I would say that ^@ in the big file are not actually the two characters (^ and @), but one of those unprintable characters.

You can input that unprintable character in the terminal with Ctrl + v + Ctrl + 2.

Use that in sed instead of the characters ^ and @ and it should be fine.

Also remove the escape sequence because it is not needed for the unprintable character.

edited May 8 at 21:07

answered May 8 at 20:54

Iskustvo

667118

edited May 8 at 21:07

answered May 8 at 20:54

Iskustvo

667118

answered May 8 at 20:54

Iskustvo

667118

answered May 8 at 20:54

Iskustvo

667118

1

Yes it was actually x00 that was displayed as ^@ (and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to press Ctrl + v + Ctrl + 2) so I just used sed -i 's/x00//g' /tmp/large.xml and now it works as expected.
â€“Â dMedia
May 9 at 7:21

I am not sure why Ctrl + v followed by Ctrl + 2 did not work. That is the default behavior for bash and zsh, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
â€“Â Iskustvo
May 9 at 8:14

add a commentÂ |Â

1

Yes it was actually x00 that was displayed as ^@ (and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to press Ctrl + v + Ctrl + 2) so I just used sed -i 's/x00//g' /tmp/large.xml and now it works as expected.
â€“Â dMedia
May 9 at 7:21

I am not sure why Ctrl + v followed by Ctrl + 2 did not work. That is the default behavior for bash and zsh, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
â€“Â Iskustvo
May 9 at 8:14

Yes it was actually x00 that was displayed as ^@ (and copied to smaller test file as such characters). The unprintable character input not worked for me (or I don't understood how to press Ctrl + v + Ctrl + 2) so I just used sed -i 's/x00//g' /tmp/large.xml and now it works as expected.
â€“Â dMedia
May 9 at 7:21

I am not sure why Ctrl + v followed by Ctrl + 2 did not work. That is the default behavior for bash and zsh, so I guessed it is like that for evey shell. Maybe you were changing the keybindings or maybe you are just using some other shell that doesn't have same keybindings. Anyway, I'm glad that you solved your issue, so this is not that important now :D
â€“Â Iskustvo
May 9 at 8:14

add a commentÂ |Â

up vote
0
down vote

awk

If a solution using awk is acceptable, this will remove all non-printable characters.

This works in GNU awk (Linux) and BSD awk (Mac).

awk ' gsub(/[^[:print:][:blank:]]/,"",$0) ; print $0 ' input.xml > output.xml

gsub(/[^[:print:][:blank:]]/,"",$0)
From each line of input, remove any unwanted characters.
- [:print:]
  Any printable character.
- [:blank:]
  Space or tab.
- [^[:print:][:blank:]]
  Any character not included in those two classes.

print $0
Print each line of input.

> output.xml
Save the output to a file instead of printing it to the screen.

Do the same thing with fewer keystrokes (it's just a little harder to read):

awk 'gsub(/[^[:print:][:blank:]]/,"")1' input.xml > output.xml

You don't need to specify ,$0 (the entire line of input) in gsub if you're examining the entire line.

The 1 at the end means "now do the default action (ie, print) for every line".

edited May 9 at 4:48

answered May 9 at 4:22

Gaultheria

3404

add a commentÂ |Â

up vote
0
down vote

awk

If a solution using awk is acceptable, this will remove all non-printable characters.

This works in GNU awk (Linux) and BSD awk (Mac).

awk ' gsub(/[^[:print:][:blank:]]/,"",$0) ; print $0 ' input.xml > output.xml

gsub(/[^[:print:][:blank:]]/,"",$0)
From each line of input, remove any unwanted characters.
- [:print:]
  Any printable character.
- [:blank:]
  Space or tab.
- [^[:print:][:blank:]]
  Any character not included in those two classes.

print $0
Print each line of input.

> output.xml
Save the output to a file instead of printing it to the screen.

Do the same thing with fewer keystrokes (it's just a little harder to read):

awk 'gsub(/[^[:print:][:blank:]]/,"")1' input.xml > output.xml

You don't need to specify ,$0 (the entire line of input) in gsub if you're examining the entire line.

The 1 at the end means "now do the default action (ie, print) for every line".

edited May 9 at 4:48

answered May 9 at 4:22

Gaultheria

3404

add a commentÂ |Â

up vote
0
down vote

awk

If a solution using awk is acceptable, this will remove all non-printable characters.

This works in GNU awk (Linux) and BSD awk (Mac).

awk ' gsub(/[^[:print:][:blank:]]/,"",$0) ; print $0 ' input.xml > output.xml

gsub(/[^[:print:][:blank:]]/,"",$0)
From each line of input, remove any unwanted characters.
- [:print:]
  Any printable character.
- [:blank:]
  Space or tab.
- [^[:print:][:blank:]]
  Any character not included in those two classes.

print $0
Print each line of input.

> output.xml
Save the output to a file instead of printing it to the screen.

Do the same thing with fewer keystrokes (it's just a little harder to read):

awk 'gsub(/[^[:print:][:blank:]]/,"")1' input.xml > output.xml

You don't need to specify ,$0 (the entire line of input) in gsub if you're examining the entire line.

The 1 at the end means "now do the default action (ie, print) for every line".

edited May 9 at 4:48

answered May 9 at 4:22

Gaultheria

3404

awk

If a solution using awk is acceptable, this will remove all non-printable characters.

This works in GNU awk (Linux) and BSD awk (Mac).

awk ' gsub(/[^[:print:][:blank:]]/,"",$0) ; print $0 ' input.xml > output.xml

gsub(/[^[:print:][:blank:]]/,"",$0)
From each line of input, remove any unwanted characters.
- [:print:]
  Any printable character.
- [:blank:]
  Space or tab.
- [^[:print:][:blank:]]
  Any character not included in those two classes.

print $0
Print each line of input.

> output.xml
Save the output to a file instead of printing it to the screen.

Do the same thing with fewer keystrokes (it's just a little harder to read):

awk 'gsub(/[^[:print:][:blank:]]/,"")1' input.xml > output.xml

You don't need to specify ,$0 (the entire line of input) in gsub if you're examining the entire line.

The 1 at the end means "now do the default action (ie, print) for every line".

edited May 9 at 4:48

answered May 9 at 4:22

Gaultheria

3404

edited May 9 at 4:48

answered May 9 at 4:22

Gaultheria

3404

answered May 9 at 4:22

Gaultheria

3404

answered May 9 at 4:22

Gaultheria

3404

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu

sed escaped charcter not matching in large file

2 Answers
2

awk

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

awk

awk

awk

awk

Post as a guest

Popular posts from this blog

How to check contact read email or not when send email to Individual?

How many registers does an x86_64 CPU actually have?

Running qemu-guest-agent on windows server 2008

sed escaped charcter not matching in large file

2 Answers 2

awk

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

awk

awk

awk

awk

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

How to check contact read email or not when send email to Individual?

How many registers does an x86_64 CPU actually have?

Running qemu-guest-agent on windows server 2008

2 Answers
2

2 Answers
2

2 Answers
2