Trying to find files that contain only NULs, but getting some others

up vote
7
down vote

favorite

The files I am trying to find/list are:

Any size (0 bytes accepted)

Consist only of ASCII NUL characters (0x00)

If there are any characters other than 0x00, the file shouldn't be listed.

The command I have now is:

grep -RLP '[^x00]' .

Which works, but it also finds file which consists only of two bytes: 0xFF, 0xFE. Don't know why.

Is there any better command to find such files?

edited Aug 17 at 1:32

muru

130k19273466

asked Aug 16 at 22:27

pbies

1406

Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
â€“Â wjandrea
Aug 17 at 0:12

add a commentÂ |Â

up vote
7
down vote

favorite

The files I am trying to find/list are:

Any size (0 bytes accepted)

Consist only of ASCII NUL characters (0x00)

If there are any characters other than 0x00, the file shouldn't be listed.

The command I have now is:

grep -RLP '[^x00]' .

Which works, but it also finds file which consists only of two bytes: 0xFF, 0xFE. Don't know why.

Is there any better command to find such files?

edited Aug 17 at 1:32

muru

130k19273466

asked Aug 16 at 22:27

pbies

1406

Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
â€“Â wjandrea
Aug 17 at 0:12

add a commentÂ |Â

up vote
7
down vote

favorite

The files I am trying to find/list are:

Any size (0 bytes accepted)

Consist only of ASCII NUL characters (0x00)

If there are any characters other than 0x00, the file shouldn't be listed.

The command I have now is:

grep -RLP '[^x00]' .

Which works, but it also finds file which consists only of two bytes: 0xFF, 0xFE. Don't know why.

Is there any better command to find such files?

edited Aug 17 at 1:32

muru

130k19273466

asked Aug 16 at 22:27

pbies

1406

The files I am trying to find/list are:

Any size (0 bytes accepted)

Consist only of ASCII NUL characters (0x00)

If there are any characters other than 0x00, the file shouldn't be listed.

The command I have now is:

grep -RLP '[^x00]' .

Which works, but it also finds file which consists only of two bytes: 0xFF, 0xFE. Don't know why.

Is there any better command to find such files?

command-line text-processing

edited Aug 17 at 1:32

muru

130k19273466

asked Aug 16 at 22:27

pbies

1406

edited Aug 17 at 1:32

muru

130k19273466

asked Aug 16 at 22:27

pbies

1406

edited Aug 17 at 1:32

muru

130k19273466

edited Aug 17 at 1:32

muru

130k19273466

edited Aug 17 at 1:32

muru

130k19273466

asked Aug 16 at 22:27

pbies

1406

asked Aug 16 at 22:27

pbies

1406

asked Aug 16 at 22:27

pbies

1406

Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
â€“Â wjandrea
Aug 17 at 0:12

add a commentÂ |Â

Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
â€“Â wjandrea
Aug 17 at 0:12

Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
â€“Â wjandrea
Aug 17 at 0:12

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
8
down vote

accepted

In short, what is happening here is that grep is trying to interpret your file as Unicode data. The sequence 0xFF, 0xFE is a Byte Order Marker for UTF-16.

(In my testing, even other sequences involving two 0xFF's or two 0xFE's etc. would still not match the '[^x00]' regex, since even when trying to do UTF-8 these would be considered non-characters.)

Using a locale that doesn't use Unicode for character types should fix this, which you can accomplish by setting the LC_CTYPE environment variable. Use the C locale to force ASCII encoding (so no Unicode enabled):

LC_CTYPE=C grep -RLP '[^x00]' .

UPDATE: As pointed out by @steeldriver, grep still acts on a line-by-line basis, so files containing NUL bytes and newlines will still match.

@DavidFoerster's solution using grep's -z does a good job of solving this problem, using the NUL bytes as separators does the trick.

Alternatively, I came up with a short Python 3 script (allzeroes.py) to check whether the file's contents are all zeroes:

#!/usr/bin/python3
import sys
assert len(sys.argv) == 2
with open(sys.argv[1], 'rb') as f:
 for block in iter(lambda: f.read(4096), b''):
 if any(block):
 sys.exit(1)

Which you can use in a find to locate all matches recursively:

$ find . -type f -exec allzeroes.py ; -print

I hope that helps.

edited Aug 17 at 16:16

answered Aug 16 at 23:23

Filipe Brandenburger

5867

3

+1 although since grep is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using -z (although that will slurp any regular text files wholly into memory). Also I don't think -P is required here?
â€“Â steeldriver
Aug 17 at 1:23

add a commentÂ |Â

up vote
2
down vote

You can abuse grepÃ¢Â€Â™s alternative null-terminated line mode and thus search for files that contain only empty lines:

grep -L -z -e . ...

Replace ... with the file set that you want to scan (here: -R .).

Explanation

-z, --null-data Ã¢Â€Â“ Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.¹

-e . Ã¢Â€Â“ Use . as the search pattern, i. e. match any character.

-L, --files-without-match Ã¢Â€Â“ Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.¹

Test case

Set-up:

: > empty
truncate -s 100 zero
printf '%s' foo bar > foobar

Run test:

$ grep -L -z -e . empty zero foobar
empty
zero

¹ From the grep(1) manual page.

answered Aug 17 at 9:18

David Foerster

26.3k1362106

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "89"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1066057%2ftrying-to-find-files-that-contain-only-nuls-but-getting-some-others%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
8
down vote

accepted

In short, what is happening here is that grep is trying to interpret your file as Unicode data. The sequence 0xFF, 0xFE is a Byte Order Marker for UTF-16.

(In my testing, even other sequences involving two 0xFF's or two 0xFE's etc. would still not match the '[^x00]' regex, since even when trying to do UTF-8 these would be considered non-characters.)

LC_CTYPE=C grep -RLP '[^x00]' .

UPDATE: As pointed out by @steeldriver, grep still acts on a line-by-line basis, so files containing NUL bytes and newlines will still match.

@DavidFoerster's solution using grep's -z does a good job of solving this problem, using the NUL bytes as separators does the trick.

Alternatively, I came up with a short Python 3 script (allzeroes.py) to check whether the file's contents are all zeroes:

#!/usr/bin/python3
import sys
assert len(sys.argv) == 2
with open(sys.argv[1], 'rb') as f:
 for block in iter(lambda: f.read(4096), b''):
 if any(block):
 sys.exit(1)

Which you can use in a find to locate all matches recursively:

$ find . -type f -exec allzeroes.py ; -print

I hope that helps.

edited Aug 17 at 16:16

answered Aug 16 at 23:23

Filipe Brandenburger

5867

3

+1 although since grep is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using -z (although that will slurp any regular text files wholly into memory). Also I don't think -P is required here?
â€“Â steeldriver
Aug 17 at 1:23

add a commentÂ |Â

up vote
8
down vote

accepted

In short, what is happening here is that grep is trying to interpret your file as Unicode data. The sequence 0xFF, 0xFE is a Byte Order Marker for UTF-16.

(In my testing, even other sequences involving two 0xFF's or two 0xFE's etc. would still not match the '[^x00]' regex, since even when trying to do UTF-8 these would be considered non-characters.)

LC_CTYPE=C grep -RLP '[^x00]' .

UPDATE: As pointed out by @steeldriver, grep still acts on a line-by-line basis, so files containing NUL bytes and newlines will still match.

@DavidFoerster's solution using grep's -z does a good job of solving this problem, using the NUL bytes as separators does the trick.

Alternatively, I came up with a short Python 3 script (allzeroes.py) to check whether the file's contents are all zeroes:

#!/usr/bin/python3
import sys
assert len(sys.argv) == 2
with open(sys.argv[1], 'rb') as f:
 for block in iter(lambda: f.read(4096), b''):
 if any(block):
 sys.exit(1)

Which you can use in a find to locate all matches recursively:

$ find . -type f -exec allzeroes.py ; -print

I hope that helps.

edited Aug 17 at 16:16

answered Aug 16 at 23:23

Filipe Brandenburger

5867

3

+1 although since grep is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using -z (although that will slurp any regular text files wholly into memory). Also I don't think -P is required here?
â€“Â steeldriver
Aug 17 at 1:23

add a commentÂ |Â

up vote
8
down vote

accepted

In short, what is happening here is that grep is trying to interpret your file as Unicode data. The sequence 0xFF, 0xFE is a Byte Order Marker for UTF-16.

(In my testing, even other sequences involving two 0xFF's or two 0xFE's etc. would still not match the '[^x00]' regex, since even when trying to do UTF-8 these would be considered non-characters.)

LC_CTYPE=C grep -RLP '[^x00]' .

UPDATE: As pointed out by @steeldriver, grep still acts on a line-by-line basis, so files containing NUL bytes and newlines will still match.

@DavidFoerster's solution using grep's -z does a good job of solving this problem, using the NUL bytes as separators does the trick.

Alternatively, I came up with a short Python 3 script (allzeroes.py) to check whether the file's contents are all zeroes:

#!/usr/bin/python3
import sys
assert len(sys.argv) == 2
with open(sys.argv[1], 'rb') as f:
 for block in iter(lambda: f.read(4096), b''):
 if any(block):
 sys.exit(1)

Which you can use in a find to locate all matches recursively:

$ find . -type f -exec allzeroes.py ; -print

I hope that helps.

edited Aug 17 at 16:16

answered Aug 16 at 23:23

Filipe Brandenburger

5867

In short, what is happening here is that grep is trying to interpret your file as Unicode data. The sequence 0xFF, 0xFE is a Byte Order Marker for UTF-16.

(In my testing, even other sequences involving two 0xFF's or two 0xFE's etc. would still not match the '[^x00]' regex, since even when trying to do UTF-8 these would be considered non-characters.)

LC_CTYPE=C grep -RLP '[^x00]' .

UPDATE: As pointed out by @steeldriver, grep still acts on a line-by-line basis, so files containing NUL bytes and newlines will still match.

@DavidFoerster's solution using grep's -z does a good job of solving this problem, using the NUL bytes as separators does the trick.

Alternatively, I came up with a short Python 3 script (allzeroes.py) to check whether the file's contents are all zeroes:

#!/usr/bin/python3
import sys
assert len(sys.argv) == 2
with open(sys.argv[1], 'rb') as f:
 for block in iter(lambda: f.read(4096), b''):
 if any(block):
 sys.exit(1)

Which you can use in a find to locate all matches recursively:

$ find . -type f -exec allzeroes.py ; -print

I hope that helps.

edited Aug 17 at 16:16

answered Aug 16 at 23:23

Filipe Brandenburger

5867

edited Aug 17 at 16:16

answered Aug 16 at 23:23

Filipe Brandenburger

5867

answered Aug 16 at 23:23

Filipe Brandenburger

5867

answered Aug 16 at 23:23

Filipe Brandenburger

5867

3

+1 although since grep is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using -z (although that will slurp any regular text files wholly into memory). Also I don't think -P is required here?
â€“Â steeldriver
Aug 17 at 1:23

add a commentÂ |Â

3

+1 although since grep is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using -z (although that will slurp any regular text files wholly into memory). Also I don't think -P is required here?
â€“Â steeldriver
Aug 17 at 1:23

+1 although since grep is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using -z (although that will slurp any regular text files wholly into memory). Also I don't think -P is required here?
â€“Â steeldriver
Aug 17 at 1:23

add a commentÂ |Â

up vote
2
down vote

You can abuse grepÃ¢Â€Â™s alternative null-terminated line mode and thus search for files that contain only empty lines:

grep -L -z -e . ...

Replace ... with the file set that you want to scan (here: -R .).

Explanation

-z, --null-data Ã¢Â€Â“ Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.¹

-e . Ã¢Â€Â“ Use . as the search pattern, i. e. match any character.

-L, --files-without-match Ã¢Â€Â“ Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.¹

Test case

Set-up:

: > empty
truncate -s 100 zero
printf '%s' foo bar > foobar

Run test:

$ grep -L -z -e . empty zero foobar
empty
zero

¹ From the grep(1) manual page.

answered Aug 17 at 9:18

David Foerster

26.3k1362106

add a commentÂ |Â

up vote
2
down vote

You can abuse grepÃ¢Â€Â™s alternative null-terminated line mode and thus search for files that contain only empty lines:

grep -L -z -e . ...

Replace ... with the file set that you want to scan (here: -R .).

Explanation

-z, --null-data Ã¢Â€Â“ Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.¹

-e . Ã¢Â€Â“ Use . as the search pattern, i. e. match any character.

-L, --files-without-match Ã¢Â€Â“ Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.¹

Test case

Set-up:

: > empty
truncate -s 100 zero
printf '%s' foo bar > foobar

Run test:

$ grep -L -z -e . empty zero foobar
empty
zero

¹ From the grep(1) manual page.

answered Aug 17 at 9:18

David Foerster

26.3k1362106

add a commentÂ |Â

up vote
2
down vote

You can abuse grepÃ¢Â€Â™s alternative null-terminated line mode and thus search for files that contain only empty lines:

grep -L -z -e . ...

Replace ... with the file set that you want to scan (here: -R .).

Explanation

-z, --null-data Ã¢Â€Â“ Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.¹

-e . Ã¢Â€Â“ Use . as the search pattern, i. e. match any character.

-L, --files-without-match Ã¢Â€Â“ Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.¹

Test case

Set-up:

: > empty
truncate -s 100 zero
printf '%s' foo bar > foobar

Run test:

$ grep -L -z -e . empty zero foobar
empty
zero

¹ From the grep(1) manual page.

answered Aug 17 at 9:18

David Foerster

26.3k1362106

You can abuse grepÃ¢Â€Â™s alternative null-terminated line mode and thus search for files that contain only empty lines:

grep -L -z -e . ...

Replace ... with the file set that you want to scan (here: -R .).

Explanation

-z, --null-data Ã¢Â€Â“ Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.¹

-e . Ã¢Â€Â“ Use . as the search pattern, i. e. match any character.

-L, --files-without-match Ã¢Â€Â“ Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.¹

Test case

Set-up:

: > empty
truncate -s 100 zero
printf '%s' foo bar > foobar

Run test:

$ grep -L -z -e . empty zero foobar
empty
zero

¹ From the grep(1) manual page.

answered Aug 17 at 9:18

David Foerster

26.3k1362106

answered Aug 17 at 9:18

David Foerster

26.3k1362106

answered Aug 17 at 9:18

David Foerster

26.3k1362106

answered Aug 17 at 9:18

David Foerster

26.3k1362106

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu