Trying to find files that contain only NULs, but getting some others
Clash Royale CLAN TAG#URR8PPP
up vote
7
down vote
favorite
The files I am trying to find/list are:
- Any size (0 bytes accepted)
- Consist only of ASCII NUL characters (0x00)
- If there are any characters other than 0x00, the file shouldn't be listed.
The command I have now is:
grep -RLP '[^x00]' .
Which works, but it also finds file which consists only of two bytes: 0xFF, 0xFE. Don't know why.
Is there any better command to find such files?
command-line text-processing
add a comment |Â
up vote
7
down vote
favorite
The files I am trying to find/list are:
- Any size (0 bytes accepted)
- Consist only of ASCII NUL characters (0x00)
- If there are any characters other than 0x00, the file shouldn't be listed.
The command I have now is:
grep -RLP '[^x00]' .
Which works, but it also finds file which consists only of two bytes: 0xFF, 0xFE. Don't know why.
Is there any better command to find such files?
command-line text-processing
Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
â wjandrea
Aug 17 at 0:12
add a comment |Â
up vote
7
down vote
favorite
up vote
7
down vote
favorite
The files I am trying to find/list are:
- Any size (0 bytes accepted)
- Consist only of ASCII NUL characters (0x00)
- If there are any characters other than 0x00, the file shouldn't be listed.
The command I have now is:
grep -RLP '[^x00]' .
Which works, but it also finds file which consists only of two bytes: 0xFF, 0xFE. Don't know why.
Is there any better command to find such files?
command-line text-processing
The files I am trying to find/list are:
- Any size (0 bytes accepted)
- Consist only of ASCII NUL characters (0x00)
- If there are any characters other than 0x00, the file shouldn't be listed.
The command I have now is:
grep -RLP '[^x00]' .
Which works, but it also finds file which consists only of two bytes: 0xFF, 0xFE. Don't know why.
Is there any better command to find such files?
command-line text-processing
command-line text-processing
edited Aug 17 at 1:32
muru
130k19273466
130k19273466
asked Aug 16 at 22:27
pbies
1406
1406
Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
â wjandrea
Aug 17 at 0:12
add a comment |Â
Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
â wjandrea
Aug 17 at 0:12
Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
â wjandrea
Aug 17 at 0:12
Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
â wjandrea
Aug 17 at 0:12
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
8
down vote
accepted
In short, what is happening here is that grep
is trying to interpret your file as Unicode data. The sequence 0xFF, 0xFE is a Byte Order Marker for UTF-16.
(In my testing, even other sequences involving two 0xFF's or two 0xFE's etc. would still not match the '[^x00]'
regex, since even when trying to do UTF-8 these would be considered non-characters.)
Using a locale that doesn't use Unicode for character types should fix this, which you can accomplish by setting the LC_CTYPE environment variable. Use the C
locale to force ASCII encoding (so no Unicode enabled):
LC_CTYPE=C grep -RLP '[^x00]' .
UPDATE: As pointed out by @steeldriver, grep still acts on a line-by-line basis, so files containing NUL bytes and newlines will still match.
@DavidFoerster's solution using grep's -z
does a good job of solving this problem, using the NUL bytes as separators does the trick.
Alternatively, I came up with a short Python 3 script (allzeroes.py
) to check whether the file's contents are all zeroes:
#!/usr/bin/python3
import sys
assert len(sys.argv) == 2
with open(sys.argv[1], 'rb') as f:
for block in iter(lambda: f.read(4096), b''):
if any(block):
sys.exit(1)
Which you can use in a find
to locate all matches recursively:
$ find . -type f -exec allzeroes.py ; -print
I hope that helps.
3
+1 although sincegrep
is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using-z
(although that will slurp any regular text files wholly into memory). Also I don't think-P
is required here?
â steeldriver
Aug 17 at 1:23
add a comment |Â
up vote
2
down vote
You can abuse grep
âÂÂs alternative null-terminated line mode and thus search for files that contain only empty lines:
grep -L -z -e . ...
Replace ...
with the file set that you want to scan (here: -R .
).
Explanation
-z
,--null-data
â Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.1-e .
â Use.
as the search pattern, i. e. match any character.-L
,--files-without-match
â Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.1
Test case
Set-up:
: > empty
truncate -s 100 zero
printf '%s' foo bar > foobar
Run test:
$ grep -L -z -e . empty zero foobar
empty
zero
1 From the grep(1)
manual page.
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
8
down vote
accepted
In short, what is happening here is that grep
is trying to interpret your file as Unicode data. The sequence 0xFF, 0xFE is a Byte Order Marker for UTF-16.
(In my testing, even other sequences involving two 0xFF's or two 0xFE's etc. would still not match the '[^x00]'
regex, since even when trying to do UTF-8 these would be considered non-characters.)
Using a locale that doesn't use Unicode for character types should fix this, which you can accomplish by setting the LC_CTYPE environment variable. Use the C
locale to force ASCII encoding (so no Unicode enabled):
LC_CTYPE=C grep -RLP '[^x00]' .
UPDATE: As pointed out by @steeldriver, grep still acts on a line-by-line basis, so files containing NUL bytes and newlines will still match.
@DavidFoerster's solution using grep's -z
does a good job of solving this problem, using the NUL bytes as separators does the trick.
Alternatively, I came up with a short Python 3 script (allzeroes.py
) to check whether the file's contents are all zeroes:
#!/usr/bin/python3
import sys
assert len(sys.argv) == 2
with open(sys.argv[1], 'rb') as f:
for block in iter(lambda: f.read(4096), b''):
if any(block):
sys.exit(1)
Which you can use in a find
to locate all matches recursively:
$ find . -type f -exec allzeroes.py ; -print
I hope that helps.
3
+1 although sincegrep
is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using-z
(although that will slurp any regular text files wholly into memory). Also I don't think-P
is required here?
â steeldriver
Aug 17 at 1:23
add a comment |Â
up vote
8
down vote
accepted
In short, what is happening here is that grep
is trying to interpret your file as Unicode data. The sequence 0xFF, 0xFE is a Byte Order Marker for UTF-16.
(In my testing, even other sequences involving two 0xFF's or two 0xFE's etc. would still not match the '[^x00]'
regex, since even when trying to do UTF-8 these would be considered non-characters.)
Using a locale that doesn't use Unicode for character types should fix this, which you can accomplish by setting the LC_CTYPE environment variable. Use the C
locale to force ASCII encoding (so no Unicode enabled):
LC_CTYPE=C grep -RLP '[^x00]' .
UPDATE: As pointed out by @steeldriver, grep still acts on a line-by-line basis, so files containing NUL bytes and newlines will still match.
@DavidFoerster's solution using grep's -z
does a good job of solving this problem, using the NUL bytes as separators does the trick.
Alternatively, I came up with a short Python 3 script (allzeroes.py
) to check whether the file's contents are all zeroes:
#!/usr/bin/python3
import sys
assert len(sys.argv) == 2
with open(sys.argv[1], 'rb') as f:
for block in iter(lambda: f.read(4096), b''):
if any(block):
sys.exit(1)
Which you can use in a find
to locate all matches recursively:
$ find . -type f -exec allzeroes.py ; -print
I hope that helps.
3
+1 although sincegrep
is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using-z
(although that will slurp any regular text files wholly into memory). Also I don't think-P
is required here?
â steeldriver
Aug 17 at 1:23
add a comment |Â
up vote
8
down vote
accepted
up vote
8
down vote
accepted
In short, what is happening here is that grep
is trying to interpret your file as Unicode data. The sequence 0xFF, 0xFE is a Byte Order Marker for UTF-16.
(In my testing, even other sequences involving two 0xFF's or two 0xFE's etc. would still not match the '[^x00]'
regex, since even when trying to do UTF-8 these would be considered non-characters.)
Using a locale that doesn't use Unicode for character types should fix this, which you can accomplish by setting the LC_CTYPE environment variable. Use the C
locale to force ASCII encoding (so no Unicode enabled):
LC_CTYPE=C grep -RLP '[^x00]' .
UPDATE: As pointed out by @steeldriver, grep still acts on a line-by-line basis, so files containing NUL bytes and newlines will still match.
@DavidFoerster's solution using grep's -z
does a good job of solving this problem, using the NUL bytes as separators does the trick.
Alternatively, I came up with a short Python 3 script (allzeroes.py
) to check whether the file's contents are all zeroes:
#!/usr/bin/python3
import sys
assert len(sys.argv) == 2
with open(sys.argv[1], 'rb') as f:
for block in iter(lambda: f.read(4096), b''):
if any(block):
sys.exit(1)
Which you can use in a find
to locate all matches recursively:
$ find . -type f -exec allzeroes.py ; -print
I hope that helps.
In short, what is happening here is that grep
is trying to interpret your file as Unicode data. The sequence 0xFF, 0xFE is a Byte Order Marker for UTF-16.
(In my testing, even other sequences involving two 0xFF's or two 0xFE's etc. would still not match the '[^x00]'
regex, since even when trying to do UTF-8 these would be considered non-characters.)
Using a locale that doesn't use Unicode for character types should fix this, which you can accomplish by setting the LC_CTYPE environment variable. Use the C
locale to force ASCII encoding (so no Unicode enabled):
LC_CTYPE=C grep -RLP '[^x00]' .
UPDATE: As pointed out by @steeldriver, grep still acts on a line-by-line basis, so files containing NUL bytes and newlines will still match.
@DavidFoerster's solution using grep's -z
does a good job of solving this problem, using the NUL bytes as separators does the trick.
Alternatively, I came up with a short Python 3 script (allzeroes.py
) to check whether the file's contents are all zeroes:
#!/usr/bin/python3
import sys
assert len(sys.argv) == 2
with open(sys.argv[1], 'rb') as f:
for block in iter(lambda: f.read(4096), b''):
if any(block):
sys.exit(1)
Which you can use in a find
to locate all matches recursively:
$ find . -type f -exec allzeroes.py ; -print
I hope that helps.
edited Aug 17 at 16:16
answered Aug 16 at 23:23
Filipe Brandenburger
5867
5867
3
+1 although sincegrep
is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using-z
(although that will slurp any regular text files wholly into memory). Also I don't think-P
is required here?
â steeldriver
Aug 17 at 1:23
add a comment |Â
3
+1 although sincegrep
is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using-z
(although that will slurp any regular text files wholly into memory). Also I don't think-P
is required here?
â steeldriver
Aug 17 at 1:23
3
3
+1 although since
grep
is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using -z
(although that will slurp any regular text files wholly into memory). Also I don't think -P
is required here?â steeldriver
Aug 17 at 1:23
+1 although since
grep
is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using -z
(although that will slurp any regular text files wholly into memory). Also I don't think -P
is required here?â steeldriver
Aug 17 at 1:23
add a comment |Â
up vote
2
down vote
You can abuse grep
âÂÂs alternative null-terminated line mode and thus search for files that contain only empty lines:
grep -L -z -e . ...
Replace ...
with the file set that you want to scan (here: -R .
).
Explanation
-z
,--null-data
â Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.1-e .
â Use.
as the search pattern, i. e. match any character.-L
,--files-without-match
â Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.1
Test case
Set-up:
: > empty
truncate -s 100 zero
printf '%s' foo bar > foobar
Run test:
$ grep -L -z -e . empty zero foobar
empty
zero
1 From the grep(1)
manual page.
add a comment |Â
up vote
2
down vote
You can abuse grep
âÂÂs alternative null-terminated line mode and thus search for files that contain only empty lines:
grep -L -z -e . ...
Replace ...
with the file set that you want to scan (here: -R .
).
Explanation
-z
,--null-data
â Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.1-e .
â Use.
as the search pattern, i. e. match any character.-L
,--files-without-match
â Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.1
Test case
Set-up:
: > empty
truncate -s 100 zero
printf '%s' foo bar > foobar
Run test:
$ grep -L -z -e . empty zero foobar
empty
zero
1 From the grep(1)
manual page.
add a comment |Â
up vote
2
down vote
up vote
2
down vote
You can abuse grep
âÂÂs alternative null-terminated line mode and thus search for files that contain only empty lines:
grep -L -z -e . ...
Replace ...
with the file set that you want to scan (here: -R .
).
Explanation
-z
,--null-data
â Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.1-e .
â Use.
as the search pattern, i. e. match any character.-L
,--files-without-match
â Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.1
Test case
Set-up:
: > empty
truncate -s 100 zero
printf '%s' foo bar > foobar
Run test:
$ grep -L -z -e . empty zero foobar
empty
zero
1 From the grep(1)
manual page.
You can abuse grep
âÂÂs alternative null-terminated line mode and thus search for files that contain only empty lines:
grep -L -z -e . ...
Replace ...
with the file set that you want to scan (here: -R .
).
Explanation
-z
,--null-data
â Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.1-e .
â Use.
as the search pattern, i. e. match any character.-L
,--files-without-match
â Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.1
Test case
Set-up:
: > empty
truncate -s 100 zero
printf '%s' foo bar > foobar
Run test:
$ grep -L -z -e . empty zero foobar
empty
zero
1 From the grep(1)
manual page.
answered Aug 17 at 9:18
David Foerster
26.3k1362106
26.3k1362106
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1066057%2ftrying-to-find-files-that-contain-only-nuls-but-getting-some-others%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
â wjandrea
Aug 17 at 0:12