Trying to find files that contain only NULs, but getting some others

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
7
down vote

favorite
2












The files I am trying to find/list are:



  • Any size (0 bytes accepted)

  • Consist only of ASCII NUL characters (0x00)

  • If there are any characters other than 0x00, the file shouldn't be listed.

The command I have now is:



grep -RLP '[^x00]' .


Which works, but it also finds file which consists only of two bytes: 0xFF, 0xFE. Don't know why.



Is there any better command to find such files?










share|improve this question























  • Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
    – wjandrea
    Aug 17 at 0:12














up vote
7
down vote

favorite
2












The files I am trying to find/list are:



  • Any size (0 bytes accepted)

  • Consist only of ASCII NUL characters (0x00)

  • If there are any characters other than 0x00, the file shouldn't be listed.

The command I have now is:



grep -RLP '[^x00]' .


Which works, but it also finds file which consists only of two bytes: 0xFF, 0xFE. Don't know why.



Is there any better command to find such files?










share|improve this question























  • Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
    – wjandrea
    Aug 17 at 0:12












up vote
7
down vote

favorite
2









up vote
7
down vote

favorite
2






2





The files I am trying to find/list are:



  • Any size (0 bytes accepted)

  • Consist only of ASCII NUL characters (0x00)

  • If there are any characters other than 0x00, the file shouldn't be listed.

The command I have now is:



grep -RLP '[^x00]' .


Which works, but it also finds file which consists only of two bytes: 0xFF, 0xFE. Don't know why.



Is there any better command to find such files?










share|improve this question















The files I am trying to find/list are:



  • Any size (0 bytes accepted)

  • Consist only of ASCII NUL characters (0x00)

  • If there are any characters other than 0x00, the file shouldn't be listed.

The command I have now is:



grep -RLP '[^x00]' .


Which works, but it also finds file which consists only of two bytes: 0xFF, 0xFE. Don't know why.



Is there any better command to find such files?







command-line text-processing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Aug 17 at 1:32









muru

130k19273466




130k19273466










asked Aug 16 at 22:27









pbies

1406




1406











  • Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
    – wjandrea
    Aug 17 at 0:12
















  • Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
    – wjandrea
    Aug 17 at 0:12















Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
– wjandrea
Aug 17 at 0:12




Note the default system encoding for Ubuntu is UTF-8, not ASCII. Though up to byte 0x7F, they're identical.
– wjandrea
Aug 17 at 0:12










2 Answers
2






active

oldest

votes

















up vote
8
down vote



accepted










In short, what is happening here is that grep is trying to interpret your file as Unicode data. The sequence 0xFF, 0xFE is a Byte Order Marker for UTF-16.



(In my testing, even other sequences involving two 0xFF's or two 0xFE's etc. would still not match the '[^x00]' regex, since even when trying to do UTF-8 these would be considered non-characters.)



Using a locale that doesn't use Unicode for character types should fix this, which you can accomplish by setting the LC_CTYPE environment variable. Use the C locale to force ASCII encoding (so no Unicode enabled):



LC_CTYPE=C grep -RLP '[^x00]' .



UPDATE: As pointed out by @steeldriver, grep still acts on a line-by-line basis, so files containing NUL bytes and newlines will still match.



@DavidFoerster's solution using grep's -z does a good job of solving this problem, using the NUL bytes as separators does the trick.



Alternatively, I came up with a short Python 3 script (allzeroes.py) to check whether the file's contents are all zeroes:



#!/usr/bin/python3
import sys
assert len(sys.argv) == 2
with open(sys.argv[1], 'rb') as f:
for block in iter(lambda: f.read(4096), b''):
if any(block):
sys.exit(1)


Which you can use in a find to locate all matches recursively:



$ find . -type f -exec allzeroes.py ; -print


I hope that helps.






share|improve this answer


















  • 3




    +1 although since grep is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using -z (although that will slurp any regular text files wholly into memory). Also I don't think -P is required here?
    – steeldriver
    Aug 17 at 1:23

















up vote
2
down vote













You can abuse grep’s alternative null-terminated line mode and thus search for files that contain only empty lines:



grep -L -z -e . ...


Replace ... with the file set that you want to scan (here: -R .).



Explanation




  • -z, --null-data – Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.1


  • -e . – Use . as the search pattern, i. e. match any character.


  • -L, --files-without-match – Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.1

Test case



Set-up:



: > empty
truncate -s 100 zero
printf '%s' foo bar > foobar


Run test:



$ grep -L -z -e . empty zero foobar
empty
zero



1 From the grep(1) manual page.






share|improve this answer




















    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "89"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1066057%2ftrying-to-find-files-that-contain-only-nuls-but-getting-some-others%23new-answer', 'question_page');

    );

    Post as a guest






























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    8
    down vote



    accepted










    In short, what is happening here is that grep is trying to interpret your file as Unicode data. The sequence 0xFF, 0xFE is a Byte Order Marker for UTF-16.



    (In my testing, even other sequences involving two 0xFF's or two 0xFE's etc. would still not match the '[^x00]' regex, since even when trying to do UTF-8 these would be considered non-characters.)



    Using a locale that doesn't use Unicode for character types should fix this, which you can accomplish by setting the LC_CTYPE environment variable. Use the C locale to force ASCII encoding (so no Unicode enabled):



    LC_CTYPE=C grep -RLP '[^x00]' .



    UPDATE: As pointed out by @steeldriver, grep still acts on a line-by-line basis, so files containing NUL bytes and newlines will still match.



    @DavidFoerster's solution using grep's -z does a good job of solving this problem, using the NUL bytes as separators does the trick.



    Alternatively, I came up with a short Python 3 script (allzeroes.py) to check whether the file's contents are all zeroes:



    #!/usr/bin/python3
    import sys
    assert len(sys.argv) == 2
    with open(sys.argv[1], 'rb') as f:
    for block in iter(lambda: f.read(4096), b''):
    if any(block):
    sys.exit(1)


    Which you can use in a find to locate all matches recursively:



    $ find . -type f -exec allzeroes.py ; -print


    I hope that helps.






    share|improve this answer


















    • 3




      +1 although since grep is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using -z (although that will slurp any regular text files wholly into memory). Also I don't think -P is required here?
      – steeldriver
      Aug 17 at 1:23














    up vote
    8
    down vote



    accepted










    In short, what is happening here is that grep is trying to interpret your file as Unicode data. The sequence 0xFF, 0xFE is a Byte Order Marker for UTF-16.



    (In my testing, even other sequences involving two 0xFF's or two 0xFE's etc. would still not match the '[^x00]' regex, since even when trying to do UTF-8 these would be considered non-characters.)



    Using a locale that doesn't use Unicode for character types should fix this, which you can accomplish by setting the LC_CTYPE environment variable. Use the C locale to force ASCII encoding (so no Unicode enabled):



    LC_CTYPE=C grep -RLP '[^x00]' .



    UPDATE: As pointed out by @steeldriver, grep still acts on a line-by-line basis, so files containing NUL bytes and newlines will still match.



    @DavidFoerster's solution using grep's -z does a good job of solving this problem, using the NUL bytes as separators does the trick.



    Alternatively, I came up with a short Python 3 script (allzeroes.py) to check whether the file's contents are all zeroes:



    #!/usr/bin/python3
    import sys
    assert len(sys.argv) == 2
    with open(sys.argv[1], 'rb') as f:
    for block in iter(lambda: f.read(4096), b''):
    if any(block):
    sys.exit(1)


    Which you can use in a find to locate all matches recursively:



    $ find . -type f -exec allzeroes.py ; -print


    I hope that helps.






    share|improve this answer


















    • 3




      +1 although since grep is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using -z (although that will slurp any regular text files wholly into memory). Also I don't think -P is required here?
      – steeldriver
      Aug 17 at 1:23












    up vote
    8
    down vote



    accepted







    up vote
    8
    down vote



    accepted






    In short, what is happening here is that grep is trying to interpret your file as Unicode data. The sequence 0xFF, 0xFE is a Byte Order Marker for UTF-16.



    (In my testing, even other sequences involving two 0xFF's or two 0xFE's etc. would still not match the '[^x00]' regex, since even when trying to do UTF-8 these would be considered non-characters.)



    Using a locale that doesn't use Unicode for character types should fix this, which you can accomplish by setting the LC_CTYPE environment variable. Use the C locale to force ASCII encoding (so no Unicode enabled):



    LC_CTYPE=C grep -RLP '[^x00]' .



    UPDATE: As pointed out by @steeldriver, grep still acts on a line-by-line basis, so files containing NUL bytes and newlines will still match.



    @DavidFoerster's solution using grep's -z does a good job of solving this problem, using the NUL bytes as separators does the trick.



    Alternatively, I came up with a short Python 3 script (allzeroes.py) to check whether the file's contents are all zeroes:



    #!/usr/bin/python3
    import sys
    assert len(sys.argv) == 2
    with open(sys.argv[1], 'rb') as f:
    for block in iter(lambda: f.read(4096), b''):
    if any(block):
    sys.exit(1)


    Which you can use in a find to locate all matches recursively:



    $ find . -type f -exec allzeroes.py ; -print


    I hope that helps.






    share|improve this answer














    In short, what is happening here is that grep is trying to interpret your file as Unicode data. The sequence 0xFF, 0xFE is a Byte Order Marker for UTF-16.



    (In my testing, even other sequences involving two 0xFF's or two 0xFE's etc. would still not match the '[^x00]' regex, since even when trying to do UTF-8 these would be considered non-characters.)



    Using a locale that doesn't use Unicode for character types should fix this, which you can accomplish by setting the LC_CTYPE environment variable. Use the C locale to force ASCII encoding (so no Unicode enabled):



    LC_CTYPE=C grep -RLP '[^x00]' .



    UPDATE: As pointed out by @steeldriver, grep still acts on a line-by-line basis, so files containing NUL bytes and newlines will still match.



    @DavidFoerster's solution using grep's -z does a good job of solving this problem, using the NUL bytes as separators does the trick.



    Alternatively, I came up with a short Python 3 script (allzeroes.py) to check whether the file's contents are all zeroes:



    #!/usr/bin/python3
    import sys
    assert len(sys.argv) == 2
    with open(sys.argv[1], 'rb') as f:
    for block in iter(lambda: f.read(4096), b''):
    if any(block):
    sys.exit(1)


    Which you can use in a find to locate all matches recursively:



    $ find . -type f -exec allzeroes.py ; -print


    I hope that helps.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Aug 17 at 16:16

























    answered Aug 16 at 23:23









    Filipe Brandenburger

    5867




    5867







    • 3




      +1 although since grep is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using -z (although that will slurp any regular text files wholly into memory). Also I don't think -P is required here?
      – steeldriver
      Aug 17 at 1:23












    • 3




      +1 although since grep is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using -z (although that will slurp any regular text files wholly into memory). Also I don't think -P is required here?
      – steeldriver
      Aug 17 at 1:23







    3




    3




    +1 although since grep is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using -z (although that will slurp any regular text files wholly into memory). Also I don't think -P is required here?
    – steeldriver
    Aug 17 at 1:23




    +1 although since grep is line-based, this will also output files that consist entirely of newlines - you may be able to work around that by specifying null-terminated mode using -z (although that will slurp any regular text files wholly into memory). Also I don't think -P is required here?
    – steeldriver
    Aug 17 at 1:23












    up vote
    2
    down vote













    You can abuse grep’s alternative null-terminated line mode and thus search for files that contain only empty lines:



    grep -L -z -e . ...


    Replace ... with the file set that you want to scan (here: -R .).



    Explanation




    • -z, --null-data – Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.1


    • -e . – Use . as the search pattern, i. e. match any character.


    • -L, --files-without-match – Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.1

    Test case



    Set-up:



    : > empty
    truncate -s 100 zero
    printf '%s' foo bar > foobar


    Run test:



    $ grep -L -z -e . empty zero foobar
    empty
    zero



    1 From the grep(1) manual page.






    share|improve this answer
























      up vote
      2
      down vote













      You can abuse grep’s alternative null-terminated line mode and thus search for files that contain only empty lines:



      grep -L -z -e . ...


      Replace ... with the file set that you want to scan (here: -R .).



      Explanation




      • -z, --null-data – Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.1


      • -e . – Use . as the search pattern, i. e. match any character.


      • -L, --files-without-match – Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.1

      Test case



      Set-up:



      : > empty
      truncate -s 100 zero
      printf '%s' foo bar > foobar


      Run test:



      $ grep -L -z -e . empty zero foobar
      empty
      zero



      1 From the grep(1) manual page.






      share|improve this answer






















        up vote
        2
        down vote










        up vote
        2
        down vote









        You can abuse grep’s alternative null-terminated line mode and thus search for files that contain only empty lines:



        grep -L -z -e . ...


        Replace ... with the file set that you want to scan (here: -R .).



        Explanation




        • -z, --null-data – Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.1


        • -e . – Use . as the search pattern, i. e. match any character.


        • -L, --files-without-match – Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.1

        Test case



        Set-up:



        : > empty
        truncate -s 100 zero
        printf '%s' foo bar > foobar


        Run test:



        $ grep -L -z -e . empty zero foobar
        empty
        zero



        1 From the grep(1) manual page.






        share|improve this answer












        You can abuse grep’s alternative null-terminated line mode and thus search for files that contain only empty lines:



        grep -L -z -e . ...


        Replace ... with the file set that you want to scan (here: -R .).



        Explanation




        • -z, --null-data – Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.1


        • -e . – Use . as the search pattern, i. e. match any character.


        • -L, --files-without-match – Suppress normal output; instead print the name of each input file from which no output would normally have been printed. The scanning will stop on the first match.1

        Test case



        Set-up:



        : > empty
        truncate -s 100 zero
        printf '%s' foo bar > foobar


        Run test:



        $ grep -L -z -e . empty zero foobar
        empty
        zero



        1 From the grep(1) manual page.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Aug 17 at 9:18









        David Foerster

        26.3k1362106




        26.3k1362106



























             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2faskubuntu.com%2fquestions%2f1066057%2ftrying-to-find-files-that-contain-only-nuls-but-getting-some-others%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Displaying single band from multi-band raster using QGIS

            How many registers does an x86_64 CPU actually have?