Checking identical files in Linux and deleting according to location

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












I use fdupes to find and delete identical files.



But I want to be able to say something like this ...



  • find all the files that are duplicate in directory A or its subdirectories

  • if there's a duplicated file in subdirs B and C then always delete the file in B

In other words, keep all the files in C that are not already in B. And note that the directory structures are not the same so rsync isn't useful here.



I don't think fdupes offers this functionality. I have to manually choose which to delete / keep for each pair.



So I was thinking of writing a quick Python script to do the same thing. But is there a quick system command I can call from Python which can give me some kind of unique id for each file that's a reliable way of seeing if two files are identical. I'm thinking of something that doesn't involve me loading the files into python and hashing their contents.










share|improve this question

























    up vote
    1
    down vote

    favorite












    I use fdupes to find and delete identical files.



    But I want to be able to say something like this ...



    • find all the files that are duplicate in directory A or its subdirectories

    • if there's a duplicated file in subdirs B and C then always delete the file in B

    In other words, keep all the files in C that are not already in B. And note that the directory structures are not the same so rsync isn't useful here.



    I don't think fdupes offers this functionality. I have to manually choose which to delete / keep for each pair.



    So I was thinking of writing a quick Python script to do the same thing. But is there a quick system command I can call from Python which can give me some kind of unique id for each file that's a reliable way of seeing if two files are identical. I'm thinking of something that doesn't involve me loading the files into python and hashing their contents.










    share|improve this question























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I use fdupes to find and delete identical files.



      But I want to be able to say something like this ...



      • find all the files that are duplicate in directory A or its subdirectories

      • if there's a duplicated file in subdirs B and C then always delete the file in B

      In other words, keep all the files in C that are not already in B. And note that the directory structures are not the same so rsync isn't useful here.



      I don't think fdupes offers this functionality. I have to manually choose which to delete / keep for each pair.



      So I was thinking of writing a quick Python script to do the same thing. But is there a quick system command I can call from Python which can give me some kind of unique id for each file that's a reliable way of seeing if two files are identical. I'm thinking of something that doesn't involve me loading the files into python and hashing their contents.










      share|improve this question













      I use fdupes to find and delete identical files.



      But I want to be able to say something like this ...



      • find all the files that are duplicate in directory A or its subdirectories

      • if there's a duplicated file in subdirs B and C then always delete the file in B

      In other words, keep all the files in C that are not already in B. And note that the directory structures are not the same so rsync isn't useful here.



      I don't think fdupes offers this functionality. I have to manually choose which to delete / keep for each pair.



      So I was thinking of writing a quick Python script to do the same thing. But is there a quick system command I can call from Python which can give me some kind of unique id for each file that's a reliable way of seeing if two files are identical. I'm thinking of something that doesn't involve me loading the files into python and hashing their contents.







      deduplication






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 29 at 12:16









      interstar

      3471721




      3471721




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          No, a hash is the only fast way to know if multipule files match, but you can speed it up by only comparing files of the same size, also select a fast hash like md5 if no one is trying for collisions... this is done for you with git/zfs/etc



          Or just



          fdupes -r A B | grep B | xargs -I rm ""





          share|improve this answer






















            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "106"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f484895%2fchecking-identical-files-in-linux-and-deleting-according-to-location%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            2
            down vote



            accepted










            No, a hash is the only fast way to know if multipule files match, but you can speed it up by only comparing files of the same size, also select a fast hash like md5 if no one is trying for collisions... this is done for you with git/zfs/etc



            Or just



            fdupes -r A B | grep B | xargs -I rm ""





            share|improve this answer


























              up vote
              2
              down vote



              accepted










              No, a hash is the only fast way to know if multipule files match, but you can speed it up by only comparing files of the same size, also select a fast hash like md5 if no one is trying for collisions... this is done for you with git/zfs/etc



              Or just



              fdupes -r A B | grep B | xargs -I rm ""





              share|improve this answer
























                up vote
                2
                down vote



                accepted







                up vote
                2
                down vote



                accepted






                No, a hash is the only fast way to know if multipule files match, but you can speed it up by only comparing files of the same size, also select a fast hash like md5 if no one is trying for collisions... this is done for you with git/zfs/etc



                Or just



                fdupes -r A B | grep B | xargs -I rm ""





                share|improve this answer














                No, a hash is the only fast way to know if multipule files match, but you can speed it up by only comparing files of the same size, also select a fast hash like md5 if no one is trying for collisions... this is done for you with git/zfs/etc



                Or just



                fdupes -r A B | grep B | xargs -I rm ""






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 29 at 14:45

























                answered Nov 29 at 12:35









                user1133275

                2,723415




                2,723415



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Unix & Linux Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f484895%2fchecking-identical-files-in-linux-and-deleting-according-to-location%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown






                    Popular posts from this blog

                    How to check contact read email or not when send email to Individual?

                    Displaying single band from multi-band raster using QGIS

                    How many registers does an x86_64 CPU actually have?