Fast way to get checksum for all files within a huge nested directory

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












We have a requirement to screen user-uploaded content. However, I've noticed that most of our user-uploaded content has actually originated from our own system: for example someone downloads a pdf from our document library, renames it as something else to suit their needs, and re-uploads it into their "custom content" section, which can be shared with other users.



I'd like to mark these files as trusted, without someone having to actually look at them, and I thought I could do this using file size and some kind of checksum. eg



  • for a given new file

    • find all files in our resource library folder with the same file extension and same filesize

    • for all the ones with the same extension & size, do some kind of checksum comparison.

    • If we find a match, then declare the new file as trusted.


Now, our resource library directory is 132 GB - quite large. So, any solution that involves looking at every file in there (even every file with the same extension) is going to be quite slow.



It seems like the sensible thing to do is keep some kind of database (not necessarily using a literal DBMS) of file checksums, which is automatically updated when the contents change, or perhaps just run with a scheduler once a day. Then, for any given new file, I can get the checksum and look it up in the db.



This feels like it must be a solved problem. Does anyone have any ideas?



thanks, Max







share|improve this question

























    up vote
    1
    down vote

    favorite












    We have a requirement to screen user-uploaded content. However, I've noticed that most of our user-uploaded content has actually originated from our own system: for example someone downloads a pdf from our document library, renames it as something else to suit their needs, and re-uploads it into their "custom content" section, which can be shared with other users.



    I'd like to mark these files as trusted, without someone having to actually look at them, and I thought I could do this using file size and some kind of checksum. eg



    • for a given new file

      • find all files in our resource library folder with the same file extension and same filesize

      • for all the ones with the same extension & size, do some kind of checksum comparison.

      • If we find a match, then declare the new file as trusted.


    Now, our resource library directory is 132 GB - quite large. So, any solution that involves looking at every file in there (even every file with the same extension) is going to be quite slow.



    It seems like the sensible thing to do is keep some kind of database (not necessarily using a literal DBMS) of file checksums, which is automatically updated when the contents change, or perhaps just run with a scheduler once a day. Then, for any given new file, I can get the checksum and look it up in the db.



    This feels like it must be a solved problem. Does anyone have any ideas?



    thanks, Max







    share|improve this question























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      We have a requirement to screen user-uploaded content. However, I've noticed that most of our user-uploaded content has actually originated from our own system: for example someone downloads a pdf from our document library, renames it as something else to suit their needs, and re-uploads it into their "custom content" section, which can be shared with other users.



      I'd like to mark these files as trusted, without someone having to actually look at them, and I thought I could do this using file size and some kind of checksum. eg



      • for a given new file

        • find all files in our resource library folder with the same file extension and same filesize

        • for all the ones with the same extension & size, do some kind of checksum comparison.

        • If we find a match, then declare the new file as trusted.


      Now, our resource library directory is 132 GB - quite large. So, any solution that involves looking at every file in there (even every file with the same extension) is going to be quite slow.



      It seems like the sensible thing to do is keep some kind of database (not necessarily using a literal DBMS) of file checksums, which is automatically updated when the contents change, or perhaps just run with a scheduler once a day. Then, for any given new file, I can get the checksum and look it up in the db.



      This feels like it must be a solved problem. Does anyone have any ideas?



      thanks, Max







      share|improve this question













      We have a requirement to screen user-uploaded content. However, I've noticed that most of our user-uploaded content has actually originated from our own system: for example someone downloads a pdf from our document library, renames it as something else to suit their needs, and re-uploads it into their "custom content" section, which can be shared with other users.



      I'd like to mark these files as trusted, without someone having to actually look at them, and I thought I could do this using file size and some kind of checksum. eg



      • for a given new file

        • find all files in our resource library folder with the same file extension and same filesize

        • for all the ones with the same extension & size, do some kind of checksum comparison.

        • If we find a match, then declare the new file as trusted.


      Now, our resource library directory is 132 GB - quite large. So, any solution that involves looking at every file in there (even every file with the same extension) is going to be quite slow.



      It seems like the sensible thing to do is keep some kind of database (not necessarily using a literal DBMS) of file checksums, which is automatically updated when the contents change, or perhaps just run with a scheduler once a day. Then, for any given new file, I can get the checksum and look it up in the db.



      This feels like it must be a solved problem. Does anyone have any ideas?



      thanks, Max









      share|improve this question












      share|improve this question




      share|improve this question








      edited Jul 5 at 11:59









      Tomasz

      8,01552560




      8,01552560









      asked Jul 5 at 11:47









      Max Williams

      342424




      342424




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          1
          down vote













          You could look at File integrity monitoring software.



          Basically these are designed to detect the introduction of rootkits to filesystems but at the core they have a database for files with meta information (checksum, hashes) and monitor files that have been changed or added under a set of directories which is what you want.



          The oldest one I've heard about is Tripwire but an open source version was created called
          AIDE. A more recent one is
          OSSEC recommended from https://serverfault.com/questions/141800/recommend-alternative-to-tripwire.






          share|improve this answer




























            up vote
            0
            down vote













            This may be a solved problem, but it's too specific to have any standard tool in the Unix/Linux world. Your question contains a large part of the answer. You need a database, or more precisely, you need an index of checksums. And also you need a component that will add, update and check new files against this index. I think you will have to implement it yourself and the natural place of implementation will be in the upload mechanism (eg. a web page).






            share|improve this answer





















            • Thanks Tomasz. I thought that someone might have done a tool for "Is this file somewhere in this massive nested directory under a different name"?
              – Max Williams
              Jul 5 at 12:29










            • Do you have any suggestions for how to store an index of checksums, btw?
              – Max Williams
              Jul 5 at 12:30










            • @MaxWilliams As you wrote in the question, you need a database. File -> checksum. Build index on the checksums and that's it. You also mentioned size and extension, but I'm not sure if this is necessary or helpful. A checksum is enough for file representation, and all you need is a clever way of making use of it.
              – Tomasz
              Jul 5 at 12:35










            Your Answer







            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "106"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );








             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f453603%2ffast-way-to-get-checksum-for-all-files-within-a-huge-nested-directory%23new-answer', 'question_page');

            );

            Post as a guest






























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote













            You could look at File integrity monitoring software.



            Basically these are designed to detect the introduction of rootkits to filesystems but at the core they have a database for files with meta information (checksum, hashes) and monitor files that have been changed or added under a set of directories which is what you want.



            The oldest one I've heard about is Tripwire but an open source version was created called
            AIDE. A more recent one is
            OSSEC recommended from https://serverfault.com/questions/141800/recommend-alternative-to-tripwire.






            share|improve this answer

























              up vote
              1
              down vote













              You could look at File integrity monitoring software.



              Basically these are designed to detect the introduction of rootkits to filesystems but at the core they have a database for files with meta information (checksum, hashes) and monitor files that have been changed or added under a set of directories which is what you want.



              The oldest one I've heard about is Tripwire but an open source version was created called
              AIDE. A more recent one is
              OSSEC recommended from https://serverfault.com/questions/141800/recommend-alternative-to-tripwire.






              share|improve this answer























                up vote
                1
                down vote










                up vote
                1
                down vote









                You could look at File integrity monitoring software.



                Basically these are designed to detect the introduction of rootkits to filesystems but at the core they have a database for files with meta information (checksum, hashes) and monitor files that have been changed or added under a set of directories which is what you want.



                The oldest one I've heard about is Tripwire but an open source version was created called
                AIDE. A more recent one is
                OSSEC recommended from https://serverfault.com/questions/141800/recommend-alternative-to-tripwire.






                share|improve this answer













                You could look at File integrity monitoring software.



                Basically these are designed to detect the introduction of rootkits to filesystems but at the core they have a database for files with meta information (checksum, hashes) and monitor files that have been changed or added under a set of directories which is what you want.



                The oldest one I've heard about is Tripwire but an open source version was created called
                AIDE. A more recent one is
                OSSEC recommended from https://serverfault.com/questions/141800/recommend-alternative-to-tripwire.







                share|improve this answer













                share|improve this answer



                share|improve this answer











                answered Jul 5 at 13:11









                tk421

                2168




                2168






















                    up vote
                    0
                    down vote













                    This may be a solved problem, but it's too specific to have any standard tool in the Unix/Linux world. Your question contains a large part of the answer. You need a database, or more precisely, you need an index of checksums. And also you need a component that will add, update and check new files against this index. I think you will have to implement it yourself and the natural place of implementation will be in the upload mechanism (eg. a web page).






                    share|improve this answer





















                    • Thanks Tomasz. I thought that someone might have done a tool for "Is this file somewhere in this massive nested directory under a different name"?
                      – Max Williams
                      Jul 5 at 12:29










                    • Do you have any suggestions for how to store an index of checksums, btw?
                      – Max Williams
                      Jul 5 at 12:30










                    • @MaxWilliams As you wrote in the question, you need a database. File -> checksum. Build index on the checksums and that's it. You also mentioned size and extension, but I'm not sure if this is necessary or helpful. A checksum is enough for file representation, and all you need is a clever way of making use of it.
                      – Tomasz
                      Jul 5 at 12:35














                    up vote
                    0
                    down vote













                    This may be a solved problem, but it's too specific to have any standard tool in the Unix/Linux world. Your question contains a large part of the answer. You need a database, or more precisely, you need an index of checksums. And also you need a component that will add, update and check new files against this index. I think you will have to implement it yourself and the natural place of implementation will be in the upload mechanism (eg. a web page).






                    share|improve this answer





















                    • Thanks Tomasz. I thought that someone might have done a tool for "Is this file somewhere in this massive nested directory under a different name"?
                      – Max Williams
                      Jul 5 at 12:29










                    • Do you have any suggestions for how to store an index of checksums, btw?
                      – Max Williams
                      Jul 5 at 12:30










                    • @MaxWilliams As you wrote in the question, you need a database. File -> checksum. Build index on the checksums and that's it. You also mentioned size and extension, but I'm not sure if this is necessary or helpful. A checksum is enough for file representation, and all you need is a clever way of making use of it.
                      – Tomasz
                      Jul 5 at 12:35












                    up vote
                    0
                    down vote










                    up vote
                    0
                    down vote









                    This may be a solved problem, but it's too specific to have any standard tool in the Unix/Linux world. Your question contains a large part of the answer. You need a database, or more precisely, you need an index of checksums. And also you need a component that will add, update and check new files against this index. I think you will have to implement it yourself and the natural place of implementation will be in the upload mechanism (eg. a web page).






                    share|improve this answer













                    This may be a solved problem, but it's too specific to have any standard tool in the Unix/Linux world. Your question contains a large part of the answer. You need a database, or more precisely, you need an index of checksums. And also you need a component that will add, update and check new files against this index. I think you will have to implement it yourself and the natural place of implementation will be in the upload mechanism (eg. a web page).







                    share|improve this answer













                    share|improve this answer



                    share|improve this answer











                    answered Jul 5 at 11:58









                    Tomasz

                    8,01552560




                    8,01552560











                    • Thanks Tomasz. I thought that someone might have done a tool for "Is this file somewhere in this massive nested directory under a different name"?
                      – Max Williams
                      Jul 5 at 12:29










                    • Do you have any suggestions for how to store an index of checksums, btw?
                      – Max Williams
                      Jul 5 at 12:30










                    • @MaxWilliams As you wrote in the question, you need a database. File -> checksum. Build index on the checksums and that's it. You also mentioned size and extension, but I'm not sure if this is necessary or helpful. A checksum is enough for file representation, and all you need is a clever way of making use of it.
                      – Tomasz
                      Jul 5 at 12:35
















                    • Thanks Tomasz. I thought that someone might have done a tool for "Is this file somewhere in this massive nested directory under a different name"?
                      – Max Williams
                      Jul 5 at 12:29










                    • Do you have any suggestions for how to store an index of checksums, btw?
                      – Max Williams
                      Jul 5 at 12:30










                    • @MaxWilliams As you wrote in the question, you need a database. File -> checksum. Build index on the checksums and that's it. You also mentioned size and extension, but I'm not sure if this is necessary or helpful. A checksum is enough for file representation, and all you need is a clever way of making use of it.
                      – Tomasz
                      Jul 5 at 12:35















                    Thanks Tomasz. I thought that someone might have done a tool for "Is this file somewhere in this massive nested directory under a different name"?
                    – Max Williams
                    Jul 5 at 12:29




                    Thanks Tomasz. I thought that someone might have done a tool for "Is this file somewhere in this massive nested directory under a different name"?
                    – Max Williams
                    Jul 5 at 12:29












                    Do you have any suggestions for how to store an index of checksums, btw?
                    – Max Williams
                    Jul 5 at 12:30




                    Do you have any suggestions for how to store an index of checksums, btw?
                    – Max Williams
                    Jul 5 at 12:30












                    @MaxWilliams As you wrote in the question, you need a database. File -> checksum. Build index on the checksums and that's it. You also mentioned size and extension, but I'm not sure if this is necessary or helpful. A checksum is enough for file representation, and all you need is a clever way of making use of it.
                    – Tomasz
                    Jul 5 at 12:35




                    @MaxWilliams As you wrote in the question, you need a database. File -> checksum. Build index on the checksums and that's it. You also mentioned size and extension, but I'm not sure if this is necessary or helpful. A checksum is enough for file representation, and all you need is a clever way of making use of it.
                    – Tomasz
                    Jul 5 at 12:35












                     

                    draft saved


                    draft discarded


























                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f453603%2ffast-way-to-get-checksum-for-all-files-within-a-huge-nested-directory%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Popular posts from this blog

                    Peggy Mitchell

                    Palaiologos

                    The Forum (Inglewood, California)