How do I get the MD5 sum of a directory's contents as one sum?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
147
down vote

favorite
56












The md5sum program does not provide checksums for directories. I want to get a single MD5 checksum for the entire contents of a directory, including files in sub-directories. That is, one combined checksum made out of all the files. Is there a way to do this?










share|improve this question

























    up vote
    147
    down vote

    favorite
    56












    The md5sum program does not provide checksums for directories. I want to get a single MD5 checksum for the entire contents of a directory, including files in sub-directories. That is, one combined checksum made out of all the files. Is there a way to do this?










    share|improve this question























      up vote
      147
      down vote

      favorite
      56









      up vote
      147
      down vote

      favorite
      56






      56





      The md5sum program does not provide checksums for directories. I want to get a single MD5 checksum for the entire contents of a directory, including files in sub-directories. That is, one combined checksum made out of all the files. Is there a way to do this?










      share|improve this question













      The md5sum program does not provide checksums for directories. I want to get a single MD5 checksum for the entire contents of a directory, including files in sub-directories. That is, one combined checksum made out of all the files. Is there a way to do this?







      directory checksum hashsum






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Apr 5 '12 at 19:48







      user17429



























          14 Answers
          14






          active

          oldest

          votes

















          up vote
          154
          down vote













          The right way depends on exactly why you're asking:



          Option 1: Compare Data Only



          If you just need a hash of the tree's file contents, this will do the trick:



          $ find -s somedir -type f -exec md5sum ; | md5sum


          This first summarizes all of the file contents individually, in a predictable order, then passes that list of file names and MD5 hashes to be hashed itself, giving a single value that should only change when the content of one of the files in the tree changes.



          Unfortunately, find -s only works with BSD find(1), used in Mac OS X, FreeBSD, NetBSD and OpenBSD. To get something comparable on a system with GNU or SUS find(1), you need something a bit uglier:



          $ find somedir -type f -exec md5sum ; | sort -k 2 | md5sum


          We've replaced find -s with a call to sort. The -k 2 bit tells it to skip over the MD5 hash, so it only sorts the file names, which are in field 2 through end-of-line, by sort's reckoning.



          There's a weakness with this version of the command, which is that it's liable to become confused if you have any filenames with newlines in them, because it'll look like multiple lines to the sort call. The find -s variant doesn't have that problem, because the tree traversal and sorting happen within the same program, find.



          In either case, the sorting is necessary to avoid false positives. *ix filesystems don't maintain the directory listings in a stable, predictable order; you might not realize this from using ls and such, which silently sort the directory contents for you. find without -s or a sort call is going to print out files in whatever order the underlying filesystem returns them, which could cause this command to give a changed hash value when all that's changed is the order of files in a directory.



          You might need to change the md5sum commands to md5 or some other hash function. If you choose another hash function and need the second form of the command for your system, you might need to adjust the sort command if its output line doesn't have a hash followed by the file name, separated by whitespace. For instance, you cannot use the old Unix sum program for this because its output doesn't include the file name.



          This method is somewhat inefficient, calling md5sum N+1 times, where N is the number of files in the tree, but that's a necessary cost to avoid hashing file and directory metadata.



          Option 2: Compare Data and Metadata



          If you need to be able to detect that anything in a tree has changed, not just file contents, ask tar to pack the directory contents up for you, then send it to md5sum:



          $ tar -cf - somedir | md5sum


          Because tar also sees file permissions, ownership, etc., this will also detect changes to those things, not just changes to file contents.



          This method is considerably faster, since it makes only one pass over the tree and runs the hash program only once.



          As with the find based method above, tar is going to process file names in the order the underlying filesystem returns them. It may well be that in your application, you can be sure you won't cause this to happen. I can think of at least three different usage patterns where that is likely to be the case. (I'm not going to list them, because we're getting into unspecified behavior territory. Each filesystem can be different here, even from one version of the OS to the next.)



          If you find yourself getting false positives, I'd recommend going with the find | cpio option in Gilles' answer.






          share|improve this answer


















          • 6




            I think it is best to navigate to the directory being compared and use find . instead of find somedir. This way the file names are the same when providing different path-specs to find; this can be tricky :-)
            – Abbafei
            Jun 24 '14 at 6:50











          • Should we sort the files too?
            – CMCDragonkai
            Jan 19 '16 at 2:52










          • @CMCDragonkai: What do you mean? In the first case, we do sort the list of file names. In the second case, we purposely do not because part of the emphasized anything in the first sentence is that the order of files in a directory has changed, so you wouldn't want to sort anything.
            – Warren Young
            Jan 19 '16 at 3:45










          • @WarrenYoung Can you explain a bit more thoroughly why option 2 isn't always better? It seems to be quicker, simpler and more cross-platform. In which case shouldn't it be option 1?
            – Robin Winslow
            Aug 17 '16 at 7:51










          • Option 1 alternative: find somedir -type f -exec sh -c "openssl dgst -sha1 -binary | xxd -p" ; | sort | openssl dgst -sha1 to ignore all filenames (should work with newlines)
            – windm
            Oct 22 '17 at 9:50

















          up vote
          32
          down vote













          The checksum needs to be of a deterministic and unambiguous representation of the files as a string. Deterministic means that if you put the same files at the same locations, you'll get the same result. Unambiguous means that two different sets of files have different representations.



          Data and metadata



          Making an archive containing the files is a good start. This is an unambiguous representation (obviously, since you can recover the files by extracting the archive). It may include file metadata such as dates and ownership. However, this isn't quite right yet: an archive is ambiguous, because its representation depends on the order in which the files are stored, and if applicable on the compression.



          A solution is to sort the file names before archiving them. If your file names don't contain newlines, you can run find | sort to list them, and add them to the archive in this order. Take care to tell the archiver not to recurse into directories. Here are examples with POSIX pax, GNU tar and cpio:



          find | LC_ALL=C sort | pax -w -d | md5sum
          find | LC_ALL=C sort | tar -cf - -T - --no-recursion | md5sum
          find | LC_ALL=C sort | cpio -o | md5sum


          Names and contents only, the low-tech way



          If you only want to take the file data into account and not metadata, you can make an archive that includes only the file contents, but there are no standard tools for that. Instead of including the file contents, you can include the hash of the files. If the file names contain no newlines, and there are only regular files and directories (no symbolic links or special files), this is fairly easy, but you do need to take care of a few things:



           sort; echo;
          find -type f -exec md5sum + | md5sum


          We include a directory listing in addition to the list of checksums, as otherwise empty directories would be invisible. The file list is sorted (in a specific, reproducible locale — thanks to Peter.O for reminding me of that). echo separates the two parts (without this, you could make some empty directories whose name look like md5sum output that could also pass for ordinary files). We also include a listing of file sizes, to avoid length-extension attacks.



          By the way, MD5 is deprecated. If it's available, consider using SHA-2, or at least SHA-1.



          Names and data, supporting newlines in names



          Here is a variant of the code above that relies on GNU tools to separate the file names with null bytes. This allows file names to contain newlines. The GNU digest utilities quote special characters in their output, so there won't be ambiguous newlines.



           sort -z; # file hashes
          echo | sha256sum


          A more robust approach



          Here's a minimally tested Python script that builds a hash describing a hierarchy of files. It takes directories and file contents into accounts and ignores symbolic links and other files, and returns a fatal error if any file can't be read.



          #! /usr/bin/env python
          import hashlib, hmac, os, stat, sys
          ## Return the hash of the contents of the specified file, as a hex string
          def file_hash(name):
          f = open(name)
          h = hashlib.sha256()
          while True:
          buf = f.read(16384)
          if len(buf) == 0: break
          h.update(buf)
          f.close()
          return h.hexdigest()
          ## Traverse the specified path and update the hash with a description of its
          ## name and contents
          def traverse(h, path):
          rs = os.lstat(path)
          quoted_name = repr(path)
          if stat.S_ISDIR(rs.st_mode):
          h.update('dir ' + quoted_name + 'n')
          for entry in sorted(os.listdir(path)):
          traverse(h, os.path.join(path, entry))
          elif stat.S_ISREG(rs.st_mode):
          h.update('reg ' + quoted_name + ' ')
          h.update(str(rs.st_size) + ' ')
          h.update(file_hash(path) + 'n')
          else: pass # silently symlinks and other special files
          h = hashlib.sha256()
          for root in sys.argv[1:]: traverse(h, root)
          h.update('endn')
          print h.hexdigest()





          share|improve this answer






















          • OK, this works, thanks. But is there any way to do it without including any metadata? Right now I need it for just the actual contents.
            – user17429
            Apr 6 '12 at 1:12










          • How about LC_ALL=C sort for checking from different environments...(+1 btw)
            – Peter.O
            Apr 6 '12 at 6:16











          • You made a whole Python program for this? Thanks! This is really more than what I had expected. :-) Anyway, I will check these methods as well as the new option 1 by Warren.
            – user17429
            Apr 6 '12 at 17:33










          • Good answer. Setting the sort order with LC_ALL=C is essential if running on multiple machines and OSs.
            – Davor Cubranic
            Aug 3 '16 at 20:52











          • What does cpio -o - mean? Doesn’t cpio use stdin/out by default? GNU cpio 2.12 produces cpio: Too many arguments
            – Jan Tojnar
            Aug 12 '16 at 12:40

















          up vote
          12
          down vote













          Have a look at md5deep. Some of the features of md5deep that may interest you:




          Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.



          Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match.



          ...







          share|improve this answer




















          • Nice, but can't get it to work, it says .../foo: Is a directory, what gives?
            – Camilo Martin
            Oct 2 '14 at 1:21






          • 3




            On its own md5deep doesn't solve the OP's problem as it doesn't print a consolidated md5sum, it just prints the md5sum for each file in the directory. That said, you can md5sum the output of md5deep - not quite what the OP wanted, but is close! e.g. for the current directory: md5deep -r -l -j0 . | md5sum (where -r is recursive, -l means "use relative paths" so that the absolute path of the files doesn't interfere when trying to compare the content of two directories, and -j0 means use 1 thread to prevent non-determinism due to individual md5sums being returned in different orders).
            – Stevie
            Oct 14 '15 at 12:34










          • How to ignore some files/directories in the path?
            – Sandeepan Nath
            Oct 21 '16 at 13:17

















          up vote
          7
          down vote













          If your goal is just to find differences between two directories, consider using diff.



          Try this:



          diff -qr dir1 dir2





          share|improve this answer






















          • Yes, this is useful as well. I think you meant dir1 dir2 in that command.
            – user17429
            Apr 6 '12 at 17:35






          • 1




            I don't usually use GUIs when I can avoid them, but for directory diffing kdiff3 is great and also works on many platforms.
            – sinelaw
            Apr 17 '12 at 2:21










          • Differing files are reported as well with this command.
            – Serge Stroobandt
            Apr 2 '14 at 15:02

















          up vote
          5
          down vote













          You can hash every file recursively and then hash the resulting text:



          > md5deep -r -l . | sort | md5sum
          d43417958e47758c6405b5098f151074 *-


          md5deep is required.






          share|improve this answer
















          • 1




            instead of md5deep use hashdeep on ubuntu 16.04 because md5deep package is just a transitional dummy for hashdeep.
            – palik
            Nov 8 '17 at 15:22






          • 1




            I've tried hashdeep. It outputs not only hashes but also some header including ## Invoked from: /home/myuser/dev/ which is your current path and ## $ hashdeep -s -r -l ~/folder/. This got to sort, so the final hash will be different if you change your current folder or command line.
            – truf
            Aug 23 at 8:28

















          up vote
          3
          down vote













          File contents only, excluding filenames



          I needed a version that only checked the filenames because the contents reside in different directories.



          This version (Warren Young's answer) helped a lot, but my version of md5sum outputs the filename (relative to the path I ran the command from), and the folder names were different, therefore even though the individual file checksums matched, the final checksum didn't.



          To fix that, in my case, I just needed to strip off the filename from each line of the find output (select only the first word as separated by spaces using cut):



          find -s somedir -type f -exec md5sum ; | cut -d" " -f1 | md5sum





          share|improve this answer






















          • You might need to sort the checksums as well to get a reproducible list.
            – eckes
            Mar 22 '16 at 21:34

















          up vote
          3
          down vote













          A good tree check-sum is the tree-id of Git.



          There is unfortunately no stand-alone tool available which can do that (at least I dont know it), but if you have Git handy you can just pretend to set up a new repository and add the files you want to check to the index.



          This allows you to produce the (reproducible) tree hash - which includes only content, file names and some reduced file modes (executable).






          share|improve this answer



























            up vote
            2
            down vote













            I use this my snippet for moderate volumes:



            find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 cat | md5sum -



            and this one for XXXL:



            find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 tail -qc100 | md5sum -






            share|improve this answer




















            • What does the -xdev flag do?
              – czerasz
              May 4 '17 at 6:35










            • It calls for you to type in: man find and read that fine manual ;)
              – poige
              May 4 '17 at 12:43











            • Good point :-). -xdev Don't descend directories on other filesystems.
              – czerasz
              May 4 '17 at 16:31






            • 1




              Note that this ignores new, empty files (like if you touch a file).
              – RonJohn
              May 12 at 23:08










            • Thanks. I think I see how to fix
              – poige
              May 13 at 2:14

















            up vote
            2
            down vote













            solution:



            $ pip install checksumdir
            $ checksumdir -a md5 assets/js
            981ac0bc890de594a9f2f40e00f13872
            $ checksumdir -a sha1 assets/js
            88cd20f115e31a1e1ae381f7291d0c8cd3b92fad


            works fast and easier solution then bash scripting.



            see doc: https://pypi.python.org/pypi/checksumdir/1.0.5






            share|improve this answer




















            • if you don't have pip you may need to install it with yum -y install python-pip (or dnf/apt-get)
              – DmitrySemenov
              Mar 8 '16 at 2:55

















            up vote
            2
            down vote













            nix-hash from the Nix package manager




            The command nix-hash computes the cryptographic hash of the contents
            of each path and prints it on standard output. By default, it computes
            an MD5 hash, but other hash algorithms are available as well. The hash is printed in hexadecimal.



            The hash is computed over a serialisation of each path: a dump of the file system tree rooted at the path. This allows directories
            and symlinks to be hashed
            as well as regular files. The dump is in the NAR format produced by nix-store --dump. Thus, nix-hash path yields the same
            cryptographic hash as nix-store
            --dump path | md5sum.







            share|improve this answer



























              up vote
              1
              down vote













              I didn't want new executables nor clunky solutions so here's my take:



              #!/bin/sh
              # md5dir.sh by Camilo Martin, 2014-10-01.
              # Give this a parameter and it will calculate an md5 of the directory's contents.
              # It only takes into account file contents and paths relative to the directory's root.
              # This means that two dirs with different names and locations can hash equally.

              if [[ ! -d "$1" ]]; then
              echo "Usage: md5dir.sh <dir_name>"
              exit
              fi

              d="$(tr '\' / <<< "$1" | tr -s / | sed 's-/$--')"
              c=$(($#d + 35))
              find "$d" -type f -exec md5sum ; | cut -c 1-33,$c- | sort | md5sum | cut -c 1-32


              Hope it helps you :)






              share|improve this answer



























                up vote
                1
                down vote













                A script which is well tested and supports a number of operations including finding duplicates, doing comparisons on both data and metadata, showing additions as well as changes and removals, you might like Fingerprint.



                Fingerprint right now doesn't produce a single checksum for a directory, but a transcript file which includes checksums for all files in that directory.



                fingerprint analyze


                This will generate index.fingerprint in the current directory which includes checksums, filenames and file sizes. By default it uses both MD5 and SHA1.256.



                In the future, I hope to add support for Merkle Trees into Fingerprint which will give you a single top-level checksum. Right now, you need to retain that file for doing verification.






                share|improve this answer



























                  up vote
                  0
                  down vote













                  A robust and clean approach



                  • First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.

                  • Different approaches for different needs/purpose (all of the below or pick what ever applies):

                    • Hash only the entry name of all entries in the directory tree

                    • Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)

                    • For a symbolic link, its content is the referent name. Hash it or choose to skip

                    • Follow or not to follow(resolved name) the symlink while hashing the contents of the entry

                    • If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files

                    • Handle large files well(again, mind the RAM)

                    • Handle very deep directory trees (mind the open file descriptors)

                    • Handle non standard file names

                    • How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?

                    • Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.


                  This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.



                  Here's a tool(disclaimer: I'm a contributor to it) dtreetrawl, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.




                  Usage:
                  dtreetrawl [OPTION...] "/trawl/me" [path2,...]

                  Help Options:
                  -h, --help Show help options

                  Application Options:
                  -t, --terse Produce a terse output; parsable.
                  -d, --delim=: Character or string delimiter/separator for terse output(default ':')
                  -l, --max-level=N Do not traverse tree beyond N level(s)
                  --hash Hash the files to produce checksums(default is MD5).
                  -c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512.
                  -s, --hash-symlink Include symbolic links' referent name while calculating the root checksum
                  -R, --only-root-hash Output only the root hash. Blank line if --hash is not set
                  -N, --no-name-hash Exclude path name while calculating the root checksum
                  -F, --no-content-hash Do not hash the contents of the file



                  An example human friendly output:




                  ...
                  ... //clipped
                  ...
                  /home/lab/linux-4.14-rc8/CREDITS
                  Base name : CREDITS
                  Level : 1
                  Type : regular file
                  Referent name :
                  File size : 98443 bytes
                  I-node number : 290850
                  No. directory entries : 0
                  Permission (octal) : 0644
                  Link count : 1
                  Ownership : UID=0, GID=0
                  Preferred I/O block size : 4096 bytes
                  Blocks allocated : 200
                  Last status change : Tue, 21 Nov 17 21:28:18 +0530
                  Last file access : Thu, 28 Dec 17 00:53:27 +0530
                  Last file modification : Tue, 21 Nov 17 21:28:18 +0530
                  Hash : 9f0312d130016d103aa5fc9d16a2437e

                  Stats for /home/lab/linux-4.14-rc8:
                  Elapsed time : 1.305767 s
                  Start time : Sun, 07 Jan 18 03:42:39 +0530
                  Root hash : 434e93111ad6f9335bb4954bc8f4eca4
                  Hash type : md5
                  Depth : 8
                  Total,
                  size : 66850916 bytes
                  entries : 12484
                  directories : 763
                  regular files : 11715
                  symlinks : 6
                  block devices : 0
                  char devices : 0
                  sockets : 0
                  FIFOs/pipes : 0






                  share|improve this answer






















                  • General advice is always welcome but the best answers are specific and with code where appropriate. If you have experience of using the tool you refer to then please include it.
                    – bu5hman
                    Jan 7 at 11:54










                  • @bu5hman Sure! I wasn't quite comfortable saying(gloating?) more about how well it works since I'm involved in its development.
                    – six-k
                    Jan 7 at 13:56

















                  up vote
                  0
                  down vote













                  Doing individually for all files in each directory.



                  # Calculating
                  find dir1 | xargs md5sum > dir1.md5
                  find dir2 | xargs md5sum > dir2.md5
                  # Comparing (and showing the difference)
                  paste <(sort -k2 dir1.md5) <(sort -k2 dir2.md5) | awk '$1 != $3'





                  share|improve this answer








                  New contributor




                  Leandro Lima is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.

















                    Your Answer








                    StackExchange.ready(function()
                    var channelOptions =
                    tags: "".split(" "),
                    id: "106"
                    ;
                    initTagRenderer("".split(" "), "".split(" "), channelOptions);

                    StackExchange.using("externalEditor", function()
                    // Have to fire editor after snippets, if snippets enabled
                    if (StackExchange.settings.snippets.snippetsEnabled)
                    StackExchange.using("snippets", function()
                    createEditor();
                    );

                    else
                    createEditor();

                    );

                    function createEditor()
                    StackExchange.prepareEditor(
                    heartbeatType: 'answer',
                    convertImagesToLinks: false,
                    noModals: true,
                    showLowRepImageUploadWarning: true,
                    reputationToPostImages: null,
                    bindNavPrevention: true,
                    postfix: "",
                    imageUploader:
                    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                    allowUrls: true
                    ,
                    onDemand: true,
                    discardSelector: ".discard-answer"
                    ,immediatelyShowMarkdownHelp:true
                    );



                    );













                     

                    draft saved


                    draft discarded


















                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f35832%2fhow-do-i-get-the-md5-sum-of-a-directorys-contents-as-one-sum%23new-answer', 'question_page');

                    );

                    Post as a guest





























                    14 Answers
                    14






                    active

                    oldest

                    votes








                    14 Answers
                    14






                    active

                    oldest

                    votes









                    active

                    oldest

                    votes






                    active

                    oldest

                    votes








                    up vote
                    154
                    down vote













                    The right way depends on exactly why you're asking:



                    Option 1: Compare Data Only



                    If you just need a hash of the tree's file contents, this will do the trick:



                    $ find -s somedir -type f -exec md5sum ; | md5sum


                    This first summarizes all of the file contents individually, in a predictable order, then passes that list of file names and MD5 hashes to be hashed itself, giving a single value that should only change when the content of one of the files in the tree changes.



                    Unfortunately, find -s only works with BSD find(1), used in Mac OS X, FreeBSD, NetBSD and OpenBSD. To get something comparable on a system with GNU or SUS find(1), you need something a bit uglier:



                    $ find somedir -type f -exec md5sum ; | sort -k 2 | md5sum


                    We've replaced find -s with a call to sort. The -k 2 bit tells it to skip over the MD5 hash, so it only sorts the file names, which are in field 2 through end-of-line, by sort's reckoning.



                    There's a weakness with this version of the command, which is that it's liable to become confused if you have any filenames with newlines in them, because it'll look like multiple lines to the sort call. The find -s variant doesn't have that problem, because the tree traversal and sorting happen within the same program, find.



                    In either case, the sorting is necessary to avoid false positives. *ix filesystems don't maintain the directory listings in a stable, predictable order; you might not realize this from using ls and such, which silently sort the directory contents for you. find without -s or a sort call is going to print out files in whatever order the underlying filesystem returns them, which could cause this command to give a changed hash value when all that's changed is the order of files in a directory.



                    You might need to change the md5sum commands to md5 or some other hash function. If you choose another hash function and need the second form of the command for your system, you might need to adjust the sort command if its output line doesn't have a hash followed by the file name, separated by whitespace. For instance, you cannot use the old Unix sum program for this because its output doesn't include the file name.



                    This method is somewhat inefficient, calling md5sum N+1 times, where N is the number of files in the tree, but that's a necessary cost to avoid hashing file and directory metadata.



                    Option 2: Compare Data and Metadata



                    If you need to be able to detect that anything in a tree has changed, not just file contents, ask tar to pack the directory contents up for you, then send it to md5sum:



                    $ tar -cf - somedir | md5sum


                    Because tar also sees file permissions, ownership, etc., this will also detect changes to those things, not just changes to file contents.



                    This method is considerably faster, since it makes only one pass over the tree and runs the hash program only once.



                    As with the find based method above, tar is going to process file names in the order the underlying filesystem returns them. It may well be that in your application, you can be sure you won't cause this to happen. I can think of at least three different usage patterns where that is likely to be the case. (I'm not going to list them, because we're getting into unspecified behavior territory. Each filesystem can be different here, even from one version of the OS to the next.)



                    If you find yourself getting false positives, I'd recommend going with the find | cpio option in Gilles' answer.






                    share|improve this answer


















                    • 6




                      I think it is best to navigate to the directory being compared and use find . instead of find somedir. This way the file names are the same when providing different path-specs to find; this can be tricky :-)
                      – Abbafei
                      Jun 24 '14 at 6:50











                    • Should we sort the files too?
                      – CMCDragonkai
                      Jan 19 '16 at 2:52










                    • @CMCDragonkai: What do you mean? In the first case, we do sort the list of file names. In the second case, we purposely do not because part of the emphasized anything in the first sentence is that the order of files in a directory has changed, so you wouldn't want to sort anything.
                      – Warren Young
                      Jan 19 '16 at 3:45










                    • @WarrenYoung Can you explain a bit more thoroughly why option 2 isn't always better? It seems to be quicker, simpler and more cross-platform. In which case shouldn't it be option 1?
                      – Robin Winslow
                      Aug 17 '16 at 7:51










                    • Option 1 alternative: find somedir -type f -exec sh -c "openssl dgst -sha1 -binary | xxd -p" ; | sort | openssl dgst -sha1 to ignore all filenames (should work with newlines)
                      – windm
                      Oct 22 '17 at 9:50














                    up vote
                    154
                    down vote













                    The right way depends on exactly why you're asking:



                    Option 1: Compare Data Only



                    If you just need a hash of the tree's file contents, this will do the trick:



                    $ find -s somedir -type f -exec md5sum ; | md5sum


                    This first summarizes all of the file contents individually, in a predictable order, then passes that list of file names and MD5 hashes to be hashed itself, giving a single value that should only change when the content of one of the files in the tree changes.



                    Unfortunately, find -s only works with BSD find(1), used in Mac OS X, FreeBSD, NetBSD and OpenBSD. To get something comparable on a system with GNU or SUS find(1), you need something a bit uglier:



                    $ find somedir -type f -exec md5sum ; | sort -k 2 | md5sum


                    We've replaced find -s with a call to sort. The -k 2 bit tells it to skip over the MD5 hash, so it only sorts the file names, which are in field 2 through end-of-line, by sort's reckoning.



                    There's a weakness with this version of the command, which is that it's liable to become confused if you have any filenames with newlines in them, because it'll look like multiple lines to the sort call. The find -s variant doesn't have that problem, because the tree traversal and sorting happen within the same program, find.



                    In either case, the sorting is necessary to avoid false positives. *ix filesystems don't maintain the directory listings in a stable, predictable order; you might not realize this from using ls and such, which silently sort the directory contents for you. find without -s or a sort call is going to print out files in whatever order the underlying filesystem returns them, which could cause this command to give a changed hash value when all that's changed is the order of files in a directory.



                    You might need to change the md5sum commands to md5 or some other hash function. If you choose another hash function and need the second form of the command for your system, you might need to adjust the sort command if its output line doesn't have a hash followed by the file name, separated by whitespace. For instance, you cannot use the old Unix sum program for this because its output doesn't include the file name.



                    This method is somewhat inefficient, calling md5sum N+1 times, where N is the number of files in the tree, but that's a necessary cost to avoid hashing file and directory metadata.



                    Option 2: Compare Data and Metadata



                    If you need to be able to detect that anything in a tree has changed, not just file contents, ask tar to pack the directory contents up for you, then send it to md5sum:



                    $ tar -cf - somedir | md5sum


                    Because tar also sees file permissions, ownership, etc., this will also detect changes to those things, not just changes to file contents.



                    This method is considerably faster, since it makes only one pass over the tree and runs the hash program only once.



                    As with the find based method above, tar is going to process file names in the order the underlying filesystem returns them. It may well be that in your application, you can be sure you won't cause this to happen. I can think of at least three different usage patterns where that is likely to be the case. (I'm not going to list them, because we're getting into unspecified behavior territory. Each filesystem can be different here, even from one version of the OS to the next.)



                    If you find yourself getting false positives, I'd recommend going with the find | cpio option in Gilles' answer.






                    share|improve this answer


















                    • 6




                      I think it is best to navigate to the directory being compared and use find . instead of find somedir. This way the file names are the same when providing different path-specs to find; this can be tricky :-)
                      – Abbafei
                      Jun 24 '14 at 6:50











                    • Should we sort the files too?
                      – CMCDragonkai
                      Jan 19 '16 at 2:52










                    • @CMCDragonkai: What do you mean? In the first case, we do sort the list of file names. In the second case, we purposely do not because part of the emphasized anything in the first sentence is that the order of files in a directory has changed, so you wouldn't want to sort anything.
                      – Warren Young
                      Jan 19 '16 at 3:45










                    • @WarrenYoung Can you explain a bit more thoroughly why option 2 isn't always better? It seems to be quicker, simpler and more cross-platform. In which case shouldn't it be option 1?
                      – Robin Winslow
                      Aug 17 '16 at 7:51










                    • Option 1 alternative: find somedir -type f -exec sh -c "openssl dgst -sha1 -binary | xxd -p" ; | sort | openssl dgst -sha1 to ignore all filenames (should work with newlines)
                      – windm
                      Oct 22 '17 at 9:50












                    up vote
                    154
                    down vote










                    up vote
                    154
                    down vote









                    The right way depends on exactly why you're asking:



                    Option 1: Compare Data Only



                    If you just need a hash of the tree's file contents, this will do the trick:



                    $ find -s somedir -type f -exec md5sum ; | md5sum


                    This first summarizes all of the file contents individually, in a predictable order, then passes that list of file names and MD5 hashes to be hashed itself, giving a single value that should only change when the content of one of the files in the tree changes.



                    Unfortunately, find -s only works with BSD find(1), used in Mac OS X, FreeBSD, NetBSD and OpenBSD. To get something comparable on a system with GNU or SUS find(1), you need something a bit uglier:



                    $ find somedir -type f -exec md5sum ; | sort -k 2 | md5sum


                    We've replaced find -s with a call to sort. The -k 2 bit tells it to skip over the MD5 hash, so it only sorts the file names, which are in field 2 through end-of-line, by sort's reckoning.



                    There's a weakness with this version of the command, which is that it's liable to become confused if you have any filenames with newlines in them, because it'll look like multiple lines to the sort call. The find -s variant doesn't have that problem, because the tree traversal and sorting happen within the same program, find.



                    In either case, the sorting is necessary to avoid false positives. *ix filesystems don't maintain the directory listings in a stable, predictable order; you might not realize this from using ls and such, which silently sort the directory contents for you. find without -s or a sort call is going to print out files in whatever order the underlying filesystem returns them, which could cause this command to give a changed hash value when all that's changed is the order of files in a directory.



                    You might need to change the md5sum commands to md5 or some other hash function. If you choose another hash function and need the second form of the command for your system, you might need to adjust the sort command if its output line doesn't have a hash followed by the file name, separated by whitespace. For instance, you cannot use the old Unix sum program for this because its output doesn't include the file name.



                    This method is somewhat inefficient, calling md5sum N+1 times, where N is the number of files in the tree, but that's a necessary cost to avoid hashing file and directory metadata.



                    Option 2: Compare Data and Metadata



                    If you need to be able to detect that anything in a tree has changed, not just file contents, ask tar to pack the directory contents up for you, then send it to md5sum:



                    $ tar -cf - somedir | md5sum


                    Because tar also sees file permissions, ownership, etc., this will also detect changes to those things, not just changes to file contents.



                    This method is considerably faster, since it makes only one pass over the tree and runs the hash program only once.



                    As with the find based method above, tar is going to process file names in the order the underlying filesystem returns them. It may well be that in your application, you can be sure you won't cause this to happen. I can think of at least three different usage patterns where that is likely to be the case. (I'm not going to list them, because we're getting into unspecified behavior territory. Each filesystem can be different here, even from one version of the OS to the next.)



                    If you find yourself getting false positives, I'd recommend going with the find | cpio option in Gilles' answer.






                    share|improve this answer














                    The right way depends on exactly why you're asking:



                    Option 1: Compare Data Only



                    If you just need a hash of the tree's file contents, this will do the trick:



                    $ find -s somedir -type f -exec md5sum ; | md5sum


                    This first summarizes all of the file contents individually, in a predictable order, then passes that list of file names and MD5 hashes to be hashed itself, giving a single value that should only change when the content of one of the files in the tree changes.



                    Unfortunately, find -s only works with BSD find(1), used in Mac OS X, FreeBSD, NetBSD and OpenBSD. To get something comparable on a system with GNU or SUS find(1), you need something a bit uglier:



                    $ find somedir -type f -exec md5sum ; | sort -k 2 | md5sum


                    We've replaced find -s with a call to sort. The -k 2 bit tells it to skip over the MD5 hash, so it only sorts the file names, which are in field 2 through end-of-line, by sort's reckoning.



                    There's a weakness with this version of the command, which is that it's liable to become confused if you have any filenames with newlines in them, because it'll look like multiple lines to the sort call. The find -s variant doesn't have that problem, because the tree traversal and sorting happen within the same program, find.



                    In either case, the sorting is necessary to avoid false positives. *ix filesystems don't maintain the directory listings in a stable, predictable order; you might not realize this from using ls and such, which silently sort the directory contents for you. find without -s or a sort call is going to print out files in whatever order the underlying filesystem returns them, which could cause this command to give a changed hash value when all that's changed is the order of files in a directory.



                    You might need to change the md5sum commands to md5 or some other hash function. If you choose another hash function and need the second form of the command for your system, you might need to adjust the sort command if its output line doesn't have a hash followed by the file name, separated by whitespace. For instance, you cannot use the old Unix sum program for this because its output doesn't include the file name.



                    This method is somewhat inefficient, calling md5sum N+1 times, where N is the number of files in the tree, but that's a necessary cost to avoid hashing file and directory metadata.



                    Option 2: Compare Data and Metadata



                    If you need to be able to detect that anything in a tree has changed, not just file contents, ask tar to pack the directory contents up for you, then send it to md5sum:



                    $ tar -cf - somedir | md5sum


                    Because tar also sees file permissions, ownership, etc., this will also detect changes to those things, not just changes to file contents.



                    This method is considerably faster, since it makes only one pass over the tree and runs the hash program only once.



                    As with the find based method above, tar is going to process file names in the order the underlying filesystem returns them. It may well be that in your application, you can be sure you won't cause this to happen. I can think of at least three different usage patterns where that is likely to be the case. (I'm not going to list them, because we're getting into unspecified behavior territory. Each filesystem can be different here, even from one version of the OS to the next.)



                    If you find yourself getting false positives, I'd recommend going with the find | cpio option in Gilles' answer.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Apr 1 '16 at 20:37

























                    answered Apr 5 '12 at 19:57









                    Warren Young

                    53.8k8140144




                    53.8k8140144







                    • 6




                      I think it is best to navigate to the directory being compared and use find . instead of find somedir. This way the file names are the same when providing different path-specs to find; this can be tricky :-)
                      – Abbafei
                      Jun 24 '14 at 6:50











                    • Should we sort the files too?
                      – CMCDragonkai
                      Jan 19 '16 at 2:52










                    • @CMCDragonkai: What do you mean? In the first case, we do sort the list of file names. In the second case, we purposely do not because part of the emphasized anything in the first sentence is that the order of files in a directory has changed, so you wouldn't want to sort anything.
                      – Warren Young
                      Jan 19 '16 at 3:45










                    • @WarrenYoung Can you explain a bit more thoroughly why option 2 isn't always better? It seems to be quicker, simpler and more cross-platform. In which case shouldn't it be option 1?
                      – Robin Winslow
                      Aug 17 '16 at 7:51










                    • Option 1 alternative: find somedir -type f -exec sh -c "openssl dgst -sha1 -binary | xxd -p" ; | sort | openssl dgst -sha1 to ignore all filenames (should work with newlines)
                      – windm
                      Oct 22 '17 at 9:50












                    • 6




                      I think it is best to navigate to the directory being compared and use find . instead of find somedir. This way the file names are the same when providing different path-specs to find; this can be tricky :-)
                      – Abbafei
                      Jun 24 '14 at 6:50











                    • Should we sort the files too?
                      – CMCDragonkai
                      Jan 19 '16 at 2:52










                    • @CMCDragonkai: What do you mean? In the first case, we do sort the list of file names. In the second case, we purposely do not because part of the emphasized anything in the first sentence is that the order of files in a directory has changed, so you wouldn't want to sort anything.
                      – Warren Young
                      Jan 19 '16 at 3:45










                    • @WarrenYoung Can you explain a bit more thoroughly why option 2 isn't always better? It seems to be quicker, simpler and more cross-platform. In which case shouldn't it be option 1?
                      – Robin Winslow
                      Aug 17 '16 at 7:51










                    • Option 1 alternative: find somedir -type f -exec sh -c "openssl dgst -sha1 -binary | xxd -p" ; | sort | openssl dgst -sha1 to ignore all filenames (should work with newlines)
                      – windm
                      Oct 22 '17 at 9:50







                    6




                    6




                    I think it is best to navigate to the directory being compared and use find . instead of find somedir. This way the file names are the same when providing different path-specs to find; this can be tricky :-)
                    – Abbafei
                    Jun 24 '14 at 6:50





                    I think it is best to navigate to the directory being compared and use find . instead of find somedir. This way the file names are the same when providing different path-specs to find; this can be tricky :-)
                    – Abbafei
                    Jun 24 '14 at 6:50













                    Should we sort the files too?
                    – CMCDragonkai
                    Jan 19 '16 at 2:52




                    Should we sort the files too?
                    – CMCDragonkai
                    Jan 19 '16 at 2:52












                    @CMCDragonkai: What do you mean? In the first case, we do sort the list of file names. In the second case, we purposely do not because part of the emphasized anything in the first sentence is that the order of files in a directory has changed, so you wouldn't want to sort anything.
                    – Warren Young
                    Jan 19 '16 at 3:45




                    @CMCDragonkai: What do you mean? In the first case, we do sort the list of file names. In the second case, we purposely do not because part of the emphasized anything in the first sentence is that the order of files in a directory has changed, so you wouldn't want to sort anything.
                    – Warren Young
                    Jan 19 '16 at 3:45












                    @WarrenYoung Can you explain a bit more thoroughly why option 2 isn't always better? It seems to be quicker, simpler and more cross-platform. In which case shouldn't it be option 1?
                    – Robin Winslow
                    Aug 17 '16 at 7:51




                    @WarrenYoung Can you explain a bit more thoroughly why option 2 isn't always better? It seems to be quicker, simpler and more cross-platform. In which case shouldn't it be option 1?
                    – Robin Winslow
                    Aug 17 '16 at 7:51












                    Option 1 alternative: find somedir -type f -exec sh -c "openssl dgst -sha1 -binary | xxd -p" ; | sort | openssl dgst -sha1 to ignore all filenames (should work with newlines)
                    – windm
                    Oct 22 '17 at 9:50




                    Option 1 alternative: find somedir -type f -exec sh -c "openssl dgst -sha1 -binary | xxd -p" ; | sort | openssl dgst -sha1 to ignore all filenames (should work with newlines)
                    – windm
                    Oct 22 '17 at 9:50












                    up vote
                    32
                    down vote













                    The checksum needs to be of a deterministic and unambiguous representation of the files as a string. Deterministic means that if you put the same files at the same locations, you'll get the same result. Unambiguous means that two different sets of files have different representations.



                    Data and metadata



                    Making an archive containing the files is a good start. This is an unambiguous representation (obviously, since you can recover the files by extracting the archive). It may include file metadata such as dates and ownership. However, this isn't quite right yet: an archive is ambiguous, because its representation depends on the order in which the files are stored, and if applicable on the compression.



                    A solution is to sort the file names before archiving them. If your file names don't contain newlines, you can run find | sort to list them, and add them to the archive in this order. Take care to tell the archiver not to recurse into directories. Here are examples with POSIX pax, GNU tar and cpio:



                    find | LC_ALL=C sort | pax -w -d | md5sum
                    find | LC_ALL=C sort | tar -cf - -T - --no-recursion | md5sum
                    find | LC_ALL=C sort | cpio -o | md5sum


                    Names and contents only, the low-tech way



                    If you only want to take the file data into account and not metadata, you can make an archive that includes only the file contents, but there are no standard tools for that. Instead of including the file contents, you can include the hash of the files. If the file names contain no newlines, and there are only regular files and directories (no symbolic links or special files), this is fairly easy, but you do need to take care of a few things:



                     sort; echo;
                    find -type f -exec md5sum + | md5sum


                    We include a directory listing in addition to the list of checksums, as otherwise empty directories would be invisible. The file list is sorted (in a specific, reproducible locale — thanks to Peter.O for reminding me of that). echo separates the two parts (without this, you could make some empty directories whose name look like md5sum output that could also pass for ordinary files). We also include a listing of file sizes, to avoid length-extension attacks.



                    By the way, MD5 is deprecated. If it's available, consider using SHA-2, or at least SHA-1.



                    Names and data, supporting newlines in names



                    Here is a variant of the code above that relies on GNU tools to separate the file names with null bytes. This allows file names to contain newlines. The GNU digest utilities quote special characters in their output, so there won't be ambiguous newlines.



                     sort -z; # file hashes
                    echo | sha256sum


                    A more robust approach



                    Here's a minimally tested Python script that builds a hash describing a hierarchy of files. It takes directories and file contents into accounts and ignores symbolic links and other files, and returns a fatal error if any file can't be read.



                    #! /usr/bin/env python
                    import hashlib, hmac, os, stat, sys
                    ## Return the hash of the contents of the specified file, as a hex string
                    def file_hash(name):
                    f = open(name)
                    h = hashlib.sha256()
                    while True:
                    buf = f.read(16384)
                    if len(buf) == 0: break
                    h.update(buf)
                    f.close()
                    return h.hexdigest()
                    ## Traverse the specified path and update the hash with a description of its
                    ## name and contents
                    def traverse(h, path):
                    rs = os.lstat(path)
                    quoted_name = repr(path)
                    if stat.S_ISDIR(rs.st_mode):
                    h.update('dir ' + quoted_name + 'n')
                    for entry in sorted(os.listdir(path)):
                    traverse(h, os.path.join(path, entry))
                    elif stat.S_ISREG(rs.st_mode):
                    h.update('reg ' + quoted_name + ' ')
                    h.update(str(rs.st_size) + ' ')
                    h.update(file_hash(path) + 'n')
                    else: pass # silently symlinks and other special files
                    h = hashlib.sha256()
                    for root in sys.argv[1:]: traverse(h, root)
                    h.update('endn')
                    print h.hexdigest()





                    share|improve this answer






















                    • OK, this works, thanks. But is there any way to do it without including any metadata? Right now I need it for just the actual contents.
                      – user17429
                      Apr 6 '12 at 1:12










                    • How about LC_ALL=C sort for checking from different environments...(+1 btw)
                      – Peter.O
                      Apr 6 '12 at 6:16











                    • You made a whole Python program for this? Thanks! This is really more than what I had expected. :-) Anyway, I will check these methods as well as the new option 1 by Warren.
                      – user17429
                      Apr 6 '12 at 17:33










                    • Good answer. Setting the sort order with LC_ALL=C is essential if running on multiple machines and OSs.
                      – Davor Cubranic
                      Aug 3 '16 at 20:52











                    • What does cpio -o - mean? Doesn’t cpio use stdin/out by default? GNU cpio 2.12 produces cpio: Too many arguments
                      – Jan Tojnar
                      Aug 12 '16 at 12:40














                    up vote
                    32
                    down vote













                    The checksum needs to be of a deterministic and unambiguous representation of the files as a string. Deterministic means that if you put the same files at the same locations, you'll get the same result. Unambiguous means that two different sets of files have different representations.



                    Data and metadata



                    Making an archive containing the files is a good start. This is an unambiguous representation (obviously, since you can recover the files by extracting the archive). It may include file metadata such as dates and ownership. However, this isn't quite right yet: an archive is ambiguous, because its representation depends on the order in which the files are stored, and if applicable on the compression.



                    A solution is to sort the file names before archiving them. If your file names don't contain newlines, you can run find | sort to list them, and add them to the archive in this order. Take care to tell the archiver not to recurse into directories. Here are examples with POSIX pax, GNU tar and cpio:



                    find | LC_ALL=C sort | pax -w -d | md5sum
                    find | LC_ALL=C sort | tar -cf - -T - --no-recursion | md5sum
                    find | LC_ALL=C sort | cpio -o | md5sum


                    Names and contents only, the low-tech way



                    If you only want to take the file data into account and not metadata, you can make an archive that includes only the file contents, but there are no standard tools for that. Instead of including the file contents, you can include the hash of the files. If the file names contain no newlines, and there are only regular files and directories (no symbolic links or special files), this is fairly easy, but you do need to take care of a few things:



                     sort; echo;
                    find -type f -exec md5sum + | md5sum


                    We include a directory listing in addition to the list of checksums, as otherwise empty directories would be invisible. The file list is sorted (in a specific, reproducible locale — thanks to Peter.O for reminding me of that). echo separates the two parts (without this, you could make some empty directories whose name look like md5sum output that could also pass for ordinary files). We also include a listing of file sizes, to avoid length-extension attacks.



                    By the way, MD5 is deprecated. If it's available, consider using SHA-2, or at least SHA-1.



                    Names and data, supporting newlines in names



                    Here is a variant of the code above that relies on GNU tools to separate the file names with null bytes. This allows file names to contain newlines. The GNU digest utilities quote special characters in their output, so there won't be ambiguous newlines.



                     sort -z; # file hashes
                    echo | sha256sum


                    A more robust approach



                    Here's a minimally tested Python script that builds a hash describing a hierarchy of files. It takes directories and file contents into accounts and ignores symbolic links and other files, and returns a fatal error if any file can't be read.



                    #! /usr/bin/env python
                    import hashlib, hmac, os, stat, sys
                    ## Return the hash of the contents of the specified file, as a hex string
                    def file_hash(name):
                    f = open(name)
                    h = hashlib.sha256()
                    while True:
                    buf = f.read(16384)
                    if len(buf) == 0: break
                    h.update(buf)
                    f.close()
                    return h.hexdigest()
                    ## Traverse the specified path and update the hash with a description of its
                    ## name and contents
                    def traverse(h, path):
                    rs = os.lstat(path)
                    quoted_name = repr(path)
                    if stat.S_ISDIR(rs.st_mode):
                    h.update('dir ' + quoted_name + 'n')
                    for entry in sorted(os.listdir(path)):
                    traverse(h, os.path.join(path, entry))
                    elif stat.S_ISREG(rs.st_mode):
                    h.update('reg ' + quoted_name + ' ')
                    h.update(str(rs.st_size) + ' ')
                    h.update(file_hash(path) + 'n')
                    else: pass # silently symlinks and other special files
                    h = hashlib.sha256()
                    for root in sys.argv[1:]: traverse(h, root)
                    h.update('endn')
                    print h.hexdigest()





                    share|improve this answer






















                    • OK, this works, thanks. But is there any way to do it without including any metadata? Right now I need it for just the actual contents.
                      – user17429
                      Apr 6 '12 at 1:12










                    • How about LC_ALL=C sort for checking from different environments...(+1 btw)
                      – Peter.O
                      Apr 6 '12 at 6:16











                    • You made a whole Python program for this? Thanks! This is really more than what I had expected. :-) Anyway, I will check these methods as well as the new option 1 by Warren.
                      – user17429
                      Apr 6 '12 at 17:33










                    • Good answer. Setting the sort order with LC_ALL=C is essential if running on multiple machines and OSs.
                      – Davor Cubranic
                      Aug 3 '16 at 20:52











                    • What does cpio -o - mean? Doesn’t cpio use stdin/out by default? GNU cpio 2.12 produces cpio: Too many arguments
                      – Jan Tojnar
                      Aug 12 '16 at 12:40












                    up vote
                    32
                    down vote










                    up vote
                    32
                    down vote









                    The checksum needs to be of a deterministic and unambiguous representation of the files as a string. Deterministic means that if you put the same files at the same locations, you'll get the same result. Unambiguous means that two different sets of files have different representations.



                    Data and metadata



                    Making an archive containing the files is a good start. This is an unambiguous representation (obviously, since you can recover the files by extracting the archive). It may include file metadata such as dates and ownership. However, this isn't quite right yet: an archive is ambiguous, because its representation depends on the order in which the files are stored, and if applicable on the compression.



                    A solution is to sort the file names before archiving them. If your file names don't contain newlines, you can run find | sort to list them, and add them to the archive in this order. Take care to tell the archiver not to recurse into directories. Here are examples with POSIX pax, GNU tar and cpio:



                    find | LC_ALL=C sort | pax -w -d | md5sum
                    find | LC_ALL=C sort | tar -cf - -T - --no-recursion | md5sum
                    find | LC_ALL=C sort | cpio -o | md5sum


                    Names and contents only, the low-tech way



                    If you only want to take the file data into account and not metadata, you can make an archive that includes only the file contents, but there are no standard tools for that. Instead of including the file contents, you can include the hash of the files. If the file names contain no newlines, and there are only regular files and directories (no symbolic links or special files), this is fairly easy, but you do need to take care of a few things:



                     sort; echo;
                    find -type f -exec md5sum + | md5sum


                    We include a directory listing in addition to the list of checksums, as otherwise empty directories would be invisible. The file list is sorted (in a specific, reproducible locale — thanks to Peter.O for reminding me of that). echo separates the two parts (without this, you could make some empty directories whose name look like md5sum output that could also pass for ordinary files). We also include a listing of file sizes, to avoid length-extension attacks.



                    By the way, MD5 is deprecated. If it's available, consider using SHA-2, or at least SHA-1.



                    Names and data, supporting newlines in names



                    Here is a variant of the code above that relies on GNU tools to separate the file names with null bytes. This allows file names to contain newlines. The GNU digest utilities quote special characters in their output, so there won't be ambiguous newlines.



                     sort -z; # file hashes
                    echo | sha256sum


                    A more robust approach



                    Here's a minimally tested Python script that builds a hash describing a hierarchy of files. It takes directories and file contents into accounts and ignores symbolic links and other files, and returns a fatal error if any file can't be read.



                    #! /usr/bin/env python
                    import hashlib, hmac, os, stat, sys
                    ## Return the hash of the contents of the specified file, as a hex string
                    def file_hash(name):
                    f = open(name)
                    h = hashlib.sha256()
                    while True:
                    buf = f.read(16384)
                    if len(buf) == 0: break
                    h.update(buf)
                    f.close()
                    return h.hexdigest()
                    ## Traverse the specified path and update the hash with a description of its
                    ## name and contents
                    def traverse(h, path):
                    rs = os.lstat(path)
                    quoted_name = repr(path)
                    if stat.S_ISDIR(rs.st_mode):
                    h.update('dir ' + quoted_name + 'n')
                    for entry in sorted(os.listdir(path)):
                    traverse(h, os.path.join(path, entry))
                    elif stat.S_ISREG(rs.st_mode):
                    h.update('reg ' + quoted_name + ' ')
                    h.update(str(rs.st_size) + ' ')
                    h.update(file_hash(path) + 'n')
                    else: pass # silently symlinks and other special files
                    h = hashlib.sha256()
                    for root in sys.argv[1:]: traverse(h, root)
                    h.update('endn')
                    print h.hexdigest()





                    share|improve this answer














                    The checksum needs to be of a deterministic and unambiguous representation of the files as a string. Deterministic means that if you put the same files at the same locations, you'll get the same result. Unambiguous means that two different sets of files have different representations.



                    Data and metadata



                    Making an archive containing the files is a good start. This is an unambiguous representation (obviously, since you can recover the files by extracting the archive). It may include file metadata such as dates and ownership. However, this isn't quite right yet: an archive is ambiguous, because its representation depends on the order in which the files are stored, and if applicable on the compression.



                    A solution is to sort the file names before archiving them. If your file names don't contain newlines, you can run find | sort to list them, and add them to the archive in this order. Take care to tell the archiver not to recurse into directories. Here are examples with POSIX pax, GNU tar and cpio:



                    find | LC_ALL=C sort | pax -w -d | md5sum
                    find | LC_ALL=C sort | tar -cf - -T - --no-recursion | md5sum
                    find | LC_ALL=C sort | cpio -o | md5sum


                    Names and contents only, the low-tech way



                    If you only want to take the file data into account and not metadata, you can make an archive that includes only the file contents, but there are no standard tools for that. Instead of including the file contents, you can include the hash of the files. If the file names contain no newlines, and there are only regular files and directories (no symbolic links or special files), this is fairly easy, but you do need to take care of a few things:



                     sort; echo;
                    find -type f -exec md5sum + | md5sum


                    We include a directory listing in addition to the list of checksums, as otherwise empty directories would be invisible. The file list is sorted (in a specific, reproducible locale — thanks to Peter.O for reminding me of that). echo separates the two parts (without this, you could make some empty directories whose name look like md5sum output that could also pass for ordinary files). We also include a listing of file sizes, to avoid length-extension attacks.



                    By the way, MD5 is deprecated. If it's available, consider using SHA-2, or at least SHA-1.



                    Names and data, supporting newlines in names



                    Here is a variant of the code above that relies on GNU tools to separate the file names with null bytes. This allows file names to contain newlines. The GNU digest utilities quote special characters in their output, so there won't be ambiguous newlines.



                     sort -z; # file hashes
                    echo | sha256sum


                    A more robust approach



                    Here's a minimally tested Python script that builds a hash describing a hierarchy of files. It takes directories and file contents into accounts and ignores symbolic links and other files, and returns a fatal error if any file can't be read.



                    #! /usr/bin/env python
                    import hashlib, hmac, os, stat, sys
                    ## Return the hash of the contents of the specified file, as a hex string
                    def file_hash(name):
                    f = open(name)
                    h = hashlib.sha256()
                    while True:
                    buf = f.read(16384)
                    if len(buf) == 0: break
                    h.update(buf)
                    f.close()
                    return h.hexdigest()
                    ## Traverse the specified path and update the hash with a description of its
                    ## name and contents
                    def traverse(h, path):
                    rs = os.lstat(path)
                    quoted_name = repr(path)
                    if stat.S_ISDIR(rs.st_mode):
                    h.update('dir ' + quoted_name + 'n')
                    for entry in sorted(os.listdir(path)):
                    traverse(h, os.path.join(path, entry))
                    elif stat.S_ISREG(rs.st_mode):
                    h.update('reg ' + quoted_name + ' ')
                    h.update(str(rs.st_size) + ' ')
                    h.update(file_hash(path) + 'n')
                    else: pass # silently symlinks and other special files
                    h = hashlib.sha256()
                    for root in sys.argv[1:]: traverse(h, root)
                    h.update('endn')
                    print h.hexdigest()






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Aug 12 '16 at 13:07

























                    answered Apr 6 '12 at 0:53









                    Gilles

                    517k12410321561




                    517k12410321561











                    • OK, this works, thanks. But is there any way to do it without including any metadata? Right now I need it for just the actual contents.
                      – user17429
                      Apr 6 '12 at 1:12










                    • How about LC_ALL=C sort for checking from different environments...(+1 btw)
                      – Peter.O
                      Apr 6 '12 at 6:16











                    • You made a whole Python program for this? Thanks! This is really more than what I had expected. :-) Anyway, I will check these methods as well as the new option 1 by Warren.
                      – user17429
                      Apr 6 '12 at 17:33










                    • Good answer. Setting the sort order with LC_ALL=C is essential if running on multiple machines and OSs.
                      – Davor Cubranic
                      Aug 3 '16 at 20:52











                    • What does cpio -o - mean? Doesn’t cpio use stdin/out by default? GNU cpio 2.12 produces cpio: Too many arguments
                      – Jan Tojnar
                      Aug 12 '16 at 12:40
















                    • OK, this works, thanks. But is there any way to do it without including any metadata? Right now I need it for just the actual contents.
                      – user17429
                      Apr 6 '12 at 1:12










                    • How about LC_ALL=C sort for checking from different environments...(+1 btw)
                      – Peter.O
                      Apr 6 '12 at 6:16











                    • You made a whole Python program for this? Thanks! This is really more than what I had expected. :-) Anyway, I will check these methods as well as the new option 1 by Warren.
                      – user17429
                      Apr 6 '12 at 17:33










                    • Good answer. Setting the sort order with LC_ALL=C is essential if running on multiple machines and OSs.
                      – Davor Cubranic
                      Aug 3 '16 at 20:52











                    • What does cpio -o - mean? Doesn’t cpio use stdin/out by default? GNU cpio 2.12 produces cpio: Too many arguments
                      – Jan Tojnar
                      Aug 12 '16 at 12:40















                    OK, this works, thanks. But is there any way to do it without including any metadata? Right now I need it for just the actual contents.
                    – user17429
                    Apr 6 '12 at 1:12




                    OK, this works, thanks. But is there any way to do it without including any metadata? Right now I need it for just the actual contents.
                    – user17429
                    Apr 6 '12 at 1:12












                    How about LC_ALL=C sort for checking from different environments...(+1 btw)
                    – Peter.O
                    Apr 6 '12 at 6:16





                    How about LC_ALL=C sort for checking from different environments...(+1 btw)
                    – Peter.O
                    Apr 6 '12 at 6:16













                    You made a whole Python program for this? Thanks! This is really more than what I had expected. :-) Anyway, I will check these methods as well as the new option 1 by Warren.
                    – user17429
                    Apr 6 '12 at 17:33




                    You made a whole Python program for this? Thanks! This is really more than what I had expected. :-) Anyway, I will check these methods as well as the new option 1 by Warren.
                    – user17429
                    Apr 6 '12 at 17:33












                    Good answer. Setting the sort order with LC_ALL=C is essential if running on multiple machines and OSs.
                    – Davor Cubranic
                    Aug 3 '16 at 20:52





                    Good answer. Setting the sort order with LC_ALL=C is essential if running on multiple machines and OSs.
                    – Davor Cubranic
                    Aug 3 '16 at 20:52













                    What does cpio -o - mean? Doesn’t cpio use stdin/out by default? GNU cpio 2.12 produces cpio: Too many arguments
                    – Jan Tojnar
                    Aug 12 '16 at 12:40




                    What does cpio -o - mean? Doesn’t cpio use stdin/out by default? GNU cpio 2.12 produces cpio: Too many arguments
                    – Jan Tojnar
                    Aug 12 '16 at 12:40










                    up vote
                    12
                    down vote













                    Have a look at md5deep. Some of the features of md5deep that may interest you:




                    Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.



                    Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match.



                    ...







                    share|improve this answer




















                    • Nice, but can't get it to work, it says .../foo: Is a directory, what gives?
                      – Camilo Martin
                      Oct 2 '14 at 1:21






                    • 3




                      On its own md5deep doesn't solve the OP's problem as it doesn't print a consolidated md5sum, it just prints the md5sum for each file in the directory. That said, you can md5sum the output of md5deep - not quite what the OP wanted, but is close! e.g. for the current directory: md5deep -r -l -j0 . | md5sum (where -r is recursive, -l means "use relative paths" so that the absolute path of the files doesn't interfere when trying to compare the content of two directories, and -j0 means use 1 thread to prevent non-determinism due to individual md5sums being returned in different orders).
                      – Stevie
                      Oct 14 '15 at 12:34










                    • How to ignore some files/directories in the path?
                      – Sandeepan Nath
                      Oct 21 '16 at 13:17














                    up vote
                    12
                    down vote













                    Have a look at md5deep. Some of the features of md5deep that may interest you:




                    Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.



                    Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match.



                    ...







                    share|improve this answer




















                    • Nice, but can't get it to work, it says .../foo: Is a directory, what gives?
                      – Camilo Martin
                      Oct 2 '14 at 1:21






                    • 3




                      On its own md5deep doesn't solve the OP's problem as it doesn't print a consolidated md5sum, it just prints the md5sum for each file in the directory. That said, you can md5sum the output of md5deep - not quite what the OP wanted, but is close! e.g. for the current directory: md5deep -r -l -j0 . | md5sum (where -r is recursive, -l means "use relative paths" so that the absolute path of the files doesn't interfere when trying to compare the content of two directories, and -j0 means use 1 thread to prevent non-determinism due to individual md5sums being returned in different orders).
                      – Stevie
                      Oct 14 '15 at 12:34










                    • How to ignore some files/directories in the path?
                      – Sandeepan Nath
                      Oct 21 '16 at 13:17












                    up vote
                    12
                    down vote










                    up vote
                    12
                    down vote









                    Have a look at md5deep. Some of the features of md5deep that may interest you:




                    Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.



                    Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match.



                    ...







                    share|improve this answer












                    Have a look at md5deep. Some of the features of md5deep that may interest you:




                    Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.



                    Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match.



                    ...








                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Apr 10 '12 at 16:19









                    faultyserver

                    22114




                    22114











                    • Nice, but can't get it to work, it says .../foo: Is a directory, what gives?
                      – Camilo Martin
                      Oct 2 '14 at 1:21






                    • 3




                      On its own md5deep doesn't solve the OP's problem as it doesn't print a consolidated md5sum, it just prints the md5sum for each file in the directory. That said, you can md5sum the output of md5deep - not quite what the OP wanted, but is close! e.g. for the current directory: md5deep -r -l -j0 . | md5sum (where -r is recursive, -l means "use relative paths" so that the absolute path of the files doesn't interfere when trying to compare the content of two directories, and -j0 means use 1 thread to prevent non-determinism due to individual md5sums being returned in different orders).
                      – Stevie
                      Oct 14 '15 at 12:34










                    • How to ignore some files/directories in the path?
                      – Sandeepan Nath
                      Oct 21 '16 at 13:17
















                    • Nice, but can't get it to work, it says .../foo: Is a directory, what gives?
                      – Camilo Martin
                      Oct 2 '14 at 1:21






                    • 3




                      On its own md5deep doesn't solve the OP's problem as it doesn't print a consolidated md5sum, it just prints the md5sum for each file in the directory. That said, you can md5sum the output of md5deep - not quite what the OP wanted, but is close! e.g. for the current directory: md5deep -r -l -j0 . | md5sum (where -r is recursive, -l means "use relative paths" so that the absolute path of the files doesn't interfere when trying to compare the content of two directories, and -j0 means use 1 thread to prevent non-determinism due to individual md5sums being returned in different orders).
                      – Stevie
                      Oct 14 '15 at 12:34










                    • How to ignore some files/directories in the path?
                      – Sandeepan Nath
                      Oct 21 '16 at 13:17















                    Nice, but can't get it to work, it says .../foo: Is a directory, what gives?
                    – Camilo Martin
                    Oct 2 '14 at 1:21




                    Nice, but can't get it to work, it says .../foo: Is a directory, what gives?
                    – Camilo Martin
                    Oct 2 '14 at 1:21




                    3




                    3




                    On its own md5deep doesn't solve the OP's problem as it doesn't print a consolidated md5sum, it just prints the md5sum for each file in the directory. That said, you can md5sum the output of md5deep - not quite what the OP wanted, but is close! e.g. for the current directory: md5deep -r -l -j0 . | md5sum (where -r is recursive, -l means "use relative paths" so that the absolute path of the files doesn't interfere when trying to compare the content of two directories, and -j0 means use 1 thread to prevent non-determinism due to individual md5sums being returned in different orders).
                    – Stevie
                    Oct 14 '15 at 12:34




                    On its own md5deep doesn't solve the OP's problem as it doesn't print a consolidated md5sum, it just prints the md5sum for each file in the directory. That said, you can md5sum the output of md5deep - not quite what the OP wanted, but is close! e.g. for the current directory: md5deep -r -l -j0 . | md5sum (where -r is recursive, -l means "use relative paths" so that the absolute path of the files doesn't interfere when trying to compare the content of two directories, and -j0 means use 1 thread to prevent non-determinism due to individual md5sums being returned in different orders).
                    – Stevie
                    Oct 14 '15 at 12:34












                    How to ignore some files/directories in the path?
                    – Sandeepan Nath
                    Oct 21 '16 at 13:17




                    How to ignore some files/directories in the path?
                    – Sandeepan Nath
                    Oct 21 '16 at 13:17










                    up vote
                    7
                    down vote













                    If your goal is just to find differences between two directories, consider using diff.



                    Try this:



                    diff -qr dir1 dir2





                    share|improve this answer






















                    • Yes, this is useful as well. I think you meant dir1 dir2 in that command.
                      – user17429
                      Apr 6 '12 at 17:35






                    • 1




                      I don't usually use GUIs when I can avoid them, but for directory diffing kdiff3 is great and also works on many platforms.
                      – sinelaw
                      Apr 17 '12 at 2:21










                    • Differing files are reported as well with this command.
                      – Serge Stroobandt
                      Apr 2 '14 at 15:02














                    up vote
                    7
                    down vote













                    If your goal is just to find differences between two directories, consider using diff.



                    Try this:



                    diff -qr dir1 dir2





                    share|improve this answer






















                    • Yes, this is useful as well. I think you meant dir1 dir2 in that command.
                      – user17429
                      Apr 6 '12 at 17:35






                    • 1




                      I don't usually use GUIs when I can avoid them, but for directory diffing kdiff3 is great and also works on many platforms.
                      – sinelaw
                      Apr 17 '12 at 2:21










                    • Differing files are reported as well with this command.
                      – Serge Stroobandt
                      Apr 2 '14 at 15:02












                    up vote
                    7
                    down vote










                    up vote
                    7
                    down vote









                    If your goal is just to find differences between two directories, consider using diff.



                    Try this:



                    diff -qr dir1 dir2





                    share|improve this answer














                    If your goal is just to find differences between two directories, consider using diff.



                    Try this:



                    diff -qr dir1 dir2






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Apr 10 '12 at 16:06









                    PaÅ­lo Ebermann

                    32028




                    32028










                    answered Apr 6 '12 at 5:24









                    Deepak Mittal

                    1,111914




                    1,111914











                    • Yes, this is useful as well. I think you meant dir1 dir2 in that command.
                      – user17429
                      Apr 6 '12 at 17:35






                    • 1




                      I don't usually use GUIs when I can avoid them, but for directory diffing kdiff3 is great and also works on many platforms.
                      – sinelaw
                      Apr 17 '12 at 2:21










                    • Differing files are reported as well with this command.
                      – Serge Stroobandt
                      Apr 2 '14 at 15:02
















                    • Yes, this is useful as well. I think you meant dir1 dir2 in that command.
                      – user17429
                      Apr 6 '12 at 17:35






                    • 1




                      I don't usually use GUIs when I can avoid them, but for directory diffing kdiff3 is great and also works on many platforms.
                      – sinelaw
                      Apr 17 '12 at 2:21










                    • Differing files are reported as well with this command.
                      – Serge Stroobandt
                      Apr 2 '14 at 15:02















                    Yes, this is useful as well. I think you meant dir1 dir2 in that command.
                    – user17429
                    Apr 6 '12 at 17:35




                    Yes, this is useful as well. I think you meant dir1 dir2 in that command.
                    – user17429
                    Apr 6 '12 at 17:35




                    1




                    1




                    I don't usually use GUIs when I can avoid them, but for directory diffing kdiff3 is great and also works on many platforms.
                    – sinelaw
                    Apr 17 '12 at 2:21




                    I don't usually use GUIs when I can avoid them, but for directory diffing kdiff3 is great and also works on many platforms.
                    – sinelaw
                    Apr 17 '12 at 2:21












                    Differing files are reported as well with this command.
                    – Serge Stroobandt
                    Apr 2 '14 at 15:02




                    Differing files are reported as well with this command.
                    – Serge Stroobandt
                    Apr 2 '14 at 15:02










                    up vote
                    5
                    down vote













                    You can hash every file recursively and then hash the resulting text:



                    > md5deep -r -l . | sort | md5sum
                    d43417958e47758c6405b5098f151074 *-


                    md5deep is required.






                    share|improve this answer
















                    • 1




                      instead of md5deep use hashdeep on ubuntu 16.04 because md5deep package is just a transitional dummy for hashdeep.
                      – palik
                      Nov 8 '17 at 15:22






                    • 1




                      I've tried hashdeep. It outputs not only hashes but also some header including ## Invoked from: /home/myuser/dev/ which is your current path and ## $ hashdeep -s -r -l ~/folder/. This got to sort, so the final hash will be different if you change your current folder or command line.
                      – truf
                      Aug 23 at 8:28














                    up vote
                    5
                    down vote













                    You can hash every file recursively and then hash the resulting text:



                    > md5deep -r -l . | sort | md5sum
                    d43417958e47758c6405b5098f151074 *-


                    md5deep is required.






                    share|improve this answer
















                    • 1




                      instead of md5deep use hashdeep on ubuntu 16.04 because md5deep package is just a transitional dummy for hashdeep.
                      – palik
                      Nov 8 '17 at 15:22






                    • 1




                      I've tried hashdeep. It outputs not only hashes but also some header including ## Invoked from: /home/myuser/dev/ which is your current path and ## $ hashdeep -s -r -l ~/folder/. This got to sort, so the final hash will be different if you change your current folder or command line.
                      – truf
                      Aug 23 at 8:28












                    up vote
                    5
                    down vote










                    up vote
                    5
                    down vote









                    You can hash every file recursively and then hash the resulting text:



                    > md5deep -r -l . | sort | md5sum
                    d43417958e47758c6405b5098f151074 *-


                    md5deep is required.






                    share|improve this answer












                    You can hash every file recursively and then hash the resulting text:



                    > md5deep -r -l . | sort | md5sum
                    d43417958e47758c6405b5098f151074 *-


                    md5deep is required.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Apr 14 '16 at 13:34









                    Pavel Vlasov

                    178126




                    178126







                    • 1




                      instead of md5deep use hashdeep on ubuntu 16.04 because md5deep package is just a transitional dummy for hashdeep.
                      – palik
                      Nov 8 '17 at 15:22






                    • 1




                      I've tried hashdeep. It outputs not only hashes but also some header including ## Invoked from: /home/myuser/dev/ which is your current path and ## $ hashdeep -s -r -l ~/folder/. This got to sort, so the final hash will be different if you change your current folder or command line.
                      – truf
                      Aug 23 at 8:28












                    • 1




                      instead of md5deep use hashdeep on ubuntu 16.04 because md5deep package is just a transitional dummy for hashdeep.
                      – palik
                      Nov 8 '17 at 15:22






                    • 1




                      I've tried hashdeep. It outputs not only hashes but also some header including ## Invoked from: /home/myuser/dev/ which is your current path and ## $ hashdeep -s -r -l ~/folder/. This got to sort, so the final hash will be different if you change your current folder or command line.
                      – truf
                      Aug 23 at 8:28







                    1




                    1




                    instead of md5deep use hashdeep on ubuntu 16.04 because md5deep package is just a transitional dummy for hashdeep.
                    – palik
                    Nov 8 '17 at 15:22




                    instead of md5deep use hashdeep on ubuntu 16.04 because md5deep package is just a transitional dummy for hashdeep.
                    – palik
                    Nov 8 '17 at 15:22




                    1




                    1




                    I've tried hashdeep. It outputs not only hashes but also some header including ## Invoked from: /home/myuser/dev/ which is your current path and ## $ hashdeep -s -r -l ~/folder/. This got to sort, so the final hash will be different if you change your current folder or command line.
                    – truf
                    Aug 23 at 8:28




                    I've tried hashdeep. It outputs not only hashes but also some header including ## Invoked from: /home/myuser/dev/ which is your current path and ## $ hashdeep -s -r -l ~/folder/. This got to sort, so the final hash will be different if you change your current folder or command line.
                    – truf
                    Aug 23 at 8:28










                    up vote
                    3
                    down vote













                    File contents only, excluding filenames



                    I needed a version that only checked the filenames because the contents reside in different directories.



                    This version (Warren Young's answer) helped a lot, but my version of md5sum outputs the filename (relative to the path I ran the command from), and the folder names were different, therefore even though the individual file checksums matched, the final checksum didn't.



                    To fix that, in my case, I just needed to strip off the filename from each line of the find output (select only the first word as separated by spaces using cut):



                    find -s somedir -type f -exec md5sum ; | cut -d" " -f1 | md5sum





                    share|improve this answer






















                    • You might need to sort the checksums as well to get a reproducible list.
                      – eckes
                      Mar 22 '16 at 21:34














                    up vote
                    3
                    down vote













                    File contents only, excluding filenames



                    I needed a version that only checked the filenames because the contents reside in different directories.



                    This version (Warren Young's answer) helped a lot, but my version of md5sum outputs the filename (relative to the path I ran the command from), and the folder names were different, therefore even though the individual file checksums matched, the final checksum didn't.



                    To fix that, in my case, I just needed to strip off the filename from each line of the find output (select only the first word as separated by spaces using cut):



                    find -s somedir -type f -exec md5sum ; | cut -d" " -f1 | md5sum





                    share|improve this answer






















                    • You might need to sort the checksums as well to get a reproducible list.
                      – eckes
                      Mar 22 '16 at 21:34












                    up vote
                    3
                    down vote










                    up vote
                    3
                    down vote









                    File contents only, excluding filenames



                    I needed a version that only checked the filenames because the contents reside in different directories.



                    This version (Warren Young's answer) helped a lot, but my version of md5sum outputs the filename (relative to the path I ran the command from), and the folder names were different, therefore even though the individual file checksums matched, the final checksum didn't.



                    To fix that, in my case, I just needed to strip off the filename from each line of the find output (select only the first word as separated by spaces using cut):



                    find -s somedir -type f -exec md5sum ; | cut -d" " -f1 | md5sum





                    share|improve this answer














                    File contents only, excluding filenames



                    I needed a version that only checked the filenames because the contents reside in different directories.



                    This version (Warren Young's answer) helped a lot, but my version of md5sum outputs the filename (relative to the path I ran the command from), and the folder names were different, therefore even though the individual file checksums matched, the final checksum didn't.



                    To fix that, in my case, I just needed to strip off the filename from each line of the find output (select only the first word as separated by spaces using cut):



                    find -s somedir -type f -exec md5sum ; | cut -d" " -f1 | md5sum






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Apr 13 '17 at 12:36









                    Community♦

                    1




                    1










                    answered May 11 '13 at 0:34









                    Nicole

                    1615




                    1615











                    • You might need to sort the checksums as well to get a reproducible list.
                      – eckes
                      Mar 22 '16 at 21:34
















                    • You might need to sort the checksums as well to get a reproducible list.
                      – eckes
                      Mar 22 '16 at 21:34















                    You might need to sort the checksums as well to get a reproducible list.
                    – eckes
                    Mar 22 '16 at 21:34




                    You might need to sort the checksums as well to get a reproducible list.
                    – eckes
                    Mar 22 '16 at 21:34










                    up vote
                    3
                    down vote













                    A good tree check-sum is the tree-id of Git.



                    There is unfortunately no stand-alone tool available which can do that (at least I dont know it), but if you have Git handy you can just pretend to set up a new repository and add the files you want to check to the index.



                    This allows you to produce the (reproducible) tree hash - which includes only content, file names and some reduced file modes (executable).






                    share|improve this answer
























                      up vote
                      3
                      down vote













                      A good tree check-sum is the tree-id of Git.



                      There is unfortunately no stand-alone tool available which can do that (at least I dont know it), but if you have Git handy you can just pretend to set up a new repository and add the files you want to check to the index.



                      This allows you to produce the (reproducible) tree hash - which includes only content, file names and some reduced file modes (executable).






                      share|improve this answer






















                        up vote
                        3
                        down vote










                        up vote
                        3
                        down vote









                        A good tree check-sum is the tree-id of Git.



                        There is unfortunately no stand-alone tool available which can do that (at least I dont know it), but if you have Git handy you can just pretend to set up a new repository and add the files you want to check to the index.



                        This allows you to produce the (reproducible) tree hash - which includes only content, file names and some reduced file modes (executable).






                        share|improve this answer












                        A good tree check-sum is the tree-id of Git.



                        There is unfortunately no stand-alone tool available which can do that (at least I dont know it), but if you have Git handy you can just pretend to set up a new repository and add the files you want to check to the index.



                        This allows you to produce the (reproducible) tree hash - which includes only content, file names and some reduced file modes (executable).







                        share|improve this answer












                        share|improve this answer



                        share|improve this answer










                        answered Aug 11 '13 at 1:37









                        eckes

                        1477




                        1477




















                            up vote
                            2
                            down vote













                            I use this my snippet for moderate volumes:



                            find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 cat | md5sum -



                            and this one for XXXL:



                            find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 tail -qc100 | md5sum -






                            share|improve this answer




















                            • What does the -xdev flag do?
                              – czerasz
                              May 4 '17 at 6:35










                            • It calls for you to type in: man find and read that fine manual ;)
                              – poige
                              May 4 '17 at 12:43











                            • Good point :-). -xdev Don't descend directories on other filesystems.
                              – czerasz
                              May 4 '17 at 16:31






                            • 1




                              Note that this ignores new, empty files (like if you touch a file).
                              – RonJohn
                              May 12 at 23:08










                            • Thanks. I think I see how to fix
                              – poige
                              May 13 at 2:14














                            up vote
                            2
                            down vote













                            I use this my snippet for moderate volumes:



                            find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 cat | md5sum -



                            and this one for XXXL:



                            find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 tail -qc100 | md5sum -






                            share|improve this answer




















                            • What does the -xdev flag do?
                              – czerasz
                              May 4 '17 at 6:35










                            • It calls for you to type in: man find and read that fine manual ;)
                              – poige
                              May 4 '17 at 12:43











                            • Good point :-). -xdev Don't descend directories on other filesystems.
                              – czerasz
                              May 4 '17 at 16:31






                            • 1




                              Note that this ignores new, empty files (like if you touch a file).
                              – RonJohn
                              May 12 at 23:08










                            • Thanks. I think I see how to fix
                              – poige
                              May 13 at 2:14












                            up vote
                            2
                            down vote










                            up vote
                            2
                            down vote









                            I use this my snippet for moderate volumes:



                            find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 cat | md5sum -



                            and this one for XXXL:



                            find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 tail -qc100 | md5sum -






                            share|improve this answer












                            I use this my snippet for moderate volumes:



                            find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 cat | md5sum -



                            and this one for XXXL:



                            find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 tail -qc100 | md5sum -







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Apr 10 '12 at 17:26









                            poige

                            3,8621541




                            3,8621541











                            • What does the -xdev flag do?
                              – czerasz
                              May 4 '17 at 6:35










                            • It calls for you to type in: man find and read that fine manual ;)
                              – poige
                              May 4 '17 at 12:43











                            • Good point :-). -xdev Don't descend directories on other filesystems.
                              – czerasz
                              May 4 '17 at 16:31






                            • 1




                              Note that this ignores new, empty files (like if you touch a file).
                              – RonJohn
                              May 12 at 23:08










                            • Thanks. I think I see how to fix
                              – poige
                              May 13 at 2:14
















                            • What does the -xdev flag do?
                              – czerasz
                              May 4 '17 at 6:35










                            • It calls for you to type in: man find and read that fine manual ;)
                              – poige
                              May 4 '17 at 12:43











                            • Good point :-). -xdev Don't descend directories on other filesystems.
                              – czerasz
                              May 4 '17 at 16:31






                            • 1




                              Note that this ignores new, empty files (like if you touch a file).
                              – RonJohn
                              May 12 at 23:08










                            • Thanks. I think I see how to fix
                              – poige
                              May 13 at 2:14















                            What does the -xdev flag do?
                            – czerasz
                            May 4 '17 at 6:35




                            What does the -xdev flag do?
                            – czerasz
                            May 4 '17 at 6:35












                            It calls for you to type in: man find and read that fine manual ;)
                            – poige
                            May 4 '17 at 12:43





                            It calls for you to type in: man find and read that fine manual ;)
                            – poige
                            May 4 '17 at 12:43













                            Good point :-). -xdev Don't descend directories on other filesystems.
                            – czerasz
                            May 4 '17 at 16:31




                            Good point :-). -xdev Don't descend directories on other filesystems.
                            – czerasz
                            May 4 '17 at 16:31




                            1




                            1




                            Note that this ignores new, empty files (like if you touch a file).
                            – RonJohn
                            May 12 at 23:08




                            Note that this ignores new, empty files (like if you touch a file).
                            – RonJohn
                            May 12 at 23:08












                            Thanks. I think I see how to fix
                            – poige
                            May 13 at 2:14




                            Thanks. I think I see how to fix
                            – poige
                            May 13 at 2:14










                            up vote
                            2
                            down vote













                            solution:



                            $ pip install checksumdir
                            $ checksumdir -a md5 assets/js
                            981ac0bc890de594a9f2f40e00f13872
                            $ checksumdir -a sha1 assets/js
                            88cd20f115e31a1e1ae381f7291d0c8cd3b92fad


                            works fast and easier solution then bash scripting.



                            see doc: https://pypi.python.org/pypi/checksumdir/1.0.5






                            share|improve this answer




















                            • if you don't have pip you may need to install it with yum -y install python-pip (or dnf/apt-get)
                              – DmitrySemenov
                              Mar 8 '16 at 2:55














                            up vote
                            2
                            down vote













                            solution:



                            $ pip install checksumdir
                            $ checksumdir -a md5 assets/js
                            981ac0bc890de594a9f2f40e00f13872
                            $ checksumdir -a sha1 assets/js
                            88cd20f115e31a1e1ae381f7291d0c8cd3b92fad


                            works fast and easier solution then bash scripting.



                            see doc: https://pypi.python.org/pypi/checksumdir/1.0.5






                            share|improve this answer




















                            • if you don't have pip you may need to install it with yum -y install python-pip (or dnf/apt-get)
                              – DmitrySemenov
                              Mar 8 '16 at 2:55












                            up vote
                            2
                            down vote










                            up vote
                            2
                            down vote









                            solution:



                            $ pip install checksumdir
                            $ checksumdir -a md5 assets/js
                            981ac0bc890de594a9f2f40e00f13872
                            $ checksumdir -a sha1 assets/js
                            88cd20f115e31a1e1ae381f7291d0c8cd3b92fad


                            works fast and easier solution then bash scripting.



                            see doc: https://pypi.python.org/pypi/checksumdir/1.0.5






                            share|improve this answer












                            solution:



                            $ pip install checksumdir
                            $ checksumdir -a md5 assets/js
                            981ac0bc890de594a9f2f40e00f13872
                            $ checksumdir -a sha1 assets/js
                            88cd20f115e31a1e1ae381f7291d0c8cd3b92fad


                            works fast and easier solution then bash scripting.



                            see doc: https://pypi.python.org/pypi/checksumdir/1.0.5







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Mar 8 '16 at 2:53









                            DmitrySemenov

                            23419




                            23419











                            • if you don't have pip you may need to install it with yum -y install python-pip (or dnf/apt-get)
                              – DmitrySemenov
                              Mar 8 '16 at 2:55
















                            • if you don't have pip you may need to install it with yum -y install python-pip (or dnf/apt-get)
                              – DmitrySemenov
                              Mar 8 '16 at 2:55















                            if you don't have pip you may need to install it with yum -y install python-pip (or dnf/apt-get)
                            – DmitrySemenov
                            Mar 8 '16 at 2:55




                            if you don't have pip you may need to install it with yum -y install python-pip (or dnf/apt-get)
                            – DmitrySemenov
                            Mar 8 '16 at 2:55










                            up vote
                            2
                            down vote













                            nix-hash from the Nix package manager




                            The command nix-hash computes the cryptographic hash of the contents
                            of each path and prints it on standard output. By default, it computes
                            an MD5 hash, but other hash algorithms are available as well. The hash is printed in hexadecimal.



                            The hash is computed over a serialisation of each path: a dump of the file system tree rooted at the path. This allows directories
                            and symlinks to be hashed
                            as well as regular files. The dump is in the NAR format produced by nix-store --dump. Thus, nix-hash path yields the same
                            cryptographic hash as nix-store
                            --dump path | md5sum.







                            share|improve this answer
























                              up vote
                              2
                              down vote













                              nix-hash from the Nix package manager




                              The command nix-hash computes the cryptographic hash of the contents
                              of each path and prints it on standard output. By default, it computes
                              an MD5 hash, but other hash algorithms are available as well. The hash is printed in hexadecimal.



                              The hash is computed over a serialisation of each path: a dump of the file system tree rooted at the path. This allows directories
                              and symlinks to be hashed
                              as well as regular files. The dump is in the NAR format produced by nix-store --dump. Thus, nix-hash path yields the same
                              cryptographic hash as nix-store
                              --dump path | md5sum.







                              share|improve this answer






















                                up vote
                                2
                                down vote










                                up vote
                                2
                                down vote









                                nix-hash from the Nix package manager




                                The command nix-hash computes the cryptographic hash of the contents
                                of each path and prints it on standard output. By default, it computes
                                an MD5 hash, but other hash algorithms are available as well. The hash is printed in hexadecimal.



                                The hash is computed over a serialisation of each path: a dump of the file system tree rooted at the path. This allows directories
                                and symlinks to be hashed
                                as well as regular files. The dump is in the NAR format produced by nix-store --dump. Thus, nix-hash path yields the same
                                cryptographic hash as nix-store
                                --dump path | md5sum.







                                share|improve this answer












                                nix-hash from the Nix package manager




                                The command nix-hash computes the cryptographic hash of the contents
                                of each path and prints it on standard output. By default, it computes
                                an MD5 hash, but other hash algorithms are available as well. The hash is printed in hexadecimal.



                                The hash is computed over a serialisation of each path: a dump of the file system tree rooted at the path. This allows directories
                                and symlinks to be hashed
                                as well as regular files. The dump is in the NAR format produced by nix-store --dump. Thus, nix-hash path yields the same
                                cryptographic hash as nix-store
                                --dump path | md5sum.








                                share|improve this answer












                                share|improve this answer



                                share|improve this answer










                                answered Jul 27 '16 at 16:48









                                Igor

                                1212




                                1212




















                                    up vote
                                    1
                                    down vote













                                    I didn't want new executables nor clunky solutions so here's my take:



                                    #!/bin/sh
                                    # md5dir.sh by Camilo Martin, 2014-10-01.
                                    # Give this a parameter and it will calculate an md5 of the directory's contents.
                                    # It only takes into account file contents and paths relative to the directory's root.
                                    # This means that two dirs with different names and locations can hash equally.

                                    if [[ ! -d "$1" ]]; then
                                    echo "Usage: md5dir.sh <dir_name>"
                                    exit
                                    fi

                                    d="$(tr '\' / <<< "$1" | tr -s / | sed 's-/$--')"
                                    c=$(($#d + 35))
                                    find "$d" -type f -exec md5sum ; | cut -c 1-33,$c- | sort | md5sum | cut -c 1-32


                                    Hope it helps you :)






                                    share|improve this answer
























                                      up vote
                                      1
                                      down vote













                                      I didn't want new executables nor clunky solutions so here's my take:



                                      #!/bin/sh
                                      # md5dir.sh by Camilo Martin, 2014-10-01.
                                      # Give this a parameter and it will calculate an md5 of the directory's contents.
                                      # It only takes into account file contents and paths relative to the directory's root.
                                      # This means that two dirs with different names and locations can hash equally.

                                      if [[ ! -d "$1" ]]; then
                                      echo "Usage: md5dir.sh <dir_name>"
                                      exit
                                      fi

                                      d="$(tr '\' / <<< "$1" | tr -s / | sed 's-/$--')"
                                      c=$(($#d + 35))
                                      find "$d" -type f -exec md5sum ; | cut -c 1-33,$c- | sort | md5sum | cut -c 1-32


                                      Hope it helps you :)






                                      share|improve this answer






















                                        up vote
                                        1
                                        down vote










                                        up vote
                                        1
                                        down vote









                                        I didn't want new executables nor clunky solutions so here's my take:



                                        #!/bin/sh
                                        # md5dir.sh by Camilo Martin, 2014-10-01.
                                        # Give this a parameter and it will calculate an md5 of the directory's contents.
                                        # It only takes into account file contents and paths relative to the directory's root.
                                        # This means that two dirs with different names and locations can hash equally.

                                        if [[ ! -d "$1" ]]; then
                                        echo "Usage: md5dir.sh <dir_name>"
                                        exit
                                        fi

                                        d="$(tr '\' / <<< "$1" | tr -s / | sed 's-/$--')"
                                        c=$(($#d + 35))
                                        find "$d" -type f -exec md5sum ; | cut -c 1-33,$c- | sort | md5sum | cut -c 1-32


                                        Hope it helps you :)






                                        share|improve this answer












                                        I didn't want new executables nor clunky solutions so here's my take:



                                        #!/bin/sh
                                        # md5dir.sh by Camilo Martin, 2014-10-01.
                                        # Give this a parameter and it will calculate an md5 of the directory's contents.
                                        # It only takes into account file contents and paths relative to the directory's root.
                                        # This means that two dirs with different names and locations can hash equally.

                                        if [[ ! -d "$1" ]]; then
                                        echo "Usage: md5dir.sh <dir_name>"
                                        exit
                                        fi

                                        d="$(tr '\' / <<< "$1" | tr -s / | sed 's-/$--')"
                                        c=$(($#d + 35))
                                        find "$d" -type f -exec md5sum ; | cut -c 1-33,$c- | sort | md5sum | cut -c 1-32


                                        Hope it helps you :)







                                        share|improve this answer












                                        share|improve this answer



                                        share|improve this answer










                                        answered Oct 2 '14 at 2:13









                                        Camilo Martin

                                        36639




                                        36639




















                                            up vote
                                            1
                                            down vote













                                            A script which is well tested and supports a number of operations including finding duplicates, doing comparisons on both data and metadata, showing additions as well as changes and removals, you might like Fingerprint.



                                            Fingerprint right now doesn't produce a single checksum for a directory, but a transcript file which includes checksums for all files in that directory.



                                            fingerprint analyze


                                            This will generate index.fingerprint in the current directory which includes checksums, filenames and file sizes. By default it uses both MD5 and SHA1.256.



                                            In the future, I hope to add support for Merkle Trees into Fingerprint which will give you a single top-level checksum. Right now, you need to retain that file for doing verification.






                                            share|improve this answer
























                                              up vote
                                              1
                                              down vote













                                              A script which is well tested and supports a number of operations including finding duplicates, doing comparisons on both data and metadata, showing additions as well as changes and removals, you might like Fingerprint.



                                              Fingerprint right now doesn't produce a single checksum for a directory, but a transcript file which includes checksums for all files in that directory.



                                              fingerprint analyze


                                              This will generate index.fingerprint in the current directory which includes checksums, filenames and file sizes. By default it uses both MD5 and SHA1.256.



                                              In the future, I hope to add support for Merkle Trees into Fingerprint which will give you a single top-level checksum. Right now, you need to retain that file for doing verification.






                                              share|improve this answer






















                                                up vote
                                                1
                                                down vote










                                                up vote
                                                1
                                                down vote









                                                A script which is well tested and supports a number of operations including finding duplicates, doing comparisons on both data and metadata, showing additions as well as changes and removals, you might like Fingerprint.



                                                Fingerprint right now doesn't produce a single checksum for a directory, but a transcript file which includes checksums for all files in that directory.



                                                fingerprint analyze


                                                This will generate index.fingerprint in the current directory which includes checksums, filenames and file sizes. By default it uses both MD5 and SHA1.256.



                                                In the future, I hope to add support for Merkle Trees into Fingerprint which will give you a single top-level checksum. Right now, you need to retain that file for doing verification.






                                                share|improve this answer












                                                A script which is well tested and supports a number of operations including finding duplicates, doing comparisons on both data and metadata, showing additions as well as changes and removals, you might like Fingerprint.



                                                Fingerprint right now doesn't produce a single checksum for a directory, but a transcript file which includes checksums for all files in that directory.



                                                fingerprint analyze


                                                This will generate index.fingerprint in the current directory which includes checksums, filenames and file sizes. By default it uses both MD5 and SHA1.256.



                                                In the future, I hope to add support for Merkle Trees into Fingerprint which will give you a single top-level checksum. Right now, you need to retain that file for doing verification.







                                                share|improve this answer












                                                share|improve this answer



                                                share|improve this answer










                                                answered Jul 7 '16 at 0:15









                                                ioquatix

                                                1113




                                                1113




















                                                    up vote
                                                    0
                                                    down vote













                                                    A robust and clean approach



                                                    • First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.

                                                    • Different approaches for different needs/purpose (all of the below or pick what ever applies):

                                                      • Hash only the entry name of all entries in the directory tree

                                                      • Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)

                                                      • For a symbolic link, its content is the referent name. Hash it or choose to skip

                                                      • Follow or not to follow(resolved name) the symlink while hashing the contents of the entry

                                                      • If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files

                                                      • Handle large files well(again, mind the RAM)

                                                      • Handle very deep directory trees (mind the open file descriptors)

                                                      • Handle non standard file names

                                                      • How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?

                                                      • Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.


                                                    This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.



                                                    Here's a tool(disclaimer: I'm a contributor to it) dtreetrawl, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.




                                                    Usage:
                                                    dtreetrawl [OPTION...] "/trawl/me" [path2,...]

                                                    Help Options:
                                                    -h, --help Show help options

                                                    Application Options:
                                                    -t, --terse Produce a terse output; parsable.
                                                    -d, --delim=: Character or string delimiter/separator for terse output(default ':')
                                                    -l, --max-level=N Do not traverse tree beyond N level(s)
                                                    --hash Hash the files to produce checksums(default is MD5).
                                                    -c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512.
                                                    -s, --hash-symlink Include symbolic links' referent name while calculating the root checksum
                                                    -R, --only-root-hash Output only the root hash. Blank line if --hash is not set
                                                    -N, --no-name-hash Exclude path name while calculating the root checksum
                                                    -F, --no-content-hash Do not hash the contents of the file



                                                    An example human friendly output:




                                                    ...
                                                    ... //clipped
                                                    ...
                                                    /home/lab/linux-4.14-rc8/CREDITS
                                                    Base name : CREDITS
                                                    Level : 1
                                                    Type : regular file
                                                    Referent name :
                                                    File size : 98443 bytes
                                                    I-node number : 290850
                                                    No. directory entries : 0
                                                    Permission (octal) : 0644
                                                    Link count : 1
                                                    Ownership : UID=0, GID=0
                                                    Preferred I/O block size : 4096 bytes
                                                    Blocks allocated : 200
                                                    Last status change : Tue, 21 Nov 17 21:28:18 +0530
                                                    Last file access : Thu, 28 Dec 17 00:53:27 +0530
                                                    Last file modification : Tue, 21 Nov 17 21:28:18 +0530
                                                    Hash : 9f0312d130016d103aa5fc9d16a2437e

                                                    Stats for /home/lab/linux-4.14-rc8:
                                                    Elapsed time : 1.305767 s
                                                    Start time : Sun, 07 Jan 18 03:42:39 +0530
                                                    Root hash : 434e93111ad6f9335bb4954bc8f4eca4
                                                    Hash type : md5
                                                    Depth : 8
                                                    Total,
                                                    size : 66850916 bytes
                                                    entries : 12484
                                                    directories : 763
                                                    regular files : 11715
                                                    symlinks : 6
                                                    block devices : 0
                                                    char devices : 0
                                                    sockets : 0
                                                    FIFOs/pipes : 0






                                                    share|improve this answer






















                                                    • General advice is always welcome but the best answers are specific and with code where appropriate. If you have experience of using the tool you refer to then please include it.
                                                      – bu5hman
                                                      Jan 7 at 11:54










                                                    • @bu5hman Sure! I wasn't quite comfortable saying(gloating?) more about how well it works since I'm involved in its development.
                                                      – six-k
                                                      Jan 7 at 13:56














                                                    up vote
                                                    0
                                                    down vote













                                                    A robust and clean approach



                                                    • First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.

                                                    • Different approaches for different needs/purpose (all of the below or pick what ever applies):

                                                      • Hash only the entry name of all entries in the directory tree

                                                      • Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)

                                                      • For a symbolic link, its content is the referent name. Hash it or choose to skip

                                                      • Follow or not to follow(resolved name) the symlink while hashing the contents of the entry

                                                      • If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files

                                                      • Handle large files well(again, mind the RAM)

                                                      • Handle very deep directory trees (mind the open file descriptors)

                                                      • Handle non standard file names

                                                      • How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?

                                                      • Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.


                                                    This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.



                                                    Here's a tool(disclaimer: I'm a contributor to it) dtreetrawl, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.




                                                    Usage:
                                                    dtreetrawl [OPTION...] "/trawl/me" [path2,...]

                                                    Help Options:
                                                    -h, --help Show help options

                                                    Application Options:
                                                    -t, --terse Produce a terse output; parsable.
                                                    -d, --delim=: Character or string delimiter/separator for terse output(default ':')
                                                    -l, --max-level=N Do not traverse tree beyond N level(s)
                                                    --hash Hash the files to produce checksums(default is MD5).
                                                    -c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512.
                                                    -s, --hash-symlink Include symbolic links' referent name while calculating the root checksum
                                                    -R, --only-root-hash Output only the root hash. Blank line if --hash is not set
                                                    -N, --no-name-hash Exclude path name while calculating the root checksum
                                                    -F, --no-content-hash Do not hash the contents of the file



                                                    An example human friendly output:




                                                    ...
                                                    ... //clipped
                                                    ...
                                                    /home/lab/linux-4.14-rc8/CREDITS
                                                    Base name : CREDITS
                                                    Level : 1
                                                    Type : regular file
                                                    Referent name :
                                                    File size : 98443 bytes
                                                    I-node number : 290850
                                                    No. directory entries : 0
                                                    Permission (octal) : 0644
                                                    Link count : 1
                                                    Ownership : UID=0, GID=0
                                                    Preferred I/O block size : 4096 bytes
                                                    Blocks allocated : 200
                                                    Last status change : Tue, 21 Nov 17 21:28:18 +0530
                                                    Last file access : Thu, 28 Dec 17 00:53:27 +0530
                                                    Last file modification : Tue, 21 Nov 17 21:28:18 +0530
                                                    Hash : 9f0312d130016d103aa5fc9d16a2437e

                                                    Stats for /home/lab/linux-4.14-rc8:
                                                    Elapsed time : 1.305767 s
                                                    Start time : Sun, 07 Jan 18 03:42:39 +0530
                                                    Root hash : 434e93111ad6f9335bb4954bc8f4eca4
                                                    Hash type : md5
                                                    Depth : 8
                                                    Total,
                                                    size : 66850916 bytes
                                                    entries : 12484
                                                    directories : 763
                                                    regular files : 11715
                                                    symlinks : 6
                                                    block devices : 0
                                                    char devices : 0
                                                    sockets : 0
                                                    FIFOs/pipes : 0






                                                    share|improve this answer






















                                                    • General advice is always welcome but the best answers are specific and with code where appropriate. If you have experience of using the tool you refer to then please include it.
                                                      – bu5hman
                                                      Jan 7 at 11:54










                                                    • @bu5hman Sure! I wasn't quite comfortable saying(gloating?) more about how well it works since I'm involved in its development.
                                                      – six-k
                                                      Jan 7 at 13:56












                                                    up vote
                                                    0
                                                    down vote










                                                    up vote
                                                    0
                                                    down vote









                                                    A robust and clean approach



                                                    • First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.

                                                    • Different approaches for different needs/purpose (all of the below or pick what ever applies):

                                                      • Hash only the entry name of all entries in the directory tree

                                                      • Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)

                                                      • For a symbolic link, its content is the referent name. Hash it or choose to skip

                                                      • Follow or not to follow(resolved name) the symlink while hashing the contents of the entry

                                                      • If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files

                                                      • Handle large files well(again, mind the RAM)

                                                      • Handle very deep directory trees (mind the open file descriptors)

                                                      • Handle non standard file names

                                                      • How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?

                                                      • Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.


                                                    This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.



                                                    Here's a tool(disclaimer: I'm a contributor to it) dtreetrawl, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.




                                                    Usage:
                                                    dtreetrawl [OPTION...] "/trawl/me" [path2,...]

                                                    Help Options:
                                                    -h, --help Show help options

                                                    Application Options:
                                                    -t, --terse Produce a terse output; parsable.
                                                    -d, --delim=: Character or string delimiter/separator for terse output(default ':')
                                                    -l, --max-level=N Do not traverse tree beyond N level(s)
                                                    --hash Hash the files to produce checksums(default is MD5).
                                                    -c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512.
                                                    -s, --hash-symlink Include symbolic links' referent name while calculating the root checksum
                                                    -R, --only-root-hash Output only the root hash. Blank line if --hash is not set
                                                    -N, --no-name-hash Exclude path name while calculating the root checksum
                                                    -F, --no-content-hash Do not hash the contents of the file



                                                    An example human friendly output:




                                                    ...
                                                    ... //clipped
                                                    ...
                                                    /home/lab/linux-4.14-rc8/CREDITS
                                                    Base name : CREDITS
                                                    Level : 1
                                                    Type : regular file
                                                    Referent name :
                                                    File size : 98443 bytes
                                                    I-node number : 290850
                                                    No. directory entries : 0
                                                    Permission (octal) : 0644
                                                    Link count : 1
                                                    Ownership : UID=0, GID=0
                                                    Preferred I/O block size : 4096 bytes
                                                    Blocks allocated : 200
                                                    Last status change : Tue, 21 Nov 17 21:28:18 +0530
                                                    Last file access : Thu, 28 Dec 17 00:53:27 +0530
                                                    Last file modification : Tue, 21 Nov 17 21:28:18 +0530
                                                    Hash : 9f0312d130016d103aa5fc9d16a2437e

                                                    Stats for /home/lab/linux-4.14-rc8:
                                                    Elapsed time : 1.305767 s
                                                    Start time : Sun, 07 Jan 18 03:42:39 +0530
                                                    Root hash : 434e93111ad6f9335bb4954bc8f4eca4
                                                    Hash type : md5
                                                    Depth : 8
                                                    Total,
                                                    size : 66850916 bytes
                                                    entries : 12484
                                                    directories : 763
                                                    regular files : 11715
                                                    symlinks : 6
                                                    block devices : 0
                                                    char devices : 0
                                                    sockets : 0
                                                    FIFOs/pipes : 0






                                                    share|improve this answer














                                                    A robust and clean approach



                                                    • First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.

                                                    • Different approaches for different needs/purpose (all of the below or pick what ever applies):

                                                      • Hash only the entry name of all entries in the directory tree

                                                      • Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)

                                                      • For a symbolic link, its content is the referent name. Hash it or choose to skip

                                                      • Follow or not to follow(resolved name) the symlink while hashing the contents of the entry

                                                      • If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files

                                                      • Handle large files well(again, mind the RAM)

                                                      • Handle very deep directory trees (mind the open file descriptors)

                                                      • Handle non standard file names

                                                      • How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?

                                                      • Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.


                                                    This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.



                                                    Here's a tool(disclaimer: I'm a contributor to it) dtreetrawl, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.




                                                    Usage:
                                                    dtreetrawl [OPTION...] "/trawl/me" [path2,...]

                                                    Help Options:
                                                    -h, --help Show help options

                                                    Application Options:
                                                    -t, --terse Produce a terse output; parsable.
                                                    -d, --delim=: Character or string delimiter/separator for terse output(default ':')
                                                    -l, --max-level=N Do not traverse tree beyond N level(s)
                                                    --hash Hash the files to produce checksums(default is MD5).
                                                    -c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512.
                                                    -s, --hash-symlink Include symbolic links' referent name while calculating the root checksum
                                                    -R, --only-root-hash Output only the root hash. Blank line if --hash is not set
                                                    -N, --no-name-hash Exclude path name while calculating the root checksum
                                                    -F, --no-content-hash Do not hash the contents of the file



                                                    An example human friendly output:




                                                    ...
                                                    ... //clipped
                                                    ...
                                                    /home/lab/linux-4.14-rc8/CREDITS
                                                    Base name : CREDITS
                                                    Level : 1
                                                    Type : regular file
                                                    Referent name :
                                                    File size : 98443 bytes
                                                    I-node number : 290850
                                                    No. directory entries : 0
                                                    Permission (octal) : 0644
                                                    Link count : 1
                                                    Ownership : UID=0, GID=0
                                                    Preferred I/O block size : 4096 bytes
                                                    Blocks allocated : 200
                                                    Last status change : Tue, 21 Nov 17 21:28:18 +0530
                                                    Last file access : Thu, 28 Dec 17 00:53:27 +0530
                                                    Last file modification : Tue, 21 Nov 17 21:28:18 +0530
                                                    Hash : 9f0312d130016d103aa5fc9d16a2437e

                                                    Stats for /home/lab/linux-4.14-rc8:
                                                    Elapsed time : 1.305767 s
                                                    Start time : Sun, 07 Jan 18 03:42:39 +0530
                                                    Root hash : 434e93111ad6f9335bb4954bc8f4eca4
                                                    Hash type : md5
                                                    Depth : 8
                                                    Total,
                                                    size : 66850916 bytes
                                                    entries : 12484
                                                    directories : 763
                                                    regular files : 11715
                                                    symlinks : 6
                                                    block devices : 0
                                                    char devices : 0
                                                    sockets : 0
                                                    FIFOs/pipes : 0







                                                    share|improve this answer














                                                    share|improve this answer



                                                    share|improve this answer








                                                    edited Jan 7 at 13:50

























                                                    answered Jan 7 at 11:27









                                                    six-k

                                                    13




                                                    13











                                                    • General advice is always welcome but the best answers are specific and with code where appropriate. If you have experience of using the tool you refer to then please include it.
                                                      – bu5hman
                                                      Jan 7 at 11:54










                                                    • @bu5hman Sure! I wasn't quite comfortable saying(gloating?) more about how well it works since I'm involved in its development.
                                                      – six-k
                                                      Jan 7 at 13:56
















                                                    • General advice is always welcome but the best answers are specific and with code where appropriate. If you have experience of using the tool you refer to then please include it.
                                                      – bu5hman
                                                      Jan 7 at 11:54










                                                    • @bu5hman Sure! I wasn't quite comfortable saying(gloating?) more about how well it works since I'm involved in its development.
                                                      – six-k
                                                      Jan 7 at 13:56















                                                    General advice is always welcome but the best answers are specific and with code where appropriate. If you have experience of using the tool you refer to then please include it.
                                                    – bu5hman
                                                    Jan 7 at 11:54




                                                    General advice is always welcome but the best answers are specific and with code where appropriate. If you have experience of using the tool you refer to then please include it.
                                                    – bu5hman
                                                    Jan 7 at 11:54












                                                    @bu5hman Sure! I wasn't quite comfortable saying(gloating?) more about how well it works since I'm involved in its development.
                                                    – six-k
                                                    Jan 7 at 13:56




                                                    @bu5hman Sure! I wasn't quite comfortable saying(gloating?) more about how well it works since I'm involved in its development.
                                                    – six-k
                                                    Jan 7 at 13:56










                                                    up vote
                                                    0
                                                    down vote













                                                    Doing individually for all files in each directory.



                                                    # Calculating
                                                    find dir1 | xargs md5sum > dir1.md5
                                                    find dir2 | xargs md5sum > dir2.md5
                                                    # Comparing (and showing the difference)
                                                    paste <(sort -k2 dir1.md5) <(sort -k2 dir2.md5) | awk '$1 != $3'





                                                    share|improve this answer








                                                    New contributor




                                                    Leandro Lima is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                    Check out our Code of Conduct.





















                                                      up vote
                                                      0
                                                      down vote













                                                      Doing individually for all files in each directory.



                                                      # Calculating
                                                      find dir1 | xargs md5sum > dir1.md5
                                                      find dir2 | xargs md5sum > dir2.md5
                                                      # Comparing (and showing the difference)
                                                      paste <(sort -k2 dir1.md5) <(sort -k2 dir2.md5) | awk '$1 != $3'





                                                      share|improve this answer








                                                      New contributor




                                                      Leandro Lima is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                      Check out our Code of Conduct.



















                                                        up vote
                                                        0
                                                        down vote










                                                        up vote
                                                        0
                                                        down vote









                                                        Doing individually for all files in each directory.



                                                        # Calculating
                                                        find dir1 | xargs md5sum > dir1.md5
                                                        find dir2 | xargs md5sum > dir2.md5
                                                        # Comparing (and showing the difference)
                                                        paste <(sort -k2 dir1.md5) <(sort -k2 dir2.md5) | awk '$1 != $3'





                                                        share|improve this answer








                                                        New contributor




                                                        Leandro Lima is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                        Check out our Code of Conduct.









                                                        Doing individually for all files in each directory.



                                                        # Calculating
                                                        find dir1 | xargs md5sum > dir1.md5
                                                        find dir2 | xargs md5sum > dir2.md5
                                                        # Comparing (and showing the difference)
                                                        paste <(sort -k2 dir1.md5) <(sort -k2 dir2.md5) | awk '$1 != $3'






                                                        share|improve this answer








                                                        New contributor




                                                        Leandro Lima is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                        Check out our Code of Conduct.









                                                        share|improve this answer



                                                        share|improve this answer






                                                        New contributor




                                                        Leandro Lima is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                        Check out our Code of Conduct.









                                                        answered 10 mins ago









                                                        Leandro Lima

                                                        12




                                                        12




                                                        New contributor




                                                        Leandro Lima is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                        Check out our Code of Conduct.





                                                        New contributor





                                                        Leandro Lima is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                        Check out our Code of Conduct.






                                                        Leandro Lima is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                                                        Check out our Code of Conduct.



























                                                             

                                                            draft saved


                                                            draft discarded















































                                                             


                                                            draft saved


                                                            draft discarded














                                                            StackExchange.ready(
                                                            function ()
                                                            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f35832%2fhow-do-i-get-the-md5-sum-of-a-directorys-contents-as-one-sum%23new-answer', 'question_page');

                                                            );

                                                            Post as a guest













































































                                                            Popular posts from this blog

                                                            How to check contact read email or not when send email to Individual?

                                                            Displaying single band from multi-band raster using QGIS

                                                            How many registers does an x86_64 CPU actually have?