mjhjmtu

Question

The md5sum program does not provide checksums for directories. I want to get a single MD5 checksum for the entire contents of a directory, including files in sub-directories. That is, one combined checksum made out of all the files. Is there a way to do this?

score 154 · Answer 1 · 2016-04-01 20:37:39Z

The right way depends on exactly why you're asking:

Option 1: Compare Data Only

If you just need a hash of the tree's file contents, this will do the trick:

$ find -s somedir -type f -exec md5sum ; | md5sum

This first summarizes all of the file contents individually, in a predictable order, then passes that list of file names and MD5 hashes to be hashed itself, giving a single value that should only change when the content of one of the files in the tree changes.

Unfortunately, find -s only works with BSD find(1), used in Mac OS X, FreeBSD, NetBSD and OpenBSD. To get something comparable on a system with GNU or SUS find(1), you need something a bit uglier:

$ find somedir -type f -exec md5sum ; | sort -k 2 | md5sum

We've replaced find -s with a call to sort. The -k 2 bit tells it to skip over the MD5 hash, so it only sorts the file names, which are in field 2 through end-of-line, by sort's reckoning.

There's a weakness with this version of the command, which is that it's liable to become confused if you have any filenames with newlines in them, because it'll look like multiple lines to the sort call. The find -s variant doesn't have that problem, because the tree traversal and sorting happen within the same program, find.

In either case, the sorting is necessary to avoid false positives. *ix filesystems don't maintain the directory listings in a stable, predictable order; you might not realize this from using ls and such, which silently sort the directory contents for you. find without -s or a sort call is going to print out files in whatever order the underlying filesystem returns them, which could cause this command to give a changed hash value when all that's changed is the order of files in a directory.

You might need to change the md5sum commands to md5 or some other hash function. If you choose another hash function and need the second form of the command for your system, you might need to adjust the sort command if its output line doesn't have a hash followed by the file name, separated by whitespace. For instance, you cannot use the old Unix sum program for this because its output doesn't include the file name.

This method is somewhat inefficient, calling md5sum N+1 times, where N is the number of files in the tree, but that's a necessary cost to avoid hashing file and directory metadata.

Option 2: Compare Data and Metadata

If you need to be able to detect that anything in a tree has changed, not just file contents, ask tar to pack the directory contents up for you, then send it to md5sum:

$ tar -cf - somedir | md5sum

Because tar also sees file permissions, ownership, etc., this will also detect changes to those things, not just changes to file contents.

This method is considerably faster, since it makes only one pass over the tree and runs the hash program only once.

As with the find based method above, tar is going to process file names in the order the underlying filesystem returns them. It may well be that in your application, you can be sure you won't cause this to happen. I can think of at least three different usage patterns where that is likely to be the case. (I'm not going to list them, because we're getting into unspecified behavior territory. Each filesystem can be different here, even from one version of the OS to the next.)

If you find yourself getting false positives, I'd recommend going with the find | cpio option in Gilles' answer.

I think it is best to navigate to the directory being compared and use find . instead of find somedir. This way the file names are the same when providing different path-specs to find; this can be tricky :-) — Jun 24 '14 at 6:50
@CMCDragonkai: What do you mean? In the first case, we do sort the list of file names. In the second case, we purposely do not because part of the emphasized anything in the first sentence is that the order of files in a directory has changed, so you wouldn't want to sort anything. — Jan 19 '16 at 3:45
@WarrenYoung Can you explain a bit more thoroughly why option 2 isn't always better? It seems to be quicker, simpler and more cross-platform. In which case shouldn't it be option 1? — Aug 17 '16 at 7:51
Option 1 alternative: find somedir -type f -exec sh -c "openssl dgst -sha1 -binary | xxd -p" ; | sort | openssl dgst -sha1 to ignore all filenames (should work with newlines) — Oct 22 '17 at 9:50

score 32 · Answer 2 · 2016-08-12 13:07:41Z

The checksum needs to be of a deterministic and unambiguous representation of the files as a string. Deterministic means that if you put the same files at the same locations, you'll get the same result. Unambiguous means that two different sets of files have different representations.

Data and metadata

Making an archive containing the files is a good start. This is an unambiguous representation (obviously, since you can recover the files by extracting the archive). It may include file metadata such as dates and ownership. However, this isn't quite right yet: an archive is ambiguous, because its representation depends on the order in which the files are stored, and if applicable on the compression.

A solution is to sort the file names before archiving them. If your file names don't contain newlines, you can run find | sort to list them, and add them to the archive in this order. Take care to tell the archiver not to recurse into directories. Here are examples with POSIX pax, GNU tar and cpio:

find | LC_ALL=C sort | pax -w -d | md5sum
find | LC_ALL=C sort | tar -cf - -T - --no-recursion | md5sum
find | LC_ALL=C sort | cpio -o | md5sum

Names and contents only, the low-tech way

If you only want to take the file data into account and not metadata, you can make an archive that includes only the file contents, but there are no standard tools for that. Instead of including the file contents, you can include the hash of the files. If the file names contain no newlines, and there are only regular files and directories (no symbolic links or special files), this is fairly easy, but you do need to take care of a few things:

 sort; echo;
 find -type f -exec md5sum + | md5sum

We include a directory listing in addition to the list of checksums, as otherwise empty directories would be invisible. The file list is sorted (in a specific, reproducible locale Ã¢Â€Â” thanks to Peter.O for reminding me of that). echo separates the two parts (without this, you could make some empty directories whose name look like md5sum output that could also pass for ordinary files). We also include a listing of file sizes, to avoid length-extension attacks.

By the way, MD5 is deprecated. If it's available, consider using SHA-2, or at least SHA-1.

Names and data, supporting newlines in names

Here is a variant of the code above that relies on GNU tools to separate the file names with null bytes. This allows file names to contain newlines. The GNU digest utilities quote special characters in their output, so there won't be ambiguous newlines.

 sort -z; # file hashes
 echo | sha256sum

A more robust approach

Here's a minimally tested Python script that builds a hash describing a hierarchy of files. It takes directories and file contents into accounts and ignores symbolic links and other files, and returns a fatal error if any file can't be read.

#! /usr/bin/env python
import hashlib, hmac, os, stat, sys
## Return the hash of the contents of the specified file, as a hex string
def file_hash(name):
 f = open(name)
 h = hashlib.sha256()
 while True:
 buf = f.read(16384)
 if len(buf) == 0: break
 h.update(buf)
 f.close()
 return h.hexdigest()
## Traverse the specified path and update the hash with a description of its
## name and contents
def traverse(h, path):
 rs = os.lstat(path)
 quoted_name = repr(path)
 if stat.S_ISDIR(rs.st_mode):
 h.update('dir ' + quoted_name + 'n')
 for entry in sorted(os.listdir(path)):
 traverse(h, os.path.join(path, entry))
 elif stat.S_ISREG(rs.st_mode):
 h.update('reg ' + quoted_name + ' ')
 h.update(str(rs.st_size) + ' ')
 h.update(file_hash(path) + 'n')
 else: pass # silently symlinks and other special files
h = hashlib.sha256()
for root in sys.argv[1:]: traverse(h, root)
h.update('endn')
print h.hexdigest()

OK, this works, thanks. But is there any way to do it without including any metadata? Right now I need it for just the actual contents. — user17429, Apr 6 '12 at 1:12
How about LC_ALL=C sort for checking from different environments...(+1 btw) — Apr 6 '12 at 6:16
You made a whole Python program for this? Thanks! This is really more than what I had expected. :-) Anyway, I will check these methods as well as the new option 1 by Warren. — user17429, Apr 6 '12 at 17:33
Good answer. Setting the sort order with LC_ALL=C is essential if running on multiple machines and OSs. — Aug 3 '16 at 20:52
What does cpio -o - mean? DoesnÃ¢Â€Â™t cpio use stdin/out by default? GNU cpio 2.12 produces cpio: Too many arguments — Aug 12 '16 at 12:40

faultyserver 22114 · Answer 3 · 2012-04-10 16:19:50Z

Have a look at md5deep. Some of the features of md5deep that may interest you:

Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.

Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match.

...

Nice, but can't get it to work, it says .../foo: Is a directory, what gives? — Oct 2 '14 at 1:21
On its own md5deep doesn't solve the OP's problem as it doesn't print a consolidated md5sum, it just prints the md5sum for each file in the directory. That said, you can md5sum the output of md5deep - not quite what the OP wanted, but is close! e.g. for the current directory: md5deep -r -l -j0 . | md5sum (where -r is recursive, -l means "use relative paths" so that the absolute path of the files doesn't interfere when trying to compare the content of two directories, and -j0 means use 1 thread to prevent non-determinism due to individual md5sums being returned in different orders). — Oct 14 '15 at 12:34

PaÃ…Âlo Ebermann 32028 · Answer 4 · 2012-04-10 16:06:55Z

up vote
7
down vote

If your goal is just to find differences between two directories, consider using diff.

Try this:

diff -qr dir1 dir2

edited Apr 10 '12 at 16:06

PaÃ…Âlo Ebermann

32028

answered Apr 6 '12 at 5:24

Deepak Mittal

1,111914

Yes, this is useful as well. I think you meant dir1 dir2 in that command.
â€“Â user17429
Apr 6 '12 at 17:35

1

I don't usually use GUIs when I can avoid them, but for directory diffing kdiff3 is great and also works on many platforms.
â€“Â sinelaw
Apr 17 '12 at 2:21

Differing files are reported as well with this command.
â€“Â Serge Stroobandt
Apr 2 '14 at 15:02

add a commentÂ |Â

Pavel Vlasov 178126 · Answer 5 · 2016-04-14 13:34:09Z

up vote
5
down vote

You can hash every file recursively and then hash the resulting text:

> md5deep -r -l . | sort | md5sum
d43417958e47758c6405b5098f151074 *-

md5deep is required.

answered Apr 14 '16 at 13:34

Pavel Vlasov

178126

1

instead of md5deep use hashdeep on ubuntu 16.04 because md5deep package is just a transitional dummy for hashdeep.
â€“Â palik
Nov 8 '17 at 15:22

1

I've tried hashdeep. It outputs not only hashes but also some header including ## Invoked from: /home/myuser/dev/ which is your current path and ## $ hashdeep -s -r -l ~/folder/. This got to sort, so the final hash will be different if you change your current folder or command line.
â€“Â truf
Aug 23 at 8:28

add a commentÂ |Â

Communityâ™¦ 1 · Answer 6 · 2017-04-13 12:36:48Z

File contents only, excluding filenames

I needed a version that only checked the filenames because the contents reside in different directories.

This version (Warren Young's answer) helped a lot, but my version of md5sum outputs the filename (relative to the path I ran the command from), and the folder names were different, therefore even though the individual file checksums matched, the final checksum didn't.

To fix that, in my case, I just needed to strip off the filename from each line of the find output (select only the first word as separated by spaces using cut):

find -s somedir -type f -exec md5sum ; | cut -d" " -f1 | md5sum

You might need to sort the checksums as well to get a reproducible list. — Mar 22 '16 at 21:34

eckes 1477 · Answer 7 · 2013-08-11 01:37:36Z

A good tree check-sum is the tree-id of Git.

There is unfortunately no stand-alone tool available which can do that (at least I dont know it), but if you have Git handy you can just pretend to set up a new repository and add the files you want to check to the index.

This allows you to produce the (reproducible) tree hash - which includes only content, file names and some reduced file modes (executable).

poige 3,8621541 · Answer 8 · 2012-04-10 17:26:52Z

up vote
2
down vote

I use this my snippet for moderate volumes:

find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 cat | md5sum -

and this one for XXXL:

find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 tail -qc100 | md5sum -

answered Apr 10 '12 at 17:26

poige

3,8621541

What does the -xdev flag do?
â€“Â czerasz
May 4 '17 at 6:35

It calls for you to type in: man find and read that fine manual ;)
â€“Â poige
May 4 '17 at 12:43

Good point :-). -xdev Don't descend directories on other filesystems.
â€“Â czerasz
May 4 '17 at 16:31

1

Note that this ignores new, empty files (like if you touch a file).
â€“Â RonJohn
May 12 at 23:08

Thanks. I think I see how to fix
â€“Â poige
May 13 at 2:14

add a commentÂ |Â

DmitrySemenov 23419 · Answer 9 · 2016-03-08 02:53:20Z

solution:

$ pip install checksumdir
$ checksumdir -a md5 assets/js
981ac0bc890de594a9f2f40e00f13872
$ checksumdir -a sha1 assets/js
88cd20f115e31a1e1ae381f7291d0c8cd3b92fad

works fast and easier solution then bash scripting.

see doc: https://pypi.python.org/pypi/checksumdir/1.0.5

if you don't have pip you may need to install it with yum -y install python-pip (or dnf/apt-get) — Mar 8 '16 at 2:55

Igor 1212 · Answer 10 · 2016-07-27 16:48:33Z

nix-hash from the Nix package manager

The command nix-hash computes the cryptographic hash of the contents
of each path and prints it on standard output. By default, it computes
an MD5 hash, but other hash algorithms are available as well. The hash is printed in hexadecimal.

The hash is computed over a serialisation of each path: a dump of the file system tree rooted at the path. This allows directories
and symlinks to be hashed
as well as regular files. The dump is in the NAR format produced by nix-store --dump. Thus, nix-hash path yields the same
cryptographic hash as nix-store
--dump path | md5sum.

Camilo Martin 36639 · Answer 11 · 2014-10-02 02:13:26Z

I didn't want new executables nor clunky solutions so here's my take:

#!/bin/sh
# md5dir.sh by Camilo Martin, 2014-10-01.
# Give this a parameter and it will calculate an md5 of the directory's contents.
# It only takes into account file contents and paths relative to the directory's root.
# This means that two dirs with different names and locations can hash equally.

if [[ ! -d "$1" ]]; then
 echo "Usage: md5dir.sh <dir_name>"
 exit
fi

d="$(tr '\' / <<< "$1" | tr -s / | sed 's-/$--')"
c=$(($#d + 35))
find "$d" -type f -exec md5sum ; | cut -c 1-33,$c- | sort | md5sum | cut -c 1-32

Hope it helps you :)

ioquatix 1113 · Answer 12 · 2016-07-07 00:15:55Z

A script which is well tested and supports a number of operations including finding duplicates, doing comparisons on both data and metadata, showing additions as well as changes and removals, you might like Fingerprint.

Fingerprint right now doesn't produce a single checksum for a directory, but a transcript file which includes checksums for all files in that directory.

fingerprint analyze

This will generate index.fingerprint in the current directory which includes checksums, filenames and file sizes. By default it uses both MD5 and SHA1.256.

In the future, I hope to add support for Merkle Trees into Fingerprint which will give you a single top-level checksum. Right now, you need to retain that file for doing verification.

score 0 · Answer 13 · 2018-01-07 13:50:49Z

A robust and clean approach

First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.

Different approaches for different needs/purpose (all of the below or pick what ever applies):
- Hash only the entry name of all entries in the directory tree
- Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)
- For a symbolic link, its content is the referent name. Hash it or choose to skip
- Follow or not to follow(resolved name) the symlink while hashing the contents of the entry
- If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files
- Handle large files well(again, mind the RAM)
- Handle very deep directory trees (mind the open file descriptors)
- Handle non standard file names
- How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?
- Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.

This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.

Here's a tool(disclaimer: I'm a contributor to it) dtreetrawl, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.

Usage:
 dtreetrawl [OPTION...] "/trawl/me" [path2,...]

Help Options:
 -h, --help Show help options

Application Options:
 -t, --terse Produce a terse output; parsable.
 -d, --delim=: Character or string delimiter/separator for terse output(default ':')
 -l, --max-level=N Do not traverse tree beyond N level(s)
 --hash Hash the files to produce checksums(default is MD5).
 -c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512.
 -s, --hash-symlink Include symbolic links' referent name while calculating the root checksum
 -R, --only-root-hash Output only the root hash. Blank line if --hash is not set
 -N, --no-name-hash Exclude path name while calculating the root checksum
 -F, --no-content-hash Do not hash the contents of the file

An example human friendly output:

...
... //clipped
...
/home/lab/linux-4.14-rc8/CREDITS
 Base name : CREDITS
 Level : 1
 Type : regular file
 Referent name :
 File size : 98443 bytes
 I-node number : 290850
 No. directory entries : 0
 Permission (octal) : 0644
 Link count : 1
 Ownership : UID=0, GID=0
 Preferred I/O block size : 4096 bytes
 Blocks allocated : 200
 Last status change : Tue, 21 Nov 17 21:28:18 +0530
 Last file access : Thu, 28 Dec 17 00:53:27 +0530
 Last file modification : Tue, 21 Nov 17 21:28:18 +0530
 Hash : 9f0312d130016d103aa5fc9d16a2437e

Stats for /home/lab/linux-4.14-rc8:
 Elapsed time : 1.305767 s
 Start time : Sun, 07 Jan 18 03:42:39 +0530
 Root hash : 434e93111ad6f9335bb4954bc8f4eca4
 Hash type : md5
 Depth : 8
 Total,
 size : 66850916 bytes
 entries : 12484
 directories : 763
 regular files : 11715
 symlinks : 6
 block devices : 0
 char devices : 0
 sockets : 0
 FIFOs/pipes : 0

General advice is always welcome but the best answers are specific and with code where appropriate. If you have experience of using the tool you refer to then please include it. — Jan 7 at 11:54
@bu5hman Sure! I wasn't quite comfortable saying(gloating?) more about how well it works since I'm involved in its development. — Jan 7 at 13:56

Leandro Lima 12 · Answer 14 · 2018-11-01 22:37:03Z

Doing individually for all files in each directory.

# Calculating
find dir1 | xargs md5sum > dir1.md5
find dir2 | xargs md5sum > dir2.md5
# Comparing (and showing the difference)
paste <(sort -k2 dir1.md5) <(sort -k2 dir2.md5) | awk '$1 != $3'

score 154 · Answer 15 · 2016-04-01 20:37:39Z

The right way depends on exactly why you're asking:

Option 1: Compare Data Only

If you just need a hash of the tree's file contents, this will do the trick:

$ find -s somedir -type f -exec md5sum ; | md5sum

This first summarizes all of the file contents individually, in a predictable order, then passes that list of file names and MD5 hashes to be hashed itself, giving a single value that should only change when the content of one of the files in the tree changes.

Unfortunately, find -s only works with BSD find(1), used in Mac OS X, FreeBSD, NetBSD and OpenBSD. To get something comparable on a system with GNU or SUS find(1), you need something a bit uglier:

$ find somedir -type f -exec md5sum ; | sort -k 2 | md5sum

We've replaced find -s with a call to sort. The -k 2 bit tells it to skip over the MD5 hash, so it only sorts the file names, which are in field 2 through end-of-line, by sort's reckoning.

There's a weakness with this version of the command, which is that it's liable to become confused if you have any filenames with newlines in them, because it'll look like multiple lines to the sort call. The find -s variant doesn't have that problem, because the tree traversal and sorting happen within the same program, find.

In either case, the sorting is necessary to avoid false positives. *ix filesystems don't maintain the directory listings in a stable, predictable order; you might not realize this from using ls and such, which silently sort the directory contents for you. find without -s or a sort call is going to print out files in whatever order the underlying filesystem returns them, which could cause this command to give a changed hash value when all that's changed is the order of files in a directory.

You might need to change the md5sum commands to md5 or some other hash function. If you choose another hash function and need the second form of the command for your system, you might need to adjust the sort command if its output line doesn't have a hash followed by the file name, separated by whitespace. For instance, you cannot use the old Unix sum program for this because its output doesn't include the file name.

This method is somewhat inefficient, calling md5sum N+1 times, where N is the number of files in the tree, but that's a necessary cost to avoid hashing file and directory metadata.

Option 2: Compare Data and Metadata

If you need to be able to detect that anything in a tree has changed, not just file contents, ask tar to pack the directory contents up for you, then send it to md5sum:

$ tar -cf - somedir | md5sum

Because tar also sees file permissions, ownership, etc., this will also detect changes to those things, not just changes to file contents.

This method is considerably faster, since it makes only one pass over the tree and runs the hash program only once.

As with the find based method above, tar is going to process file names in the order the underlying filesystem returns them. It may well be that in your application, you can be sure you won't cause this to happen. I can think of at least three different usage patterns where that is likely to be the case. (I'm not going to list them, because we're getting into unspecified behavior territory. Each filesystem can be different here, even from one version of the OS to the next.)

If you find yourself getting false positives, I'd recommend going with the find | cpio option in Gilles' answer.

I think it is best to navigate to the directory being compared and use find . instead of find somedir. This way the file names are the same when providing different path-specs to find; this can be tricky :-) — Jun 24 '14 at 6:50
@CMCDragonkai: What do you mean? In the first case, we do sort the list of file names. In the second case, we purposely do not because part of the emphasized anything in the first sentence is that the order of files in a directory has changed, so you wouldn't want to sort anything. — Jan 19 '16 at 3:45
@WarrenYoung Can you explain a bit more thoroughly why option 2 isn't always better? It seems to be quicker, simpler and more cross-platform. In which case shouldn't it be option 1? — Aug 17 '16 at 7:51
Option 1 alternative: find somedir -type f -exec sh -c "openssl dgst -sha1 -binary | xxd -p" ; | sort | openssl dgst -sha1 to ignore all filenames (should work with newlines) — Oct 22 '17 at 9:50

score 32 · Answer 16 · 2016-08-12 13:07:41Z

The checksum needs to be of a deterministic and unambiguous representation of the files as a string. Deterministic means that if you put the same files at the same locations, you'll get the same result. Unambiguous means that two different sets of files have different representations.

Data and metadata

Making an archive containing the files is a good start. This is an unambiguous representation (obviously, since you can recover the files by extracting the archive). It may include file metadata such as dates and ownership. However, this isn't quite right yet: an archive is ambiguous, because its representation depends on the order in which the files are stored, and if applicable on the compression.

A solution is to sort the file names before archiving them. If your file names don't contain newlines, you can run find | sort to list them, and add them to the archive in this order. Take care to tell the archiver not to recurse into directories. Here are examples with POSIX pax, GNU tar and cpio:

find | LC_ALL=C sort | pax -w -d | md5sum
find | LC_ALL=C sort | tar -cf - -T - --no-recursion | md5sum
find | LC_ALL=C sort | cpio -o | md5sum

Names and contents only, the low-tech way

If you only want to take the file data into account and not metadata, you can make an archive that includes only the file contents, but there are no standard tools for that. Instead of including the file contents, you can include the hash of the files. If the file names contain no newlines, and there are only regular files and directories (no symbolic links or special files), this is fairly easy, but you do need to take care of a few things:

 sort; echo;
 find -type f -exec md5sum + | md5sum

We include a directory listing in addition to the list of checksums, as otherwise empty directories would be invisible. The file list is sorted (in a specific, reproducible locale Ã¢Â€Â” thanks to Peter.O for reminding me of that). echo separates the two parts (without this, you could make some empty directories whose name look like md5sum output that could also pass for ordinary files). We also include a listing of file sizes, to avoid length-extension attacks.

By the way, MD5 is deprecated. If it's available, consider using SHA-2, or at least SHA-1.

Names and data, supporting newlines in names

Here is a variant of the code above that relies on GNU tools to separate the file names with null bytes. This allows file names to contain newlines. The GNU digest utilities quote special characters in their output, so there won't be ambiguous newlines.

 sort -z; # file hashes
 echo | sha256sum

A more robust approach

Here's a minimally tested Python script that builds a hash describing a hierarchy of files. It takes directories and file contents into accounts and ignores symbolic links and other files, and returns a fatal error if any file can't be read.

#! /usr/bin/env python
import hashlib, hmac, os, stat, sys
## Return the hash of the contents of the specified file, as a hex string
def file_hash(name):
 f = open(name)
 h = hashlib.sha256()
 while True:
 buf = f.read(16384)
 if len(buf) == 0: break
 h.update(buf)
 f.close()
 return h.hexdigest()
## Traverse the specified path and update the hash with a description of its
## name and contents
def traverse(h, path):
 rs = os.lstat(path)
 quoted_name = repr(path)
 if stat.S_ISDIR(rs.st_mode):
 h.update('dir ' + quoted_name + 'n')
 for entry in sorted(os.listdir(path)):
 traverse(h, os.path.join(path, entry))
 elif stat.S_ISREG(rs.st_mode):
 h.update('reg ' + quoted_name + ' ')
 h.update(str(rs.st_size) + ' ')
 h.update(file_hash(path) + 'n')
 else: pass # silently symlinks and other special files
h = hashlib.sha256()
for root in sys.argv[1:]: traverse(h, root)
h.update('endn')
print h.hexdigest()

OK, this works, thanks. But is there any way to do it without including any metadata? Right now I need it for just the actual contents. — user17429, Apr 6 '12 at 1:12
How about LC_ALL=C sort for checking from different environments...(+1 btw) — Apr 6 '12 at 6:16
You made a whole Python program for this? Thanks! This is really more than what I had expected. :-) Anyway, I will check these methods as well as the new option 1 by Warren. — user17429, Apr 6 '12 at 17:33
Good answer. Setting the sort order with LC_ALL=C is essential if running on multiple machines and OSs. — Aug 3 '16 at 20:52
What does cpio -o - mean? DoesnÃ¢Â€Â™t cpio use stdin/out by default? GNU cpio 2.12 produces cpio: Too many arguments — Aug 12 '16 at 12:40

faultyserver 22114 · Answer 17 · 2012-04-10 16:19:50Z

Have a look at md5deep. Some of the features of md5deep that may interest you:

Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.

Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match.

...

Nice, but can't get it to work, it says .../foo: Is a directory, what gives? — Oct 2 '14 at 1:21
On its own md5deep doesn't solve the OP's problem as it doesn't print a consolidated md5sum, it just prints the md5sum for each file in the directory. That said, you can md5sum the output of md5deep - not quite what the OP wanted, but is close! e.g. for the current directory: md5deep -r -l -j0 . | md5sum (where -r is recursive, -l means "use relative paths" so that the absolute path of the files doesn't interfere when trying to compare the content of two directories, and -j0 means use 1 thread to prevent non-determinism due to individual md5sums being returned in different orders). — Oct 14 '15 at 12:34

PaÃ…Âlo Ebermann 32028 · Answer 18 · 2012-04-10 16:06:55Z

up vote
7
down vote

If your goal is just to find differences between two directories, consider using diff.

Try this:

diff -qr dir1 dir2

edited Apr 10 '12 at 16:06

PaÃ…Âlo Ebermann

32028

answered Apr 6 '12 at 5:24

Deepak Mittal

1,111914

Yes, this is useful as well. I think you meant dir1 dir2 in that command.
â€“Â user17429
Apr 6 '12 at 17:35

1

I don't usually use GUIs when I can avoid them, but for directory diffing kdiff3 is great and also works on many platforms.
â€“Â sinelaw
Apr 17 '12 at 2:21

Differing files are reported as well with this command.
â€“Â Serge Stroobandt
Apr 2 '14 at 15:02

add a commentÂ |Â

Pavel Vlasov 178126 · Answer 19 · 2016-04-14 13:34:09Z

up vote
5
down vote

You can hash every file recursively and then hash the resulting text:

> md5deep -r -l . | sort | md5sum
d43417958e47758c6405b5098f151074 *-

md5deep is required.

answered Apr 14 '16 at 13:34

Pavel Vlasov

178126

1

instead of md5deep use hashdeep on ubuntu 16.04 because md5deep package is just a transitional dummy for hashdeep.
â€“Â palik
Nov 8 '17 at 15:22

1

I've tried hashdeep. It outputs not only hashes but also some header including ## Invoked from: /home/myuser/dev/ which is your current path and ## $ hashdeep -s -r -l ~/folder/. This got to sort, so the final hash will be different if you change your current folder or command line.
â€“Â truf
Aug 23 at 8:28

add a commentÂ |Â

Communityâ™¦ 1 · Answer 20 · 2017-04-13 12:36:48Z

File contents only, excluding filenames

I needed a version that only checked the filenames because the contents reside in different directories.

This version (Warren Young's answer) helped a lot, but my version of md5sum outputs the filename (relative to the path I ran the command from), and the folder names were different, therefore even though the individual file checksums matched, the final checksum didn't.

To fix that, in my case, I just needed to strip off the filename from each line of the find output (select only the first word as separated by spaces using cut):

find -s somedir -type f -exec md5sum ; | cut -d" " -f1 | md5sum

You might need to sort the checksums as well to get a reproducible list. — Mar 22 '16 at 21:34

eckes 1477 · Answer 21 · 2013-08-11 01:37:36Z

A good tree check-sum is the tree-id of Git.

There is unfortunately no stand-alone tool available which can do that (at least I dont know it), but if you have Git handy you can just pretend to set up a new repository and add the files you want to check to the index.

This allows you to produce the (reproducible) tree hash - which includes only content, file names and some reduced file modes (executable).

poige 3,8621541 · Answer 22 · 2012-04-10 17:26:52Z

up vote
2
down vote

I use this my snippet for moderate volumes:

find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 cat | md5sum -

and this one for XXXL:

find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 tail -qc100 | md5sum -

answered Apr 10 '12 at 17:26

poige

3,8621541

What does the -xdev flag do?
â€“Â czerasz
May 4 '17 at 6:35

It calls for you to type in: man find and read that fine manual ;)
â€“Â poige
May 4 '17 at 12:43

Good point :-). -xdev Don't descend directories on other filesystems.
â€“Â czerasz
May 4 '17 at 16:31

1

Note that this ignores new, empty files (like if you touch a file).
â€“Â RonJohn
May 12 at 23:08

Thanks. I think I see how to fix
â€“Â poige
May 13 at 2:14

add a commentÂ |Â

DmitrySemenov 23419 · Answer 23 · 2016-03-08 02:53:20Z

solution:

$ pip install checksumdir
$ checksumdir -a md5 assets/js
981ac0bc890de594a9f2f40e00f13872
$ checksumdir -a sha1 assets/js
88cd20f115e31a1e1ae381f7291d0c8cd3b92fad

works fast and easier solution then bash scripting.

see doc: https://pypi.python.org/pypi/checksumdir/1.0.5

if you don't have pip you may need to install it with yum -y install python-pip (or dnf/apt-get) — Mar 8 '16 at 2:55

Igor 1212 · Answer 24 · 2016-07-27 16:48:33Z

nix-hash from the Nix package manager

The command nix-hash computes the cryptographic hash of the contents
of each path and prints it on standard output. By default, it computes
an MD5 hash, but other hash algorithms are available as well. The hash is printed in hexadecimal.

The hash is computed over a serialisation of each path: a dump of the file system tree rooted at the path. This allows directories
and symlinks to be hashed
as well as regular files. The dump is in the NAR format produced by nix-store --dump. Thus, nix-hash path yields the same
cryptographic hash as nix-store
--dump path | md5sum.

Camilo Martin 36639 · Answer 25 · 2014-10-02 02:13:26Z

I didn't want new executables nor clunky solutions so here's my take:

#!/bin/sh
# md5dir.sh by Camilo Martin, 2014-10-01.
# Give this a parameter and it will calculate an md5 of the directory's contents.
# It only takes into account file contents and paths relative to the directory's root.
# This means that two dirs with different names and locations can hash equally.

if [[ ! -d "$1" ]]; then
 echo "Usage: md5dir.sh <dir_name>"
 exit
fi

d="$(tr '\' / <<< "$1" | tr -s / | sed 's-/$--')"
c=$(($#d + 35))
find "$d" -type f -exec md5sum ; | cut -c 1-33,$c- | sort | md5sum | cut -c 1-32

Hope it helps you :)

ioquatix 1113 · Answer 26 · 2016-07-07 00:15:55Z

A script which is well tested and supports a number of operations including finding duplicates, doing comparisons on both data and metadata, showing additions as well as changes and removals, you might like Fingerprint.

Fingerprint right now doesn't produce a single checksum for a directory, but a transcript file which includes checksums for all files in that directory.

fingerprint analyze

This will generate index.fingerprint in the current directory which includes checksums, filenames and file sizes. By default it uses both MD5 and SHA1.256.

In the future, I hope to add support for Merkle Trees into Fingerprint which will give you a single top-level checksum. Right now, you need to retain that file for doing verification.

score 0 · Answer 27 · 2018-01-07 13:50:49Z

A robust and clean approach

First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.

Different approaches for different needs/purpose (all of the below or pick what ever applies):
- Hash only the entry name of all entries in the directory tree
- Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)
- For a symbolic link, its content is the referent name. Hash it or choose to skip
- Follow or not to follow(resolved name) the symlink while hashing the contents of the entry
- If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files
- Handle large files well(again, mind the RAM)
- Handle very deep directory trees (mind the open file descriptors)
- Handle non standard file names
- How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?
- Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.

This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.

Here's a tool(disclaimer: I'm a contributor to it) dtreetrawl, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.

Usage:
 dtreetrawl [OPTION...] "/trawl/me" [path2,...]

Help Options:
 -h, --help Show help options

Application Options:
 -t, --terse Produce a terse output; parsable.
 -d, --delim=: Character or string delimiter/separator for terse output(default ':')
 -l, --max-level=N Do not traverse tree beyond N level(s)
 --hash Hash the files to produce checksums(default is MD5).
 -c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512.
 -s, --hash-symlink Include symbolic links' referent name while calculating the root checksum
 -R, --only-root-hash Output only the root hash. Blank line if --hash is not set
 -N, --no-name-hash Exclude path name while calculating the root checksum
 -F, --no-content-hash Do not hash the contents of the file

An example human friendly output:

...
... //clipped
...
/home/lab/linux-4.14-rc8/CREDITS
 Base name : CREDITS
 Level : 1
 Type : regular file
 Referent name :
 File size : 98443 bytes
 I-node number : 290850
 No. directory entries : 0
 Permission (octal) : 0644
 Link count : 1
 Ownership : UID=0, GID=0
 Preferred I/O block size : 4096 bytes
 Blocks allocated : 200
 Last status change : Tue, 21 Nov 17 21:28:18 +0530
 Last file access : Thu, 28 Dec 17 00:53:27 +0530
 Last file modification : Tue, 21 Nov 17 21:28:18 +0530
 Hash : 9f0312d130016d103aa5fc9d16a2437e

Stats for /home/lab/linux-4.14-rc8:
 Elapsed time : 1.305767 s
 Start time : Sun, 07 Jan 18 03:42:39 +0530
 Root hash : 434e93111ad6f9335bb4954bc8f4eca4
 Hash type : md5
 Depth : 8
 Total,
 size : 66850916 bytes
 entries : 12484
 directories : 763
 regular files : 11715
 symlinks : 6
 block devices : 0
 char devices : 0
 sockets : 0
 FIFOs/pipes : 0

General advice is always welcome but the best answers are specific and with code where appropriate. If you have experience of using the tool you refer to then please include it. — Jan 7 at 11:54
@bu5hman Sure! I wasn't quite comfortable saying(gloating?) more about how well it works since I'm involved in its development. — Jan 7 at 13:56

Leandro Lima 12 · Answer 28 · 2018-11-01 22:37:03Z

Doing individually for all files in each directory.

# Calculating
find dir1 | xargs md5sum > dir1.md5
find dir2 | xargs md5sum > dir2.md5
# Comparing (and showing the difference)
paste <(sort -k2 dir1.md5) <(sort -k2 dir2.md5) | awk '$1 != $3'

How do I get the MD5 sum of a directory's contents as one sum?

14 Answers 14

Option 1: Compare Data Only

Option 2: Compare Data and Metadata

Data and metadata

Names and contents only, the low-tech way

Names and data, supporting newlines in names

A more robust approach

File contents only, excluding filenames

A robust and clean approach

Your Answer

Sign up or log in

Post as a guest

Post as a guest

14 Answers 14

14 Answers 14

Option 1: Compare Data Only

Option 2: Compare Data and Metadata

Option 1: Compare Data Only

Option 2: Compare Data and Metadata

Option 1: Compare Data Only

Option 2: Compare Data and Metadata

Option 1: Compare Data Only

Option 2: Compare Data and Metadata

Data and metadata

Names and contents only, the low-tech way

Names and data, supporting newlines in names

A more robust approach

Data and metadata

Names and contents only, the low-tech way

Names and data, supporting newlines in names

A more robust approach

Data and metadata

Names and contents only, the low-tech way

Names and data, supporting newlines in names

A more robust approach

Data and metadata

Names and contents only, the low-tech way

Names and data, supporting newlines in names

A more robust approach

File contents only, excluding filenames

File contents only, excluding filenames

File contents only, excluding filenames

File contents only, excluding filenames

A robust and clean approach

A robust and clean approach

A robust and clean approach

A robust and clean approach

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

How to check contact read email or not when send email to Individual?

Running qemu-guest-agent on windows server 2008

Christian Cage

14 Answers
14

14 Answers
14

14 Answers
14