How do I get the MD5 sum of a directory's contents as one sum?
Clash Royale CLAN TAG#URR8PPP
up vote
147
down vote
favorite
The md5sum program does not provide checksums for directories. I want to get a single MD5 checksum for the entire contents of a directory, including files in sub-directories. That is, one combined checksum made out of all the files. Is there a way to do this?
directory checksum hashsum
add a comment |Â
up vote
147
down vote
favorite
The md5sum program does not provide checksums for directories. I want to get a single MD5 checksum for the entire contents of a directory, including files in sub-directories. That is, one combined checksum made out of all the files. Is there a way to do this?
directory checksum hashsum
add a comment |Â
up vote
147
down vote
favorite
up vote
147
down vote
favorite
The md5sum program does not provide checksums for directories. I want to get a single MD5 checksum for the entire contents of a directory, including files in sub-directories. That is, one combined checksum made out of all the files. Is there a way to do this?
directory checksum hashsum
The md5sum program does not provide checksums for directories. I want to get a single MD5 checksum for the entire contents of a directory, including files in sub-directories. That is, one combined checksum made out of all the files. Is there a way to do this?
directory checksum hashsum
directory checksum hashsum
asked Apr 5 '12 at 19:48
user17429
add a comment |Â
add a comment |Â
14 Answers
14
active
oldest
votes
up vote
154
down vote
The right way depends on exactly why you're asking:
Option 1: Compare Data Only
If you just need a hash of the tree's file contents, this will do the trick:
$ find -s somedir -type f -exec md5sum ; | md5sum
This first summarizes all of the file contents individually, in a predictable order, then passes that list of file names and MD5 hashes to be hashed itself, giving a single value that should only change when the content of one of the files in the tree changes.
Unfortunately, find -s
only works with BSD find(1), used in Mac OS X, FreeBSD, NetBSD and OpenBSD. To get something comparable on a system with GNU or SUS find(1), you need something a bit uglier:
$ find somedir -type f -exec md5sum ; | sort -k 2 | md5sum
We've replaced find -s
with a call to sort
. The -k 2
bit tells it to skip over the MD5 hash, so it only sorts the file names, which are in field 2 through end-of-line, by sort
's reckoning.
There's a weakness with this version of the command, which is that it's liable to become confused if you have any filenames with newlines in them, because it'll look like multiple lines to the sort
call. The find -s
variant doesn't have that problem, because the tree traversal and sorting happen within the same program, find
.
In either case, the sorting is necessary to avoid false positives. *ix filesystems don't maintain the directory listings in a stable, predictable order; you might not realize this from using ls
and such, which silently sort the directory contents for you. find
without -s
or a sort
call is going to print out files in whatever order the underlying filesystem returns them, which could cause this command to give a changed hash value when all that's changed is the order of files in a directory.
You might need to change the md5sum
commands to md5
or some other hash function. If you choose another hash function and need the second form of the command for your system, you might need to adjust the sort
command if its output line doesn't have a hash followed by the file name, separated by whitespace. For instance, you cannot use the old Unix sum
program for this because its output doesn't include the file name.
This method is somewhat inefficient, calling md5sum
N+1 times, where N is the number of files in the tree, but that's a necessary cost to avoid hashing file and directory metadata.
Option 2: Compare Data and Metadata
If you need to be able to detect that anything in a tree has changed, not just file contents, ask tar
to pack the directory contents up for you, then send it to md5sum
:
$ tar -cf - somedir | md5sum
Because tar
also sees file permissions, ownership, etc., this will also detect changes to those things, not just changes to file contents.
This method is considerably faster, since it makes only one pass over the tree and runs the hash program only once.
As with the find
based method above, tar
is going to process file names in the order the underlying filesystem returns them. It may well be that in your application, you can be sure you won't cause this to happen. I can think of at least three different usage patterns where that is likely to be the case. (I'm not going to list them, because we're getting into unspecified behavior territory. Each filesystem can be different here, even from one version of the OS to the next.)
If you find yourself getting false positives, I'd recommend going with the find | cpio
option in Gilles' answer.
6
I think it is best to navigate to the directory being compared and usefind .
instead offind somedir
. This way the file names are the same when providing different path-specs to find; this can be tricky :-)
â Abbafei
Jun 24 '14 at 6:50
Should we sort the files too?
â CMCDragonkai
Jan 19 '16 at 2:52
@CMCDragonkai: What do you mean? In the first case, we do sort the list of file names. In the second case, we purposely do not because part of the emphasized anything in the first sentence is that the order of files in a directory has changed, so you wouldn't want to sort anything.
â Warren Young
Jan 19 '16 at 3:45
@WarrenYoung Can you explain a bit more thoroughly why option 2 isn't always better? It seems to be quicker, simpler and more cross-platform. In which case shouldn't it be option 1?
â Robin Winslow
Aug 17 '16 at 7:51
Option 1 alternative:find somedir -type f -exec sh -c "openssl dgst -sha1 -binary | xxd -p" ; | sort | openssl dgst -sha1
to ignore all filenames (should work with newlines)
â windm
Oct 22 '17 at 9:50
add a comment |Â
up vote
32
down vote
The checksum needs to be of a deterministic and unambiguous representation of the files as a string. Deterministic means that if you put the same files at the same locations, you'll get the same result. Unambiguous means that two different sets of files have different representations.
Data and metadata
Making an archive containing the files is a good start. This is an unambiguous representation (obviously, since you can recover the files by extracting the archive). It may include file metadata such as dates and ownership. However, this isn't quite right yet: an archive is ambiguous, because its representation depends on the order in which the files are stored, and if applicable on the compression.
A solution is to sort the file names before archiving them. If your file names don't contain newlines, you can run find | sort
to list them, and add them to the archive in this order. Take care to tell the archiver not to recurse into directories. Here are examples with POSIX pax
, GNU tar and cpio:
find | LC_ALL=C sort | pax -w -d | md5sum
find | LC_ALL=C sort | tar -cf - -T - --no-recursion | md5sum
find | LC_ALL=C sort | cpio -o | md5sum
Names and contents only, the low-tech way
If you only want to take the file data into account and not metadata, you can make an archive that includes only the file contents, but there are no standard tools for that. Instead of including the file contents, you can include the hash of the files. If the file names contain no newlines, and there are only regular files and directories (no symbolic links or special files), this is fairly easy, but you do need to take care of a few things:
sort; echo;
find -type f -exec md5sum + | md5sum
We include a directory listing in addition to the list of checksums, as otherwise empty directories would be invisible. The file list is sorted (in a specific, reproducible locale â thanks to Peter.O for reminding me of that). echo
separates the two parts (without this, you could make some empty directories whose name look like md5sum
output that could also pass for ordinary files). We also include a listing of file sizes, to avoid length-extension attacks.
By the way, MD5 is deprecated. If it's available, consider using SHA-2, or at least SHA-1.
Names and data, supporting newlines in names
Here is a variant of the code above that relies on GNU tools to separate the file names with null bytes. This allows file names to contain newlines. The GNU digest utilities quote special characters in their output, so there won't be ambiguous newlines.
sort -z; # file hashes
echo | sha256sum
A more robust approach
Here's a minimally tested Python script that builds a hash describing a hierarchy of files. It takes directories and file contents into accounts and ignores symbolic links and other files, and returns a fatal error if any file can't be read.
#! /usr/bin/env python
import hashlib, hmac, os, stat, sys
## Return the hash of the contents of the specified file, as a hex string
def file_hash(name):
f = open(name)
h = hashlib.sha256()
while True:
buf = f.read(16384)
if len(buf) == 0: break
h.update(buf)
f.close()
return h.hexdigest()
## Traverse the specified path and update the hash with a description of its
## name and contents
def traverse(h, path):
rs = os.lstat(path)
quoted_name = repr(path)
if stat.S_ISDIR(rs.st_mode):
h.update('dir ' + quoted_name + 'n')
for entry in sorted(os.listdir(path)):
traverse(h, os.path.join(path, entry))
elif stat.S_ISREG(rs.st_mode):
h.update('reg ' + quoted_name + ' ')
h.update(str(rs.st_size) + ' ')
h.update(file_hash(path) + 'n')
else: pass # silently symlinks and other special files
h = hashlib.sha256()
for root in sys.argv[1:]: traverse(h, root)
h.update('endn')
print h.hexdigest()
OK, this works, thanks. But is there any way to do it without including any metadata? Right now I need it for just the actual contents.
â user17429
Apr 6 '12 at 1:12
How aboutLC_ALL=C sort
for checking from different environments...(+1 btw)
â Peter.O
Apr 6 '12 at 6:16
You made a whole Python program for this? Thanks! This is really more than what I had expected. :-) Anyway, I will check these methods as well as the new option 1 by Warren.
â user17429
Apr 6 '12 at 17:33
Good answer. Setting the sort order withLC_ALL=C
is essential if running on multiple machines and OSs.
â Davor Cubranic
Aug 3 '16 at 20:52
What doescpio -o -
mean? DoesnâÂÂt cpio use stdin/out by default? GNU cpio 2.12 producescpio: Too many arguments
â Jan Tojnar
Aug 12 '16 at 12:40
 |Â
show 5 more comments
up vote
12
down vote
Have a look at md5deep. Some of the features of md5deep that may interest you:
Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.
Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match.
...
Nice, but can't get it to work, it says.../foo: Is a directory
, what gives?
â Camilo Martin
Oct 2 '14 at 1:21
3
On its own md5deep doesn't solve the OP's problem as it doesn't print a consolidated md5sum, it just prints the md5sum for each file in the directory. That said, you can md5sum the output of md5deep - not quite what the OP wanted, but is close! e.g. for the current directory:md5deep -r -l -j0 . | md5sum
(where-r
is recursive,-l
means "use relative paths" so that the absolute path of the files doesn't interfere when trying to compare the content of two directories, and-j0
means use 1 thread to prevent non-determinism due to individual md5sums being returned in different orders).
â Stevie
Oct 14 '15 at 12:34
How to ignore some files/directories in the path?
â Sandeepan Nath
Oct 21 '16 at 13:17
add a comment |Â
up vote
7
down vote
If your goal is just to find differences between two directories, consider using diff.
Try this:
diff -qr dir1 dir2
Yes, this is useful as well. I think you meant dir1 dir2 in that command.
â user17429
Apr 6 '12 at 17:35
1
I don't usually use GUIs when I can avoid them, but for directory diffing kdiff3 is great and also works on many platforms.
â sinelaw
Apr 17 '12 at 2:21
Differing files are reported as well with this command.
â Serge Stroobandt
Apr 2 '14 at 15:02
add a comment |Â
up vote
5
down vote
You can hash every file recursively and then hash the resulting text:
> md5deep -r -l . | sort | md5sum
d43417958e47758c6405b5098f151074 *-
md5deep is required.
1
instead ofmd5deep
usehashdeep
on ubuntu 16.04 because md5deep package is just a transitional dummy for hashdeep.
â palik
Nov 8 '17 at 15:22
1
I've tried hashdeep. It outputs not only hashes but also some header including## Invoked from: /home/myuser/dev/
which is your current path and## $ hashdeep -s -r -l ~/folder/
. This got to sort, so the final hash will be different if you change your current folder or command line.
â truf
Aug 23 at 8:28
add a comment |Â
up vote
3
down vote
File contents only, excluding filenames
I needed a version that only checked the filenames because the contents reside in different directories.
This version (Warren Young's answer) helped a lot, but my version of md5sum
outputs the filename (relative to the path I ran the command from), and the folder names were different, therefore even though the individual file checksums matched, the final checksum didn't.
To fix that, in my case, I just needed to strip off the filename from each line of the find
output (select only the first word as separated by spaces using cut
):
find -s somedir -type f -exec md5sum ; | cut -d" " -f1 | md5sum
You might need to sort the checksums as well to get a reproducible list.
â eckes
Mar 22 '16 at 21:34
add a comment |Â
up vote
3
down vote
A good tree check-sum is the tree-id of Git.
There is unfortunately no stand-alone tool available which can do that (at least I dont know it), but if you have Git handy you can just pretend to set up a new repository and add the files you want to check to the index.
This allows you to produce the (reproducible) tree hash - which includes only content, file names and some reduced file modes (executable).
add a comment |Â
up vote
2
down vote
I use this my snippet for moderate volumes:
find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 cat | md5sum -
and this one for XXXL:
find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 tail -qc100 | md5sum -
What does the-xdev
flag do?
â czerasz
May 4 '17 at 6:35
It calls for you to type in:man find
and read that fine manual ;)
â poige
May 4 '17 at 12:43
Good point :-).-xdev Don't descend directories on other filesystems.
â czerasz
May 4 '17 at 16:31
1
Note that this ignores new, empty files (like if you touch a file).
â RonJohn
May 12 at 23:08
Thanks. I think I see how to fix
â poige
May 13 at 2:14
add a comment |Â
up vote
2
down vote
solution:
$ pip install checksumdir
$ checksumdir -a md5 assets/js
981ac0bc890de594a9f2f40e00f13872
$ checksumdir -a sha1 assets/js
88cd20f115e31a1e1ae381f7291d0c8cd3b92fad
works fast and easier solution then bash scripting.
see doc: https://pypi.python.org/pypi/checksumdir/1.0.5
if you don't have pip you may need to install it with yum -y install python-pip (or dnf/apt-get)
â DmitrySemenov
Mar 8 '16 at 2:55
add a comment |Â
up vote
2
down vote
nix-hash
from the Nix package manager
The command nix-hash computes the cryptographic hash of the contents
of each path and prints it on standard output. By default, it computes
an MD5 hash, but other hash algorithms are available as well. The hash is printed in hexadecimal.
The hash is computed over a serialisation of each path: a dump of the file system tree rooted at the path. This allows directories
and symlinks to be hashed
as well as regular files. The dump is in the NAR format produced by nix-store --dump. Thus, nix-hash path yields the same
cryptographic hash as nix-store
--dump path | md5sum.
add a comment |Â
up vote
1
down vote
I didn't want new executables nor clunky solutions so here's my take:
#!/bin/sh
# md5dir.sh by Camilo Martin, 2014-10-01.
# Give this a parameter and it will calculate an md5 of the directory's contents.
# It only takes into account file contents and paths relative to the directory's root.
# This means that two dirs with different names and locations can hash equally.
if [[ ! -d "$1" ]]; then
echo "Usage: md5dir.sh <dir_name>"
exit
fi
d="$(tr '\' / <<< "$1" | tr -s / | sed 's-/$--')"
c=$(($#d + 35))
find "$d" -type f -exec md5sum ; | cut -c 1-33,$c- | sort | md5sum | cut -c 1-32
Hope it helps you :)
add a comment |Â
up vote
1
down vote
A script which is well tested and supports a number of operations including finding duplicates, doing comparisons on both data and metadata, showing additions as well as changes and removals, you might like Fingerprint.
Fingerprint right now doesn't produce a single checksum for a directory, but a transcript file which includes checksums for all files in that directory.
fingerprint analyze
This will generate index.fingerprint
in the current directory which includes checksums, filenames and file sizes. By default it uses both MD5
and SHA1.256
.
In the future, I hope to add support for Merkle Trees into Fingerprint which will give you a single top-level checksum. Right now, you need to retain that file for doing verification.
add a comment |Â
up vote
0
down vote
A robust and clean approach
- First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.
- Different approaches for different needs/purpose (all of the below or pick what ever applies):
- Hash only the entry name of all entries in the directory tree
- Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)
- For a symbolic link, its content is the referent name. Hash it or choose to skip
- Follow or not to follow(resolved name) the symlink while hashing the contents of the entry
- If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files
- Handle large files well(again, mind the RAM)
- Handle very deep directory trees (mind the open file descriptors)
- Handle non standard file names
- How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?
- Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.
This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.
Here's a tool(disclaimer: I'm a contributor to it) dtreetrawl, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.
Usage:
dtreetrawl [OPTION...] "/trawl/me" [path2,...]
Help Options:
-h, --help Show help options
Application Options:
-t, --terse Produce a terse output; parsable.
-d, --delim=: Character or string delimiter/separator for terse output(default ':')
-l, --max-level=N Do not traverse tree beyond N level(s)
--hash Hash the files to produce checksums(default is MD5).
-c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512.
-s, --hash-symlink Include symbolic links' referent name while calculating the root checksum
-R, --only-root-hash Output only the root hash. Blank line if --hash is not set
-N, --no-name-hash Exclude path name while calculating the root checksum
-F, --no-content-hash Do not hash the contents of the file
An example human friendly output:
...
... //clipped
...
/home/lab/linux-4.14-rc8/CREDITS
Base name : CREDITS
Level : 1
Type : regular file
Referent name :
File size : 98443 bytes
I-node number : 290850
No. directory entries : 0
Permission (octal) : 0644
Link count : 1
Ownership : UID=0, GID=0
Preferred I/O block size : 4096 bytes
Blocks allocated : 200
Last status change : Tue, 21 Nov 17 21:28:18 +0530
Last file access : Thu, 28 Dec 17 00:53:27 +0530
Last file modification : Tue, 21 Nov 17 21:28:18 +0530
Hash : 9f0312d130016d103aa5fc9d16a2437e
Stats for /home/lab/linux-4.14-rc8:
Elapsed time : 1.305767 s
Start time : Sun, 07 Jan 18 03:42:39 +0530
Root hash : 434e93111ad6f9335bb4954bc8f4eca4
Hash type : md5
Depth : 8
Total,
size : 66850916 bytes
entries : 12484
directories : 763
regular files : 11715
symlinks : 6
block devices : 0
char devices : 0
sockets : 0
FIFOs/pipes : 0
General advice is always welcome but the best answers are specific and with code where appropriate. If you have experience of using the tool you refer to then please include it.
â bu5hman
Jan 7 at 11:54
@bu5hman Sure! I wasn't quite comfortable saying(gloating?) more about how well it works since I'm involved in its development.
â six-k
Jan 7 at 13:56
add a comment |Â
up vote
0
down vote
Doing individually for all files in each directory.
# Calculating
find dir1 | xargs md5sum > dir1.md5
find dir2 | xargs md5sum > dir2.md5
# Comparing (and showing the difference)
paste <(sort -k2 dir1.md5) <(sort -k2 dir2.md5) | awk '$1 != $3'
New contributor
add a comment |Â
14 Answers
14
active
oldest
votes
14 Answers
14
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
154
down vote
The right way depends on exactly why you're asking:
Option 1: Compare Data Only
If you just need a hash of the tree's file contents, this will do the trick:
$ find -s somedir -type f -exec md5sum ; | md5sum
This first summarizes all of the file contents individually, in a predictable order, then passes that list of file names and MD5 hashes to be hashed itself, giving a single value that should only change when the content of one of the files in the tree changes.
Unfortunately, find -s
only works with BSD find(1), used in Mac OS X, FreeBSD, NetBSD and OpenBSD. To get something comparable on a system with GNU or SUS find(1), you need something a bit uglier:
$ find somedir -type f -exec md5sum ; | sort -k 2 | md5sum
We've replaced find -s
with a call to sort
. The -k 2
bit tells it to skip over the MD5 hash, so it only sorts the file names, which are in field 2 through end-of-line, by sort
's reckoning.
There's a weakness with this version of the command, which is that it's liable to become confused if you have any filenames with newlines in them, because it'll look like multiple lines to the sort
call. The find -s
variant doesn't have that problem, because the tree traversal and sorting happen within the same program, find
.
In either case, the sorting is necessary to avoid false positives. *ix filesystems don't maintain the directory listings in a stable, predictable order; you might not realize this from using ls
and such, which silently sort the directory contents for you. find
without -s
or a sort
call is going to print out files in whatever order the underlying filesystem returns them, which could cause this command to give a changed hash value when all that's changed is the order of files in a directory.
You might need to change the md5sum
commands to md5
or some other hash function. If you choose another hash function and need the second form of the command for your system, you might need to adjust the sort
command if its output line doesn't have a hash followed by the file name, separated by whitespace. For instance, you cannot use the old Unix sum
program for this because its output doesn't include the file name.
This method is somewhat inefficient, calling md5sum
N+1 times, where N is the number of files in the tree, but that's a necessary cost to avoid hashing file and directory metadata.
Option 2: Compare Data and Metadata
If you need to be able to detect that anything in a tree has changed, not just file contents, ask tar
to pack the directory contents up for you, then send it to md5sum
:
$ tar -cf - somedir | md5sum
Because tar
also sees file permissions, ownership, etc., this will also detect changes to those things, not just changes to file contents.
This method is considerably faster, since it makes only one pass over the tree and runs the hash program only once.
As with the find
based method above, tar
is going to process file names in the order the underlying filesystem returns them. It may well be that in your application, you can be sure you won't cause this to happen. I can think of at least three different usage patterns where that is likely to be the case. (I'm not going to list them, because we're getting into unspecified behavior territory. Each filesystem can be different here, even from one version of the OS to the next.)
If you find yourself getting false positives, I'd recommend going with the find | cpio
option in Gilles' answer.
6
I think it is best to navigate to the directory being compared and usefind .
instead offind somedir
. This way the file names are the same when providing different path-specs to find; this can be tricky :-)
â Abbafei
Jun 24 '14 at 6:50
Should we sort the files too?
â CMCDragonkai
Jan 19 '16 at 2:52
@CMCDragonkai: What do you mean? In the first case, we do sort the list of file names. In the second case, we purposely do not because part of the emphasized anything in the first sentence is that the order of files in a directory has changed, so you wouldn't want to sort anything.
â Warren Young
Jan 19 '16 at 3:45
@WarrenYoung Can you explain a bit more thoroughly why option 2 isn't always better? It seems to be quicker, simpler and more cross-platform. In which case shouldn't it be option 1?
â Robin Winslow
Aug 17 '16 at 7:51
Option 1 alternative:find somedir -type f -exec sh -c "openssl dgst -sha1 -binary | xxd -p" ; | sort | openssl dgst -sha1
to ignore all filenames (should work with newlines)
â windm
Oct 22 '17 at 9:50
add a comment |Â
up vote
154
down vote
The right way depends on exactly why you're asking:
Option 1: Compare Data Only
If you just need a hash of the tree's file contents, this will do the trick:
$ find -s somedir -type f -exec md5sum ; | md5sum
This first summarizes all of the file contents individually, in a predictable order, then passes that list of file names and MD5 hashes to be hashed itself, giving a single value that should only change when the content of one of the files in the tree changes.
Unfortunately, find -s
only works with BSD find(1), used in Mac OS X, FreeBSD, NetBSD and OpenBSD. To get something comparable on a system with GNU or SUS find(1), you need something a bit uglier:
$ find somedir -type f -exec md5sum ; | sort -k 2 | md5sum
We've replaced find -s
with a call to sort
. The -k 2
bit tells it to skip over the MD5 hash, so it only sorts the file names, which are in field 2 through end-of-line, by sort
's reckoning.
There's a weakness with this version of the command, which is that it's liable to become confused if you have any filenames with newlines in them, because it'll look like multiple lines to the sort
call. The find -s
variant doesn't have that problem, because the tree traversal and sorting happen within the same program, find
.
In either case, the sorting is necessary to avoid false positives. *ix filesystems don't maintain the directory listings in a stable, predictable order; you might not realize this from using ls
and such, which silently sort the directory contents for you. find
without -s
or a sort
call is going to print out files in whatever order the underlying filesystem returns them, which could cause this command to give a changed hash value when all that's changed is the order of files in a directory.
You might need to change the md5sum
commands to md5
or some other hash function. If you choose another hash function and need the second form of the command for your system, you might need to adjust the sort
command if its output line doesn't have a hash followed by the file name, separated by whitespace. For instance, you cannot use the old Unix sum
program for this because its output doesn't include the file name.
This method is somewhat inefficient, calling md5sum
N+1 times, where N is the number of files in the tree, but that's a necessary cost to avoid hashing file and directory metadata.
Option 2: Compare Data and Metadata
If you need to be able to detect that anything in a tree has changed, not just file contents, ask tar
to pack the directory contents up for you, then send it to md5sum
:
$ tar -cf - somedir | md5sum
Because tar
also sees file permissions, ownership, etc., this will also detect changes to those things, not just changes to file contents.
This method is considerably faster, since it makes only one pass over the tree and runs the hash program only once.
As with the find
based method above, tar
is going to process file names in the order the underlying filesystem returns them. It may well be that in your application, you can be sure you won't cause this to happen. I can think of at least three different usage patterns where that is likely to be the case. (I'm not going to list them, because we're getting into unspecified behavior territory. Each filesystem can be different here, even from one version of the OS to the next.)
If you find yourself getting false positives, I'd recommend going with the find | cpio
option in Gilles' answer.
6
I think it is best to navigate to the directory being compared and usefind .
instead offind somedir
. This way the file names are the same when providing different path-specs to find; this can be tricky :-)
â Abbafei
Jun 24 '14 at 6:50
Should we sort the files too?
â CMCDragonkai
Jan 19 '16 at 2:52
@CMCDragonkai: What do you mean? In the first case, we do sort the list of file names. In the second case, we purposely do not because part of the emphasized anything in the first sentence is that the order of files in a directory has changed, so you wouldn't want to sort anything.
â Warren Young
Jan 19 '16 at 3:45
@WarrenYoung Can you explain a bit more thoroughly why option 2 isn't always better? It seems to be quicker, simpler and more cross-platform. In which case shouldn't it be option 1?
â Robin Winslow
Aug 17 '16 at 7:51
Option 1 alternative:find somedir -type f -exec sh -c "openssl dgst -sha1 -binary | xxd -p" ; | sort | openssl dgst -sha1
to ignore all filenames (should work with newlines)
â windm
Oct 22 '17 at 9:50
add a comment |Â
up vote
154
down vote
up vote
154
down vote
The right way depends on exactly why you're asking:
Option 1: Compare Data Only
If you just need a hash of the tree's file contents, this will do the trick:
$ find -s somedir -type f -exec md5sum ; | md5sum
This first summarizes all of the file contents individually, in a predictable order, then passes that list of file names and MD5 hashes to be hashed itself, giving a single value that should only change when the content of one of the files in the tree changes.
Unfortunately, find -s
only works with BSD find(1), used in Mac OS X, FreeBSD, NetBSD and OpenBSD. To get something comparable on a system with GNU or SUS find(1), you need something a bit uglier:
$ find somedir -type f -exec md5sum ; | sort -k 2 | md5sum
We've replaced find -s
with a call to sort
. The -k 2
bit tells it to skip over the MD5 hash, so it only sorts the file names, which are in field 2 through end-of-line, by sort
's reckoning.
There's a weakness with this version of the command, which is that it's liable to become confused if you have any filenames with newlines in them, because it'll look like multiple lines to the sort
call. The find -s
variant doesn't have that problem, because the tree traversal and sorting happen within the same program, find
.
In either case, the sorting is necessary to avoid false positives. *ix filesystems don't maintain the directory listings in a stable, predictable order; you might not realize this from using ls
and such, which silently sort the directory contents for you. find
without -s
or a sort
call is going to print out files in whatever order the underlying filesystem returns them, which could cause this command to give a changed hash value when all that's changed is the order of files in a directory.
You might need to change the md5sum
commands to md5
or some other hash function. If you choose another hash function and need the second form of the command for your system, you might need to adjust the sort
command if its output line doesn't have a hash followed by the file name, separated by whitespace. For instance, you cannot use the old Unix sum
program for this because its output doesn't include the file name.
This method is somewhat inefficient, calling md5sum
N+1 times, where N is the number of files in the tree, but that's a necessary cost to avoid hashing file and directory metadata.
Option 2: Compare Data and Metadata
If you need to be able to detect that anything in a tree has changed, not just file contents, ask tar
to pack the directory contents up for you, then send it to md5sum
:
$ tar -cf - somedir | md5sum
Because tar
also sees file permissions, ownership, etc., this will also detect changes to those things, not just changes to file contents.
This method is considerably faster, since it makes only one pass over the tree and runs the hash program only once.
As with the find
based method above, tar
is going to process file names in the order the underlying filesystem returns them. It may well be that in your application, you can be sure you won't cause this to happen. I can think of at least three different usage patterns where that is likely to be the case. (I'm not going to list them, because we're getting into unspecified behavior territory. Each filesystem can be different here, even from one version of the OS to the next.)
If you find yourself getting false positives, I'd recommend going with the find | cpio
option in Gilles' answer.
The right way depends on exactly why you're asking:
Option 1: Compare Data Only
If you just need a hash of the tree's file contents, this will do the trick:
$ find -s somedir -type f -exec md5sum ; | md5sum
This first summarizes all of the file contents individually, in a predictable order, then passes that list of file names and MD5 hashes to be hashed itself, giving a single value that should only change when the content of one of the files in the tree changes.
Unfortunately, find -s
only works with BSD find(1), used in Mac OS X, FreeBSD, NetBSD and OpenBSD. To get something comparable on a system with GNU or SUS find(1), you need something a bit uglier:
$ find somedir -type f -exec md5sum ; | sort -k 2 | md5sum
We've replaced find -s
with a call to sort
. The -k 2
bit tells it to skip over the MD5 hash, so it only sorts the file names, which are in field 2 through end-of-line, by sort
's reckoning.
There's a weakness with this version of the command, which is that it's liable to become confused if you have any filenames with newlines in them, because it'll look like multiple lines to the sort
call. The find -s
variant doesn't have that problem, because the tree traversal and sorting happen within the same program, find
.
In either case, the sorting is necessary to avoid false positives. *ix filesystems don't maintain the directory listings in a stable, predictable order; you might not realize this from using ls
and such, which silently sort the directory contents for you. find
without -s
or a sort
call is going to print out files in whatever order the underlying filesystem returns them, which could cause this command to give a changed hash value when all that's changed is the order of files in a directory.
You might need to change the md5sum
commands to md5
or some other hash function. If you choose another hash function and need the second form of the command for your system, you might need to adjust the sort
command if its output line doesn't have a hash followed by the file name, separated by whitespace. For instance, you cannot use the old Unix sum
program for this because its output doesn't include the file name.
This method is somewhat inefficient, calling md5sum
N+1 times, where N is the number of files in the tree, but that's a necessary cost to avoid hashing file and directory metadata.
Option 2: Compare Data and Metadata
If you need to be able to detect that anything in a tree has changed, not just file contents, ask tar
to pack the directory contents up for you, then send it to md5sum
:
$ tar -cf - somedir | md5sum
Because tar
also sees file permissions, ownership, etc., this will also detect changes to those things, not just changes to file contents.
This method is considerably faster, since it makes only one pass over the tree and runs the hash program only once.
As with the find
based method above, tar
is going to process file names in the order the underlying filesystem returns them. It may well be that in your application, you can be sure you won't cause this to happen. I can think of at least three different usage patterns where that is likely to be the case. (I'm not going to list them, because we're getting into unspecified behavior territory. Each filesystem can be different here, even from one version of the OS to the next.)
If you find yourself getting false positives, I'd recommend going with the find | cpio
option in Gilles' answer.
edited Apr 1 '16 at 20:37
answered Apr 5 '12 at 19:57
Warren Young
53.8k8140144
53.8k8140144
6
I think it is best to navigate to the directory being compared and usefind .
instead offind somedir
. This way the file names are the same when providing different path-specs to find; this can be tricky :-)
â Abbafei
Jun 24 '14 at 6:50
Should we sort the files too?
â CMCDragonkai
Jan 19 '16 at 2:52
@CMCDragonkai: What do you mean? In the first case, we do sort the list of file names. In the second case, we purposely do not because part of the emphasized anything in the first sentence is that the order of files in a directory has changed, so you wouldn't want to sort anything.
â Warren Young
Jan 19 '16 at 3:45
@WarrenYoung Can you explain a bit more thoroughly why option 2 isn't always better? It seems to be quicker, simpler and more cross-platform. In which case shouldn't it be option 1?
â Robin Winslow
Aug 17 '16 at 7:51
Option 1 alternative:find somedir -type f -exec sh -c "openssl dgst -sha1 -binary | xxd -p" ; | sort | openssl dgst -sha1
to ignore all filenames (should work with newlines)
â windm
Oct 22 '17 at 9:50
add a comment |Â
6
I think it is best to navigate to the directory being compared and usefind .
instead offind somedir
. This way the file names are the same when providing different path-specs to find; this can be tricky :-)
â Abbafei
Jun 24 '14 at 6:50
Should we sort the files too?
â CMCDragonkai
Jan 19 '16 at 2:52
@CMCDragonkai: What do you mean? In the first case, we do sort the list of file names. In the second case, we purposely do not because part of the emphasized anything in the first sentence is that the order of files in a directory has changed, so you wouldn't want to sort anything.
â Warren Young
Jan 19 '16 at 3:45
@WarrenYoung Can you explain a bit more thoroughly why option 2 isn't always better? It seems to be quicker, simpler and more cross-platform. In which case shouldn't it be option 1?
â Robin Winslow
Aug 17 '16 at 7:51
Option 1 alternative:find somedir -type f -exec sh -c "openssl dgst -sha1 -binary | xxd -p" ; | sort | openssl dgst -sha1
to ignore all filenames (should work with newlines)
â windm
Oct 22 '17 at 9:50
6
6
I think it is best to navigate to the directory being compared and use
find .
instead of find somedir
. This way the file names are the same when providing different path-specs to find; this can be tricky :-)â Abbafei
Jun 24 '14 at 6:50
I think it is best to navigate to the directory being compared and use
find .
instead of find somedir
. This way the file names are the same when providing different path-specs to find; this can be tricky :-)â Abbafei
Jun 24 '14 at 6:50
Should we sort the files too?
â CMCDragonkai
Jan 19 '16 at 2:52
Should we sort the files too?
â CMCDragonkai
Jan 19 '16 at 2:52
@CMCDragonkai: What do you mean? In the first case, we do sort the list of file names. In the second case, we purposely do not because part of the emphasized anything in the first sentence is that the order of files in a directory has changed, so you wouldn't want to sort anything.
â Warren Young
Jan 19 '16 at 3:45
@CMCDragonkai: What do you mean? In the first case, we do sort the list of file names. In the second case, we purposely do not because part of the emphasized anything in the first sentence is that the order of files in a directory has changed, so you wouldn't want to sort anything.
â Warren Young
Jan 19 '16 at 3:45
@WarrenYoung Can you explain a bit more thoroughly why option 2 isn't always better? It seems to be quicker, simpler and more cross-platform. In which case shouldn't it be option 1?
â Robin Winslow
Aug 17 '16 at 7:51
@WarrenYoung Can you explain a bit more thoroughly why option 2 isn't always better? It seems to be quicker, simpler and more cross-platform. In which case shouldn't it be option 1?
â Robin Winslow
Aug 17 '16 at 7:51
Option 1 alternative:
find somedir -type f -exec sh -c "openssl dgst -sha1 -binary | xxd -p" ; | sort | openssl dgst -sha1
to ignore all filenames (should work with newlines)â windm
Oct 22 '17 at 9:50
Option 1 alternative:
find somedir -type f -exec sh -c "openssl dgst -sha1 -binary | xxd -p" ; | sort | openssl dgst -sha1
to ignore all filenames (should work with newlines)â windm
Oct 22 '17 at 9:50
add a comment |Â
up vote
32
down vote
The checksum needs to be of a deterministic and unambiguous representation of the files as a string. Deterministic means that if you put the same files at the same locations, you'll get the same result. Unambiguous means that two different sets of files have different representations.
Data and metadata
Making an archive containing the files is a good start. This is an unambiguous representation (obviously, since you can recover the files by extracting the archive). It may include file metadata such as dates and ownership. However, this isn't quite right yet: an archive is ambiguous, because its representation depends on the order in which the files are stored, and if applicable on the compression.
A solution is to sort the file names before archiving them. If your file names don't contain newlines, you can run find | sort
to list them, and add them to the archive in this order. Take care to tell the archiver not to recurse into directories. Here are examples with POSIX pax
, GNU tar and cpio:
find | LC_ALL=C sort | pax -w -d | md5sum
find | LC_ALL=C sort | tar -cf - -T - --no-recursion | md5sum
find | LC_ALL=C sort | cpio -o | md5sum
Names and contents only, the low-tech way
If you only want to take the file data into account and not metadata, you can make an archive that includes only the file contents, but there are no standard tools for that. Instead of including the file contents, you can include the hash of the files. If the file names contain no newlines, and there are only regular files and directories (no symbolic links or special files), this is fairly easy, but you do need to take care of a few things:
sort; echo;
find -type f -exec md5sum + | md5sum
We include a directory listing in addition to the list of checksums, as otherwise empty directories would be invisible. The file list is sorted (in a specific, reproducible locale â thanks to Peter.O for reminding me of that). echo
separates the two parts (without this, you could make some empty directories whose name look like md5sum
output that could also pass for ordinary files). We also include a listing of file sizes, to avoid length-extension attacks.
By the way, MD5 is deprecated. If it's available, consider using SHA-2, or at least SHA-1.
Names and data, supporting newlines in names
Here is a variant of the code above that relies on GNU tools to separate the file names with null bytes. This allows file names to contain newlines. The GNU digest utilities quote special characters in their output, so there won't be ambiguous newlines.
sort -z; # file hashes
echo | sha256sum
A more robust approach
Here's a minimally tested Python script that builds a hash describing a hierarchy of files. It takes directories and file contents into accounts and ignores symbolic links and other files, and returns a fatal error if any file can't be read.
#! /usr/bin/env python
import hashlib, hmac, os, stat, sys
## Return the hash of the contents of the specified file, as a hex string
def file_hash(name):
f = open(name)
h = hashlib.sha256()
while True:
buf = f.read(16384)
if len(buf) == 0: break
h.update(buf)
f.close()
return h.hexdigest()
## Traverse the specified path and update the hash with a description of its
## name and contents
def traverse(h, path):
rs = os.lstat(path)
quoted_name = repr(path)
if stat.S_ISDIR(rs.st_mode):
h.update('dir ' + quoted_name + 'n')
for entry in sorted(os.listdir(path)):
traverse(h, os.path.join(path, entry))
elif stat.S_ISREG(rs.st_mode):
h.update('reg ' + quoted_name + ' ')
h.update(str(rs.st_size) + ' ')
h.update(file_hash(path) + 'n')
else: pass # silently symlinks and other special files
h = hashlib.sha256()
for root in sys.argv[1:]: traverse(h, root)
h.update('endn')
print h.hexdigest()
OK, this works, thanks. But is there any way to do it without including any metadata? Right now I need it for just the actual contents.
â user17429
Apr 6 '12 at 1:12
How aboutLC_ALL=C sort
for checking from different environments...(+1 btw)
â Peter.O
Apr 6 '12 at 6:16
You made a whole Python program for this? Thanks! This is really more than what I had expected. :-) Anyway, I will check these methods as well as the new option 1 by Warren.
â user17429
Apr 6 '12 at 17:33
Good answer. Setting the sort order withLC_ALL=C
is essential if running on multiple machines and OSs.
â Davor Cubranic
Aug 3 '16 at 20:52
What doescpio -o -
mean? DoesnâÂÂt cpio use stdin/out by default? GNU cpio 2.12 producescpio: Too many arguments
â Jan Tojnar
Aug 12 '16 at 12:40
 |Â
show 5 more comments
up vote
32
down vote
The checksum needs to be of a deterministic and unambiguous representation of the files as a string. Deterministic means that if you put the same files at the same locations, you'll get the same result. Unambiguous means that two different sets of files have different representations.
Data and metadata
Making an archive containing the files is a good start. This is an unambiguous representation (obviously, since you can recover the files by extracting the archive). It may include file metadata such as dates and ownership. However, this isn't quite right yet: an archive is ambiguous, because its representation depends on the order in which the files are stored, and if applicable on the compression.
A solution is to sort the file names before archiving them. If your file names don't contain newlines, you can run find | sort
to list them, and add them to the archive in this order. Take care to tell the archiver not to recurse into directories. Here are examples with POSIX pax
, GNU tar and cpio:
find | LC_ALL=C sort | pax -w -d | md5sum
find | LC_ALL=C sort | tar -cf - -T - --no-recursion | md5sum
find | LC_ALL=C sort | cpio -o | md5sum
Names and contents only, the low-tech way
If you only want to take the file data into account and not metadata, you can make an archive that includes only the file contents, but there are no standard tools for that. Instead of including the file contents, you can include the hash of the files. If the file names contain no newlines, and there are only regular files and directories (no symbolic links or special files), this is fairly easy, but you do need to take care of a few things:
sort; echo;
find -type f -exec md5sum + | md5sum
We include a directory listing in addition to the list of checksums, as otherwise empty directories would be invisible. The file list is sorted (in a specific, reproducible locale â thanks to Peter.O for reminding me of that). echo
separates the two parts (without this, you could make some empty directories whose name look like md5sum
output that could also pass for ordinary files). We also include a listing of file sizes, to avoid length-extension attacks.
By the way, MD5 is deprecated. If it's available, consider using SHA-2, or at least SHA-1.
Names and data, supporting newlines in names
Here is a variant of the code above that relies on GNU tools to separate the file names with null bytes. This allows file names to contain newlines. The GNU digest utilities quote special characters in their output, so there won't be ambiguous newlines.
sort -z; # file hashes
echo | sha256sum
A more robust approach
Here's a minimally tested Python script that builds a hash describing a hierarchy of files. It takes directories and file contents into accounts and ignores symbolic links and other files, and returns a fatal error if any file can't be read.
#! /usr/bin/env python
import hashlib, hmac, os, stat, sys
## Return the hash of the contents of the specified file, as a hex string
def file_hash(name):
f = open(name)
h = hashlib.sha256()
while True:
buf = f.read(16384)
if len(buf) == 0: break
h.update(buf)
f.close()
return h.hexdigest()
## Traverse the specified path and update the hash with a description of its
## name and contents
def traverse(h, path):
rs = os.lstat(path)
quoted_name = repr(path)
if stat.S_ISDIR(rs.st_mode):
h.update('dir ' + quoted_name + 'n')
for entry in sorted(os.listdir(path)):
traverse(h, os.path.join(path, entry))
elif stat.S_ISREG(rs.st_mode):
h.update('reg ' + quoted_name + ' ')
h.update(str(rs.st_size) + ' ')
h.update(file_hash(path) + 'n')
else: pass # silently symlinks and other special files
h = hashlib.sha256()
for root in sys.argv[1:]: traverse(h, root)
h.update('endn')
print h.hexdigest()
OK, this works, thanks. But is there any way to do it without including any metadata? Right now I need it for just the actual contents.
â user17429
Apr 6 '12 at 1:12
How aboutLC_ALL=C sort
for checking from different environments...(+1 btw)
â Peter.O
Apr 6 '12 at 6:16
You made a whole Python program for this? Thanks! This is really more than what I had expected. :-) Anyway, I will check these methods as well as the new option 1 by Warren.
â user17429
Apr 6 '12 at 17:33
Good answer. Setting the sort order withLC_ALL=C
is essential if running on multiple machines and OSs.
â Davor Cubranic
Aug 3 '16 at 20:52
What doescpio -o -
mean? DoesnâÂÂt cpio use stdin/out by default? GNU cpio 2.12 producescpio: Too many arguments
â Jan Tojnar
Aug 12 '16 at 12:40
 |Â
show 5 more comments
up vote
32
down vote
up vote
32
down vote
The checksum needs to be of a deterministic and unambiguous representation of the files as a string. Deterministic means that if you put the same files at the same locations, you'll get the same result. Unambiguous means that two different sets of files have different representations.
Data and metadata
Making an archive containing the files is a good start. This is an unambiguous representation (obviously, since you can recover the files by extracting the archive). It may include file metadata such as dates and ownership. However, this isn't quite right yet: an archive is ambiguous, because its representation depends on the order in which the files are stored, and if applicable on the compression.
A solution is to sort the file names before archiving them. If your file names don't contain newlines, you can run find | sort
to list them, and add them to the archive in this order. Take care to tell the archiver not to recurse into directories. Here are examples with POSIX pax
, GNU tar and cpio:
find | LC_ALL=C sort | pax -w -d | md5sum
find | LC_ALL=C sort | tar -cf - -T - --no-recursion | md5sum
find | LC_ALL=C sort | cpio -o | md5sum
Names and contents only, the low-tech way
If you only want to take the file data into account and not metadata, you can make an archive that includes only the file contents, but there are no standard tools for that. Instead of including the file contents, you can include the hash of the files. If the file names contain no newlines, and there are only regular files and directories (no symbolic links or special files), this is fairly easy, but you do need to take care of a few things:
sort; echo;
find -type f -exec md5sum + | md5sum
We include a directory listing in addition to the list of checksums, as otherwise empty directories would be invisible. The file list is sorted (in a specific, reproducible locale â thanks to Peter.O for reminding me of that). echo
separates the two parts (without this, you could make some empty directories whose name look like md5sum
output that could also pass for ordinary files). We also include a listing of file sizes, to avoid length-extension attacks.
By the way, MD5 is deprecated. If it's available, consider using SHA-2, or at least SHA-1.
Names and data, supporting newlines in names
Here is a variant of the code above that relies on GNU tools to separate the file names with null bytes. This allows file names to contain newlines. The GNU digest utilities quote special characters in their output, so there won't be ambiguous newlines.
sort -z; # file hashes
echo | sha256sum
A more robust approach
Here's a minimally tested Python script that builds a hash describing a hierarchy of files. It takes directories and file contents into accounts and ignores symbolic links and other files, and returns a fatal error if any file can't be read.
#! /usr/bin/env python
import hashlib, hmac, os, stat, sys
## Return the hash of the contents of the specified file, as a hex string
def file_hash(name):
f = open(name)
h = hashlib.sha256()
while True:
buf = f.read(16384)
if len(buf) == 0: break
h.update(buf)
f.close()
return h.hexdigest()
## Traverse the specified path and update the hash with a description of its
## name and contents
def traverse(h, path):
rs = os.lstat(path)
quoted_name = repr(path)
if stat.S_ISDIR(rs.st_mode):
h.update('dir ' + quoted_name + 'n')
for entry in sorted(os.listdir(path)):
traverse(h, os.path.join(path, entry))
elif stat.S_ISREG(rs.st_mode):
h.update('reg ' + quoted_name + ' ')
h.update(str(rs.st_size) + ' ')
h.update(file_hash(path) + 'n')
else: pass # silently symlinks and other special files
h = hashlib.sha256()
for root in sys.argv[1:]: traverse(h, root)
h.update('endn')
print h.hexdigest()
The checksum needs to be of a deterministic and unambiguous representation of the files as a string. Deterministic means that if you put the same files at the same locations, you'll get the same result. Unambiguous means that two different sets of files have different representations.
Data and metadata
Making an archive containing the files is a good start. This is an unambiguous representation (obviously, since you can recover the files by extracting the archive). It may include file metadata such as dates and ownership. However, this isn't quite right yet: an archive is ambiguous, because its representation depends on the order in which the files are stored, and if applicable on the compression.
A solution is to sort the file names before archiving them. If your file names don't contain newlines, you can run find | sort
to list them, and add them to the archive in this order. Take care to tell the archiver not to recurse into directories. Here are examples with POSIX pax
, GNU tar and cpio:
find | LC_ALL=C sort | pax -w -d | md5sum
find | LC_ALL=C sort | tar -cf - -T - --no-recursion | md5sum
find | LC_ALL=C sort | cpio -o | md5sum
Names and contents only, the low-tech way
If you only want to take the file data into account and not metadata, you can make an archive that includes only the file contents, but there are no standard tools for that. Instead of including the file contents, you can include the hash of the files. If the file names contain no newlines, and there are only regular files and directories (no symbolic links or special files), this is fairly easy, but you do need to take care of a few things:
sort; echo;
find -type f -exec md5sum + | md5sum
We include a directory listing in addition to the list of checksums, as otherwise empty directories would be invisible. The file list is sorted (in a specific, reproducible locale â thanks to Peter.O for reminding me of that). echo
separates the two parts (without this, you could make some empty directories whose name look like md5sum
output that could also pass for ordinary files). We also include a listing of file sizes, to avoid length-extension attacks.
By the way, MD5 is deprecated. If it's available, consider using SHA-2, or at least SHA-1.
Names and data, supporting newlines in names
Here is a variant of the code above that relies on GNU tools to separate the file names with null bytes. This allows file names to contain newlines. The GNU digest utilities quote special characters in their output, so there won't be ambiguous newlines.
sort -z; # file hashes
echo | sha256sum
A more robust approach
Here's a minimally tested Python script that builds a hash describing a hierarchy of files. It takes directories and file contents into accounts and ignores symbolic links and other files, and returns a fatal error if any file can't be read.
#! /usr/bin/env python
import hashlib, hmac, os, stat, sys
## Return the hash of the contents of the specified file, as a hex string
def file_hash(name):
f = open(name)
h = hashlib.sha256()
while True:
buf = f.read(16384)
if len(buf) == 0: break
h.update(buf)
f.close()
return h.hexdigest()
## Traverse the specified path and update the hash with a description of its
## name and contents
def traverse(h, path):
rs = os.lstat(path)
quoted_name = repr(path)
if stat.S_ISDIR(rs.st_mode):
h.update('dir ' + quoted_name + 'n')
for entry in sorted(os.listdir(path)):
traverse(h, os.path.join(path, entry))
elif stat.S_ISREG(rs.st_mode):
h.update('reg ' + quoted_name + ' ')
h.update(str(rs.st_size) + ' ')
h.update(file_hash(path) + 'n')
else: pass # silently symlinks and other special files
h = hashlib.sha256()
for root in sys.argv[1:]: traverse(h, root)
h.update('endn')
print h.hexdigest()
edited Aug 12 '16 at 13:07
answered Apr 6 '12 at 0:53
Gilles
517k12410321561
517k12410321561
OK, this works, thanks. But is there any way to do it without including any metadata? Right now I need it for just the actual contents.
â user17429
Apr 6 '12 at 1:12
How aboutLC_ALL=C sort
for checking from different environments...(+1 btw)
â Peter.O
Apr 6 '12 at 6:16
You made a whole Python program for this? Thanks! This is really more than what I had expected. :-) Anyway, I will check these methods as well as the new option 1 by Warren.
â user17429
Apr 6 '12 at 17:33
Good answer. Setting the sort order withLC_ALL=C
is essential if running on multiple machines and OSs.
â Davor Cubranic
Aug 3 '16 at 20:52
What doescpio -o -
mean? DoesnâÂÂt cpio use stdin/out by default? GNU cpio 2.12 producescpio: Too many arguments
â Jan Tojnar
Aug 12 '16 at 12:40
 |Â
show 5 more comments
OK, this works, thanks. But is there any way to do it without including any metadata? Right now I need it for just the actual contents.
â user17429
Apr 6 '12 at 1:12
How aboutLC_ALL=C sort
for checking from different environments...(+1 btw)
â Peter.O
Apr 6 '12 at 6:16
You made a whole Python program for this? Thanks! This is really more than what I had expected. :-) Anyway, I will check these methods as well as the new option 1 by Warren.
â user17429
Apr 6 '12 at 17:33
Good answer. Setting the sort order withLC_ALL=C
is essential if running on multiple machines and OSs.
â Davor Cubranic
Aug 3 '16 at 20:52
What doescpio -o -
mean? DoesnâÂÂt cpio use stdin/out by default? GNU cpio 2.12 producescpio: Too many arguments
â Jan Tojnar
Aug 12 '16 at 12:40
OK, this works, thanks. But is there any way to do it without including any metadata? Right now I need it for just the actual contents.
â user17429
Apr 6 '12 at 1:12
OK, this works, thanks. But is there any way to do it without including any metadata? Right now I need it for just the actual contents.
â user17429
Apr 6 '12 at 1:12
How about
LC_ALL=C sort
for checking from different environments...(+1 btw)â Peter.O
Apr 6 '12 at 6:16
How about
LC_ALL=C sort
for checking from different environments...(+1 btw)â Peter.O
Apr 6 '12 at 6:16
You made a whole Python program for this? Thanks! This is really more than what I had expected. :-) Anyway, I will check these methods as well as the new option 1 by Warren.
â user17429
Apr 6 '12 at 17:33
You made a whole Python program for this? Thanks! This is really more than what I had expected. :-) Anyway, I will check these methods as well as the new option 1 by Warren.
â user17429
Apr 6 '12 at 17:33
Good answer. Setting the sort order with
LC_ALL=C
is essential if running on multiple machines and OSs.â Davor Cubranic
Aug 3 '16 at 20:52
Good answer. Setting the sort order with
LC_ALL=C
is essential if running on multiple machines and OSs.â Davor Cubranic
Aug 3 '16 at 20:52
What does
cpio -o -
mean? DoesnâÂÂt cpio use stdin/out by default? GNU cpio 2.12 produces cpio: Too many arguments
â Jan Tojnar
Aug 12 '16 at 12:40
What does
cpio -o -
mean? DoesnâÂÂt cpio use stdin/out by default? GNU cpio 2.12 produces cpio: Too many arguments
â Jan Tojnar
Aug 12 '16 at 12:40
 |Â
show 5 more comments
up vote
12
down vote
Have a look at md5deep. Some of the features of md5deep that may interest you:
Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.
Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match.
...
Nice, but can't get it to work, it says.../foo: Is a directory
, what gives?
â Camilo Martin
Oct 2 '14 at 1:21
3
On its own md5deep doesn't solve the OP's problem as it doesn't print a consolidated md5sum, it just prints the md5sum for each file in the directory. That said, you can md5sum the output of md5deep - not quite what the OP wanted, but is close! e.g. for the current directory:md5deep -r -l -j0 . | md5sum
(where-r
is recursive,-l
means "use relative paths" so that the absolute path of the files doesn't interfere when trying to compare the content of two directories, and-j0
means use 1 thread to prevent non-determinism due to individual md5sums being returned in different orders).
â Stevie
Oct 14 '15 at 12:34
How to ignore some files/directories in the path?
â Sandeepan Nath
Oct 21 '16 at 13:17
add a comment |Â
up vote
12
down vote
Have a look at md5deep. Some of the features of md5deep that may interest you:
Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.
Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match.
...
Nice, but can't get it to work, it says.../foo: Is a directory
, what gives?
â Camilo Martin
Oct 2 '14 at 1:21
3
On its own md5deep doesn't solve the OP's problem as it doesn't print a consolidated md5sum, it just prints the md5sum for each file in the directory. That said, you can md5sum the output of md5deep - not quite what the OP wanted, but is close! e.g. for the current directory:md5deep -r -l -j0 . | md5sum
(where-r
is recursive,-l
means "use relative paths" so that the absolute path of the files doesn't interfere when trying to compare the content of two directories, and-j0
means use 1 thread to prevent non-determinism due to individual md5sums being returned in different orders).
â Stevie
Oct 14 '15 at 12:34
How to ignore some files/directories in the path?
â Sandeepan Nath
Oct 21 '16 at 13:17
add a comment |Â
up vote
12
down vote
up vote
12
down vote
Have a look at md5deep. Some of the features of md5deep that may interest you:
Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.
Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match.
...
Have a look at md5deep. Some of the features of md5deep that may interest you:
Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.
Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match.
...
answered Apr 10 '12 at 16:19
faultyserver
22114
22114
Nice, but can't get it to work, it says.../foo: Is a directory
, what gives?
â Camilo Martin
Oct 2 '14 at 1:21
3
On its own md5deep doesn't solve the OP's problem as it doesn't print a consolidated md5sum, it just prints the md5sum for each file in the directory. That said, you can md5sum the output of md5deep - not quite what the OP wanted, but is close! e.g. for the current directory:md5deep -r -l -j0 . | md5sum
(where-r
is recursive,-l
means "use relative paths" so that the absolute path of the files doesn't interfere when trying to compare the content of two directories, and-j0
means use 1 thread to prevent non-determinism due to individual md5sums being returned in different orders).
â Stevie
Oct 14 '15 at 12:34
How to ignore some files/directories in the path?
â Sandeepan Nath
Oct 21 '16 at 13:17
add a comment |Â
Nice, but can't get it to work, it says.../foo: Is a directory
, what gives?
â Camilo Martin
Oct 2 '14 at 1:21
3
On its own md5deep doesn't solve the OP's problem as it doesn't print a consolidated md5sum, it just prints the md5sum for each file in the directory. That said, you can md5sum the output of md5deep - not quite what the OP wanted, but is close! e.g. for the current directory:md5deep -r -l -j0 . | md5sum
(where-r
is recursive,-l
means "use relative paths" so that the absolute path of the files doesn't interfere when trying to compare the content of two directories, and-j0
means use 1 thread to prevent non-determinism due to individual md5sums being returned in different orders).
â Stevie
Oct 14 '15 at 12:34
How to ignore some files/directories in the path?
â Sandeepan Nath
Oct 21 '16 at 13:17
Nice, but can't get it to work, it says
.../foo: Is a directory
, what gives?â Camilo Martin
Oct 2 '14 at 1:21
Nice, but can't get it to work, it says
.../foo: Is a directory
, what gives?â Camilo Martin
Oct 2 '14 at 1:21
3
3
On its own md5deep doesn't solve the OP's problem as it doesn't print a consolidated md5sum, it just prints the md5sum for each file in the directory. That said, you can md5sum the output of md5deep - not quite what the OP wanted, but is close! e.g. for the current directory:
md5deep -r -l -j0 . | md5sum
(where -r
is recursive, -l
means "use relative paths" so that the absolute path of the files doesn't interfere when trying to compare the content of two directories, and -j0
means use 1 thread to prevent non-determinism due to individual md5sums being returned in different orders).â Stevie
Oct 14 '15 at 12:34
On its own md5deep doesn't solve the OP's problem as it doesn't print a consolidated md5sum, it just prints the md5sum for each file in the directory. That said, you can md5sum the output of md5deep - not quite what the OP wanted, but is close! e.g. for the current directory:
md5deep -r -l -j0 . | md5sum
(where -r
is recursive, -l
means "use relative paths" so that the absolute path of the files doesn't interfere when trying to compare the content of two directories, and -j0
means use 1 thread to prevent non-determinism due to individual md5sums being returned in different orders).â Stevie
Oct 14 '15 at 12:34
How to ignore some files/directories in the path?
â Sandeepan Nath
Oct 21 '16 at 13:17
How to ignore some files/directories in the path?
â Sandeepan Nath
Oct 21 '16 at 13:17
add a comment |Â
up vote
7
down vote
If your goal is just to find differences between two directories, consider using diff.
Try this:
diff -qr dir1 dir2
Yes, this is useful as well. I think you meant dir1 dir2 in that command.
â user17429
Apr 6 '12 at 17:35
1
I don't usually use GUIs when I can avoid them, but for directory diffing kdiff3 is great and also works on many platforms.
â sinelaw
Apr 17 '12 at 2:21
Differing files are reported as well with this command.
â Serge Stroobandt
Apr 2 '14 at 15:02
add a comment |Â
up vote
7
down vote
If your goal is just to find differences between two directories, consider using diff.
Try this:
diff -qr dir1 dir2
Yes, this is useful as well. I think you meant dir1 dir2 in that command.
â user17429
Apr 6 '12 at 17:35
1
I don't usually use GUIs when I can avoid them, but for directory diffing kdiff3 is great and also works on many platforms.
â sinelaw
Apr 17 '12 at 2:21
Differing files are reported as well with this command.
â Serge Stroobandt
Apr 2 '14 at 15:02
add a comment |Â
up vote
7
down vote
up vote
7
down vote
If your goal is just to find differences between two directories, consider using diff.
Try this:
diff -qr dir1 dir2
If your goal is just to find differences between two directories, consider using diff.
Try this:
diff -qr dir1 dir2
edited Apr 10 '12 at 16:06
Paà Âlo Ebermann
32028
32028
answered Apr 6 '12 at 5:24
Deepak Mittal
1,111914
1,111914
Yes, this is useful as well. I think you meant dir1 dir2 in that command.
â user17429
Apr 6 '12 at 17:35
1
I don't usually use GUIs when I can avoid them, but for directory diffing kdiff3 is great and also works on many platforms.
â sinelaw
Apr 17 '12 at 2:21
Differing files are reported as well with this command.
â Serge Stroobandt
Apr 2 '14 at 15:02
add a comment |Â
Yes, this is useful as well. I think you meant dir1 dir2 in that command.
â user17429
Apr 6 '12 at 17:35
1
I don't usually use GUIs when I can avoid them, but for directory diffing kdiff3 is great and also works on many platforms.
â sinelaw
Apr 17 '12 at 2:21
Differing files are reported as well with this command.
â Serge Stroobandt
Apr 2 '14 at 15:02
Yes, this is useful as well. I think you meant dir1 dir2 in that command.
â user17429
Apr 6 '12 at 17:35
Yes, this is useful as well. I think you meant dir1 dir2 in that command.
â user17429
Apr 6 '12 at 17:35
1
1
I don't usually use GUIs when I can avoid them, but for directory diffing kdiff3 is great and also works on many platforms.
â sinelaw
Apr 17 '12 at 2:21
I don't usually use GUIs when I can avoid them, but for directory diffing kdiff3 is great and also works on many platforms.
â sinelaw
Apr 17 '12 at 2:21
Differing files are reported as well with this command.
â Serge Stroobandt
Apr 2 '14 at 15:02
Differing files are reported as well with this command.
â Serge Stroobandt
Apr 2 '14 at 15:02
add a comment |Â
up vote
5
down vote
You can hash every file recursively and then hash the resulting text:
> md5deep -r -l . | sort | md5sum
d43417958e47758c6405b5098f151074 *-
md5deep is required.
1
instead ofmd5deep
usehashdeep
on ubuntu 16.04 because md5deep package is just a transitional dummy for hashdeep.
â palik
Nov 8 '17 at 15:22
1
I've tried hashdeep. It outputs not only hashes but also some header including## Invoked from: /home/myuser/dev/
which is your current path and## $ hashdeep -s -r -l ~/folder/
. This got to sort, so the final hash will be different if you change your current folder or command line.
â truf
Aug 23 at 8:28
add a comment |Â
up vote
5
down vote
You can hash every file recursively and then hash the resulting text:
> md5deep -r -l . | sort | md5sum
d43417958e47758c6405b5098f151074 *-
md5deep is required.
1
instead ofmd5deep
usehashdeep
on ubuntu 16.04 because md5deep package is just a transitional dummy for hashdeep.
â palik
Nov 8 '17 at 15:22
1
I've tried hashdeep. It outputs not only hashes but also some header including## Invoked from: /home/myuser/dev/
which is your current path and## $ hashdeep -s -r -l ~/folder/
. This got to sort, so the final hash will be different if you change your current folder or command line.
â truf
Aug 23 at 8:28
add a comment |Â
up vote
5
down vote
up vote
5
down vote
You can hash every file recursively and then hash the resulting text:
> md5deep -r -l . | sort | md5sum
d43417958e47758c6405b5098f151074 *-
md5deep is required.
You can hash every file recursively and then hash the resulting text:
> md5deep -r -l . | sort | md5sum
d43417958e47758c6405b5098f151074 *-
md5deep is required.
answered Apr 14 '16 at 13:34
Pavel Vlasov
178126
178126
1
instead ofmd5deep
usehashdeep
on ubuntu 16.04 because md5deep package is just a transitional dummy for hashdeep.
â palik
Nov 8 '17 at 15:22
1
I've tried hashdeep. It outputs not only hashes but also some header including## Invoked from: /home/myuser/dev/
which is your current path and## $ hashdeep -s -r -l ~/folder/
. This got to sort, so the final hash will be different if you change your current folder or command line.
â truf
Aug 23 at 8:28
add a comment |Â
1
instead ofmd5deep
usehashdeep
on ubuntu 16.04 because md5deep package is just a transitional dummy for hashdeep.
â palik
Nov 8 '17 at 15:22
1
I've tried hashdeep. It outputs not only hashes but also some header including## Invoked from: /home/myuser/dev/
which is your current path and## $ hashdeep -s -r -l ~/folder/
. This got to sort, so the final hash will be different if you change your current folder or command line.
â truf
Aug 23 at 8:28
1
1
instead of
md5deep
use hashdeep
on ubuntu 16.04 because md5deep package is just a transitional dummy for hashdeep.â palik
Nov 8 '17 at 15:22
instead of
md5deep
use hashdeep
on ubuntu 16.04 because md5deep package is just a transitional dummy for hashdeep.â palik
Nov 8 '17 at 15:22
1
1
I've tried hashdeep. It outputs not only hashes but also some header including
## Invoked from: /home/myuser/dev/
which is your current path and ## $ hashdeep -s -r -l ~/folder/
. This got to sort, so the final hash will be different if you change your current folder or command line.â truf
Aug 23 at 8:28
I've tried hashdeep. It outputs not only hashes but also some header including
## Invoked from: /home/myuser/dev/
which is your current path and ## $ hashdeep -s -r -l ~/folder/
. This got to sort, so the final hash will be different if you change your current folder or command line.â truf
Aug 23 at 8:28
add a comment |Â
up vote
3
down vote
File contents only, excluding filenames
I needed a version that only checked the filenames because the contents reside in different directories.
This version (Warren Young's answer) helped a lot, but my version of md5sum
outputs the filename (relative to the path I ran the command from), and the folder names were different, therefore even though the individual file checksums matched, the final checksum didn't.
To fix that, in my case, I just needed to strip off the filename from each line of the find
output (select only the first word as separated by spaces using cut
):
find -s somedir -type f -exec md5sum ; | cut -d" " -f1 | md5sum
You might need to sort the checksums as well to get a reproducible list.
â eckes
Mar 22 '16 at 21:34
add a comment |Â
up vote
3
down vote
File contents only, excluding filenames
I needed a version that only checked the filenames because the contents reside in different directories.
This version (Warren Young's answer) helped a lot, but my version of md5sum
outputs the filename (relative to the path I ran the command from), and the folder names were different, therefore even though the individual file checksums matched, the final checksum didn't.
To fix that, in my case, I just needed to strip off the filename from each line of the find
output (select only the first word as separated by spaces using cut
):
find -s somedir -type f -exec md5sum ; | cut -d" " -f1 | md5sum
You might need to sort the checksums as well to get a reproducible list.
â eckes
Mar 22 '16 at 21:34
add a comment |Â
up vote
3
down vote
up vote
3
down vote
File contents only, excluding filenames
I needed a version that only checked the filenames because the contents reside in different directories.
This version (Warren Young's answer) helped a lot, but my version of md5sum
outputs the filename (relative to the path I ran the command from), and the folder names were different, therefore even though the individual file checksums matched, the final checksum didn't.
To fix that, in my case, I just needed to strip off the filename from each line of the find
output (select only the first word as separated by spaces using cut
):
find -s somedir -type f -exec md5sum ; | cut -d" " -f1 | md5sum
File contents only, excluding filenames
I needed a version that only checked the filenames because the contents reside in different directories.
This version (Warren Young's answer) helped a lot, but my version of md5sum
outputs the filename (relative to the path I ran the command from), and the folder names were different, therefore even though the individual file checksums matched, the final checksum didn't.
To fix that, in my case, I just needed to strip off the filename from each line of the find
output (select only the first word as separated by spaces using cut
):
find -s somedir -type f -exec md5sum ; | cut -d" " -f1 | md5sum
edited Apr 13 '17 at 12:36
Communityâ¦
1
1
answered May 11 '13 at 0:34
Nicole
1615
1615
You might need to sort the checksums as well to get a reproducible list.
â eckes
Mar 22 '16 at 21:34
add a comment |Â
You might need to sort the checksums as well to get a reproducible list.
â eckes
Mar 22 '16 at 21:34
You might need to sort the checksums as well to get a reproducible list.
â eckes
Mar 22 '16 at 21:34
You might need to sort the checksums as well to get a reproducible list.
â eckes
Mar 22 '16 at 21:34
add a comment |Â
up vote
3
down vote
A good tree check-sum is the tree-id of Git.
There is unfortunately no stand-alone tool available which can do that (at least I dont know it), but if you have Git handy you can just pretend to set up a new repository and add the files you want to check to the index.
This allows you to produce the (reproducible) tree hash - which includes only content, file names and some reduced file modes (executable).
add a comment |Â
up vote
3
down vote
A good tree check-sum is the tree-id of Git.
There is unfortunately no stand-alone tool available which can do that (at least I dont know it), but if you have Git handy you can just pretend to set up a new repository and add the files you want to check to the index.
This allows you to produce the (reproducible) tree hash - which includes only content, file names and some reduced file modes (executable).
add a comment |Â
up vote
3
down vote
up vote
3
down vote
A good tree check-sum is the tree-id of Git.
There is unfortunately no stand-alone tool available which can do that (at least I dont know it), but if you have Git handy you can just pretend to set up a new repository and add the files you want to check to the index.
This allows you to produce the (reproducible) tree hash - which includes only content, file names and some reduced file modes (executable).
A good tree check-sum is the tree-id of Git.
There is unfortunately no stand-alone tool available which can do that (at least I dont know it), but if you have Git handy you can just pretend to set up a new repository and add the files you want to check to the index.
This allows you to produce the (reproducible) tree hash - which includes only content, file names and some reduced file modes (executable).
answered Aug 11 '13 at 1:37
eckes
1477
1477
add a comment |Â
add a comment |Â
up vote
2
down vote
I use this my snippet for moderate volumes:
find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 cat | md5sum -
and this one for XXXL:
find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 tail -qc100 | md5sum -
What does the-xdev
flag do?
â czerasz
May 4 '17 at 6:35
It calls for you to type in:man find
and read that fine manual ;)
â poige
May 4 '17 at 12:43
Good point :-).-xdev Don't descend directories on other filesystems.
â czerasz
May 4 '17 at 16:31
1
Note that this ignores new, empty files (like if you touch a file).
â RonJohn
May 12 at 23:08
Thanks. I think I see how to fix
â poige
May 13 at 2:14
add a comment |Â
up vote
2
down vote
I use this my snippet for moderate volumes:
find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 cat | md5sum -
and this one for XXXL:
find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 tail -qc100 | md5sum -
What does the-xdev
flag do?
â czerasz
May 4 '17 at 6:35
It calls for you to type in:man find
and read that fine manual ;)
â poige
May 4 '17 at 12:43
Good point :-).-xdev Don't descend directories on other filesystems.
â czerasz
May 4 '17 at 16:31
1
Note that this ignores new, empty files (like if you touch a file).
â RonJohn
May 12 at 23:08
Thanks. I think I see how to fix
â poige
May 13 at 2:14
add a comment |Â
up vote
2
down vote
up vote
2
down vote
I use this my snippet for moderate volumes:
find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 cat | md5sum -
and this one for XXXL:
find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 tail -qc100 | md5sum -
I use this my snippet for moderate volumes:
find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 cat | md5sum -
and this one for XXXL:
find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 tail -qc100 | md5sum -
answered Apr 10 '12 at 17:26
poige
3,8621541
3,8621541
What does the-xdev
flag do?
â czerasz
May 4 '17 at 6:35
It calls for you to type in:man find
and read that fine manual ;)
â poige
May 4 '17 at 12:43
Good point :-).-xdev Don't descend directories on other filesystems.
â czerasz
May 4 '17 at 16:31
1
Note that this ignores new, empty files (like if you touch a file).
â RonJohn
May 12 at 23:08
Thanks. I think I see how to fix
â poige
May 13 at 2:14
add a comment |Â
What does the-xdev
flag do?
â czerasz
May 4 '17 at 6:35
It calls for you to type in:man find
and read that fine manual ;)
â poige
May 4 '17 at 12:43
Good point :-).-xdev Don't descend directories on other filesystems.
â czerasz
May 4 '17 at 16:31
1
Note that this ignores new, empty files (like if you touch a file).
â RonJohn
May 12 at 23:08
Thanks. I think I see how to fix
â poige
May 13 at 2:14
What does the
-xdev
flag do?â czerasz
May 4 '17 at 6:35
What does the
-xdev
flag do?â czerasz
May 4 '17 at 6:35
It calls for you to type in:
man find
and read that fine manual ;)â poige
May 4 '17 at 12:43
It calls for you to type in:
man find
and read that fine manual ;)â poige
May 4 '17 at 12:43
Good point :-).
-xdev Don't descend directories on other filesystems.
â czerasz
May 4 '17 at 16:31
Good point :-).
-xdev Don't descend directories on other filesystems.
â czerasz
May 4 '17 at 16:31
1
1
Note that this ignores new, empty files (like if you touch a file).
â RonJohn
May 12 at 23:08
Note that this ignores new, empty files (like if you touch a file).
â RonJohn
May 12 at 23:08
Thanks. I think I see how to fix
â poige
May 13 at 2:14
Thanks. I think I see how to fix
â poige
May 13 at 2:14
add a comment |Â
up vote
2
down vote
solution:
$ pip install checksumdir
$ checksumdir -a md5 assets/js
981ac0bc890de594a9f2f40e00f13872
$ checksumdir -a sha1 assets/js
88cd20f115e31a1e1ae381f7291d0c8cd3b92fad
works fast and easier solution then bash scripting.
see doc: https://pypi.python.org/pypi/checksumdir/1.0.5
if you don't have pip you may need to install it with yum -y install python-pip (or dnf/apt-get)
â DmitrySemenov
Mar 8 '16 at 2:55
add a comment |Â
up vote
2
down vote
solution:
$ pip install checksumdir
$ checksumdir -a md5 assets/js
981ac0bc890de594a9f2f40e00f13872
$ checksumdir -a sha1 assets/js
88cd20f115e31a1e1ae381f7291d0c8cd3b92fad
works fast and easier solution then bash scripting.
see doc: https://pypi.python.org/pypi/checksumdir/1.0.5
if you don't have pip you may need to install it with yum -y install python-pip (or dnf/apt-get)
â DmitrySemenov
Mar 8 '16 at 2:55
add a comment |Â
up vote
2
down vote
up vote
2
down vote
solution:
$ pip install checksumdir
$ checksumdir -a md5 assets/js
981ac0bc890de594a9f2f40e00f13872
$ checksumdir -a sha1 assets/js
88cd20f115e31a1e1ae381f7291d0c8cd3b92fad
works fast and easier solution then bash scripting.
see doc: https://pypi.python.org/pypi/checksumdir/1.0.5
solution:
$ pip install checksumdir
$ checksumdir -a md5 assets/js
981ac0bc890de594a9f2f40e00f13872
$ checksumdir -a sha1 assets/js
88cd20f115e31a1e1ae381f7291d0c8cd3b92fad
works fast and easier solution then bash scripting.
see doc: https://pypi.python.org/pypi/checksumdir/1.0.5
answered Mar 8 '16 at 2:53
DmitrySemenov
23419
23419
if you don't have pip you may need to install it with yum -y install python-pip (or dnf/apt-get)
â DmitrySemenov
Mar 8 '16 at 2:55
add a comment |Â
if you don't have pip you may need to install it with yum -y install python-pip (or dnf/apt-get)
â DmitrySemenov
Mar 8 '16 at 2:55
if you don't have pip you may need to install it with yum -y install python-pip (or dnf/apt-get)
â DmitrySemenov
Mar 8 '16 at 2:55
if you don't have pip you may need to install it with yum -y install python-pip (or dnf/apt-get)
â DmitrySemenov
Mar 8 '16 at 2:55
add a comment |Â
up vote
2
down vote
nix-hash
from the Nix package manager
The command nix-hash computes the cryptographic hash of the contents
of each path and prints it on standard output. By default, it computes
an MD5 hash, but other hash algorithms are available as well. The hash is printed in hexadecimal.
The hash is computed over a serialisation of each path: a dump of the file system tree rooted at the path. This allows directories
and symlinks to be hashed
as well as regular files. The dump is in the NAR format produced by nix-store --dump. Thus, nix-hash path yields the same
cryptographic hash as nix-store
--dump path | md5sum.
add a comment |Â
up vote
2
down vote
nix-hash
from the Nix package manager
The command nix-hash computes the cryptographic hash of the contents
of each path and prints it on standard output. By default, it computes
an MD5 hash, but other hash algorithms are available as well. The hash is printed in hexadecimal.
The hash is computed over a serialisation of each path: a dump of the file system tree rooted at the path. This allows directories
and symlinks to be hashed
as well as regular files. The dump is in the NAR format produced by nix-store --dump. Thus, nix-hash path yields the same
cryptographic hash as nix-store
--dump path | md5sum.
add a comment |Â
up vote
2
down vote
up vote
2
down vote
nix-hash
from the Nix package manager
The command nix-hash computes the cryptographic hash of the contents
of each path and prints it on standard output. By default, it computes
an MD5 hash, but other hash algorithms are available as well. The hash is printed in hexadecimal.
The hash is computed over a serialisation of each path: a dump of the file system tree rooted at the path. This allows directories
and symlinks to be hashed
as well as regular files. The dump is in the NAR format produced by nix-store --dump. Thus, nix-hash path yields the same
cryptographic hash as nix-store
--dump path | md5sum.
nix-hash
from the Nix package manager
The command nix-hash computes the cryptographic hash of the contents
of each path and prints it on standard output. By default, it computes
an MD5 hash, but other hash algorithms are available as well. The hash is printed in hexadecimal.
The hash is computed over a serialisation of each path: a dump of the file system tree rooted at the path. This allows directories
and symlinks to be hashed
as well as regular files. The dump is in the NAR format produced by nix-store --dump. Thus, nix-hash path yields the same
cryptographic hash as nix-store
--dump path | md5sum.
answered Jul 27 '16 at 16:48
Igor
1212
1212
add a comment |Â
add a comment |Â
up vote
1
down vote
I didn't want new executables nor clunky solutions so here's my take:
#!/bin/sh
# md5dir.sh by Camilo Martin, 2014-10-01.
# Give this a parameter and it will calculate an md5 of the directory's contents.
# It only takes into account file contents and paths relative to the directory's root.
# This means that two dirs with different names and locations can hash equally.
if [[ ! -d "$1" ]]; then
echo "Usage: md5dir.sh <dir_name>"
exit
fi
d="$(tr '\' / <<< "$1" | tr -s / | sed 's-/$--')"
c=$(($#d + 35))
find "$d" -type f -exec md5sum ; | cut -c 1-33,$c- | sort | md5sum | cut -c 1-32
Hope it helps you :)
add a comment |Â
up vote
1
down vote
I didn't want new executables nor clunky solutions so here's my take:
#!/bin/sh
# md5dir.sh by Camilo Martin, 2014-10-01.
# Give this a parameter and it will calculate an md5 of the directory's contents.
# It only takes into account file contents and paths relative to the directory's root.
# This means that two dirs with different names and locations can hash equally.
if [[ ! -d "$1" ]]; then
echo "Usage: md5dir.sh <dir_name>"
exit
fi
d="$(tr '\' / <<< "$1" | tr -s / | sed 's-/$--')"
c=$(($#d + 35))
find "$d" -type f -exec md5sum ; | cut -c 1-33,$c- | sort | md5sum | cut -c 1-32
Hope it helps you :)
add a comment |Â
up vote
1
down vote
up vote
1
down vote
I didn't want new executables nor clunky solutions so here's my take:
#!/bin/sh
# md5dir.sh by Camilo Martin, 2014-10-01.
# Give this a parameter and it will calculate an md5 of the directory's contents.
# It only takes into account file contents and paths relative to the directory's root.
# This means that two dirs with different names and locations can hash equally.
if [[ ! -d "$1" ]]; then
echo "Usage: md5dir.sh <dir_name>"
exit
fi
d="$(tr '\' / <<< "$1" | tr -s / | sed 's-/$--')"
c=$(($#d + 35))
find "$d" -type f -exec md5sum ; | cut -c 1-33,$c- | sort | md5sum | cut -c 1-32
Hope it helps you :)
I didn't want new executables nor clunky solutions so here's my take:
#!/bin/sh
# md5dir.sh by Camilo Martin, 2014-10-01.
# Give this a parameter and it will calculate an md5 of the directory's contents.
# It only takes into account file contents and paths relative to the directory's root.
# This means that two dirs with different names and locations can hash equally.
if [[ ! -d "$1" ]]; then
echo "Usage: md5dir.sh <dir_name>"
exit
fi
d="$(tr '\' / <<< "$1" | tr -s / | sed 's-/$--')"
c=$(($#d + 35))
find "$d" -type f -exec md5sum ; | cut -c 1-33,$c- | sort | md5sum | cut -c 1-32
Hope it helps you :)
answered Oct 2 '14 at 2:13
Camilo Martin
36639
36639
add a comment |Â
add a comment |Â
up vote
1
down vote
A script which is well tested and supports a number of operations including finding duplicates, doing comparisons on both data and metadata, showing additions as well as changes and removals, you might like Fingerprint.
Fingerprint right now doesn't produce a single checksum for a directory, but a transcript file which includes checksums for all files in that directory.
fingerprint analyze
This will generate index.fingerprint
in the current directory which includes checksums, filenames and file sizes. By default it uses both MD5
and SHA1.256
.
In the future, I hope to add support for Merkle Trees into Fingerprint which will give you a single top-level checksum. Right now, you need to retain that file for doing verification.
add a comment |Â
up vote
1
down vote
A script which is well tested and supports a number of operations including finding duplicates, doing comparisons on both data and metadata, showing additions as well as changes and removals, you might like Fingerprint.
Fingerprint right now doesn't produce a single checksum for a directory, but a transcript file which includes checksums for all files in that directory.
fingerprint analyze
This will generate index.fingerprint
in the current directory which includes checksums, filenames and file sizes. By default it uses both MD5
and SHA1.256
.
In the future, I hope to add support for Merkle Trees into Fingerprint which will give you a single top-level checksum. Right now, you need to retain that file for doing verification.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
A script which is well tested and supports a number of operations including finding duplicates, doing comparisons on both data and metadata, showing additions as well as changes and removals, you might like Fingerprint.
Fingerprint right now doesn't produce a single checksum for a directory, but a transcript file which includes checksums for all files in that directory.
fingerprint analyze
This will generate index.fingerprint
in the current directory which includes checksums, filenames and file sizes. By default it uses both MD5
and SHA1.256
.
In the future, I hope to add support for Merkle Trees into Fingerprint which will give you a single top-level checksum. Right now, you need to retain that file for doing verification.
A script which is well tested and supports a number of operations including finding duplicates, doing comparisons on both data and metadata, showing additions as well as changes and removals, you might like Fingerprint.
Fingerprint right now doesn't produce a single checksum for a directory, but a transcript file which includes checksums for all files in that directory.
fingerprint analyze
This will generate index.fingerprint
in the current directory which includes checksums, filenames and file sizes. By default it uses both MD5
and SHA1.256
.
In the future, I hope to add support for Merkle Trees into Fingerprint which will give you a single top-level checksum. Right now, you need to retain that file for doing verification.
answered Jul 7 '16 at 0:15
ioquatix
1113
1113
add a comment |Â
add a comment |Â
up vote
0
down vote
A robust and clean approach
- First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.
- Different approaches for different needs/purpose (all of the below or pick what ever applies):
- Hash only the entry name of all entries in the directory tree
- Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)
- For a symbolic link, its content is the referent name. Hash it or choose to skip
- Follow or not to follow(resolved name) the symlink while hashing the contents of the entry
- If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files
- Handle large files well(again, mind the RAM)
- Handle very deep directory trees (mind the open file descriptors)
- Handle non standard file names
- How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?
- Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.
This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.
Here's a tool(disclaimer: I'm a contributor to it) dtreetrawl, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.
Usage:
dtreetrawl [OPTION...] "/trawl/me" [path2,...]
Help Options:
-h, --help Show help options
Application Options:
-t, --terse Produce a terse output; parsable.
-d, --delim=: Character or string delimiter/separator for terse output(default ':')
-l, --max-level=N Do not traverse tree beyond N level(s)
--hash Hash the files to produce checksums(default is MD5).
-c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512.
-s, --hash-symlink Include symbolic links' referent name while calculating the root checksum
-R, --only-root-hash Output only the root hash. Blank line if --hash is not set
-N, --no-name-hash Exclude path name while calculating the root checksum
-F, --no-content-hash Do not hash the contents of the file
An example human friendly output:
...
... //clipped
...
/home/lab/linux-4.14-rc8/CREDITS
Base name : CREDITS
Level : 1
Type : regular file
Referent name :
File size : 98443 bytes
I-node number : 290850
No. directory entries : 0
Permission (octal) : 0644
Link count : 1
Ownership : UID=0, GID=0
Preferred I/O block size : 4096 bytes
Blocks allocated : 200
Last status change : Tue, 21 Nov 17 21:28:18 +0530
Last file access : Thu, 28 Dec 17 00:53:27 +0530
Last file modification : Tue, 21 Nov 17 21:28:18 +0530
Hash : 9f0312d130016d103aa5fc9d16a2437e
Stats for /home/lab/linux-4.14-rc8:
Elapsed time : 1.305767 s
Start time : Sun, 07 Jan 18 03:42:39 +0530
Root hash : 434e93111ad6f9335bb4954bc8f4eca4
Hash type : md5
Depth : 8
Total,
size : 66850916 bytes
entries : 12484
directories : 763
regular files : 11715
symlinks : 6
block devices : 0
char devices : 0
sockets : 0
FIFOs/pipes : 0
General advice is always welcome but the best answers are specific and with code where appropriate. If you have experience of using the tool you refer to then please include it.
â bu5hman
Jan 7 at 11:54
@bu5hman Sure! I wasn't quite comfortable saying(gloating?) more about how well it works since I'm involved in its development.
â six-k
Jan 7 at 13:56
add a comment |Â
up vote
0
down vote
A robust and clean approach
- First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.
- Different approaches for different needs/purpose (all of the below or pick what ever applies):
- Hash only the entry name of all entries in the directory tree
- Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)
- For a symbolic link, its content is the referent name. Hash it or choose to skip
- Follow or not to follow(resolved name) the symlink while hashing the contents of the entry
- If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files
- Handle large files well(again, mind the RAM)
- Handle very deep directory trees (mind the open file descriptors)
- Handle non standard file names
- How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?
- Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.
This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.
Here's a tool(disclaimer: I'm a contributor to it) dtreetrawl, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.
Usage:
dtreetrawl [OPTION...] "/trawl/me" [path2,...]
Help Options:
-h, --help Show help options
Application Options:
-t, --terse Produce a terse output; parsable.
-d, --delim=: Character or string delimiter/separator for terse output(default ':')
-l, --max-level=N Do not traverse tree beyond N level(s)
--hash Hash the files to produce checksums(default is MD5).
-c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512.
-s, --hash-symlink Include symbolic links' referent name while calculating the root checksum
-R, --only-root-hash Output only the root hash. Blank line if --hash is not set
-N, --no-name-hash Exclude path name while calculating the root checksum
-F, --no-content-hash Do not hash the contents of the file
An example human friendly output:
...
... //clipped
...
/home/lab/linux-4.14-rc8/CREDITS
Base name : CREDITS
Level : 1
Type : regular file
Referent name :
File size : 98443 bytes
I-node number : 290850
No. directory entries : 0
Permission (octal) : 0644
Link count : 1
Ownership : UID=0, GID=0
Preferred I/O block size : 4096 bytes
Blocks allocated : 200
Last status change : Tue, 21 Nov 17 21:28:18 +0530
Last file access : Thu, 28 Dec 17 00:53:27 +0530
Last file modification : Tue, 21 Nov 17 21:28:18 +0530
Hash : 9f0312d130016d103aa5fc9d16a2437e
Stats for /home/lab/linux-4.14-rc8:
Elapsed time : 1.305767 s
Start time : Sun, 07 Jan 18 03:42:39 +0530
Root hash : 434e93111ad6f9335bb4954bc8f4eca4
Hash type : md5
Depth : 8
Total,
size : 66850916 bytes
entries : 12484
directories : 763
regular files : 11715
symlinks : 6
block devices : 0
char devices : 0
sockets : 0
FIFOs/pipes : 0
General advice is always welcome but the best answers are specific and with code where appropriate. If you have experience of using the tool you refer to then please include it.
â bu5hman
Jan 7 at 11:54
@bu5hman Sure! I wasn't quite comfortable saying(gloating?) more about how well it works since I'm involved in its development.
â six-k
Jan 7 at 13:56
add a comment |Â
up vote
0
down vote
up vote
0
down vote
A robust and clean approach
- First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.
- Different approaches for different needs/purpose (all of the below or pick what ever applies):
- Hash only the entry name of all entries in the directory tree
- Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)
- For a symbolic link, its content is the referent name. Hash it or choose to skip
- Follow or not to follow(resolved name) the symlink while hashing the contents of the entry
- If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files
- Handle large files well(again, mind the RAM)
- Handle very deep directory trees (mind the open file descriptors)
- Handle non standard file names
- How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?
- Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.
This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.
Here's a tool(disclaimer: I'm a contributor to it) dtreetrawl, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.
Usage:
dtreetrawl [OPTION...] "/trawl/me" [path2,...]
Help Options:
-h, --help Show help options
Application Options:
-t, --terse Produce a terse output; parsable.
-d, --delim=: Character or string delimiter/separator for terse output(default ':')
-l, --max-level=N Do not traverse tree beyond N level(s)
--hash Hash the files to produce checksums(default is MD5).
-c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512.
-s, --hash-symlink Include symbolic links' referent name while calculating the root checksum
-R, --only-root-hash Output only the root hash. Blank line if --hash is not set
-N, --no-name-hash Exclude path name while calculating the root checksum
-F, --no-content-hash Do not hash the contents of the file
An example human friendly output:
...
... //clipped
...
/home/lab/linux-4.14-rc8/CREDITS
Base name : CREDITS
Level : 1
Type : regular file
Referent name :
File size : 98443 bytes
I-node number : 290850
No. directory entries : 0
Permission (octal) : 0644
Link count : 1
Ownership : UID=0, GID=0
Preferred I/O block size : 4096 bytes
Blocks allocated : 200
Last status change : Tue, 21 Nov 17 21:28:18 +0530
Last file access : Thu, 28 Dec 17 00:53:27 +0530
Last file modification : Tue, 21 Nov 17 21:28:18 +0530
Hash : 9f0312d130016d103aa5fc9d16a2437e
Stats for /home/lab/linux-4.14-rc8:
Elapsed time : 1.305767 s
Start time : Sun, 07 Jan 18 03:42:39 +0530
Root hash : 434e93111ad6f9335bb4954bc8f4eca4
Hash type : md5
Depth : 8
Total,
size : 66850916 bytes
entries : 12484
directories : 763
regular files : 11715
symlinks : 6
block devices : 0
char devices : 0
sockets : 0
FIFOs/pipes : 0
A robust and clean approach
- First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.
- Different approaches for different needs/purpose (all of the below or pick what ever applies):
- Hash only the entry name of all entries in the directory tree
- Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)
- For a symbolic link, its content is the referent name. Hash it or choose to skip
- Follow or not to follow(resolved name) the symlink while hashing the contents of the entry
- If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files
- Handle large files well(again, mind the RAM)
- Handle very deep directory trees (mind the open file descriptors)
- Handle non standard file names
- How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?
- Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.
This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.
Here's a tool(disclaimer: I'm a contributor to it) dtreetrawl, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.
Usage:
dtreetrawl [OPTION...] "/trawl/me" [path2,...]
Help Options:
-h, --help Show help options
Application Options:
-t, --terse Produce a terse output; parsable.
-d, --delim=: Character or string delimiter/separator for terse output(default ':')
-l, --max-level=N Do not traverse tree beyond N level(s)
--hash Hash the files to produce checksums(default is MD5).
-c, --checksum=md5 Valid hashing algorithms: md5, sha1, sha256, sha512.
-s, --hash-symlink Include symbolic links' referent name while calculating the root checksum
-R, --only-root-hash Output only the root hash. Blank line if --hash is not set
-N, --no-name-hash Exclude path name while calculating the root checksum
-F, --no-content-hash Do not hash the contents of the file
An example human friendly output:
...
... //clipped
...
/home/lab/linux-4.14-rc8/CREDITS
Base name : CREDITS
Level : 1
Type : regular file
Referent name :
File size : 98443 bytes
I-node number : 290850
No. directory entries : 0
Permission (octal) : 0644
Link count : 1
Ownership : UID=0, GID=0
Preferred I/O block size : 4096 bytes
Blocks allocated : 200
Last status change : Tue, 21 Nov 17 21:28:18 +0530
Last file access : Thu, 28 Dec 17 00:53:27 +0530
Last file modification : Tue, 21 Nov 17 21:28:18 +0530
Hash : 9f0312d130016d103aa5fc9d16a2437e
Stats for /home/lab/linux-4.14-rc8:
Elapsed time : 1.305767 s
Start time : Sun, 07 Jan 18 03:42:39 +0530
Root hash : 434e93111ad6f9335bb4954bc8f4eca4
Hash type : md5
Depth : 8
Total,
size : 66850916 bytes
entries : 12484
directories : 763
regular files : 11715
symlinks : 6
block devices : 0
char devices : 0
sockets : 0
FIFOs/pipes : 0
edited Jan 7 at 13:50
answered Jan 7 at 11:27
six-k
13
13
General advice is always welcome but the best answers are specific and with code where appropriate. If you have experience of using the tool you refer to then please include it.
â bu5hman
Jan 7 at 11:54
@bu5hman Sure! I wasn't quite comfortable saying(gloating?) more about how well it works since I'm involved in its development.
â six-k
Jan 7 at 13:56
add a comment |Â
General advice is always welcome but the best answers are specific and with code where appropriate. If you have experience of using the tool you refer to then please include it.
â bu5hman
Jan 7 at 11:54
@bu5hman Sure! I wasn't quite comfortable saying(gloating?) more about how well it works since I'm involved in its development.
â six-k
Jan 7 at 13:56
General advice is always welcome but the best answers are specific and with code where appropriate. If you have experience of using the tool you refer to then please include it.
â bu5hman
Jan 7 at 11:54
General advice is always welcome but the best answers are specific and with code where appropriate. If you have experience of using the tool you refer to then please include it.
â bu5hman
Jan 7 at 11:54
@bu5hman Sure! I wasn't quite comfortable saying(gloating?) more about how well it works since I'm involved in its development.
â six-k
Jan 7 at 13:56
@bu5hman Sure! I wasn't quite comfortable saying(gloating?) more about how well it works since I'm involved in its development.
â six-k
Jan 7 at 13:56
add a comment |Â
up vote
0
down vote
Doing individually for all files in each directory.
# Calculating
find dir1 | xargs md5sum > dir1.md5
find dir2 | xargs md5sum > dir2.md5
# Comparing (and showing the difference)
paste <(sort -k2 dir1.md5) <(sort -k2 dir2.md5) | awk '$1 != $3'
New contributor
add a comment |Â
up vote
0
down vote
Doing individually for all files in each directory.
# Calculating
find dir1 | xargs md5sum > dir1.md5
find dir2 | xargs md5sum > dir2.md5
# Comparing (and showing the difference)
paste <(sort -k2 dir1.md5) <(sort -k2 dir2.md5) | awk '$1 != $3'
New contributor
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Doing individually for all files in each directory.
# Calculating
find dir1 | xargs md5sum > dir1.md5
find dir2 | xargs md5sum > dir2.md5
# Comparing (and showing the difference)
paste <(sort -k2 dir1.md5) <(sort -k2 dir2.md5) | awk '$1 != $3'
New contributor
Doing individually for all files in each directory.
# Calculating
find dir1 | xargs md5sum > dir1.md5
find dir2 | xargs md5sum > dir2.md5
# Comparing (and showing the difference)
paste <(sort -k2 dir1.md5) <(sort -k2 dir2.md5) | awk '$1 != $3'
New contributor
New contributor
answered 10 mins ago
Leandro Lima
12
12
New contributor
New contributor
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f35832%2fhow-do-i-get-the-md5-sum-of-a-directorys-contents-as-one-sum%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password