Find duplicate files

Clash Royale CLAN TAG#URR8PPP

Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?

edited Feb 28 at 1:48

Jeff Schaller♦

43.9k1161141

asked Apr 4 '13 at 13:18

student

7,2601766129

3

Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.

– Shadur
Apr 4 '13 at 14:02

4

@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)

– peterph
Apr 4 '13 at 14:57

15

@Shadur That's not true. You can reduce the time by checking for matching st_sizes, eliminating those with only one of the same, and then only calculating md5sums for matching st_sizes.

– Chris Down
Apr 4 '13 at 16:36

6

@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).

– derobert
Apr 4 '13 at 17:09

1

@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.

– Shadur
Apr 4 '13 at 19:38

|
show 3 more comments

Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?

edited Feb 28 at 1:48

Jeff Schaller♦

43.9k1161141

asked Apr 4 '13 at 13:18

student

7,2601766129

3

Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.

– Shadur
Apr 4 '13 at 14:02

4

@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)

– peterph
Apr 4 '13 at 14:57

15

@Shadur That's not true. You can reduce the time by checking for matching st_sizes, eliminating those with only one of the same, and then only calculating md5sums for matching st_sizes.

– Chris Down
Apr 4 '13 at 16:36

6

@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).

– derobert
Apr 4 '13 at 17:09

1

@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.

– Shadur
Apr 4 '13 at 19:38

|
show 3 more comments

Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?

edited Feb 28 at 1:48

Jeff Schaller♦

43.9k1161141

asked Apr 4 '13 at 13:18

student

7,2601766129

Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?

files duplicate-files

edited Feb 28 at 1:48

Jeff Schaller♦

43.9k1161141

asked Apr 4 '13 at 13:18

student

7,2601766129

edited Feb 28 at 1:48

Jeff Schaller♦

43.9k1161141

asked Apr 4 '13 at 13:18

student

7,2601766129

edited Feb 28 at 1:48

Jeff Schaller♦

43.9k1161141

edited Feb 28 at 1:48

Jeff Schaller♦

43.9k1161141

edited Feb 28 at 1:48

Jeff Schaller♦

43.9k1161141

asked Apr 4 '13 at 13:18

student

7,2601766129

asked Apr 4 '13 at 13:18

student

7,2601766129

asked Apr 4 '13 at 13:18

student

7,2601766129

3

Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.

– Shadur
Apr 4 '13 at 14:02

4

@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)

– peterph
Apr 4 '13 at 14:57

15

@Shadur That's not true. You can reduce the time by checking for matching st_sizes, eliminating those with only one of the same, and then only calculating md5sums for matching st_sizes.

– Chris Down
Apr 4 '13 at 16:36

6

@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).

– derobert
Apr 4 '13 at 17:09

1

@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.

– Shadur
Apr 4 '13 at 19:38

|
show 3 more comments

3

Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.

– Shadur
Apr 4 '13 at 14:02

4

@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)

– peterph
Apr 4 '13 at 14:57

15

@Shadur That's not true. You can reduce the time by checking for matching st_sizes, eliminating those with only one of the same, and then only calculating md5sums for matching st_sizes.

– Chris Down
Apr 4 '13 at 16:36

6

@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).

– derobert
Apr 4 '13 at 17:09

1

@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.

– Shadur
Apr 4 '13 at 19:38

Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.

– Shadur
Apr 4 '13 at 14:02

@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)

– peterph
Apr 4 '13 at 14:57

@Shadur That's not true. You can reduce the time by checking for matching st_sizes, eliminating those with only one of the same, and then only calculating md5sums for matching st_sizes.

– Chris Down
Apr 4 '13 at 16:36

@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).

– derobert
Apr 4 '13 at 17:09

@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.

– Shadur
Apr 4 '13 at 19:38

|
show 3 more comments

7 Answers
7

active

oldest

votes

101

fdupes can do this. From man fdupes:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

In Debian or Ubuntu, you can install it with apt-get install fdupes. In Fedora/Red Hat/CentOS, you can install it with yum install fdupes. On Arch Linux you can use pacman -S fdupes, and on Gentoo, emerge fdupes.

To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.

As asked in the comments, you can get the largest duplicates by doing the following:

fdupes -r . | 
 while IFS= read -r file; do
 [[ $file ]] && du "$file"
 done
 | sort -n

This will break if your filenames contain newlines.

edited Aug 14 '17 at 17:38

genpfault

1457

answered Apr 4 '13 at 13:24

Chris Down

81.4k15190204

Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

– student
Apr 5 '13 at 9:31

@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

– Olivier Dulac
Apr 5 '13 at 12:27

2

@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

– Chris Down
Apr 5 '13 at 13:13

@student - Once you have the filenames, du piped to sort will tell you.

– Chris Down
Apr 5 '13 at 13:14

@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

– Olivier Dulac
Apr 5 '13 at 14:05

|
show 4 more comments

Another good tool is fslint:

fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.

Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
--help option which further details its parameters.
 findup - find DUPlicate files

On debian-based systems, youcan install it with:

sudo apt-get install fslint

You can also do this manually if you don't want to or cannot install third party tools. The way most such programs work is by calculating file checksums. Files with the same md5sum almost certainly contain exactly the same data. So, you could do something like this:

find / -type f -exec md5sum ; > md5sums
gawk 'print $1' md5sums | sort | uniq -d > dupes
while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes

Sample output (the file names in this example are the same, but it will also work when they are different):

$ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes 
---
 /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
 /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
---
 /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
 /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
---
 /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
 /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
---

This will be much slower than the dedicated tools already mentioned, but it will work.

edited Apr 4 '13 at 16:06

answered Apr 4 '13 at 16:00

terdon♦

133k32264444

3

It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

– Chris Down
Apr 4 '13 at 16:34

@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

– terdon♦
Apr 4 '13 at 16:37

add a comment |

Short answer: yes.

Longer version: have a look at the wikipedia fdupes entry, it sports quite nice list of ready made solutions. Of course you can write your own, it's not that difficult - hashing programs like diff, sha*sum, find, sort and uniq should do the job. You can even put it on one line, and it will still be understandable.

answered Apr 4 '13 at 13:25

peterph

23.8k24558

add a comment |

If you believe a hash function (here MD5) is collision-free on your domain:

find $target -type f -exec md5sum '' + | sort | uniq --all-repeated --check-chars=32 
 | cut --characters=35-

Want identical file names grouped? Write a simple script not_uniq.sh to format output:

#!/bin/bash

last_checksum=0
while read line; do
 checksum=$line:0:32
 filename=$line:34
 if [ $checksum == $last_checksum ]; then
 if [ $last_filename:-0 != '0' ]; then
 echo $last_filename
 unset last_filename
 fi
 echo $filename
 else
 if [ $last_filename:-0 == '0' ]; then
 echo "======="
 fi
 last_filename=$filename
 fi

 last_checksum=$checksum
done

Then change find command to use your script:

chmod +x not_uniq.sh
find $target -type f -exec md5sum '' + | sort | not_uniq.sh

This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)

edited Feb 21 '17 at 18:15

Wayne Werner

6,34851936

answered Apr 13 '13 at 15:39

xin

29929

add a comment |

I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):

jdupes . -rS -X size-:50m > myjdups.txt

This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.

Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:

jdupes -r . -X size-:50m | 
 while IFS= read -r file; do
 [[ $file ]] && du "$file"
 done
 | sort -n > myjdups_sorted.txt

answered Nov 23 '17 at 17:27

Sebastian Müller

1814

add a comment |

^{Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted}.

I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.

On Linux you can use:

- FSLint: http://www.pixelbeat.org/fslint/

- FDupes: https://en.wikipedia.org/wiki/Fdupes

- DupeGuru: https://www.hardcoded.net/dupeguru/

The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint

edited Jul 3 '17 at 10:09

Stéphane Chazelas

311k57587945

answered Jan 29 '14 at 11:01

MordicusEtCubitus

1293

5

It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

– Anthon
Jan 29 '14 at 11:22

2

Wikipedia page is empty.

– ihor_dvoretskyi
Sep 10 '15 at 9:01

yes, it has been cleaned, what a pity shake...

– MordicusEtCubitus
Dec 21 '15 at 16:23

I've edited it with these 3 tools

– MordicusEtCubitus
Dec 21 '15 at 16:30

add a comment |

Here's my take on that:

find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
 echo -n '.'
 if grep -q "$i" md5-partial.txt; then echo -e "n$i ---- Already counted, skipping."; continue; fi
 MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
 MD5=`echo $MD5 | cut -d' ' -f1`
 if grep "$MD5" md5-partial.txt; then echo "n$i ---- Possible duplicate"; fi
 echo $MD5 $i >> md5-partial.txt
done

It's different in that it only hashes up to first 1 MB of the file.

This has few issues / features:

There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.

Checking by file size first could speed this up.

Only takes files larger than 3 MB.

I use it to compare video clips so this is enough for me.

answered Jun 2 '17 at 1:50

Ondra Žižka

474312

add a comment |

protected by Community♦ Jan 14 '16 at 12:14

Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).

Would you like to answer one of these unanswered questions instead?

7 Answers
7

active

oldest

votes

7 Answers
7

active

oldest

votes

101

fdupes can do this. From man fdupes:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.

As asked in the comments, you can get the largest duplicates by doing the following:

fdupes -r . | 
 while IFS= read -r file; do
 [[ $file ]] && du "$file"
 done
 | sort -n

This will break if your filenames contain newlines.

edited Aug 14 '17 at 17:38

genpfault

1457

answered Apr 4 '13 at 13:24

Chris Down

81.4k15190204

Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

– student
Apr 5 '13 at 9:31

@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

– Olivier Dulac
Apr 5 '13 at 12:27

2

@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

– Chris Down
Apr 5 '13 at 13:13

@student - Once you have the filenames, du piped to sort will tell you.

– Chris Down
Apr 5 '13 at 13:14

@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

– Olivier Dulac
Apr 5 '13 at 14:05

|
show 4 more comments

101

fdupes can do this. From man fdupes:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.

As asked in the comments, you can get the largest duplicates by doing the following:

fdupes -r . | 
 while IFS= read -r file; do
 [[ $file ]] && du "$file"
 done
 | sort -n

This will break if your filenames contain newlines.

edited Aug 14 '17 at 17:38

genpfault

1457

answered Apr 4 '13 at 13:24

Chris Down

81.4k15190204

Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

– student
Apr 5 '13 at 9:31

@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

– Olivier Dulac
Apr 5 '13 at 12:27

2

@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

– Chris Down
Apr 5 '13 at 13:13

@student - Once you have the filenames, du piped to sort will tell you.

– Chris Down
Apr 5 '13 at 13:14

@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

– Olivier Dulac
Apr 5 '13 at 14:05

|
show 4 more comments

101

fdupes can do this. From man fdupes:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.

As asked in the comments, you can get the largest duplicates by doing the following:

fdupes -r . | 
 while IFS= read -r file; do
 [[ $file ]] && du "$file"
 done
 | sort -n

This will break if your filenames contain newlines.

edited Aug 14 '17 at 17:38

genpfault

1457

answered Apr 4 '13 at 13:24

Chris Down

81.4k15190204

fdupes can do this. From man fdupes:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.

As asked in the comments, you can get the largest duplicates by doing the following:

fdupes -r . | 
 while IFS= read -r file; do
 [[ $file ]] && du "$file"
 done
 | sort -n

This will break if your filenames contain newlines.

edited Aug 14 '17 at 17:38

genpfault

1457

answered Apr 4 '13 at 13:24

Chris Down

81.4k15190204

edited Aug 14 '17 at 17:38

genpfault

1457

edited Aug 14 '17 at 17:38

genpfault

1457

edited Aug 14 '17 at 17:38

genpfault

1457

answered Apr 4 '13 at 13:24

Chris Down

81.4k15190204

answered Apr 4 '13 at 13:24

Chris Down

81.4k15190204

answered Apr 4 '13 at 13:24

Chris Down

81.4k15190204

Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

– student
Apr 5 '13 at 9:31

@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

– Olivier Dulac
Apr 5 '13 at 12:27

2

@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

– Chris Down
Apr 5 '13 at 13:13

@student - Once you have the filenames, du piped to sort will tell you.

– Chris Down
Apr 5 '13 at 13:14

@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

– Olivier Dulac
Apr 5 '13 at 14:05

|
show 4 more comments

Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

– student
Apr 5 '13 at 9:31

@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

– Olivier Dulac
Apr 5 '13 at 12:27

2

@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

– Chris Down
Apr 5 '13 at 13:13

@student - Once you have the filenames, du piped to sort will tell you.

– Chris Down
Apr 5 '13 at 13:14

@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

– Olivier Dulac
Apr 5 '13 at 14:05

Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

– student
Apr 5 '13 at 9:31

@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

– Olivier Dulac
Apr 5 '13 at 12:27

@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

– Chris Down
Apr 5 '13 at 13:13

@student - Once you have the filenames, du piped to sort will tell you.

– Chris Down
Apr 5 '13 at 13:14

@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

– Olivier Dulac
Apr 5 '13 at 14:05

|
show 4 more comments

Another good tool is fslint:

fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.

Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
--help option which further details its parameters.
 findup - find DUPlicate files

On debian-based systems, youcan install it with:

sudo apt-get install fslint

find / -type f -exec md5sum ; > md5sums
gawk 'print $1' md5sums | sort | uniq -d > dupes
while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes

Sample output (the file names in this example are the same, but it will also work when they are different):

$ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes 
---
 /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
 /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
---
 /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
 /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
---
 /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
 /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
---

This will be much slower than the dedicated tools already mentioned, but it will work.

edited Apr 4 '13 at 16:06

answered Apr 4 '13 at 16:00

terdon♦

133k32264444

3

It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

– Chris Down
Apr 4 '13 at 16:34

@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

– terdon♦
Apr 4 '13 at 16:37

add a comment |

Another good tool is fslint:

fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.

Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
--help option which further details its parameters.
 findup - find DUPlicate files

On debian-based systems, youcan install it with:

sudo apt-get install fslint

find / -type f -exec md5sum ; > md5sums
gawk 'print $1' md5sums | sort | uniq -d > dupes
while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes

Sample output (the file names in this example are the same, but it will also work when they are different):

$ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes 
---
 /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
 /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
---
 /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
 /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
---
 /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
 /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
---

This will be much slower than the dedicated tools already mentioned, but it will work.

edited Apr 4 '13 at 16:06

answered Apr 4 '13 at 16:00

terdon♦

133k32264444

3

It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

– Chris Down
Apr 4 '13 at 16:34

@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

– terdon♦
Apr 4 '13 at 16:37

add a comment |

Another good tool is fslint:

fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.

Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
--help option which further details its parameters.
 findup - find DUPlicate files

On debian-based systems, youcan install it with:

sudo apt-get install fslint

find / -type f -exec md5sum ; > md5sums
gawk 'print $1' md5sums | sort | uniq -d > dupes
while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes

Sample output (the file names in this example are the same, but it will also work when they are different):

$ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes 
---
 /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
 /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
---
 /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
 /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
---
 /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
 /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
---

This will be much slower than the dedicated tools already mentioned, but it will work.

edited Apr 4 '13 at 16:06

answered Apr 4 '13 at 16:00

terdon♦

133k32264444

Another good tool is fslint:

fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.

Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
--help option which further details its parameters.
 findup - find DUPlicate files

On debian-based systems, youcan install it with:

sudo apt-get install fslint

find / -type f -exec md5sum ; > md5sums
gawk 'print $1' md5sums | sort | uniq -d > dupes
while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes

Sample output (the file names in this example are the same, but it will also work when they are different):

$ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes 
---
 /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
 /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
---
 /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
 /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
---
 /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
 /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
---

This will be much slower than the dedicated tools already mentioned, but it will work.

edited Apr 4 '13 at 16:06

answered Apr 4 '13 at 16:00

terdon♦

133k32264444

edited Apr 4 '13 at 16:06

answered Apr 4 '13 at 16:00

terdon♦

133k32264444

answered Apr 4 '13 at 16:00

terdon♦

133k32264444

answered Apr 4 '13 at 16:00

terdon♦

133k32264444

3

It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

– Chris Down
Apr 4 '13 at 16:34

@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

– terdon♦
Apr 4 '13 at 16:37

add a comment |

3

It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

– Chris Down
Apr 4 '13 at 16:34

@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

– terdon♦
Apr 4 '13 at 16:37

It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

– Chris Down
Apr 4 '13 at 16:34

@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

– terdon♦
Apr 4 '13 at 16:37

add a comment |

Short answer: yes.

answered Apr 4 '13 at 13:25

peterph

23.8k24558

add a comment |

Short answer: yes.

answered Apr 4 '13 at 13:25

peterph

23.8k24558

add a comment |

Short answer: yes.

answered Apr 4 '13 at 13:25

peterph

23.8k24558

Short answer: yes.

answered Apr 4 '13 at 13:25

peterph

23.8k24558

answered Apr 4 '13 at 13:25

peterph

23.8k24558

answered Apr 4 '13 at 13:25

peterph

23.8k24558

answered Apr 4 '13 at 13:25

peterph

23.8k24558

add a comment |

If you believe a hash function (here MD5) is collision-free on your domain:

find $target -type f -exec md5sum '' + | sort | uniq --all-repeated --check-chars=32 
 | cut --characters=35-

Want identical file names grouped? Write a simple script not_uniq.sh to format output:

#!/bin/bash

last_checksum=0
while read line; do
 checksum=$line:0:32
 filename=$line:34
 if [ $checksum == $last_checksum ]; then
 if [ $last_filename:-0 != '0' ]; then
 echo $last_filename
 unset last_filename
 fi
 echo $filename
 else
 if [ $last_filename:-0 == '0' ]; then
 echo "======="
 fi
 last_filename=$filename
 fi

 last_checksum=$checksum
done

Then change find command to use your script:

chmod +x not_uniq.sh
find $target -type f -exec md5sum '' + | sort | not_uniq.sh

This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)

edited Feb 21 '17 at 18:15

Wayne Werner

6,34851936

answered Apr 13 '13 at 15:39

xin

29929

add a comment |

If you believe a hash function (here MD5) is collision-free on your domain:

find $target -type f -exec md5sum '' + | sort | uniq --all-repeated --check-chars=32 
 | cut --characters=35-

Want identical file names grouped? Write a simple script not_uniq.sh to format output:

#!/bin/bash

last_checksum=0
while read line; do
 checksum=$line:0:32
 filename=$line:34
 if [ $checksum == $last_checksum ]; then
 if [ $last_filename:-0 != '0' ]; then
 echo $last_filename
 unset last_filename
 fi
 echo $filename
 else
 if [ $last_filename:-0 == '0' ]; then
 echo "======="
 fi
 last_filename=$filename
 fi

 last_checksum=$checksum
done

Then change find command to use your script:

chmod +x not_uniq.sh
find $target -type f -exec md5sum '' + | sort | not_uniq.sh

This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)

edited Feb 21 '17 at 18:15

Wayne Werner

6,34851936

answered Apr 13 '13 at 15:39

xin

29929

add a comment |

If you believe a hash function (here MD5) is collision-free on your domain:

find $target -type f -exec md5sum '' + | sort | uniq --all-repeated --check-chars=32 
 | cut --characters=35-

Want identical file names grouped? Write a simple script not_uniq.sh to format output:

#!/bin/bash

last_checksum=0
while read line; do
 checksum=$line:0:32
 filename=$line:34
 if [ $checksum == $last_checksum ]; then
 if [ $last_filename:-0 != '0' ]; then
 echo $last_filename
 unset last_filename
 fi
 echo $filename
 else
 if [ $last_filename:-0 == '0' ]; then
 echo "======="
 fi
 last_filename=$filename
 fi

 last_checksum=$checksum
done

Then change find command to use your script:

chmod +x not_uniq.sh
find $target -type f -exec md5sum '' + | sort | not_uniq.sh

This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)

edited Feb 21 '17 at 18:15

Wayne Werner

6,34851936

answered Apr 13 '13 at 15:39

xin

29929

If you believe a hash function (here MD5) is collision-free on your domain:

find $target -type f -exec md5sum '' + | sort | uniq --all-repeated --check-chars=32 
 | cut --characters=35-

Want identical file names grouped? Write a simple script not_uniq.sh to format output:

#!/bin/bash

last_checksum=0
while read line; do
 checksum=$line:0:32
 filename=$line:34
 if [ $checksum == $last_checksum ]; then
 if [ $last_filename:-0 != '0' ]; then
 echo $last_filename
 unset last_filename
 fi
 echo $filename
 else
 if [ $last_filename:-0 == '0' ]; then
 echo "======="
 fi
 last_filename=$filename
 fi

 last_checksum=$checksum
done

Then change find command to use your script:

chmod +x not_uniq.sh
find $target -type f -exec md5sum '' + | sort | not_uniq.sh

This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)

edited Feb 21 '17 at 18:15

Wayne Werner

6,34851936

answered Apr 13 '13 at 15:39

xin

29929

edited Feb 21 '17 at 18:15

Wayne Werner

6,34851936

edited Feb 21 '17 at 18:15

Wayne Werner

6,34851936

edited Feb 21 '17 at 18:15

Wayne Werner

6,34851936

answered Apr 13 '13 at 15:39

xin

29929

answered Apr 13 '13 at 15:39

xin

29929

answered Apr 13 '13 at 15:39

xin

29929

add a comment |

I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):

jdupes . -rS -X size-:50m > myjdups.txt

This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.

Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:

jdupes -r . -X size-:50m | 
 while IFS= read -r file; do
 [[ $file ]] && du "$file"
 done
 | sort -n > myjdups_sorted.txt

answered Nov 23 '17 at 17:27

Sebastian Müller

1814

add a comment |

I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):

jdupes . -rS -X size-:50m > myjdups.txt

This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.

Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:

jdupes -r . -X size-:50m | 
 while IFS= read -r file; do
 [[ $file ]] && du "$file"
 done
 | sort -n > myjdups_sorted.txt

answered Nov 23 '17 at 17:27

Sebastian Müller

1814

add a comment |

I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):

jdupes . -rS -X size-:50m > myjdups.txt

This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.

Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:

jdupes -r . -X size-:50m | 
 while IFS= read -r file; do
 [[ $file ]] && du "$file"
 done
 | sort -n > myjdups_sorted.txt

answered Nov 23 '17 at 17:27

Sebastian Müller

1814

I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):

jdupes . -rS -X size-:50m > myjdups.txt

This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.

Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:

jdupes -r . -X size-:50m | 
 while IFS= read -r file; do
 [[ $file ]] && du "$file"
 done
 | sort -n > myjdups_sorted.txt

answered Nov 23 '17 at 17:27

Sebastian Müller

1814

answered Nov 23 '17 at 17:27

Sebastian Müller

1814

answered Nov 23 '17 at 17:27

Sebastian Müller

1814

answered Nov 23 '17 at 17:27

Sebastian Müller

1814

add a comment |

^{Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted}.

I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.

On Linux you can use:

- FSLint: http://www.pixelbeat.org/fslint/

- FDupes: https://en.wikipedia.org/wiki/Fdupes

- DupeGuru: https://www.hardcoded.net/dupeguru/

The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint

edited Jul 3 '17 at 10:09

Stéphane Chazelas

311k57587945

answered Jan 29 '14 at 11:01

MordicusEtCubitus

1293

5

It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

– Anthon
Jan 29 '14 at 11:22

2

Wikipedia page is empty.

– ihor_dvoretskyi
Sep 10 '15 at 9:01

yes, it has been cleaned, what a pity shake...

– MordicusEtCubitus
Dec 21 '15 at 16:23

I've edited it with these 3 tools

– MordicusEtCubitus
Dec 21 '15 at 16:30

add a comment |

^{Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted}.

I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.

On Linux you can use:

- FSLint: http://www.pixelbeat.org/fslint/

- FDupes: https://en.wikipedia.org/wiki/Fdupes

- DupeGuru: https://www.hardcoded.net/dupeguru/

The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint

edited Jul 3 '17 at 10:09

Stéphane Chazelas

311k57587945

answered Jan 29 '14 at 11:01

MordicusEtCubitus

1293

5

It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

– Anthon
Jan 29 '14 at 11:22

2

Wikipedia page is empty.

– ihor_dvoretskyi
Sep 10 '15 at 9:01

yes, it has been cleaned, what a pity shake...

– MordicusEtCubitus
Dec 21 '15 at 16:23

I've edited it with these 3 tools

– MordicusEtCubitus
Dec 21 '15 at 16:30

add a comment |

^{Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted}.

I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.

On Linux you can use:

- FSLint: http://www.pixelbeat.org/fslint/

- FDupes: https://en.wikipedia.org/wiki/Fdupes

- DupeGuru: https://www.hardcoded.net/dupeguru/

The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint

edited Jul 3 '17 at 10:09

Stéphane Chazelas

311k57587945

answered Jan 29 '14 at 11:01

MordicusEtCubitus

1293

^{Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted}.

I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.

On Linux you can use:

- FSLint: http://www.pixelbeat.org/fslint/

- FDupes: https://en.wikipedia.org/wiki/Fdupes

- DupeGuru: https://www.hardcoded.net/dupeguru/

The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint

edited Jul 3 '17 at 10:09

Stéphane Chazelas

311k57587945

answered Jan 29 '14 at 11:01

MordicusEtCubitus

1293

edited Jul 3 '17 at 10:09

Stéphane Chazelas

311k57587945

edited Jul 3 '17 at 10:09

Stéphane Chazelas

311k57587945

edited Jul 3 '17 at 10:09

Stéphane Chazelas

311k57587945

answered Jan 29 '14 at 11:01

MordicusEtCubitus

1293

answered Jan 29 '14 at 11:01

MordicusEtCubitus

1293

answered Jan 29 '14 at 11:01

MordicusEtCubitus

1293

5

It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

– Anthon
Jan 29 '14 at 11:22

2

Wikipedia page is empty.

– ihor_dvoretskyi
Sep 10 '15 at 9:01

yes, it has been cleaned, what a pity shake...

– MordicusEtCubitus
Dec 21 '15 at 16:23

I've edited it with these 3 tools

– MordicusEtCubitus
Dec 21 '15 at 16:30

add a comment |

5

It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

– Anthon
Jan 29 '14 at 11:22

2

Wikipedia page is empty.

– ihor_dvoretskyi
Sep 10 '15 at 9:01

yes, it has been cleaned, what a pity shake...

– MordicusEtCubitus
Dec 21 '15 at 16:23

I've edited it with these 3 tools

– MordicusEtCubitus
Dec 21 '15 at 16:30

It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

– Anthon
Jan 29 '14 at 11:22

Wikipedia page is empty.

– ihor_dvoretskyi
Sep 10 '15 at 9:01

yes, it has been cleaned, what a pity shake...

– MordicusEtCubitus
Dec 21 '15 at 16:23

I've edited it with these 3 tools

– MordicusEtCubitus
Dec 21 '15 at 16:30

add a comment |

Here's my take on that:

find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
 echo -n '.'
 if grep -q "$i" md5-partial.txt; then echo -e "n$i ---- Already counted, skipping."; continue; fi
 MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
 MD5=`echo $MD5 | cut -d' ' -f1`
 if grep "$MD5" md5-partial.txt; then echo "n$i ---- Possible duplicate"; fi
 echo $MD5 $i >> md5-partial.txt
done

It's different in that it only hashes up to first 1 MB of the file.

This has few issues / features:

There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.

Checking by file size first could speed this up.

Only takes files larger than 3 MB.

I use it to compare video clips so this is enough for me.

answered Jun 2 '17 at 1:50

Ondra Žižka

474312

add a comment |

Here's my take on that:

find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
 echo -n '.'
 if grep -q "$i" md5-partial.txt; then echo -e "n$i ---- Already counted, skipping."; continue; fi
 MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
 MD5=`echo $MD5 | cut -d' ' -f1`
 if grep "$MD5" md5-partial.txt; then echo "n$i ---- Possible duplicate"; fi
 echo $MD5 $i >> md5-partial.txt
done

It's different in that it only hashes up to first 1 MB of the file.

This has few issues / features:

There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.

Checking by file size first could speed this up.

Only takes files larger than 3 MB.

I use it to compare video clips so this is enough for me.

answered Jun 2 '17 at 1:50

Ondra Žižka

474312

add a comment |

Here's my take on that:

find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
 echo -n '.'
 if grep -q "$i" md5-partial.txt; then echo -e "n$i ---- Already counted, skipping."; continue; fi
 MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
 MD5=`echo $MD5 | cut -d' ' -f1`
 if grep "$MD5" md5-partial.txt; then echo "n$i ---- Possible duplicate"; fi
 echo $MD5 $i >> md5-partial.txt
done

It's different in that it only hashes up to first 1 MB of the file.

This has few issues / features:

There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.

Checking by file size first could speed this up.

Only takes files larger than 3 MB.

I use it to compare video clips so this is enough for me.

answered Jun 2 '17 at 1:50

Ondra Žižka

474312

Here's my take on that:

find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
 echo -n '.'
 if grep -q "$i" md5-partial.txt; then echo -e "n$i ---- Already counted, skipping."; continue; fi
 MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
 MD5=`echo $MD5 | cut -d' ' -f1`
 if grep "$MD5" md5-partial.txt; then echo "n$i ---- Possible duplicate"; fi
 echo $MD5 $i >> md5-partial.txt
done

It's different in that it only hashes up to first 1 MB of the file.

This has few issues / features:

There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.

Checking by file size first could speed this up.

Only takes files larger than 3 MB.

I use it to compare video clips so this is enough for me.

answered Jun 2 '17 at 1:50

Ondra Žižka

474312

answered Jun 2 '17 at 1:50

Ondra Žižka

474312

answered Jun 2 '17 at 1:50

Ondra Žižka

474312

answered Jun 2 '17 at 1:50

Ondra Žižka

474312

add a comment |

搜尋此網誌

mjhjmtu