Help with script/rsync command to move file with md5 sum comparison before deleting the source file/ [closed]

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite
1












Referencing this post to find and delete duplicate files based on checksum, I would like to modify the approach to perform a copy operation followed by a file integrity check on the destination file.



SOURCE = /path/to/Source
DEST = /path/to/Destination
# filecksums containing the md5 of the copied files
declare -A filecksums

for file in "$@"
do
[[ -f "$file" ]] || continue

# Generate the checksum
cksum=$(cksum <"$file" | tr ' ' _)

# Can an exact duplicate be found in the destination directory?
if [[ -n "$filecksums[$cksum]" ]] && [[ "$filecksums[$cksum]" != "$file" ]]
then
rm -f "$file"
else
echo " '$file' is not in '$DEST'" >&2
fi
done


I want to use the result of the md5 checksum comparison to allow rm -f of the source file only if the checksums are equivalent. If there is a difference, I want to echo the result and escape. rsync might be another option, but I think I would have problems forcing a checksum comparison for local-local file transfer.



UPDATE
I have looked into using rsync per @Lucas 's answer. It appears that there are options to transfer files more stabily with checks rather than a bulk mv /data1/* /data2/ and report what was done and delete after a check. This might narrow the definition as indicated by community members.







share|improve this question













closed as too broad by Rui F Ribeiro, G-Man, DopeGhoti, Kusalananda, slm♦ Jul 10 at 20:21


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.










  • 3




    This is not a script asking forum. Please tell us what have you done until now and how we might assist you in improving your Unix knowledge.
    – Rui F Ribeiro
    Jul 10 at 18:27







  • 1




    Using rsync with --checksum would force using the checksumming even for local transfers. This is not a integrity check though, but a way for rsync to figure out what have changed between source and target files, and since you're planning to delete the source files after transfer, this would be kinda useless.
    – Kusalananda
    Jul 10 at 18:34










  • That script can't be running correctly - the first two lines are wrong. I'd also strongly recommend getting into the habit of starting every script with a #! line to define the interpreter. In your case here I think #!/bin/bash could be appropriate.
    – roaima
    Jul 10 at 22:06














up vote
1
down vote

favorite
1












Referencing this post to find and delete duplicate files based on checksum, I would like to modify the approach to perform a copy operation followed by a file integrity check on the destination file.



SOURCE = /path/to/Source
DEST = /path/to/Destination
# filecksums containing the md5 of the copied files
declare -A filecksums

for file in "$@"
do
[[ -f "$file" ]] || continue

# Generate the checksum
cksum=$(cksum <"$file" | tr ' ' _)

# Can an exact duplicate be found in the destination directory?
if [[ -n "$filecksums[$cksum]" ]] && [[ "$filecksums[$cksum]" != "$file" ]]
then
rm -f "$file"
else
echo " '$file' is not in '$DEST'" >&2
fi
done


I want to use the result of the md5 checksum comparison to allow rm -f of the source file only if the checksums are equivalent. If there is a difference, I want to echo the result and escape. rsync might be another option, but I think I would have problems forcing a checksum comparison for local-local file transfer.



UPDATE
I have looked into using rsync per @Lucas 's answer. It appears that there are options to transfer files more stabily with checks rather than a bulk mv /data1/* /data2/ and report what was done and delete after a check. This might narrow the definition as indicated by community members.







share|improve this question













closed as too broad by Rui F Ribeiro, G-Man, DopeGhoti, Kusalananda, slm♦ Jul 10 at 20:21


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.










  • 3




    This is not a script asking forum. Please tell us what have you done until now and how we might assist you in improving your Unix knowledge.
    – Rui F Ribeiro
    Jul 10 at 18:27







  • 1




    Using rsync with --checksum would force using the checksumming even for local transfers. This is not a integrity check though, but a way for rsync to figure out what have changed between source and target files, and since you're planning to delete the source files after transfer, this would be kinda useless.
    – Kusalananda
    Jul 10 at 18:34










  • That script can't be running correctly - the first two lines are wrong. I'd also strongly recommend getting into the habit of starting every script with a #! line to define the interpreter. In your case here I think #!/bin/bash could be appropriate.
    – roaima
    Jul 10 at 22:06












up vote
1
down vote

favorite
1









up vote
1
down vote

favorite
1






1





Referencing this post to find and delete duplicate files based on checksum, I would like to modify the approach to perform a copy operation followed by a file integrity check on the destination file.



SOURCE = /path/to/Source
DEST = /path/to/Destination
# filecksums containing the md5 of the copied files
declare -A filecksums

for file in "$@"
do
[[ -f "$file" ]] || continue

# Generate the checksum
cksum=$(cksum <"$file" | tr ' ' _)

# Can an exact duplicate be found in the destination directory?
if [[ -n "$filecksums[$cksum]" ]] && [[ "$filecksums[$cksum]" != "$file" ]]
then
rm -f "$file"
else
echo " '$file' is not in '$DEST'" >&2
fi
done


I want to use the result of the md5 checksum comparison to allow rm -f of the source file only if the checksums are equivalent. If there is a difference, I want to echo the result and escape. rsync might be another option, but I think I would have problems forcing a checksum comparison for local-local file transfer.



UPDATE
I have looked into using rsync per @Lucas 's answer. It appears that there are options to transfer files more stabily with checks rather than a bulk mv /data1/* /data2/ and report what was done and delete after a check. This might narrow the definition as indicated by community members.







share|improve this question













Referencing this post to find and delete duplicate files based on checksum, I would like to modify the approach to perform a copy operation followed by a file integrity check on the destination file.



SOURCE = /path/to/Source
DEST = /path/to/Destination
# filecksums containing the md5 of the copied files
declare -A filecksums

for file in "$@"
do
[[ -f "$file" ]] || continue

# Generate the checksum
cksum=$(cksum <"$file" | tr ' ' _)

# Can an exact duplicate be found in the destination directory?
if [[ -n "$filecksums[$cksum]" ]] && [[ "$filecksums[$cksum]" != "$file" ]]
then
rm -f "$file"
else
echo " '$file' is not in '$DEST'" >&2
fi
done


I want to use the result of the md5 checksum comparison to allow rm -f of the source file only if the checksums are equivalent. If there is a difference, I want to echo the result and escape. rsync might be another option, but I think I would have problems forcing a checksum comparison for local-local file transfer.



UPDATE
I have looked into using rsync per @Lucas 's answer. It appears that there are options to transfer files more stabily with checks rather than a bulk mv /data1/* /data2/ and report what was done and delete after a check. This might narrow the definition as indicated by community members.









share|improve this question












share|improve this question




share|improve this question








edited Jul 10 at 21:08
























asked Jul 10 at 18:25









brawny84

85




85




closed as too broad by Rui F Ribeiro, G-Man, DopeGhoti, Kusalananda, slm♦ Jul 10 at 20:21


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.






closed as too broad by Rui F Ribeiro, G-Man, DopeGhoti, Kusalananda, slm♦ Jul 10 at 20:21


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.









  • 3




    This is not a script asking forum. Please tell us what have you done until now and how we might assist you in improving your Unix knowledge.
    – Rui F Ribeiro
    Jul 10 at 18:27







  • 1




    Using rsync with --checksum would force using the checksumming even for local transfers. This is not a integrity check though, but a way for rsync to figure out what have changed between source and target files, and since you're planning to delete the source files after transfer, this would be kinda useless.
    – Kusalananda
    Jul 10 at 18:34










  • That script can't be running correctly - the first two lines are wrong. I'd also strongly recommend getting into the habit of starting every script with a #! line to define the interpreter. In your case here I think #!/bin/bash could be appropriate.
    – roaima
    Jul 10 at 22:06












  • 3




    This is not a script asking forum. Please tell us what have you done until now and how we might assist you in improving your Unix knowledge.
    – Rui F Ribeiro
    Jul 10 at 18:27







  • 1




    Using rsync with --checksum would force using the checksumming even for local transfers. This is not a integrity check though, but a way for rsync to figure out what have changed between source and target files, and since you're planning to delete the source files after transfer, this would be kinda useless.
    – Kusalananda
    Jul 10 at 18:34










  • That script can't be running correctly - the first two lines are wrong. I'd also strongly recommend getting into the habit of starting every script with a #! line to define the interpreter. In your case here I think #!/bin/bash could be appropriate.
    – roaima
    Jul 10 at 22:06







3




3




This is not a script asking forum. Please tell us what have you done until now and how we might assist you in improving your Unix knowledge.
– Rui F Ribeiro
Jul 10 at 18:27





This is not a script asking forum. Please tell us what have you done until now and how we might assist you in improving your Unix knowledge.
– Rui F Ribeiro
Jul 10 at 18:27





1




1




Using rsync with --checksum would force using the checksumming even for local transfers. This is not a integrity check though, but a way for rsync to figure out what have changed between source and target files, and since you're planning to delete the source files after transfer, this would be kinda useless.
– Kusalananda
Jul 10 at 18:34




Using rsync with --checksum would force using the checksumming even for local transfers. This is not a integrity check though, but a way for rsync to figure out what have changed between source and target files, and since you're planning to delete the source files after transfer, this would be kinda useless.
– Kusalananda
Jul 10 at 18:34












That script can't be running correctly - the first two lines are wrong. I'd also strongly recommend getting into the habit of starting every script with a #! line to define the interpreter. In your case here I think #!/bin/bash could be appropriate.
– roaima
Jul 10 at 22:06




That script can't be running correctly - the first two lines are wrong. I'd also strongly recommend getting into the habit of starting every script with a #! line to define the interpreter. In your case here I think #!/bin/bash could be appropriate.
– roaima
Jul 10 at 22:06










1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










Implementing something like this might be hard as a first try if you care about the files and don't want to mess up. So here are some alternatives to writing a full script in bash. These are more or less complex command lines (oneliners) that might help in your situation.



There is one uncertainty in your question: do you want to compare each file in source with every file in dest or only those with "matching" file names? (That would be compare /path/to/src/a with /path/to/dest/a and /path/to/src/b with /path/to/dest/b but not /path/to/src/a with /path/to/dest/b and so on)



I will assume that you only want to compare files with matching paths!!



first idea: diff



The good old diff can compare directories recursively. Also use the -q option to just see which files differ and not how they differ.



diff -r -q /path/to/source /path/to/dest


cons



  • This can take a long time depending on the size of your hard disk.

  • This doesn't delete the old files.

  • The output i not easily parseable

pros



  • This doesn't delete any files :)

So after you manually/visually confirmed that there are no differences in any files you care about you have to manually delete the source with rm -rf /path/to/source.



second idea: rsync (edit: this might be the best now)



rsync is the master of all copying command line tools (in my opinion ;). As mentioned in the comments to your question it has a --checksum option but it has a bulkload of other options as well. It can transfer files from local to remote from remote to local and from local to local. One of the most important features in my opinion is that if you give the correct options you can abort and restart the command (execute the same command line again) and it will continue where it left of!



For your purpose the following options can be interesting:




  • -v: verbose, show what happens can be given several times but normally one is enough


  • -n: dry run, very important to test stuff but don't do anything (combine with -v)!!


  • -c: use checksum to decide what should be copied


  • --remove-source-files: removes files that where successfully transfered (pointed out by @brawny84, I did not know it and did not find it in the man page on my first read)

So this command will overwrite all files in dest which have a different checksum than the corresponding file in source (corresponding by name).



 rsync -a -c -v --remove-source-files -n /path/to/source /path/to/dest
rsync -a -c -v --remove-source-files /path/to/source /path/to/dest


pros



  • works with checksums

  • has a dry run mode

  • will actually copy all missing files and files that differ from source to dest

  • can be aborted and restarted

  • has an exclude option to ignore some files in src if you don't want to copy all files

  • can delete transferred source files

cons



  • ??

third idea: fdupes



The program fdupes I designed to list duplicate files. It checks the md5sums by default.



pros



  • it uses md5 to compare files

  • it has a --delete option to delete one of the duplicates

cons



  • it compares each file to every other file so if there are duplicate files inside dest itself it will also list them

  • delete mode seems to be interactive, you have to confirm for every set of equal files, that might not be feasible for large directory trees

  • the non interactive mode will delete all but the first file from each set of equal files. But I have no idea which the first file is (in source or in dest?)

last idea: go through the pain of actually writing and debugging your own shell script



I would start with something like this if it has to be done manually. I did not test this, try it with the ls first and try to figure out if it will brake something!!



#!/bin/bash
# first require that the source and dest dirs
# are given as arguments to the script.
src=$1:?Please give the source dir as first argument
dest=$2:?Please give the destination dir as second argument
# go to the source directory
cd "$src"
# This assumes that there are no newlines in filenames!
# first find all plain files in the current dir
# (which should be $src)
# then use xargs to hand the filenames to md5sum
# pipe the md5 sums into a subshell
# go to the dest in the subshell
# read the md5sums from stdin and use md5sum -c to check them
# After the subshell filter lines to only keep those that end in "OK"
# and at the same time remove the "OK" stuff after the file name
# use xargs to hand these file names to ls or rm.
find . -type f |
xargs md5sum |
( cd "$dest" && md5sum -c ) |
sed -n 's/: OK$//p' |
xargs ls


The ls in the last line is to list all files that passed the check. If you replace it with rm they are removed from the source dir (the current dir after the cd "$src").






share|improve this answer























  • Thank you! The assumption you made is correct - I only want to compare /path/to/source/a with /path/to/dest/a and so on...
    – brawny84
    Jul 10 at 20:26






  • 1




    For rsync, it looks like there is a --remove-source-files flag as well. If you think it's correct, maybe you could add this information to your answer? It's already very thorough and I greatly appreciate the time you took to answer my overly-broad question! This might be how I proceed: rsync --remove-source-files --log-file=transfers.log -nv /path/to/src/ /path/to/dest/ where -n is dry-run first and -v is verbose output per usual.
    – brawny84
    Jul 10 at 21:00

















1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










Implementing something like this might be hard as a first try if you care about the files and don't want to mess up. So here are some alternatives to writing a full script in bash. These are more or less complex command lines (oneliners) that might help in your situation.



There is one uncertainty in your question: do you want to compare each file in source with every file in dest or only those with "matching" file names? (That would be compare /path/to/src/a with /path/to/dest/a and /path/to/src/b with /path/to/dest/b but not /path/to/src/a with /path/to/dest/b and so on)



I will assume that you only want to compare files with matching paths!!



first idea: diff



The good old diff can compare directories recursively. Also use the -q option to just see which files differ and not how they differ.



diff -r -q /path/to/source /path/to/dest


cons



  • This can take a long time depending on the size of your hard disk.

  • This doesn't delete the old files.

  • The output i not easily parseable

pros



  • This doesn't delete any files :)

So after you manually/visually confirmed that there are no differences in any files you care about you have to manually delete the source with rm -rf /path/to/source.



second idea: rsync (edit: this might be the best now)



rsync is the master of all copying command line tools (in my opinion ;). As mentioned in the comments to your question it has a --checksum option but it has a bulkload of other options as well. It can transfer files from local to remote from remote to local and from local to local. One of the most important features in my opinion is that if you give the correct options you can abort and restart the command (execute the same command line again) and it will continue where it left of!



For your purpose the following options can be interesting:




  • -v: verbose, show what happens can be given several times but normally one is enough


  • -n: dry run, very important to test stuff but don't do anything (combine with -v)!!


  • -c: use checksum to decide what should be copied


  • --remove-source-files: removes files that where successfully transfered (pointed out by @brawny84, I did not know it and did not find it in the man page on my first read)

So this command will overwrite all files in dest which have a different checksum than the corresponding file in source (corresponding by name).



 rsync -a -c -v --remove-source-files -n /path/to/source /path/to/dest
rsync -a -c -v --remove-source-files /path/to/source /path/to/dest


pros



  • works with checksums

  • has a dry run mode

  • will actually copy all missing files and files that differ from source to dest

  • can be aborted and restarted

  • has an exclude option to ignore some files in src if you don't want to copy all files

  • can delete transferred source files

cons



  • ??

third idea: fdupes



The program fdupes I designed to list duplicate files. It checks the md5sums by default.



pros



  • it uses md5 to compare files

  • it has a --delete option to delete one of the duplicates

cons



  • it compares each file to every other file so if there are duplicate files inside dest itself it will also list them

  • delete mode seems to be interactive, you have to confirm for every set of equal files, that might not be feasible for large directory trees

  • the non interactive mode will delete all but the first file from each set of equal files. But I have no idea which the first file is (in source or in dest?)

last idea: go through the pain of actually writing and debugging your own shell script



I would start with something like this if it has to be done manually. I did not test this, try it with the ls first and try to figure out if it will brake something!!



#!/bin/bash
# first require that the source and dest dirs
# are given as arguments to the script.
src=$1:?Please give the source dir as first argument
dest=$2:?Please give the destination dir as second argument
# go to the source directory
cd "$src"
# This assumes that there are no newlines in filenames!
# first find all plain files in the current dir
# (which should be $src)
# then use xargs to hand the filenames to md5sum
# pipe the md5 sums into a subshell
# go to the dest in the subshell
# read the md5sums from stdin and use md5sum -c to check them
# After the subshell filter lines to only keep those that end in "OK"
# and at the same time remove the "OK" stuff after the file name
# use xargs to hand these file names to ls or rm.
find . -type f |
xargs md5sum |
( cd "$dest" && md5sum -c ) |
sed -n 's/: OK$//p' |
xargs ls


The ls in the last line is to list all files that passed the check. If you replace it with rm they are removed from the source dir (the current dir after the cd "$src").






share|improve this answer























  • Thank you! The assumption you made is correct - I only want to compare /path/to/source/a with /path/to/dest/a and so on...
    – brawny84
    Jul 10 at 20:26






  • 1




    For rsync, it looks like there is a --remove-source-files flag as well. If you think it's correct, maybe you could add this information to your answer? It's already very thorough and I greatly appreciate the time you took to answer my overly-broad question! This might be how I proceed: rsync --remove-source-files --log-file=transfers.log -nv /path/to/src/ /path/to/dest/ where -n is dry-run first and -v is verbose output per usual.
    – brawny84
    Jul 10 at 21:00














up vote
1
down vote



accepted










Implementing something like this might be hard as a first try if you care about the files and don't want to mess up. So here are some alternatives to writing a full script in bash. These are more or less complex command lines (oneliners) that might help in your situation.



There is one uncertainty in your question: do you want to compare each file in source with every file in dest or only those with "matching" file names? (That would be compare /path/to/src/a with /path/to/dest/a and /path/to/src/b with /path/to/dest/b but not /path/to/src/a with /path/to/dest/b and so on)



I will assume that you only want to compare files with matching paths!!



first idea: diff



The good old diff can compare directories recursively. Also use the -q option to just see which files differ and not how they differ.



diff -r -q /path/to/source /path/to/dest


cons



  • This can take a long time depending on the size of your hard disk.

  • This doesn't delete the old files.

  • The output i not easily parseable

pros



  • This doesn't delete any files :)

So after you manually/visually confirmed that there are no differences in any files you care about you have to manually delete the source with rm -rf /path/to/source.



second idea: rsync (edit: this might be the best now)



rsync is the master of all copying command line tools (in my opinion ;). As mentioned in the comments to your question it has a --checksum option but it has a bulkload of other options as well. It can transfer files from local to remote from remote to local and from local to local. One of the most important features in my opinion is that if you give the correct options you can abort and restart the command (execute the same command line again) and it will continue where it left of!



For your purpose the following options can be interesting:




  • -v: verbose, show what happens can be given several times but normally one is enough


  • -n: dry run, very important to test stuff but don't do anything (combine with -v)!!


  • -c: use checksum to decide what should be copied


  • --remove-source-files: removes files that where successfully transfered (pointed out by @brawny84, I did not know it and did not find it in the man page on my first read)

So this command will overwrite all files in dest which have a different checksum than the corresponding file in source (corresponding by name).



 rsync -a -c -v --remove-source-files -n /path/to/source /path/to/dest
rsync -a -c -v --remove-source-files /path/to/source /path/to/dest


pros



  • works with checksums

  • has a dry run mode

  • will actually copy all missing files and files that differ from source to dest

  • can be aborted and restarted

  • has an exclude option to ignore some files in src if you don't want to copy all files

  • can delete transferred source files

cons



  • ??

third idea: fdupes



The program fdupes I designed to list duplicate files. It checks the md5sums by default.



pros



  • it uses md5 to compare files

  • it has a --delete option to delete one of the duplicates

cons



  • it compares each file to every other file so if there are duplicate files inside dest itself it will also list them

  • delete mode seems to be interactive, you have to confirm for every set of equal files, that might not be feasible for large directory trees

  • the non interactive mode will delete all but the first file from each set of equal files. But I have no idea which the first file is (in source or in dest?)

last idea: go through the pain of actually writing and debugging your own shell script



I would start with something like this if it has to be done manually. I did not test this, try it with the ls first and try to figure out if it will brake something!!



#!/bin/bash
# first require that the source and dest dirs
# are given as arguments to the script.
src=$1:?Please give the source dir as first argument
dest=$2:?Please give the destination dir as second argument
# go to the source directory
cd "$src"
# This assumes that there are no newlines in filenames!
# first find all plain files in the current dir
# (which should be $src)
# then use xargs to hand the filenames to md5sum
# pipe the md5 sums into a subshell
# go to the dest in the subshell
# read the md5sums from stdin and use md5sum -c to check them
# After the subshell filter lines to only keep those that end in "OK"
# and at the same time remove the "OK" stuff after the file name
# use xargs to hand these file names to ls or rm.
find . -type f |
xargs md5sum |
( cd "$dest" && md5sum -c ) |
sed -n 's/: OK$//p' |
xargs ls


The ls in the last line is to list all files that passed the check. If you replace it with rm they are removed from the source dir (the current dir after the cd "$src").






share|improve this answer























  • Thank you! The assumption you made is correct - I only want to compare /path/to/source/a with /path/to/dest/a and so on...
    – brawny84
    Jul 10 at 20:26






  • 1




    For rsync, it looks like there is a --remove-source-files flag as well. If you think it's correct, maybe you could add this information to your answer? It's already very thorough and I greatly appreciate the time you took to answer my overly-broad question! This might be how I proceed: rsync --remove-source-files --log-file=transfers.log -nv /path/to/src/ /path/to/dest/ where -n is dry-run first and -v is verbose output per usual.
    – brawny84
    Jul 10 at 21:00












up vote
1
down vote



accepted







up vote
1
down vote



accepted






Implementing something like this might be hard as a first try if you care about the files and don't want to mess up. So here are some alternatives to writing a full script in bash. These are more or less complex command lines (oneliners) that might help in your situation.



There is one uncertainty in your question: do you want to compare each file in source with every file in dest or only those with "matching" file names? (That would be compare /path/to/src/a with /path/to/dest/a and /path/to/src/b with /path/to/dest/b but not /path/to/src/a with /path/to/dest/b and so on)



I will assume that you only want to compare files with matching paths!!



first idea: diff



The good old diff can compare directories recursively. Also use the -q option to just see which files differ and not how they differ.



diff -r -q /path/to/source /path/to/dest


cons



  • This can take a long time depending on the size of your hard disk.

  • This doesn't delete the old files.

  • The output i not easily parseable

pros



  • This doesn't delete any files :)

So after you manually/visually confirmed that there are no differences in any files you care about you have to manually delete the source with rm -rf /path/to/source.



second idea: rsync (edit: this might be the best now)



rsync is the master of all copying command line tools (in my opinion ;). As mentioned in the comments to your question it has a --checksum option but it has a bulkload of other options as well. It can transfer files from local to remote from remote to local and from local to local. One of the most important features in my opinion is that if you give the correct options you can abort and restart the command (execute the same command line again) and it will continue where it left of!



For your purpose the following options can be interesting:




  • -v: verbose, show what happens can be given several times but normally one is enough


  • -n: dry run, very important to test stuff but don't do anything (combine with -v)!!


  • -c: use checksum to decide what should be copied


  • --remove-source-files: removes files that where successfully transfered (pointed out by @brawny84, I did not know it and did not find it in the man page on my first read)

So this command will overwrite all files in dest which have a different checksum than the corresponding file in source (corresponding by name).



 rsync -a -c -v --remove-source-files -n /path/to/source /path/to/dest
rsync -a -c -v --remove-source-files /path/to/source /path/to/dest


pros



  • works with checksums

  • has a dry run mode

  • will actually copy all missing files and files that differ from source to dest

  • can be aborted and restarted

  • has an exclude option to ignore some files in src if you don't want to copy all files

  • can delete transferred source files

cons



  • ??

third idea: fdupes



The program fdupes I designed to list duplicate files. It checks the md5sums by default.



pros



  • it uses md5 to compare files

  • it has a --delete option to delete one of the duplicates

cons



  • it compares each file to every other file so if there are duplicate files inside dest itself it will also list them

  • delete mode seems to be interactive, you have to confirm for every set of equal files, that might not be feasible for large directory trees

  • the non interactive mode will delete all but the first file from each set of equal files. But I have no idea which the first file is (in source or in dest?)

last idea: go through the pain of actually writing and debugging your own shell script



I would start with something like this if it has to be done manually. I did not test this, try it with the ls first and try to figure out if it will brake something!!



#!/bin/bash
# first require that the source and dest dirs
# are given as arguments to the script.
src=$1:?Please give the source dir as first argument
dest=$2:?Please give the destination dir as second argument
# go to the source directory
cd "$src"
# This assumes that there are no newlines in filenames!
# first find all plain files in the current dir
# (which should be $src)
# then use xargs to hand the filenames to md5sum
# pipe the md5 sums into a subshell
# go to the dest in the subshell
# read the md5sums from stdin and use md5sum -c to check them
# After the subshell filter lines to only keep those that end in "OK"
# and at the same time remove the "OK" stuff after the file name
# use xargs to hand these file names to ls or rm.
find . -type f |
xargs md5sum |
( cd "$dest" && md5sum -c ) |
sed -n 's/: OK$//p' |
xargs ls


The ls in the last line is to list all files that passed the check. If you replace it with rm they are removed from the source dir (the current dir after the cd "$src").






share|improve this answer















Implementing something like this might be hard as a first try if you care about the files and don't want to mess up. So here are some alternatives to writing a full script in bash. These are more or less complex command lines (oneliners) that might help in your situation.



There is one uncertainty in your question: do you want to compare each file in source with every file in dest or only those with "matching" file names? (That would be compare /path/to/src/a with /path/to/dest/a and /path/to/src/b with /path/to/dest/b but not /path/to/src/a with /path/to/dest/b and so on)



I will assume that you only want to compare files with matching paths!!



first idea: diff



The good old diff can compare directories recursively. Also use the -q option to just see which files differ and not how they differ.



diff -r -q /path/to/source /path/to/dest


cons



  • This can take a long time depending on the size of your hard disk.

  • This doesn't delete the old files.

  • The output i not easily parseable

pros



  • This doesn't delete any files :)

So after you manually/visually confirmed that there are no differences in any files you care about you have to manually delete the source with rm -rf /path/to/source.



second idea: rsync (edit: this might be the best now)



rsync is the master of all copying command line tools (in my opinion ;). As mentioned in the comments to your question it has a --checksum option but it has a bulkload of other options as well. It can transfer files from local to remote from remote to local and from local to local. One of the most important features in my opinion is that if you give the correct options you can abort and restart the command (execute the same command line again) and it will continue where it left of!



For your purpose the following options can be interesting:




  • -v: verbose, show what happens can be given several times but normally one is enough


  • -n: dry run, very important to test stuff but don't do anything (combine with -v)!!


  • -c: use checksum to decide what should be copied


  • --remove-source-files: removes files that where successfully transfered (pointed out by @brawny84, I did not know it and did not find it in the man page on my first read)

So this command will overwrite all files in dest which have a different checksum than the corresponding file in source (corresponding by name).



 rsync -a -c -v --remove-source-files -n /path/to/source /path/to/dest
rsync -a -c -v --remove-source-files /path/to/source /path/to/dest


pros



  • works with checksums

  • has a dry run mode

  • will actually copy all missing files and files that differ from source to dest

  • can be aborted and restarted

  • has an exclude option to ignore some files in src if you don't want to copy all files

  • can delete transferred source files

cons



  • ??

third idea: fdupes



The program fdupes I designed to list duplicate files. It checks the md5sums by default.



pros



  • it uses md5 to compare files

  • it has a --delete option to delete one of the duplicates

cons



  • it compares each file to every other file so if there are duplicate files inside dest itself it will also list them

  • delete mode seems to be interactive, you have to confirm for every set of equal files, that might not be feasible for large directory trees

  • the non interactive mode will delete all but the first file from each set of equal files. But I have no idea which the first file is (in source or in dest?)

last idea: go through the pain of actually writing and debugging your own shell script



I would start with something like this if it has to be done manually. I did not test this, try it with the ls first and try to figure out if it will brake something!!



#!/bin/bash
# first require that the source and dest dirs
# are given as arguments to the script.
src=$1:?Please give the source dir as first argument
dest=$2:?Please give the destination dir as second argument
# go to the source directory
cd "$src"
# This assumes that there are no newlines in filenames!
# first find all plain files in the current dir
# (which should be $src)
# then use xargs to hand the filenames to md5sum
# pipe the md5 sums into a subshell
# go to the dest in the subshell
# read the md5sums from stdin and use md5sum -c to check them
# After the subshell filter lines to only keep those that end in "OK"
# and at the same time remove the "OK" stuff after the file name
# use xargs to hand these file names to ls or rm.
find . -type f |
xargs md5sum |
( cd "$dest" && md5sum -c ) |
sed -n 's/: OK$//p' |
xargs ls


The ls in the last line is to list all files that passed the check. If you replace it with rm they are removed from the source dir (the current dir after the cd "$src").







share|improve this answer















share|improve this answer



share|improve this answer








edited Jul 11 at 5:15


























answered Jul 10 at 20:05









Lucas

1,908617




1,908617











  • Thank you! The assumption you made is correct - I only want to compare /path/to/source/a with /path/to/dest/a and so on...
    – brawny84
    Jul 10 at 20:26






  • 1




    For rsync, it looks like there is a --remove-source-files flag as well. If you think it's correct, maybe you could add this information to your answer? It's already very thorough and I greatly appreciate the time you took to answer my overly-broad question! This might be how I proceed: rsync --remove-source-files --log-file=transfers.log -nv /path/to/src/ /path/to/dest/ where -n is dry-run first and -v is verbose output per usual.
    – brawny84
    Jul 10 at 21:00
















  • Thank you! The assumption you made is correct - I only want to compare /path/to/source/a with /path/to/dest/a and so on...
    – brawny84
    Jul 10 at 20:26






  • 1




    For rsync, it looks like there is a --remove-source-files flag as well. If you think it's correct, maybe you could add this information to your answer? It's already very thorough and I greatly appreciate the time you took to answer my overly-broad question! This might be how I proceed: rsync --remove-source-files --log-file=transfers.log -nv /path/to/src/ /path/to/dest/ where -n is dry-run first and -v is verbose output per usual.
    – brawny84
    Jul 10 at 21:00















Thank you! The assumption you made is correct - I only want to compare /path/to/source/a with /path/to/dest/a and so on...
– brawny84
Jul 10 at 20:26




Thank you! The assumption you made is correct - I only want to compare /path/to/source/a with /path/to/dest/a and so on...
– brawny84
Jul 10 at 20:26




1




1




For rsync, it looks like there is a --remove-source-files flag as well. If you think it's correct, maybe you could add this information to your answer? It's already very thorough and I greatly appreciate the time you took to answer my overly-broad question! This might be how I proceed: rsync --remove-source-files --log-file=transfers.log -nv /path/to/src/ /path/to/dest/ where -n is dry-run first and -v is verbose output per usual.
– brawny84
Jul 10 at 21:00




For rsync, it looks like there is a --remove-source-files flag as well. If you think it's correct, maybe you could add this information to your answer? It's already very thorough and I greatly appreciate the time you took to answer my overly-broad question! This might be how I proceed: rsync --remove-source-files --log-file=transfers.log -nv /path/to/src/ /path/to/dest/ where -n is dry-run first and -v is verbose output per usual.
– brawny84
Jul 10 at 21:00


Popular posts from this blog

How to check contact read email or not when send email to Individual?

Bahrain

Postfix configuration issue with fips on centos 7; mailgun relay