Help with script/rsync command to move file with md5 sum comparison before deleting the source file/ [closed]
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
Referencing this post to find and delete duplicate files based on checksum, I would like to modify the approach to perform a copy operation followed by a file integrity check on the destination file.
SOURCE = /path/to/Source
DEST = /path/to/Destination
# filecksums containing the md5 of the copied files
declare -A filecksums
for file in "$@"
do
[[ -f "$file" ]] || continue
# Generate the checksum
cksum=$(cksum <"$file" | tr ' ' _)
# Can an exact duplicate be found in the destination directory?
if [[ -n "$filecksums[$cksum]" ]] && [[ "$filecksums[$cksum]" != "$file" ]]
then
rm -f "$file"
else
echo " '$file' is not in '$DEST'" >&2
fi
done
I want to use the result of the md5 checksum comparison to allow rm -f
of the source file only if the checksums are equivalent. If there is a difference, I want to echo the result and escape. rsync
might be another option, but I think I would have problems forcing a checksum comparison for local-local file transfer.
UPDATE
I have looked into using rsync per @Lucas 's answer. It appears that there are options to transfer files more stabily with checks rather than a bulk mv /data1/* /data2/
and report what was done and delete after a check. This might narrow the definition as indicated by community members.
shell-script rsync backup file-copy checksum
closed as too broad by Rui F Ribeiro, G-Man, DopeGhoti, Kusalananda, slm⦠Jul 10 at 20:21
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |Â
up vote
1
down vote
favorite
Referencing this post to find and delete duplicate files based on checksum, I would like to modify the approach to perform a copy operation followed by a file integrity check on the destination file.
SOURCE = /path/to/Source
DEST = /path/to/Destination
# filecksums containing the md5 of the copied files
declare -A filecksums
for file in "$@"
do
[[ -f "$file" ]] || continue
# Generate the checksum
cksum=$(cksum <"$file" | tr ' ' _)
# Can an exact duplicate be found in the destination directory?
if [[ -n "$filecksums[$cksum]" ]] && [[ "$filecksums[$cksum]" != "$file" ]]
then
rm -f "$file"
else
echo " '$file' is not in '$DEST'" >&2
fi
done
I want to use the result of the md5 checksum comparison to allow rm -f
of the source file only if the checksums are equivalent. If there is a difference, I want to echo the result and escape. rsync
might be another option, but I think I would have problems forcing a checksum comparison for local-local file transfer.
UPDATE
I have looked into using rsync per @Lucas 's answer. It appears that there are options to transfer files more stabily with checks rather than a bulk mv /data1/* /data2/
and report what was done and delete after a check. This might narrow the definition as indicated by community members.
shell-script rsync backup file-copy checksum
closed as too broad by Rui F Ribeiro, G-Man, DopeGhoti, Kusalananda, slm⦠Jul 10 at 20:21
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
3
This is not a script asking forum. Please tell us what have you done until now and how we might assist you in improving your Unix knowledge.
â Rui F Ribeiro
Jul 10 at 18:27
1
Usingrsync
with--checksum
would force using the checksumming even for local transfers. This is not a integrity check though, but a way forrsync
to figure out what have changed between source and target files, and since you're planning to delete the source files after transfer, this would be kinda useless.
â Kusalananda
Jul 10 at 18:34
That script can't be running correctly - the first two lines are wrong. I'd also strongly recommend getting into the habit of starting every script with a#!
line to define the interpreter. In your case here I think#!/bin/bash
could be appropriate.
â roaima
Jul 10 at 22:06
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
Referencing this post to find and delete duplicate files based on checksum, I would like to modify the approach to perform a copy operation followed by a file integrity check on the destination file.
SOURCE = /path/to/Source
DEST = /path/to/Destination
# filecksums containing the md5 of the copied files
declare -A filecksums
for file in "$@"
do
[[ -f "$file" ]] || continue
# Generate the checksum
cksum=$(cksum <"$file" | tr ' ' _)
# Can an exact duplicate be found in the destination directory?
if [[ -n "$filecksums[$cksum]" ]] && [[ "$filecksums[$cksum]" != "$file" ]]
then
rm -f "$file"
else
echo " '$file' is not in '$DEST'" >&2
fi
done
I want to use the result of the md5 checksum comparison to allow rm -f
of the source file only if the checksums are equivalent. If there is a difference, I want to echo the result and escape. rsync
might be another option, but I think I would have problems forcing a checksum comparison for local-local file transfer.
UPDATE
I have looked into using rsync per @Lucas 's answer. It appears that there are options to transfer files more stabily with checks rather than a bulk mv /data1/* /data2/
and report what was done and delete after a check. This might narrow the definition as indicated by community members.
shell-script rsync backup file-copy checksum
Referencing this post to find and delete duplicate files based on checksum, I would like to modify the approach to perform a copy operation followed by a file integrity check on the destination file.
SOURCE = /path/to/Source
DEST = /path/to/Destination
# filecksums containing the md5 of the copied files
declare -A filecksums
for file in "$@"
do
[[ -f "$file" ]] || continue
# Generate the checksum
cksum=$(cksum <"$file" | tr ' ' _)
# Can an exact duplicate be found in the destination directory?
if [[ -n "$filecksums[$cksum]" ]] && [[ "$filecksums[$cksum]" != "$file" ]]
then
rm -f "$file"
else
echo " '$file' is not in '$DEST'" >&2
fi
done
I want to use the result of the md5 checksum comparison to allow rm -f
of the source file only if the checksums are equivalent. If there is a difference, I want to echo the result and escape. rsync
might be another option, but I think I would have problems forcing a checksum comparison for local-local file transfer.
UPDATE
I have looked into using rsync per @Lucas 's answer. It appears that there are options to transfer files more stabily with checks rather than a bulk mv /data1/* /data2/
and report what was done and delete after a check. This might narrow the definition as indicated by community members.
shell-script rsync backup file-copy checksum
edited Jul 10 at 21:08
asked Jul 10 at 18:25
brawny84
85
85
closed as too broad by Rui F Ribeiro, G-Man, DopeGhoti, Kusalananda, slm⦠Jul 10 at 20:21
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
closed as too broad by Rui F Ribeiro, G-Man, DopeGhoti, Kusalananda, slm⦠Jul 10 at 20:21
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
3
This is not a script asking forum. Please tell us what have you done until now and how we might assist you in improving your Unix knowledge.
â Rui F Ribeiro
Jul 10 at 18:27
1
Usingrsync
with--checksum
would force using the checksumming even for local transfers. This is not a integrity check though, but a way forrsync
to figure out what have changed between source and target files, and since you're planning to delete the source files after transfer, this would be kinda useless.
â Kusalananda
Jul 10 at 18:34
That script can't be running correctly - the first two lines are wrong. I'd also strongly recommend getting into the habit of starting every script with a#!
line to define the interpreter. In your case here I think#!/bin/bash
could be appropriate.
â roaima
Jul 10 at 22:06
add a comment |Â
3
This is not a script asking forum. Please tell us what have you done until now and how we might assist you in improving your Unix knowledge.
â Rui F Ribeiro
Jul 10 at 18:27
1
Usingrsync
with--checksum
would force using the checksumming even for local transfers. This is not a integrity check though, but a way forrsync
to figure out what have changed between source and target files, and since you're planning to delete the source files after transfer, this would be kinda useless.
â Kusalananda
Jul 10 at 18:34
That script can't be running correctly - the first two lines are wrong. I'd also strongly recommend getting into the habit of starting every script with a#!
line to define the interpreter. In your case here I think#!/bin/bash
could be appropriate.
â roaima
Jul 10 at 22:06
3
3
This is not a script asking forum. Please tell us what have you done until now and how we might assist you in improving your Unix knowledge.
â Rui F Ribeiro
Jul 10 at 18:27
This is not a script asking forum. Please tell us what have you done until now and how we might assist you in improving your Unix knowledge.
â Rui F Ribeiro
Jul 10 at 18:27
1
1
Using
rsync
with --checksum
would force using the checksumming even for local transfers. This is not a integrity check though, but a way for rsync
to figure out what have changed between source and target files, and since you're planning to delete the source files after transfer, this would be kinda useless.â Kusalananda
Jul 10 at 18:34
Using
rsync
with --checksum
would force using the checksumming even for local transfers. This is not a integrity check though, but a way for rsync
to figure out what have changed between source and target files, and since you're planning to delete the source files after transfer, this would be kinda useless.â Kusalananda
Jul 10 at 18:34
That script can't be running correctly - the first two lines are wrong. I'd also strongly recommend getting into the habit of starting every script with a
#!
line to define the interpreter. In your case here I think #!/bin/bash
could be appropriate.â roaima
Jul 10 at 22:06
That script can't be running correctly - the first two lines are wrong. I'd also strongly recommend getting into the habit of starting every script with a
#!
line to define the interpreter. In your case here I think #!/bin/bash
could be appropriate.â roaima
Jul 10 at 22:06
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
Implementing something like this might be hard as a first try if you care about the files and don't want to mess up. So here are some alternatives to writing a full script in bash. These are more or less complex command lines (oneliners) that might help in your situation.
There is one uncertainty in your question: do you want to compare each file in source with every file in dest or only those with "matching" file names? (That would be compare /path/to/src/a
with /path/to/dest/a
and /path/to/src/b
with /path/to/dest/b
but not /path/to/src/a
with /path/to/dest/b
and so on)
I will assume that you only want to compare files with matching paths!!
first idea: diff
The good old diff
can compare directories recursively. Also use the -q
option to just see which files differ and not how they differ.
diff -r -q /path/to/source /path/to/dest
cons
- This can take a long time depending on the size of your hard disk.
- This doesn't delete the old files.
- The output i not easily parseable
pros
- This doesn't delete any files :)
So after you manually/visually confirmed that there are no differences in any files you care about you have to manually delete the source with rm -rf /path/to/source
.
second idea: rsync
(edit: this might be the best now)
rsync
is the master of all copying command line tools (in my opinion ;). As mentioned in the comments to your question it has a --checksum
option but it has a bulkload of other options as well. It can transfer files from local to remote from remote to local and from local to local. One of the most important features in my opinion is that if you give the correct options you can abort and restart the command (execute the same command line again) and it will continue where it left of!
For your purpose the following options can be interesting:
-v
: verbose, show what happens can be given several times but normally one is enough-n
: dry run, very important to test stuff but don't do anything (combine with-v
)!!-c
: use checksum to decide what should be copied--remove-source-files
: removes files that where successfully transfered (pointed out by @brawny84, I did not know it and did not find it in the man page on my first read)
So this command will overwrite all files in dest
which have a different checksum than the corresponding file in source
(corresponding by name).
rsync -a -c -v --remove-source-files -n /path/to/source /path/to/dest
rsync -a -c -v --remove-source-files /path/to/source /path/to/dest
pros
- works with checksums
- has a dry run mode
- will actually copy all missing files and files that differ from source to dest
- can be aborted and restarted
- has an exclude option to ignore some files in src if you don't want to copy all files
- can delete transferred source files
cons
- ??
third idea: fdupes
The program fdupes
I designed to list duplicate files. It checks the md5sums by default.
pros
- it uses md5 to compare files
- it has a
--delete
option to delete one of the duplicates
cons
- it compares each file to every other file so if there are duplicate files inside dest itself it will also list them
- delete mode seems to be interactive, you have to confirm for every set of equal files, that might not be feasible for large directory trees
- the non interactive mode will delete all but the first file from each set of equal files. But I have no idea which the first file is (in source or in dest?)
last idea: go through the pain of actually writing and debugging your own shell script
I would start with something like this if it has to be done manually. I did not test this, try it with the ls
first and try to figure out if it will brake something!!
#!/bin/bash
# first require that the source and dest dirs
# are given as arguments to the script.
src=$1:?Please give the source dir as first argument
dest=$2:?Please give the destination dir as second argument
# go to the source directory
cd "$src"
# This assumes that there are no newlines in filenames!
# first find all plain files in the current dir
# (which should be $src)
# then use xargs to hand the filenames to md5sum
# pipe the md5 sums into a subshell
# go to the dest in the subshell
# read the md5sums from stdin and use md5sum -c to check them
# After the subshell filter lines to only keep those that end in "OK"
# and at the same time remove the "OK" stuff after the file name
# use xargs to hand these file names to ls or rm.
find . -type f |
xargs md5sum |
( cd "$dest" && md5sum -c ) |
sed -n 's/: OK$//p' |
xargs ls
The ls
in the last line is to list all files that passed the check. If you replace it with rm
they are removed from the source dir (the current dir after the cd "$src"
).
Thank you! The assumption you made is correct - I only want to compare /path/to/source/a with /path/to/dest/a and so on...
â brawny84
Jul 10 at 20:26
1
For rsync, it looks like there is a --remove-source-files flag as well. If you think it's correct, maybe you could add this information to your answer? It's already very thorough and I greatly appreciate the time you took to answer my overly-broad question! This might be how I proceed:rsync --remove-source-files --log-file=transfers.log -nv /path/to/src/ /path/to/dest/
where -n is dry-run first and -v is verbose output per usual.
â brawny84
Jul 10 at 21:00
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
Implementing something like this might be hard as a first try if you care about the files and don't want to mess up. So here are some alternatives to writing a full script in bash. These are more or less complex command lines (oneliners) that might help in your situation.
There is one uncertainty in your question: do you want to compare each file in source with every file in dest or only those with "matching" file names? (That would be compare /path/to/src/a
with /path/to/dest/a
and /path/to/src/b
with /path/to/dest/b
but not /path/to/src/a
with /path/to/dest/b
and so on)
I will assume that you only want to compare files with matching paths!!
first idea: diff
The good old diff
can compare directories recursively. Also use the -q
option to just see which files differ and not how they differ.
diff -r -q /path/to/source /path/to/dest
cons
- This can take a long time depending on the size of your hard disk.
- This doesn't delete the old files.
- The output i not easily parseable
pros
- This doesn't delete any files :)
So after you manually/visually confirmed that there are no differences in any files you care about you have to manually delete the source with rm -rf /path/to/source
.
second idea: rsync
(edit: this might be the best now)
rsync
is the master of all copying command line tools (in my opinion ;). As mentioned in the comments to your question it has a --checksum
option but it has a bulkload of other options as well. It can transfer files from local to remote from remote to local and from local to local. One of the most important features in my opinion is that if you give the correct options you can abort and restart the command (execute the same command line again) and it will continue where it left of!
For your purpose the following options can be interesting:
-v
: verbose, show what happens can be given several times but normally one is enough-n
: dry run, very important to test stuff but don't do anything (combine with-v
)!!-c
: use checksum to decide what should be copied--remove-source-files
: removes files that where successfully transfered (pointed out by @brawny84, I did not know it and did not find it in the man page on my first read)
So this command will overwrite all files in dest
which have a different checksum than the corresponding file in source
(corresponding by name).
rsync -a -c -v --remove-source-files -n /path/to/source /path/to/dest
rsync -a -c -v --remove-source-files /path/to/source /path/to/dest
pros
- works with checksums
- has a dry run mode
- will actually copy all missing files and files that differ from source to dest
- can be aborted and restarted
- has an exclude option to ignore some files in src if you don't want to copy all files
- can delete transferred source files
cons
- ??
third idea: fdupes
The program fdupes
I designed to list duplicate files. It checks the md5sums by default.
pros
- it uses md5 to compare files
- it has a
--delete
option to delete one of the duplicates
cons
- it compares each file to every other file so if there are duplicate files inside dest itself it will also list them
- delete mode seems to be interactive, you have to confirm for every set of equal files, that might not be feasible for large directory trees
- the non interactive mode will delete all but the first file from each set of equal files. But I have no idea which the first file is (in source or in dest?)
last idea: go through the pain of actually writing and debugging your own shell script
I would start with something like this if it has to be done manually. I did not test this, try it with the ls
first and try to figure out if it will brake something!!
#!/bin/bash
# first require that the source and dest dirs
# are given as arguments to the script.
src=$1:?Please give the source dir as first argument
dest=$2:?Please give the destination dir as second argument
# go to the source directory
cd "$src"
# This assumes that there are no newlines in filenames!
# first find all plain files in the current dir
# (which should be $src)
# then use xargs to hand the filenames to md5sum
# pipe the md5 sums into a subshell
# go to the dest in the subshell
# read the md5sums from stdin and use md5sum -c to check them
# After the subshell filter lines to only keep those that end in "OK"
# and at the same time remove the "OK" stuff after the file name
# use xargs to hand these file names to ls or rm.
find . -type f |
xargs md5sum |
( cd "$dest" && md5sum -c ) |
sed -n 's/: OK$//p' |
xargs ls
The ls
in the last line is to list all files that passed the check. If you replace it with rm
they are removed from the source dir (the current dir after the cd "$src"
).
Thank you! The assumption you made is correct - I only want to compare /path/to/source/a with /path/to/dest/a and so on...
â brawny84
Jul 10 at 20:26
1
For rsync, it looks like there is a --remove-source-files flag as well. If you think it's correct, maybe you could add this information to your answer? It's already very thorough and I greatly appreciate the time you took to answer my overly-broad question! This might be how I proceed:rsync --remove-source-files --log-file=transfers.log -nv /path/to/src/ /path/to/dest/
where -n is dry-run first and -v is verbose output per usual.
â brawny84
Jul 10 at 21:00
add a comment |Â
up vote
1
down vote
accepted
Implementing something like this might be hard as a first try if you care about the files and don't want to mess up. So here are some alternatives to writing a full script in bash. These are more or less complex command lines (oneliners) that might help in your situation.
There is one uncertainty in your question: do you want to compare each file in source with every file in dest or only those with "matching" file names? (That would be compare /path/to/src/a
with /path/to/dest/a
and /path/to/src/b
with /path/to/dest/b
but not /path/to/src/a
with /path/to/dest/b
and so on)
I will assume that you only want to compare files with matching paths!!
first idea: diff
The good old diff
can compare directories recursively. Also use the -q
option to just see which files differ and not how they differ.
diff -r -q /path/to/source /path/to/dest
cons
- This can take a long time depending on the size of your hard disk.
- This doesn't delete the old files.
- The output i not easily parseable
pros
- This doesn't delete any files :)
So after you manually/visually confirmed that there are no differences in any files you care about you have to manually delete the source with rm -rf /path/to/source
.
second idea: rsync
(edit: this might be the best now)
rsync
is the master of all copying command line tools (in my opinion ;). As mentioned in the comments to your question it has a --checksum
option but it has a bulkload of other options as well. It can transfer files from local to remote from remote to local and from local to local. One of the most important features in my opinion is that if you give the correct options you can abort and restart the command (execute the same command line again) and it will continue where it left of!
For your purpose the following options can be interesting:
-v
: verbose, show what happens can be given several times but normally one is enough-n
: dry run, very important to test stuff but don't do anything (combine with-v
)!!-c
: use checksum to decide what should be copied--remove-source-files
: removes files that where successfully transfered (pointed out by @brawny84, I did not know it and did not find it in the man page on my first read)
So this command will overwrite all files in dest
which have a different checksum than the corresponding file in source
(corresponding by name).
rsync -a -c -v --remove-source-files -n /path/to/source /path/to/dest
rsync -a -c -v --remove-source-files /path/to/source /path/to/dest
pros
- works with checksums
- has a dry run mode
- will actually copy all missing files and files that differ from source to dest
- can be aborted and restarted
- has an exclude option to ignore some files in src if you don't want to copy all files
- can delete transferred source files
cons
- ??
third idea: fdupes
The program fdupes
I designed to list duplicate files. It checks the md5sums by default.
pros
- it uses md5 to compare files
- it has a
--delete
option to delete one of the duplicates
cons
- it compares each file to every other file so if there are duplicate files inside dest itself it will also list them
- delete mode seems to be interactive, you have to confirm for every set of equal files, that might not be feasible for large directory trees
- the non interactive mode will delete all but the first file from each set of equal files. But I have no idea which the first file is (in source or in dest?)
last idea: go through the pain of actually writing and debugging your own shell script
I would start with something like this if it has to be done manually. I did not test this, try it with the ls
first and try to figure out if it will brake something!!
#!/bin/bash
# first require that the source and dest dirs
# are given as arguments to the script.
src=$1:?Please give the source dir as first argument
dest=$2:?Please give the destination dir as second argument
# go to the source directory
cd "$src"
# This assumes that there are no newlines in filenames!
# first find all plain files in the current dir
# (which should be $src)
# then use xargs to hand the filenames to md5sum
# pipe the md5 sums into a subshell
# go to the dest in the subshell
# read the md5sums from stdin and use md5sum -c to check them
# After the subshell filter lines to only keep those that end in "OK"
# and at the same time remove the "OK" stuff after the file name
# use xargs to hand these file names to ls or rm.
find . -type f |
xargs md5sum |
( cd "$dest" && md5sum -c ) |
sed -n 's/: OK$//p' |
xargs ls
The ls
in the last line is to list all files that passed the check. If you replace it with rm
they are removed from the source dir (the current dir after the cd "$src"
).
Thank you! The assumption you made is correct - I only want to compare /path/to/source/a with /path/to/dest/a and so on...
â brawny84
Jul 10 at 20:26
1
For rsync, it looks like there is a --remove-source-files flag as well. If you think it's correct, maybe you could add this information to your answer? It's already very thorough and I greatly appreciate the time you took to answer my overly-broad question! This might be how I proceed:rsync --remove-source-files --log-file=transfers.log -nv /path/to/src/ /path/to/dest/
where -n is dry-run first and -v is verbose output per usual.
â brawny84
Jul 10 at 21:00
add a comment |Â
up vote
1
down vote
accepted
up vote
1
down vote
accepted
Implementing something like this might be hard as a first try if you care about the files and don't want to mess up. So here are some alternatives to writing a full script in bash. These are more or less complex command lines (oneliners) that might help in your situation.
There is one uncertainty in your question: do you want to compare each file in source with every file in dest or only those with "matching" file names? (That would be compare /path/to/src/a
with /path/to/dest/a
and /path/to/src/b
with /path/to/dest/b
but not /path/to/src/a
with /path/to/dest/b
and so on)
I will assume that you only want to compare files with matching paths!!
first idea: diff
The good old diff
can compare directories recursively. Also use the -q
option to just see which files differ and not how they differ.
diff -r -q /path/to/source /path/to/dest
cons
- This can take a long time depending on the size of your hard disk.
- This doesn't delete the old files.
- The output i not easily parseable
pros
- This doesn't delete any files :)
So after you manually/visually confirmed that there are no differences in any files you care about you have to manually delete the source with rm -rf /path/to/source
.
second idea: rsync
(edit: this might be the best now)
rsync
is the master of all copying command line tools (in my opinion ;). As mentioned in the comments to your question it has a --checksum
option but it has a bulkload of other options as well. It can transfer files from local to remote from remote to local and from local to local. One of the most important features in my opinion is that if you give the correct options you can abort and restart the command (execute the same command line again) and it will continue where it left of!
For your purpose the following options can be interesting:
-v
: verbose, show what happens can be given several times but normally one is enough-n
: dry run, very important to test stuff but don't do anything (combine with-v
)!!-c
: use checksum to decide what should be copied--remove-source-files
: removes files that where successfully transfered (pointed out by @brawny84, I did not know it and did not find it in the man page on my first read)
So this command will overwrite all files in dest
which have a different checksum than the corresponding file in source
(corresponding by name).
rsync -a -c -v --remove-source-files -n /path/to/source /path/to/dest
rsync -a -c -v --remove-source-files /path/to/source /path/to/dest
pros
- works with checksums
- has a dry run mode
- will actually copy all missing files and files that differ from source to dest
- can be aborted and restarted
- has an exclude option to ignore some files in src if you don't want to copy all files
- can delete transferred source files
cons
- ??
third idea: fdupes
The program fdupes
I designed to list duplicate files. It checks the md5sums by default.
pros
- it uses md5 to compare files
- it has a
--delete
option to delete one of the duplicates
cons
- it compares each file to every other file so if there are duplicate files inside dest itself it will also list them
- delete mode seems to be interactive, you have to confirm for every set of equal files, that might not be feasible for large directory trees
- the non interactive mode will delete all but the first file from each set of equal files. But I have no idea which the first file is (in source or in dest?)
last idea: go through the pain of actually writing and debugging your own shell script
I would start with something like this if it has to be done manually. I did not test this, try it with the ls
first and try to figure out if it will brake something!!
#!/bin/bash
# first require that the source and dest dirs
# are given as arguments to the script.
src=$1:?Please give the source dir as first argument
dest=$2:?Please give the destination dir as second argument
# go to the source directory
cd "$src"
# This assumes that there are no newlines in filenames!
# first find all plain files in the current dir
# (which should be $src)
# then use xargs to hand the filenames to md5sum
# pipe the md5 sums into a subshell
# go to the dest in the subshell
# read the md5sums from stdin and use md5sum -c to check them
# After the subshell filter lines to only keep those that end in "OK"
# and at the same time remove the "OK" stuff after the file name
# use xargs to hand these file names to ls or rm.
find . -type f |
xargs md5sum |
( cd "$dest" && md5sum -c ) |
sed -n 's/: OK$//p' |
xargs ls
The ls
in the last line is to list all files that passed the check. If you replace it with rm
they are removed from the source dir (the current dir after the cd "$src"
).
Implementing something like this might be hard as a first try if you care about the files and don't want to mess up. So here are some alternatives to writing a full script in bash. These are more or less complex command lines (oneliners) that might help in your situation.
There is one uncertainty in your question: do you want to compare each file in source with every file in dest or only those with "matching" file names? (That would be compare /path/to/src/a
with /path/to/dest/a
and /path/to/src/b
with /path/to/dest/b
but not /path/to/src/a
with /path/to/dest/b
and so on)
I will assume that you only want to compare files with matching paths!!
first idea: diff
The good old diff
can compare directories recursively. Also use the -q
option to just see which files differ and not how they differ.
diff -r -q /path/to/source /path/to/dest
cons
- This can take a long time depending on the size of your hard disk.
- This doesn't delete the old files.
- The output i not easily parseable
pros
- This doesn't delete any files :)
So after you manually/visually confirmed that there are no differences in any files you care about you have to manually delete the source with rm -rf /path/to/source
.
second idea: rsync
(edit: this might be the best now)
rsync
is the master of all copying command line tools (in my opinion ;). As mentioned in the comments to your question it has a --checksum
option but it has a bulkload of other options as well. It can transfer files from local to remote from remote to local and from local to local. One of the most important features in my opinion is that if you give the correct options you can abort and restart the command (execute the same command line again) and it will continue where it left of!
For your purpose the following options can be interesting:
-v
: verbose, show what happens can be given several times but normally one is enough-n
: dry run, very important to test stuff but don't do anything (combine with-v
)!!-c
: use checksum to decide what should be copied--remove-source-files
: removes files that where successfully transfered (pointed out by @brawny84, I did not know it and did not find it in the man page on my first read)
So this command will overwrite all files in dest
which have a different checksum than the corresponding file in source
(corresponding by name).
rsync -a -c -v --remove-source-files -n /path/to/source /path/to/dest
rsync -a -c -v --remove-source-files /path/to/source /path/to/dest
pros
- works with checksums
- has a dry run mode
- will actually copy all missing files and files that differ from source to dest
- can be aborted and restarted
- has an exclude option to ignore some files in src if you don't want to copy all files
- can delete transferred source files
cons
- ??
third idea: fdupes
The program fdupes
I designed to list duplicate files. It checks the md5sums by default.
pros
- it uses md5 to compare files
- it has a
--delete
option to delete one of the duplicates
cons
- it compares each file to every other file so if there are duplicate files inside dest itself it will also list them
- delete mode seems to be interactive, you have to confirm for every set of equal files, that might not be feasible for large directory trees
- the non interactive mode will delete all but the first file from each set of equal files. But I have no idea which the first file is (in source or in dest?)
last idea: go through the pain of actually writing and debugging your own shell script
I would start with something like this if it has to be done manually. I did not test this, try it with the ls
first and try to figure out if it will brake something!!
#!/bin/bash
# first require that the source and dest dirs
# are given as arguments to the script.
src=$1:?Please give the source dir as first argument
dest=$2:?Please give the destination dir as second argument
# go to the source directory
cd "$src"
# This assumes that there are no newlines in filenames!
# first find all plain files in the current dir
# (which should be $src)
# then use xargs to hand the filenames to md5sum
# pipe the md5 sums into a subshell
# go to the dest in the subshell
# read the md5sums from stdin and use md5sum -c to check them
# After the subshell filter lines to only keep those that end in "OK"
# and at the same time remove the "OK" stuff after the file name
# use xargs to hand these file names to ls or rm.
find . -type f |
xargs md5sum |
( cd "$dest" && md5sum -c ) |
sed -n 's/: OK$//p' |
xargs ls
The ls
in the last line is to list all files that passed the check. If you replace it with rm
they are removed from the source dir (the current dir after the cd "$src"
).
edited Jul 11 at 5:15
answered Jul 10 at 20:05
Lucas
1,908617
1,908617
Thank you! The assumption you made is correct - I only want to compare /path/to/source/a with /path/to/dest/a and so on...
â brawny84
Jul 10 at 20:26
1
For rsync, it looks like there is a --remove-source-files flag as well. If you think it's correct, maybe you could add this information to your answer? It's already very thorough and I greatly appreciate the time you took to answer my overly-broad question! This might be how I proceed:rsync --remove-source-files --log-file=transfers.log -nv /path/to/src/ /path/to/dest/
where -n is dry-run first and -v is verbose output per usual.
â brawny84
Jul 10 at 21:00
add a comment |Â
Thank you! The assumption you made is correct - I only want to compare /path/to/source/a with /path/to/dest/a and so on...
â brawny84
Jul 10 at 20:26
1
For rsync, it looks like there is a --remove-source-files flag as well. If you think it's correct, maybe you could add this information to your answer? It's already very thorough and I greatly appreciate the time you took to answer my overly-broad question! This might be how I proceed:rsync --remove-source-files --log-file=transfers.log -nv /path/to/src/ /path/to/dest/
where -n is dry-run first and -v is verbose output per usual.
â brawny84
Jul 10 at 21:00
Thank you! The assumption you made is correct - I only want to compare /path/to/source/a with /path/to/dest/a and so on...
â brawny84
Jul 10 at 20:26
Thank you! The assumption you made is correct - I only want to compare /path/to/source/a with /path/to/dest/a and so on...
â brawny84
Jul 10 at 20:26
1
1
For rsync, it looks like there is a --remove-source-files flag as well. If you think it's correct, maybe you could add this information to your answer? It's already very thorough and I greatly appreciate the time you took to answer my overly-broad question! This might be how I proceed:
rsync --remove-source-files --log-file=transfers.log -nv /path/to/src/ /path/to/dest/
where -n is dry-run first and -v is verbose output per usual.â brawny84
Jul 10 at 21:00
For rsync, it looks like there is a --remove-source-files flag as well. If you think it's correct, maybe you could add this information to your answer? It's already very thorough and I greatly appreciate the time you took to answer my overly-broad question! This might be how I proceed:
rsync --remove-source-files --log-file=transfers.log -nv /path/to/src/ /path/to/dest/
where -n is dry-run first and -v is verbose output per usual.â brawny84
Jul 10 at 21:00
add a comment |Â
3
This is not a script asking forum. Please tell us what have you done until now and how we might assist you in improving your Unix knowledge.
â Rui F Ribeiro
Jul 10 at 18:27
1
Using
rsync
with--checksum
would force using the checksumming even for local transfers. This is not a integrity check though, but a way forrsync
to figure out what have changed between source and target files, and since you're planning to delete the source files after transfer, this would be kinda useless.â Kusalananda
Jul 10 at 18:34
That script can't be running correctly - the first two lines are wrong. I'd also strongly recommend getting into the habit of starting every script with a
#!
line to define the interpreter. In your case here I think#!/bin/bash
could be appropriate.â roaima
Jul 10 at 22:06