A lot of duplicatesâ¦no fdupes,i want to make a script
Clash Royale CLAN TAG#URR8PPP
up vote
-4
down vote
favorite
No fdup please..I want to make a script.
I have a lot of file duplicates,there are more than 200
I made ( is under construction ) a bash script which make
md5sum on every file,then with uniq i put the duplicate md5s on
another file,then again i check the duplicates and put the entire line
in a third final file.
Now the problem..i can remove one by one the dups.
But my question is: is possible to find only the dups and put on 4th file to delete safely?
This is the script
#!/bin/bash
# Script is "under construction"
# First we make the md5sum
find mp3 -type f -print0 |xargs -0 md5sum|tee firstfile.txt
# Then we find all the md5sum identical and put in secondfile.txt
sort +1rn -2 +0 -1 firstfile.txt |awk 'print $1'|uniq -d > secondfile.txt
# then we extract from the secondfile and firstfile md5sum and name
while read line;do grep -i $line firstfile.txt;done < secondfile.txt > thirdfinal.txt
Now the problem..thirdfinal.txt contains a lot of lines
similar to those
625e8fd5f878b19b39826db539e01cda mp3/16.mp3
625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/9.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
My question is...how to grep the third file to obtain
a 4th file which includes ALL duplicates..except the first
line(otherwise you delete ALL file including the original!)
So you can have a line of duplicates but at the same time
preserve the original one.
The 4th file must appear like this
625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
Suggestions?Please don't answer : use fdup or external programs..I
prefer bash automation
bash uniq duplicate
add a comment |Â
up vote
-4
down vote
favorite
No fdup please..I want to make a script.
I have a lot of file duplicates,there are more than 200
I made ( is under construction ) a bash script which make
md5sum on every file,then with uniq i put the duplicate md5s on
another file,then again i check the duplicates and put the entire line
in a third final file.
Now the problem..i can remove one by one the dups.
But my question is: is possible to find only the dups and put on 4th file to delete safely?
This is the script
#!/bin/bash
# Script is "under construction"
# First we make the md5sum
find mp3 -type f -print0 |xargs -0 md5sum|tee firstfile.txt
# Then we find all the md5sum identical and put in secondfile.txt
sort +1rn -2 +0 -1 firstfile.txt |awk 'print $1'|uniq -d > secondfile.txt
# then we extract from the secondfile and firstfile md5sum and name
while read line;do grep -i $line firstfile.txt;done < secondfile.txt > thirdfinal.txt
Now the problem..thirdfinal.txt contains a lot of lines
similar to those
625e8fd5f878b19b39826db539e01cda mp3/16.mp3
625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/9.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
My question is...how to grep the third file to obtain
a 4th file which includes ALL duplicates..except the first
line(otherwise you delete ALL file including the original!)
So you can have a line of duplicates but at the same time
preserve the original one.
The 4th file must appear like this
625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
Suggestions?Please don't answer : use fdup or external programs..I
prefer bash automation
bash uniq duplicate
1
Please format your question for readability and include your attempts so far.
â l0b0
Jan 15 at 0:32
1
I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
â Jeff Schaller
Jan 15 at 0:38
2
If you don't want to use external programs why are you yourself usingfind
,awk
,md5sum
, etc.? What is wrong with using a tool such asfdupes
?
â roaima
Jan 15 at 1:27
Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have?grep -c 'Now the problem'
â muru
Jan 15 at 1:35
add a comment |Â
up vote
-4
down vote
favorite
up vote
-4
down vote
favorite
No fdup please..I want to make a script.
I have a lot of file duplicates,there are more than 200
I made ( is under construction ) a bash script which make
md5sum on every file,then with uniq i put the duplicate md5s on
another file,then again i check the duplicates and put the entire line
in a third final file.
Now the problem..i can remove one by one the dups.
But my question is: is possible to find only the dups and put on 4th file to delete safely?
This is the script
#!/bin/bash
# Script is "under construction"
# First we make the md5sum
find mp3 -type f -print0 |xargs -0 md5sum|tee firstfile.txt
# Then we find all the md5sum identical and put in secondfile.txt
sort +1rn -2 +0 -1 firstfile.txt |awk 'print $1'|uniq -d > secondfile.txt
# then we extract from the secondfile and firstfile md5sum and name
while read line;do grep -i $line firstfile.txt;done < secondfile.txt > thirdfinal.txt
Now the problem..thirdfinal.txt contains a lot of lines
similar to those
625e8fd5f878b19b39826db539e01cda mp3/16.mp3
625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/9.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
My question is...how to grep the third file to obtain
a 4th file which includes ALL duplicates..except the first
line(otherwise you delete ALL file including the original!)
So you can have a line of duplicates but at the same time
preserve the original one.
The 4th file must appear like this
625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
Suggestions?Please don't answer : use fdup or external programs..I
prefer bash automation
bash uniq duplicate
No fdup please..I want to make a script.
I have a lot of file duplicates,there are more than 200
I made ( is under construction ) a bash script which make
md5sum on every file,then with uniq i put the duplicate md5s on
another file,then again i check the duplicates and put the entire line
in a third final file.
Now the problem..i can remove one by one the dups.
But my question is: is possible to find only the dups and put on 4th file to delete safely?
This is the script
#!/bin/bash
# Script is "under construction"
# First we make the md5sum
find mp3 -type f -print0 |xargs -0 md5sum|tee firstfile.txt
# Then we find all the md5sum identical and put in secondfile.txt
sort +1rn -2 +0 -1 firstfile.txt |awk 'print $1'|uniq -d > secondfile.txt
# then we extract from the secondfile and firstfile md5sum and name
while read line;do grep -i $line firstfile.txt;done < secondfile.txt > thirdfinal.txt
Now the problem..thirdfinal.txt contains a lot of lines
similar to those
625e8fd5f878b19b39826db539e01cda mp3/16.mp3
625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/9.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
My question is...how to grep the third file to obtain
a 4th file which includes ALL duplicates..except the first
line(otherwise you delete ALL file including the original!)
So you can have a line of duplicates but at the same time
preserve the original one.
The 4th file must appear like this
625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
Suggestions?Please don't answer : use fdup or external programs..I
prefer bash automation
bash uniq duplicate
asked Jan 15 at 0:03
elbarna
3,80893477
3,80893477
1
Please format your question for readability and include your attempts so far.
â l0b0
Jan 15 at 0:32
1
I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
â Jeff Schaller
Jan 15 at 0:38
2
If you don't want to use external programs why are you yourself usingfind
,awk
,md5sum
, etc.? What is wrong with using a tool such asfdupes
?
â roaima
Jan 15 at 1:27
Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have?grep -c 'Now the problem'
â muru
Jan 15 at 1:35
add a comment |Â
1
Please format your question for readability and include your attempts so far.
â l0b0
Jan 15 at 0:32
1
I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
â Jeff Schaller
Jan 15 at 0:38
2
If you don't want to use external programs why are you yourself usingfind
,awk
,md5sum
, etc.? What is wrong with using a tool such asfdupes
?
â roaima
Jan 15 at 1:27
Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have?grep -c 'Now the problem'
â muru
Jan 15 at 1:35
1
1
Please format your question for readability and include your attempts so far.
â l0b0
Jan 15 at 0:32
Please format your question for readability and include your attempts so far.
â l0b0
Jan 15 at 0:32
1
1
I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
â Jeff Schaller
Jan 15 at 0:38
I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
â Jeff Schaller
Jan 15 at 0:38
2
2
If you don't want to use external programs why are you yourself using
find
, awk
, md5sum
, etc.? What is wrong with using a tool such as fdupes
?â roaima
Jan 15 at 1:27
If you don't want to use external programs why are you yourself using
find
, awk
, md5sum
, etc.? What is wrong with using a tool such as fdupes
?â roaima
Jan 15 at 1:27
Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have?
grep -c 'Now the problem'
â muru
Jan 15 at 1:35
Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have?
grep -c 'Now the problem'
â muru
Jan 15 at 1:35
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
3
down vote
accepted
awk ' if (seen[$1]++) print ' < file3 > file4
This builds up an awk array of the md5sums in column 1; if the array value for a particular md5sum has already been seen (e.g. not the first time it is seen), then it prints the line. Either way, it increments the array value for that md5sum, starting from the default of zero.
Another way, using bash associative arrays:
unset md5sums
declare -A md5sums
while read md5sum path
do
((md5sums[$md5sum]++))
[[ $md5sums[$md5sum] -gt 1 ]] && printf "%s %sn" "$md5sum" "$path"
done < file3 > file4
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
awk ' if (seen[$1]++) print ' < file3 > file4
This builds up an awk array of the md5sums in column 1; if the array value for a particular md5sum has already been seen (e.g. not the first time it is seen), then it prints the line. Either way, it increments the array value for that md5sum, starting from the default of zero.
Another way, using bash associative arrays:
unset md5sums
declare -A md5sums
while read md5sum path
do
((md5sums[$md5sum]++))
[[ $md5sums[$md5sum] -gt 1 ]] && printf "%s %sn" "$md5sum" "$path"
done < file3 > file4
add a comment |Â
up vote
3
down vote
accepted
awk ' if (seen[$1]++) print ' < file3 > file4
This builds up an awk array of the md5sums in column 1; if the array value for a particular md5sum has already been seen (e.g. not the first time it is seen), then it prints the line. Either way, it increments the array value for that md5sum, starting from the default of zero.
Another way, using bash associative arrays:
unset md5sums
declare -A md5sums
while read md5sum path
do
((md5sums[$md5sum]++))
[[ $md5sums[$md5sum] -gt 1 ]] && printf "%s %sn" "$md5sum" "$path"
done < file3 > file4
add a comment |Â
up vote
3
down vote
accepted
up vote
3
down vote
accepted
awk ' if (seen[$1]++) print ' < file3 > file4
This builds up an awk array of the md5sums in column 1; if the array value for a particular md5sum has already been seen (e.g. not the first time it is seen), then it prints the line. Either way, it increments the array value for that md5sum, starting from the default of zero.
Another way, using bash associative arrays:
unset md5sums
declare -A md5sums
while read md5sum path
do
((md5sums[$md5sum]++))
[[ $md5sums[$md5sum] -gt 1 ]] && printf "%s %sn" "$md5sum" "$path"
done < file3 > file4
awk ' if (seen[$1]++) print ' < file3 > file4
This builds up an awk array of the md5sums in column 1; if the array value for a particular md5sum has already been seen (e.g. not the first time it is seen), then it prints the line. Either way, it increments the array value for that md5sum, starting from the default of zero.
Another way, using bash associative arrays:
unset md5sums
declare -A md5sums
while read md5sum path
do
((md5sums[$md5sum]++))
[[ $md5sums[$md5sum] -gt 1 ]] && printf "%s %sn" "$md5sum" "$path"
done < file3 > file4
edited Jan 15 at 1:30
answered Jan 15 at 1:13
Jeff Schaller
31.8k848109
31.8k848109
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f417139%2fa-lot-of-duplicates-no-fdupes-i-want-to-make-a-script%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
Please format your question for readability and include your attempts so far.
â l0b0
Jan 15 at 0:32
1
I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
â Jeff Schaller
Jan 15 at 0:38
2
If you don't want to use external programs why are you yourself using
find
,awk
,md5sum
, etc.? What is wrong with using a tool such asfdupes
?â roaima
Jan 15 at 1:27
Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have?
grep -c 'Now the problem'
â muru
Jan 15 at 1:35