A lot of duplicatesâ€¦no fdupes,i want to make a script

Clash Royale CLAN TAG#URR8PPP

up vote
-4
down vote

favorite

No fdup please..I want to make a script.

I have a lot of file duplicates,there are more than 200

I made ( is under construction ) a bash script which make

md5sum on every file,then with uniq i put the duplicate md5s on

another file,then again i check the duplicates and put the entire line

in a third final file.

Now the problem..i can remove one by one the dups.

But my question is: is possible to find only the dups and put on 4th file to delete safely?

This is the script

#!/bin/bash

# Script is "under construction"

# First we make the md5sum
find mp3 -type f -print0 |xargs -0 md5sum|tee firstfile.txt

# Then we find all the md5sum identical and put in secondfile.txt
sort +1rn -2 +0 -1 firstfile.txt |awk 'print $1'|uniq -d > secondfile.txt


# then we extract from the secondfile and firstfile md5sum and name
while read line;do grep -i $line firstfile.txt;done < secondfile.txt > thirdfinal.txt

Now the problem..thirdfinal.txt contains a lot of lines
similar to those

625e8fd5f878b19b39826db539e01cda mp3/16.mp3
625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/9.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3

My question is...how to grep the third file to obtain

a 4th file which includes ALL duplicates..except the first

line(otherwise you delete ALL file including the original!)

So you can have a line of duplicates but at the same time

preserve the original one.

The 4th file must appear like this

625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3

Suggestions?Please don't answer : use fdup or external programs..I

prefer bash automation

asked Jan 15 at 0:03

elbarna

3,80893477

1

Please format your question for readability and include your attempts so far.
â€“Â l0b0
Jan 15 at 0:32

1

I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
â€“Â Jeff Schaller
Jan 15 at 0:38

2

If you don't want to use external programs why are you yourself using find, awk, md5sum, etc.? What is wrong with using a tool such as fdupes?
â€“Â roaima
Jan 15 at 1:27

Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have? grep -c 'Now the problem'
â€“Â muru
Jan 15 at 1:35

add a commentÂ |Â

up vote
-4
down vote

favorite

No fdup please..I want to make a script.

I have a lot of file duplicates,there are more than 200

I made ( is under construction ) a bash script which make

md5sum on every file,then with uniq i put the duplicate md5s on

another file,then again i check the duplicates and put the entire line

in a third final file.

Now the problem..i can remove one by one the dups.

But my question is: is possible to find only the dups and put on 4th file to delete safely?

This is the script

#!/bin/bash

# Script is "under construction"

# First we make the md5sum
find mp3 -type f -print0 |xargs -0 md5sum|tee firstfile.txt

# Then we find all the md5sum identical and put in secondfile.txt
sort +1rn -2 +0 -1 firstfile.txt |awk 'print $1'|uniq -d > secondfile.txt


# then we extract from the secondfile and firstfile md5sum and name
while read line;do grep -i $line firstfile.txt;done < secondfile.txt > thirdfinal.txt

Now the problem..thirdfinal.txt contains a lot of lines
similar to those

625e8fd5f878b19b39826db539e01cda mp3/16.mp3
625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/9.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3

My question is...how to grep the third file to obtain

a 4th file which includes ALL duplicates..except the first

line(otherwise you delete ALL file including the original!)

So you can have a line of duplicates but at the same time

preserve the original one.

The 4th file must appear like this

625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3

Suggestions?Please don't answer : use fdup or external programs..I

prefer bash automation

asked Jan 15 at 0:03

elbarna

3,80893477

1

Please format your question for readability and include your attempts so far.
â€“Â l0b0
Jan 15 at 0:32

1

I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
â€“Â Jeff Schaller
Jan 15 at 0:38

2

If you don't want to use external programs why are you yourself using find, awk, md5sum, etc.? What is wrong with using a tool such as fdupes?
â€“Â roaima
Jan 15 at 1:27

Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have? grep -c 'Now the problem'
â€“Â muru
Jan 15 at 1:35

add a commentÂ |Â

up vote
-4
down vote

favorite

No fdup please..I want to make a script.

I have a lot of file duplicates,there are more than 200

I made ( is under construction ) a bash script which make

md5sum on every file,then with uniq i put the duplicate md5s on

another file,then again i check the duplicates and put the entire line

in a third final file.

Now the problem..i can remove one by one the dups.

But my question is: is possible to find only the dups and put on 4th file to delete safely?

This is the script

#!/bin/bash

# Script is "under construction"

# First we make the md5sum
find mp3 -type f -print0 |xargs -0 md5sum|tee firstfile.txt

# Then we find all the md5sum identical and put in secondfile.txt
sort +1rn -2 +0 -1 firstfile.txt |awk 'print $1'|uniq -d > secondfile.txt


# then we extract from the secondfile and firstfile md5sum and name
while read line;do grep -i $line firstfile.txt;done < secondfile.txt > thirdfinal.txt

Now the problem..thirdfinal.txt contains a lot of lines
similar to those

625e8fd5f878b19b39826db539e01cda mp3/16.mp3
625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/9.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3

My question is...how to grep the third file to obtain

a 4th file which includes ALL duplicates..except the first

line(otherwise you delete ALL file including the original!)

So you can have a line of duplicates but at the same time

preserve the original one.

The 4th file must appear like this

625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3

Suggestions?Please don't answer : use fdup or external programs..I

prefer bash automation

asked Jan 15 at 0:03

elbarna

3,80893477

No fdup please..I want to make a script.

I have a lot of file duplicates,there are more than 200

I made ( is under construction ) a bash script which make

md5sum on every file,then with uniq i put the duplicate md5s on

another file,then again i check the duplicates and put the entire line

in a third final file.

Now the problem..i can remove one by one the dups.

But my question is: is possible to find only the dups and put on 4th file to delete safely?

This is the script

#!/bin/bash

# Script is "under construction"

# First we make the md5sum
find mp3 -type f -print0 |xargs -0 md5sum|tee firstfile.txt

# Then we find all the md5sum identical and put in secondfile.txt
sort +1rn -2 +0 -1 firstfile.txt |awk 'print $1'|uniq -d > secondfile.txt


# then we extract from the secondfile and firstfile md5sum and name
while read line;do grep -i $line firstfile.txt;done < secondfile.txt > thirdfinal.txt

Now the problem..thirdfinal.txt contains a lot of lines
similar to those

625e8fd5f878b19b39826db539e01cda mp3/16.mp3
625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/9.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3

My question is...how to grep the third file to obtain

a 4th file which includes ALL duplicates..except the first

line(otherwise you delete ALL file including the original!)

So you can have a line of duplicates but at the same time

preserve the original one.

The 4th file must appear like this

625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3

Suggestions?Please don't answer : use fdup or external programs..I

prefer bash automation

asked Jan 15 at 0:03

elbarna

3,80893477

asked Jan 15 at 0:03

elbarna

3,80893477

asked Jan 15 at 0:03

elbarna

3,80893477

asked Jan 15 at 0:03

elbarna

3,80893477

1

Please format your question for readability and include your attempts so far.
â€“Â l0b0
Jan 15 at 0:32

1

I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
â€“Â Jeff Schaller
Jan 15 at 0:38

2

If you don't want to use external programs why are you yourself using find, awk, md5sum, etc.? What is wrong with using a tool such as fdupes?
â€“Â roaima
Jan 15 at 1:27

Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have? grep -c 'Now the problem'
â€“Â muru
Jan 15 at 1:35

add a commentÂ |Â

1

Please format your question for readability and include your attempts so far.
â€“Â l0b0
Jan 15 at 0:32

1

I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
â€“Â Jeff Schaller
Jan 15 at 0:38

2

If you don't want to use external programs why are you yourself using find, awk, md5sum, etc.? What is wrong with using a tool such as fdupes?
â€“Â roaima
Jan 15 at 1:27

Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have? grep -c 'Now the problem'
â€“Â muru
Jan 15 at 1:35

Please format your question for readability and include your attempts so far.
â€“Â l0b0
Jan 15 at 0:32

I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
â€“Â Jeff Schaller
Jan 15 at 0:38

If you don't want to use external programs why are you yourself using find, awk, md5sum, etc.? What is wrong with using a tool such as fdupes?
â€“Â roaima
Jan 15 at 1:27

Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have? grep -c 'Now the problem'
â€“Â muru
Jan 15 at 1:35

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
3
down vote

accepted

awk ' if (seen[$1]++) print ' < file3 > file4

This builds up an awk array of the md5sums in column 1; if the array value for a particular md5sum has already been seen (e.g. not the first time it is seen), then it prints the line. Either way, it increments the array value for that md5sum, starting from the default of zero.

Another way, using bash associative arrays:

unset md5sums
declare -A md5sums
while read md5sum path
do 
 ((md5sums[$md5sum]++))
 [[ $md5sums[$md5sum] -gt 1 ]] && printf "%s %sn" "$md5sum" "$path" 
done < file3 > file4

edited Jan 15 at 1:30

answered Jan 15 at 1:13

Jeff Schaller

31.8k848109

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f417139%2fa-lot-of-duplicates-no-fdupes-i-want-to-make-a-script%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
3
down vote

accepted

awk ' if (seen[$1]++) print ' < file3 > file4

Another way, using bash associative arrays:

unset md5sums
declare -A md5sums
while read md5sum path
do 
 ((md5sums[$md5sum]++))
 [[ $md5sums[$md5sum] -gt 1 ]] && printf "%s %sn" "$md5sum" "$path" 
done < file3 > file4

edited Jan 15 at 1:30

answered Jan 15 at 1:13

Jeff Schaller

31.8k848109

add a commentÂ |Â

up vote
3
down vote

accepted

awk ' if (seen[$1]++) print ' < file3 > file4

Another way, using bash associative arrays:

unset md5sums
declare -A md5sums
while read md5sum path
do 
 ((md5sums[$md5sum]++))
 [[ $md5sums[$md5sum] -gt 1 ]] && printf "%s %sn" "$md5sum" "$path" 
done < file3 > file4

edited Jan 15 at 1:30

answered Jan 15 at 1:13

Jeff Schaller

31.8k848109

add a commentÂ |Â

up vote
3
down vote

accepted

awk ' if (seen[$1]++) print ' < file3 > file4

Another way, using bash associative arrays:

unset md5sums
declare -A md5sums
while read md5sum path
do 
 ((md5sums[$md5sum]++))
 [[ $md5sums[$md5sum] -gt 1 ]] && printf "%s %sn" "$md5sum" "$path" 
done < file3 > file4

edited Jan 15 at 1:30

answered Jan 15 at 1:13

Jeff Schaller

31.8k848109

awk ' if (seen[$1]++) print ' < file3 > file4

Another way, using bash associative arrays:

unset md5sums
declare -A md5sums
while read md5sum path
do 
 ((md5sums[$md5sum]++))
 [[ $md5sums[$md5sum] -gt 1 ]] && printf "%s %sn" "$md5sum" "$path" 
done < file3 > file4

edited Jan 15 at 1:30

answered Jan 15 at 1:13

Jeff Schaller

31.8k848109

edited Jan 15 at 1:30

answered Jan 15 at 1:13

Jeff Schaller

31.8k848109

answered Jan 15 at 1:13

Jeff Schaller

31.8k848109

answered Jan 15 at 1:13

Jeff Schaller

31.8k848109

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu