A lot of duplicates…no fdupes,i want to make a script

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
-4
down vote

favorite












No fdup please..I want to make a script.



I have a lot of file duplicates,there are more than 200



I made ( is under construction ) a bash script which make



md5sum on every file,then with uniq i put the duplicate md5s on



another file,then again i check the duplicates and put the entire line



in a third final file.



Now the problem..i can remove one by one the dups.



But my question is: is possible to find only the dups and put on 4th file to delete safely?



This is the script



#!/bin/bash

# Script is "under construction"

# First we make the md5sum
find mp3 -type f -print0 |xargs -0 md5sum|tee firstfile.txt

# Then we find all the md5sum identical and put in secondfile.txt
sort +1rn -2 +0 -1 firstfile.txt |awk 'print $1'|uniq -d > secondfile.txt


# then we extract from the secondfile and firstfile md5sum and name
while read line;do grep -i $line firstfile.txt;done < secondfile.txt > thirdfinal.txt


Now the problem..thirdfinal.txt contains a lot of lines
similar to those



625e8fd5f878b19b39826db539e01cda mp3/16.mp3
625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/9.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3


My question is...how to grep the third file to obtain



a 4th file which includes ALL duplicates..except the first



line(otherwise you delete ALL file including the original!)



So you can have a line of duplicates but at the same time



preserve the original one.



The 4th file must appear like this



625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3


Suggestions?Please don't answer : use fdup or external programs..I



prefer bash automation







share|improve this question
















  • 1




    Please format your question for readability and include your attempts so far.
    – l0b0
    Jan 15 at 0:32






  • 1




    I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
    – Jeff Schaller
    Jan 15 at 0:38






  • 2




    If you don't want to use external programs why are you yourself using find, awk, md5sum, etc.? What is wrong with using a tool such as fdupes?
    – roaima
    Jan 15 at 1:27










  • Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have? grep -c 'Now the problem'
    – muru
    Jan 15 at 1:35















up vote
-4
down vote

favorite












No fdup please..I want to make a script.



I have a lot of file duplicates,there are more than 200



I made ( is under construction ) a bash script which make



md5sum on every file,then with uniq i put the duplicate md5s on



another file,then again i check the duplicates and put the entire line



in a third final file.



Now the problem..i can remove one by one the dups.



But my question is: is possible to find only the dups and put on 4th file to delete safely?



This is the script



#!/bin/bash

# Script is "under construction"

# First we make the md5sum
find mp3 -type f -print0 |xargs -0 md5sum|tee firstfile.txt

# Then we find all the md5sum identical and put in secondfile.txt
sort +1rn -2 +0 -1 firstfile.txt |awk 'print $1'|uniq -d > secondfile.txt


# then we extract from the secondfile and firstfile md5sum and name
while read line;do grep -i $line firstfile.txt;done < secondfile.txt > thirdfinal.txt


Now the problem..thirdfinal.txt contains a lot of lines
similar to those



625e8fd5f878b19b39826db539e01cda mp3/16.mp3
625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/9.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3


My question is...how to grep the third file to obtain



a 4th file which includes ALL duplicates..except the first



line(otherwise you delete ALL file including the original!)



So you can have a line of duplicates but at the same time



preserve the original one.



The 4th file must appear like this



625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3


Suggestions?Please don't answer : use fdup or external programs..I



prefer bash automation







share|improve this question
















  • 1




    Please format your question for readability and include your attempts so far.
    – l0b0
    Jan 15 at 0:32






  • 1




    I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
    – Jeff Schaller
    Jan 15 at 0:38






  • 2




    If you don't want to use external programs why are you yourself using find, awk, md5sum, etc.? What is wrong with using a tool such as fdupes?
    – roaima
    Jan 15 at 1:27










  • Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have? grep -c 'Now the problem'
    – muru
    Jan 15 at 1:35













up vote
-4
down vote

favorite









up vote
-4
down vote

favorite











No fdup please..I want to make a script.



I have a lot of file duplicates,there are more than 200



I made ( is under construction ) a bash script which make



md5sum on every file,then with uniq i put the duplicate md5s on



another file,then again i check the duplicates and put the entire line



in a third final file.



Now the problem..i can remove one by one the dups.



But my question is: is possible to find only the dups and put on 4th file to delete safely?



This is the script



#!/bin/bash

# Script is "under construction"

# First we make the md5sum
find mp3 -type f -print0 |xargs -0 md5sum|tee firstfile.txt

# Then we find all the md5sum identical and put in secondfile.txt
sort +1rn -2 +0 -1 firstfile.txt |awk 'print $1'|uniq -d > secondfile.txt


# then we extract from the secondfile and firstfile md5sum and name
while read line;do grep -i $line firstfile.txt;done < secondfile.txt > thirdfinal.txt


Now the problem..thirdfinal.txt contains a lot of lines
similar to those



625e8fd5f878b19b39826db539e01cda mp3/16.mp3
625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/9.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3


My question is...how to grep the third file to obtain



a 4th file which includes ALL duplicates..except the first



line(otherwise you delete ALL file including the original!)



So you can have a line of duplicates but at the same time



preserve the original one.



The 4th file must appear like this



625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3


Suggestions?Please don't answer : use fdup or external programs..I



prefer bash automation







share|improve this question












No fdup please..I want to make a script.



I have a lot of file duplicates,there are more than 200



I made ( is under construction ) a bash script which make



md5sum on every file,then with uniq i put the duplicate md5s on



another file,then again i check the duplicates and put the entire line



in a third final file.



Now the problem..i can remove one by one the dups.



But my question is: is possible to find only the dups and put on 4th file to delete safely?



This is the script



#!/bin/bash

# Script is "under construction"

# First we make the md5sum
find mp3 -type f -print0 |xargs -0 md5sum|tee firstfile.txt

# Then we find all the md5sum identical and put in secondfile.txt
sort +1rn -2 +0 -1 firstfile.txt |awk 'print $1'|uniq -d > secondfile.txt


# then we extract from the secondfile and firstfile md5sum and name
while read line;do grep -i $line firstfile.txt;done < secondfile.txt > thirdfinal.txt


Now the problem..thirdfinal.txt contains a lot of lines
similar to those



625e8fd5f878b19b39826db539e01cda mp3/16.mp3
625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/9.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3


My question is...how to grep the third file to obtain



a 4th file which includes ALL duplicates..except the first



line(otherwise you delete ALL file including the original!)



So you can have a line of duplicates but at the same time



preserve the original one.



The 4th file must appear like this



625e8fd5f878b19b39826db539e01cda mp3/12.mp3
625e8fd5f878b19b39826db539e01cda mp3/20.mp3
625e8fd5f878b19b39826db539e01cda mp3/21.mp3
625e8fd5f878b19b39826db539e01cda mp3/19.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/5.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/7.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/10.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/8.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/3.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/2.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/1.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/11.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/6.mp3
7eac02c26935323fe167d6e39ef6bd0a mp3/4.mp3
d7fbd596e86dfdb546092f34ab8ca576 mp3/25.mp3


Suggestions?Please don't answer : use fdup or external programs..I



prefer bash automation









share|improve this question











share|improve this question




share|improve this question










asked Jan 15 at 0:03









elbarna

3,80893477




3,80893477







  • 1




    Please format your question for readability and include your attempts so far.
    – l0b0
    Jan 15 at 0:32






  • 1




    I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
    – Jeff Schaller
    Jan 15 at 0:38






  • 2




    If you don't want to use external programs why are you yourself using find, awk, md5sum, etc.? What is wrong with using a tool such as fdupes?
    – roaima
    Jan 15 at 1:27










  • Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have? grep -c 'Now the problem'
    – muru
    Jan 15 at 1:35













  • 1




    Please format your question for readability and include your attempts so far.
    – l0b0
    Jan 15 at 0:32






  • 1




    I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
    – Jeff Schaller
    Jan 15 at 0:38






  • 2




    If you don't want to use external programs why are you yourself using find, awk, md5sum, etc.? What is wrong with using a tool such as fdupes?
    – roaima
    Jan 15 at 1:27










  • Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have? grep -c 'Now the problem'
    – muru
    Jan 15 at 1:35








1




1




Please format your question for readability and include your attempts so far.
– l0b0
Jan 15 at 0:32




Please format your question for readability and include your attempts so far.
– l0b0
Jan 15 at 0:32




1




1




I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
– Jeff Schaller
Jan 15 at 0:38




I noticed that you have the same file path listed twice, meaning you could end up removing your only copy of mp3/25.mp3
– Jeff Schaller
Jan 15 at 0:38




2




2




If you don't want to use external programs why are you yourself using find, awk, md5sum, etc.? What is wrong with using a tool such as fdupes?
– roaima
Jan 15 at 1:27




If you don't want to use external programs why are you yourself using find, awk, md5sum, etc.? What is wrong with using a tool such as fdupes?
– roaima
Jan 15 at 1:27












Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have? grep -c 'Now the problem'
– muru
Jan 15 at 1:35





Also, paragraphs are a good thing. Please use them instead of this half sentence per line nonsense. And how many problems do you have? grep -c 'Now the problem'
– muru
Jan 15 at 1:35











1 Answer
1






active

oldest

votes

















up vote
3
down vote



accepted










awk ' if (seen[$1]++) print ' < file3 > file4


This builds up an awk array of the md5sums in column 1; if the array value for a particular md5sum has already been seen (e.g. not the first time it is seen), then it prints the line. Either way, it increments the array value for that md5sum, starting from the default of zero.




Another way, using bash associative arrays:



unset md5sums
declare -A md5sums
while read md5sum path
do
((md5sums[$md5sum]++))
[[ $md5sums[$md5sum] -gt 1 ]] && printf "%s %sn" "$md5sum" "$path"
done < file3 > file4





share|improve this answer






















    Your Answer







    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "106"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f417139%2fa-lot-of-duplicates-no-fdupes-i-want-to-make-a-script%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    3
    down vote



    accepted










    awk ' if (seen[$1]++) print ' < file3 > file4


    This builds up an awk array of the md5sums in column 1; if the array value for a particular md5sum has already been seen (e.g. not the first time it is seen), then it prints the line. Either way, it increments the array value for that md5sum, starting from the default of zero.




    Another way, using bash associative arrays:



    unset md5sums
    declare -A md5sums
    while read md5sum path
    do
    ((md5sums[$md5sum]++))
    [[ $md5sums[$md5sum] -gt 1 ]] && printf "%s %sn" "$md5sum" "$path"
    done < file3 > file4





    share|improve this answer


























      up vote
      3
      down vote



      accepted










      awk ' if (seen[$1]++) print ' < file3 > file4


      This builds up an awk array of the md5sums in column 1; if the array value for a particular md5sum has already been seen (e.g. not the first time it is seen), then it prints the line. Either way, it increments the array value for that md5sum, starting from the default of zero.




      Another way, using bash associative arrays:



      unset md5sums
      declare -A md5sums
      while read md5sum path
      do
      ((md5sums[$md5sum]++))
      [[ $md5sums[$md5sum] -gt 1 ]] && printf "%s %sn" "$md5sum" "$path"
      done < file3 > file4





      share|improve this answer
























        up vote
        3
        down vote



        accepted







        up vote
        3
        down vote



        accepted






        awk ' if (seen[$1]++) print ' < file3 > file4


        This builds up an awk array of the md5sums in column 1; if the array value for a particular md5sum has already been seen (e.g. not the first time it is seen), then it prints the line. Either way, it increments the array value for that md5sum, starting from the default of zero.




        Another way, using bash associative arrays:



        unset md5sums
        declare -A md5sums
        while read md5sum path
        do
        ((md5sums[$md5sum]++))
        [[ $md5sums[$md5sum] -gt 1 ]] && printf "%s %sn" "$md5sum" "$path"
        done < file3 > file4





        share|improve this answer














        awk ' if (seen[$1]++) print ' < file3 > file4


        This builds up an awk array of the md5sums in column 1; if the array value for a particular md5sum has already been seen (e.g. not the first time it is seen), then it prints the line. Either way, it increments the array value for that md5sum, starting from the default of zero.




        Another way, using bash associative arrays:



        unset md5sums
        declare -A md5sums
        while read md5sum path
        do
        ((md5sums[$md5sum]++))
        [[ $md5sums[$md5sum] -gt 1 ]] && printf "%s %sn" "$md5sum" "$path"
        done < file3 > file4






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jan 15 at 1:30

























        answered Jan 15 at 1:13









        Jeff Schaller

        31.8k848109




        31.8k848109






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f417139%2fa-lot-of-duplicates-no-fdupes-i-want-to-make-a-script%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            Bahrain

            Postfix configuration issue with fips on centos 7; mailgun relay