4

i want to delete duplicate files based on their MD5 value. I already got the script down below but how do i modify it so it works recursively?

So for example i have folder containing 3 subfolders A B C

I want ALL of those files in ./ ./A/ ./B/ ./C/ checked for their md5 and compared to each other, if a positive match is found just randomly delete either match. In the end no more duplicates exist. I dont care which match gets deleted first.

I hope i expressed what i need to achieve clearly enough, if not, kindly let me know :)

#!/bin/bash
while true
do
  echo "Enter the directory:"
  read directory
  if [ -d $directory ]; then
    break
  else
    echo "Invalid directory"
  fi
done

for FILE in `ls $directory`
do
  if [ ! -f $FILE ]; then
    break;
  fi
  h=`md5sum $directory/$FILE | awk '{ print $1 }'`
  for f in `ls $directory`
  do
    if [ -f $f ] && [ $FILE != $f ]; then
      s=`md5sum $directory/$f | awk '{ print $1 }'`
      if [ "$s" = "$h" ]; then
        echo Removing $f
        rm -rf $directory/$f
      fi
    fi
  done
done
8
  • 3
    Is this a script writing exercise? There are tools (e.g. jdupes or fdupes for Linux) that identify duplicates, hardlink or delete excessive copies. Commented Oct 2, 2018 at 10:35
  • Did you try using find? Probably also worth using while read FILE instead of for FILE in $(...) to handle massive lists. Commented Oct 2, 2018 at 10:39
  • THanks for your responses. The script shall work under windows not linux. I think i'm in the wrong place Commented Oct 2, 2018 at 11:29
  • 2
    You're in a good place, just edit the question and clearly announce your environment (WSL? Cygwin?) and purpose (getting the job done, no matter the tool? or getting the job done with Bash only? etc.) Commented Oct 2, 2018 at 11:42
  • 1
    The script you show is for Linux... But the performance is abysmal. Its performance is O(n²), in other words, twice as many files will take four times longer. For 100 files it will run md5sum 100.000 times!!! . I doubt it was ever really used. Commented Oct 2, 2018 at 11:53

5 Answers 5

9

I'd recommend something like the following instead:

find . -type f \
    | xargs md5sum \
    | sort -k1,1 \
    | uniq -Dw32

This will list all duplicated files in groups of files that have an identical MD5 hash.

Watch out, because the -w32 argument to uniq will only compare the first 32 characters... if you change the hash's length, you'll need to update this.


Consider the following tree, with the following content:

./a/1: foo
./a/2: bar
./b/3: hello world
./b/d/5: bar
./c/4: foo
$ find . -type f \
>     | xargs md5sum \
>     | sort -k1,1 \
>     | uniq -Dw32
c157a79031e1c40f85931829bc5fc552  ./a/2
c157a79031e1c40f85931829bc5fc552  ./b/d/5
d3b07384d113edec49eaa6238ad5ff00  ./a/1
d3b07384d113edec49eaa6238ad5ff00  ./c/4

You can now process the lines one-by-one... each line with a matching hash at the front points at a file that can be de-duplicated.

If you're not too bothered about which file gets deleted, then something like this works:

find . -type f \
    | xargs md5sum \
    | sort -k1,1 \
    | uniq -Dw32 \
    | while read hash file; do 
        [ "${prev_hash}" == "${hash}" ] && rm -v "${file}"
        prev_hash="${hash}"; 
    done

Note that MD5 is no longer considered secure... so if you're using this in a system where users have control of files, then it is feasible for them to engineer a collision - and thus for you to accidentally remove a legitimate / target file instead of de-duplicating as you had hoped. Prefer a stronger hash like SHA-256.

3

First a caveat : assuming identity based on a checksum is very dangerous. Not recommended.

Using a checksum as a filter to remove definite non-duplicates is OK.

If I was doing this I'd approach it like this :

  1. Create a list of files based on length ( length, full pathname )

  2. Scan that list looking for potential duplicate lengths.

  3. Any matches are potential duplicates and I'd compare the suspect files properly if possible.

The reason to use lengths is that this info is available very quickly without scanning the file byte-by-byte as it's normally in the filesystem stats for quick access.

You can add another stage comparing checksums (on similar length files) if you think it's quicker than comparing files directly, using a similar approach (checksums calculated once). Use a similar approach (start from the matching lengths list and calculate checksums for those).

Doing the checksum calculation only benefits you if there are multiple files with the same length, and even then a direct comparison byte-by-byte will likely find non-matches very quickly.

3
  • I agree about checking lengths first. But using hashes is better: the chances to have two identical MD5 sums for different files are abysmally small, and you could use better hash algorithms. If you have large files, the checksum is better, because you read the file only once. Assume you have three 1GB videos of the same size, with checksums you read 3GB, with byte-by-byte compare you read 6GB. Also the byte-by-byte comparison is O(n²) when comparing hashes can be linear. Commented Oct 2, 2018 at 16:49
  • 1
    Best practice would be to never rely on a checksum for file identity. I'd consider multiple (different algorithm) hashes better, but ultimately you're you're really not checking for a duplicate, but a duplicate signature. Hands up who want to explain to the CEO why those vital files were deleted by accident. :-) Unless there's a major time/performance constraint that justifies risking data loss, always do a proper check to be sure. That said note that using checksums at all risks not deleting some duplicates, but that's always a safer bet than an accidental deletion. Commented Oct 2, 2018 at 17:09
  • Hashes work well.. Git is based on hashes, as are a whole raft of security applications (for instance the HTTPS certificate of you bank, or your company website). If you are truly paranoid, you can complement hash with byte-by-byte compare, which is going to be linear-time since it will find equality every time. Commented Oct 2, 2018 at 20:19
0

There is a beautiful solution on https://stackoverflow.com/questions/57736996/how-to-remove-duplicate-files-in-linux/57737192#57737192:

md5sum prime-* | awk 'n[$1]++' | cut -d " " -f 3- | xargs -I {} echo rm {}
0
pcre2 () {
    local s='--'
    if [[ $1 = -i ]]; then
        s="$1 $s"
        set -- "${@:2}"
    fi
    perl -0777spe 'my @arr; BEGIN{ @arr = @ARGV } END{ foreach my $arg (@arr) { -f $arg or exit 2 } } '"$1" "$s" "${@:2}"
    return $?
}

remove-md5 () {
    local function_name="${FUNCNAME[0]}"
    local i
    local f=''
    local maxdepth=0
    local submaxdepth=-1
    local abs
    local file
    local md5
    local exit_code=0
    local dotglob
    local extglob
    local nullglob
    if [[ -z ${dotglob} ]]; then dotglob="$(shopt -p dotglob)"; fi
    if [[ -z ${extglob} ]]; then extglob="$(shopt -p extglob)"; fi
    if [[ -z ${nullglob} ]]; then nullglob="$(shopt -p nullglob)"; fi
    ${dotglob}; ${extglob}; ${nullglob}
    for (( i=1; i<=$#; i++ )); do
        if [[ ${!i} = -- ]]; then
            set -- "${@:1:i-1}" "${@:i+1}"
            break
        fi
        if [[ ${!i} = -f ]]; then
            f='-f'
            set -- "${@:1:i-1}" "${@:i+1}"
            ((i--))
            continue
        fi
        if [[ ${!i,,} = -r ]]; then
            set -- "${@:1:i-1}" "${@:i+1}"
            if [[ ${!i} =~ ^[+-]?[0-9]+$ ]]; then
                maxdepth=${!i/#+}
                abs=${maxdepth/#-}
                if [[ ${#abs} -gt 9 ]]; then
                    echo "$function_name: $maxdepth: Numerical result out of range=[-999999999, 999999999]" >&2
                    return 1
                fi
                set -- "${@:1:i-1}" "${@:i+1}"
            else
                maxdepth=-1
            fi
            ((i--))
            continue
        fi
    done
    if [[ $maxdepth -gt 0 ]]; then submaxdepth=$((maxdepth-1)); fi
    if [[ $# -lt 2 ]]; then
        echo "$function_name: Too few arguments (min 2)." >&2
        return 4
    fi
    if [[ ! -d ${!#} ]]; then
        echo "$function_name: The target must be an existing directory." >&2
        return 3
    fi
    for (( i=1; i<$#; i++ )); do
        if [[ ! ${!i} =~ ^[0-9a-f]{32}$ ]]; then
            echo "$function_name: \"${!i}\" is not a valid md5 hash." >&2
            return 2
        fi
    done
    shopt -s dotglob extglob nullglob
    for file in "${!#}"/*; do
        if [[ ! -d ${file} ]]; then
            md5="$(md5sum -- "${file}" | awk '{print $1}')"
            for (( i=1; i<$#; i++ )); do
                if [[ $md5 = ${!i} ]]; then
                    rm $f -- "${file}" 2> >(pcre2 's/^rm: /$function_name: /gm' -function_name="$function_name" >&2) || exit_code=$?
                    break
                fi
            done
        elif [[ $maxdepth -ne 0 ]]; then
            dotglob="${dotglob}" extglob="${extglob}" nullglob="${nullglob}" $function_name -R $submaxdepth $f -- "${@:1:$#-1}" "${file}" || exit_code=$?
        fi
    done
    ${dotglob}; ${extglob}; ${nullglob}
    return $exit_code
}

pcre2: Perl Compatible Regular Expression v2.
This script works exactly like on the "regex101.com" site and if at least one file in the file list is not found, it returns an exit code other than 0. It still continues the replace operations on the found files. pcre2 accepts the -i option (only if as first argument) to write the output to file and the -s option for variable parsing is already enabled, so the wording will be correct: pcre2 -i 's/$var/Replaced/gm' -var='Replace me' file.txt

remove-md5 md5-1 [md5-2 ... md5-n] target-directory

remove-md5 accepts the -R (case insensitive, so also -r) option for recursion with unlimited depth.
If the -R option is accompanied by a number, it will have depth n. For example -R 3, will have a recursion to depth 3.
if n is less than 0 or not present, the recursion is unlimited, e.g. -R -1 or simply -R.
If the -R option is not present, the removal will be non-recursive, i.e. it is like writing -R 0.
If there are multiple -R options, only the rightmost parameter will be considered. For example in the following case:

remove-md5 ... -R 3 ... -R ...

the recursion will be of unlimited depth.

-f has the same function as when used in the rm command.
-- is used to stop reading parameters at the specified point.

-2

What about to enter in the folder you want to check, list the files and check each against all and if it is a match on md5 and it has a different filename then suggest to delete the file.

In the script below it is doing exactly this. Bear in mind that this is a template and it is spitting out all the filenames and checksum for debug purposes and it actually does not delete, but echo the filename that you can delete.

Edit it according to your needs.

#!/bin/bash

function getone(){
h=$(md5sum "${a}" | awk '{print $1}')   
}

function gettwo(){
s=$(md5sum "${x}" | awk '{print $1}')
}

echo "Type the directory NAME"
read directory

if [ -d ${directory} ]
then cd ${directory}
    for a in *.*
        do echo checking "${a}"
        getone
        echo $h # irrelevant echo, just for debug, you can remove it
            for x in *.*
            do echo scanning "${x}" # irrelevant echo, just for debug, you can remove it
            gettwo
            echo $s # irrelevant echo, just for debug, you can remove it
                if [ "${a}" = "${x}" ]
                then echo "Original file, skipping" # irrelevant echo, just for debug, you can remove it by leaving empty quotes.
                elif [ "${h}" = "${s}" ]
                then echo "Delete ${x}"  # This should be replaced by rm once you are happy with the script
                fi
            done
        done
else echo "The directory name does not exist"
fi

This method is not however, the best approach, because if you are checking file A and it is the same as B then it will tell you to delete B and when it checks B it will tell you to delete A... so the first it finds, it will delete the second. In this example B will be deleted first. Will it break the loop once it tries to check B and B does not exist any more? I do not know. I did not check...

1
  • This is an incredibly inefficient implementation that will hash all files multiple times. Also it won't recurse, handle files with no extension (or no basename), handle directories... and it's spaghetti code... :-( Commented May 19, 2019 at 21:33

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.