How to find and delete duplicate files within the same directory?

Question

I want to find duplicate files, within a directory, and then delete all but one, to reclaim space. How do I achieve this using a shell script?

For example:

pwd
folder

Files in it are:

log.bkp
log
extract.bkp
extract

I need to compare log.bkp with all the other files and if a duplicate file is found (by it's content), I need to delete it. Similarly, file 'log' has to be checked with all other files, that follow, and so on.

So far, I have written this, But it's not giving desired result.

#!/usr/bin/env ksh
count=`ls -ltrh /folder | grep '^-'|wc -l`
for i in `/folder/*`
do
   for (( j=i+1; j<=count; j++ ))
   do
      echo "Current two files are $i and $j"
      sdiff -s $i  $j
      if [ `echo $?` -eq  0 ]
      then
         echo "Contents of $i and $j are same"
       fi
    done
 done

note echo $? is equivalent to $?. Also $(command) is preferred to using back ticks. — ctrl-alt-delor
– ctrl-alt-delor, Commented May 28, 2017 at 16:50
Here is a one-line script to find duplicate files with subfolders support, it may help others also: md5sum `find -type f` | sort | uniq -D -w 32 (superuser.com/questions/259148/…) It does not automatically delete duplicates but it could be changed if you really want to. — baptx
– baptx, Commented Dec 30, 2018 at 16:31

user123570 · Accepted Answer · 2017-05-28 20:19:45Z

12

If you're happy to simply use a command line tool, and not have to create a shell script, the fdupes program is available on most distros to do this.

There's also the GUI based fslint tool that has the same functionality.

answered May 28, 2017 at 20:19

user123570

Both of these utilities are not installed in my systems!

Su_scriptingbee
– Su_scriptingbee

2017-05-29 08:11:46 +00:00
Commented May 29, 2017 at 8:11
2

@Su_scriptingbee If you let us know what your system is, a way to install it could possible be suggested.

user123570
– user123570

2017-05-29 12:14:13 +00:00
Commented May 29, 2017 at 12:14
@Su_scriptingbee you can brew install fdupes if you have brew installed.

A.G.
– A.G.

2021-06-01 18:00:44 +00:00
Commented Jun 1, 2021 at 18:00

Add a comment |

αғsнιη · Accepted Answer · 2018-04-15 14:16:48Z

10

This solution will find duplicates in O(n) time. Each file has a checksum generated for it, and each file in turn is compared to the set of known checksums via an associative array.

#!/bin/bash
#
# Usage:  ./delete-duplicates.sh  [<files...>]
#
declare -A filecksums

# No args, use files in current directory
test 0 -eq $# && set -- *

for file in "$@"
do
    # Files only (also no symlinks)
    [[ -f "$file" ]] && [[ ! -h "$file" ]] || continue

    # Generate the checksum
    cksum=$(cksum <"$file" | tr ' ' _)

    # Have we already got this one?
    if [[ -n "${filecksums[$cksum]}" ]] && [[ "${filecksums[$cksum]}" != "$file" ]]
    then
        echo "Found '$file' is a duplicate of '${filecksums[$cksum]}'" >&2
        echo rm -f "$file"
    else
        filecksums[$cksum]="$file"
    fi
done

If you don't specify any files (or wildcards) on the command line it will use the set of files in the current directory. It will compare files in multiple directories but it is not written to recurse into directories themselves.

The "first" file in the set is always considered the definitive version. No consideration is taken of file times, permissions or ownerships. Only the content is considered.

Remove the echo from the rm -f "$file" line when you're sure it does what you want. Note that if you were to replace that line with ln -f "${filecksums[$cksum]}" "$file" you could hard-link the content. Same saving in disk space but you wouldn't lose the file names.

edited Apr 15, 2018 at 14:16

αғsнιη

41.9k17 gold badges75 silver badges118 bronze badges

answered May 28, 2017 at 21:24

Chris Davies

128k16 gold badges179 silver badges324 bronze badges

It will be helpful if you elaborate on "test 0 -eq $# && set -- * "

Su_scriptingbee
– Su_scriptingbee

2017-05-29 08:08:24 +00:00
Commented May 29, 2017 at 8:08
1

@Su_scriptingbee: could be rewritten as : [ 0 -eq $# ] && set -- * . To explain: test 0 -eq "$#" compares (numerically, with -eq) "0" with "$#" (which is the number of arguments passed to the shell script). If 0, there was no arguments passed, and then (&&) it creates an argument list with set -- * (* which will expand to all the files and dirs in the current directory (as the script as no "cd" before that, it will be the directory the person was in when launching the script, or the home-dir of the user starting the script if the script is launched remotely or via something like cron).

Olivier Dulac
– Olivier Dulac

2017-05-30 09:17:57 +00:00
Commented May 30, 2017 at 9:17
@Su_scriptingbee test 0 -eq $# asks if there are any command-line arguments. set -- * sets the command-line arguments to be the set of items (files, directories, etc.) in the current directory. The && allows the second part to run only if the the first part succeeds.

Chris Davies
– Chris Davies

2017-05-30 09:18:02 +00:00
Commented May 30, 2017 at 9:18

Add a comment |

ilkkachu · Accepted Answer · 2017-05-28 17:35:39Z

The main issue in your script seems to be that i takes the actual file names as values, while j is just a number. Taking the names to an array and using both i and j as indices should work:

files=(*)
count=${#files[@]}
for (( i=0 ; i < count ;i++ )); do 
    for (( j=i+1 ; j < count ; j++ )); do
        if diff -q "${files[i]}" "${files[j]}"  >/dev/null ; then
            echo "${files[i]} and ${files[j]} are the same"
        fi
    done
done

(Seems to work with Bash and the ksh/ksh93 Debian has.)

The assignment a=(this that) would initialize the array a with the two elements this and that (with indices 0 and 1). Wordsplitting and globbing works as usual, so files=(*) initializes files with the names of all files in the current directory (except dotfiles). "${files[@]}" would expand to all elements of the array, and the hash sign asks for a length, so ${#files[@]} is the number of elements in the array. (Note that ${files} would be the first element of the array, and ${#files} is the length of the first element, not the array!)

for i in `/folder/*`

The backticks here are surely a typo? You'd be running the first file as a command and giving the rest as arguments to it.

Thank you!! Its giving me the output. Could you pls elaborate on first two lines? — Su_scriptingbee
– Su_scriptingbee, Commented May 28, 2017 at 17:25
@Su_scriptingbee, elaborated. Bash's manual has a page on arrays if you need a reference. I think it should be similar in ksh, except that bash uses declare in place of ksh's typeset. (I may have missed some differences, though.) — ilkkachu
– ilkkachu, Commented May 28, 2017 at 17:38

ctrl-alt-delor · Accepted Answer · 2017-05-28 16:58:00Z

There are tools that do this, and do it more efficiently. Your solution when it is working is O(n²) that is the time it takes to run is proportional to n² where n is the size of the problem in total bytes in files. The best algorithm could do this in close to O(n). (A am discussing big-O notation, a way to summarise how efficient an algorithm is.)

First you would create a hash of each file, and only compare these: this saves a lot of time if you have a lot of large files that are almost the same.

Second you would use short cut methods: If files have different sizes, then they are not the same. Unless there is another file of same size, don't even open it.

MiniMax · Accepted Answer · 2017-05-29 00:36:59Z

By the way, using checksum or hash is a good idea. My script doesn't use it. But if files are small and amount of files are not big (like 10-20 files), this script will work quite fast. If you are have 100 files and more, 1000 lines in every file, than time will be more than 10 second.

Usage: ./duplicate_removing.sh files/*

#!/bin/bash

for target_file in "$@"; do
    shift
    for candidate_file in "$@"; do
        compare=$(diff -q "$target_file" "$candidate_file")
        if [ -z "$compare" ]; then
            echo the "$target_file" is a copy "$candidate_file"
            echo rm -v "$candidate_file"
        fi
    done
done

Testing

Create random files: ./creating_random_files.sh

#!/bin/bash

file_amount=10
files_dir="files"

mkdir -p "$files_dir"

while ((file_amount)); do
    content=$(shuf -i 1-1000)
    echo "$RANDOM" "$content" | tee "${files_dir}/${file_amount}".txt{,.copied} > /dev/null
    ((file_amount--))
done

Run ./duplicate_removing.sh files/* and get output

the files/10.txt is a copy files/10.txt.copied
rm -v files/10.txt.copied
the files/1.txt is a copy files/1.txt.copied
rm -v files/1.txt.copied
the files/2.txt is a copy files/2.txt.copied
rm -v files/2.txt.copied
the files/3.txt is a copy files/3.txt.copied
rm -v files/3.txt.copied
the files/4.txt is a copy files/4.txt.copied
rm -v files/4.txt.copied
the files/5.txt is a copy files/5.txt.copied
rm -v files/5.txt.copied
the files/6.txt is a copy files/6.txt.copied
rm -v files/6.txt.copied
the files/7.txt is a copy files/7.txt.copied
rm -v files/7.txt.copied
the files/8.txt is a copy files/8.txt.copied
rm -v files/8.txt.copied
the files/9.txt is a copy files/9.txt.copied
rm -v files/9.txt.copied

Cool, this is working. I just tested on dir with less than 10 files. — Su_scriptingbee
– Su_scriptingbee, Commented May 29, 2017 at 8:05

Lasith Niroshan · Accepted Answer · 2017-12-01 21:51:46Z

0

You can use finddup for this. Read this!!! http://manpages.ubuntu.com/manpages/xenial/man1/finddup.1.html

answered Dec 1, 2017 at 21:51

Lasith Niroshan

1611 silver badge2 bronze badges

finddup can find duplicates. please read the manual page. As well as you can use fdupes.

Lasith Niroshan
– Lasith Niroshan

2017-12-01 22:10:55 +00:00
Commented Dec 1, 2017 at 22:10

Add a comment |

rugk · Accepted Answer · 2023-10-24 16:15:31Z

If you're fine with a GUI tool instead, I can highly recommend Czkawka. You can very easily find duplicate files, filter out the files you want (they are grouped by default) and delete the files you do not need. Also search is very fast and cached, so the next time you run it, it will be even faster.

demo video

https://github.com/qarmin/czkawka

You can easily install it on any distro, e.g. get it here on Flathub.

And if you insist on a CLI tool, the software itself also has/can be run as a CLI tool.

Also, it also seems to work on Windows or so.

(Note the similar question on SE Ubuntu.)

Stack Exchange Network

How to find and delete duplicate files within the same directory?

7 Answers 7

Testing

You must log in to answer this question.

Linked

Hot Network Questions

How to find and delete duplicate files within the same directory?

7 Answers 7

Testing

You must log in to answer this question.

Linked

Related

Hot Network Questions