Find Duplicate Files (based on size, name, and md5sum)

find -type f -printf '%20s\t%100f\t%p\n' | sort -n | uniq -Dw121 | awk -F'\t' '{print $3}' | xargs -d '\n' md5sum | uniq -Dw32 | cut -b 35- | xargs -d '\n' ls -lU
It works extremely fast, because it calculates md5sum only on the files that have the same size and name. But there is nothing for free - it won't find duplicates with the different names.
Sample Output
-rw-r--r--  4 root root  728724 Nov  6  2014 ./gstreamer1.0-alsa/changelog.gz
-rw-r--r--  4 root root  728724 Nov  6  2014 ./gstreamer1.0-plugins-base/changelog.gz
-rw-r--r--  4 root root  728724 Nov  6  2014 ./gstreamer1.0-x/changelog.gz
-rw-r--r--  4 root root  728724 Nov  6  2014 ./libgstreamer-plugins-base1.0-0/changelog.gz
-rw-r--r--  2 root root 1012313 Nov  6  2014 ./gstreamer1.0-plugins-bad/changelog.gz
-rw-r--r--  2 root root 1012313 Nov  6  2014 ./libgstreamer-plugins-bad1.0-0/changelog.gz
-rw-r--r--  3 root root 1146705 Nov  9  2014 ./libglib2.0-0/changelog.gz
-rw-r--r--  3 root root 1146705 Nov  9  2014 ./libglib2.0-bin/changelog.gz
-rw-r--r--  3 root root 1146705 Nov  9  2014 ./libglib2.0-data/changelog.gz
-rw-r--r--  2 root root 1209168 May 31  2013 ./epiphany-browser/changelog.gz
-rw-r--r--  2 root root 1209168 May 31  2013 ./epiphany-browser-data/changelog.gz
-rw-r--r--  1 root root 1899053 Nov 11  2016 ./xserver-common/changelog.gz
-rw-r--r--  1 root root 1899053 Nov 11  2016 ./xserver-xorg-core/changelog.gz

0
By: ant7
2017-05-21 02:26:16

These Might Interest You

  • Improvement of the command "Find Duplicate Files (based on size first, then MD5 hash)" when searching for duplicate files in a directory containing a subversion working copy. This way the (multiple dupicates) in the meta-information directories are ignored. Can easily be adopted for other VCS as well. For CVS i.e. change ".svn" into ".csv": find -type d -name ".csv" -prune -o -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type d -name ".csv" -prune -o -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate Show Sample Output


    2
    find -type d -name ".svn" -prune -o -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type d -name ".svn" -prune -o -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
    2chg · 2010-01-28 09:45:29 0
  • This dup finder saves time by comparing size first, then md5sum, it doesn't delete anything, just lists them.


    73
    find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
    syssyphus · 2009-09-21 00:24:14 23
  • Finds duplicates based on MD5 sum. Compares only files with the same size. Performance improvements on: find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate The new version takes around 3 seconds where the old version took around 17 minutes. The bottle neck in the old command was the second find. It searches for the files with the specified file size. The new version keeps the file path and size from the beginning.


    1
    find -not -empty -type f -printf "%-30s'\t\"%h/%f\"\n" | sort -rn -t$'\t' | uniq -w30 -D | cut -f 2 -d $'\t' | xargs md5sum | sort | uniq -w32 --all-repeated=separate
    fobos3 · 2014-10-19 02:00:55 0
  • Avoids the nested 'find' commands but doesn't seem to run any faster than syssyphus's solution.


    0
    find . -type f -size +0 -printf "%-25s%p\n" | sort -n | uniq -D -w 25 | sed 's/^\w* *\(.*\)/md5sum "\1"/' | sh | sort | uniq -w32 --all-repeated=separate
    jimetc · 2013-02-23 20:44:20 0

What do you think?

Any thoughts on this command? Does it work on your machine? Can you do the same thing with only 14 characters?

You must be signed in to comment.

What's this?

commandlinefu.com is the place to record those command-line gems that you return to again and again. That way others can gain from your CLI wisdom and you from theirs too. All commands can be commented on, discussed and voted up or down.

Share Your Commands



Stay in the loop…

Follow the Tweets.

Every new command is wrapped in a tweet and posted to Twitter. Following the stream is a great way of staying abreast of the latest commands. For the more discerning, there are Twitter accounts for commands that get a minimum of 3 and 10 votes - that way only the great commands get tweeted.

» http://twitter.com/commandlinefu
» http://twitter.com/commandlinefu3
» http://twitter.com/commandlinefu10

Subscribe to the feeds.

Use your favourite RSS aggregator to stay in touch with the latest commands. There are feeds mirroring the 3 Twitter streams as well as for virtually every other subset (users, tags, functions,…):

Subscribe to the feed for: