Commands tagged duplicate files (3)

  • Finds duplicates based on MD5 sum. Compares only files with the same size. Performance improvements on: find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate The new version takes around 3 seconds where the old version took around 17 minutes. The bottle neck in the old command was the second find. It searches for the files with the specified file size. The new version keeps the file path and size from the beginning.


    2
    find -not -empty -type f -printf "%-30s'\t\"%h/%f\"\n" | sort -rn -t$'\t' | uniq -w30 -D | cut -f 2 -d $'\t' | xargs md5sum | sort | uniq -w32 --all-repeated=separate
    fobos3 · 2014-10-19 02:00:55 1
  • * Find all file sizes and file names from the current directory down (replace "." with a target directory as needed). * sort the file sizes in numeric order * List only the duplicated file sizes * drop the file sizes so there are simply a list of files (retain order) * calculate md5sums on all of the files * replace the first instance of two spaces (md5sum output) with a \0 * drop the unique md5sums so only duplicate files remain listed * Use AWK to aggregate identical files on one line. * Remove the blank line from the beginning (This was done more efficiently by putting another "IF" into the AWK command, but then the whole line exceeded the 255 char limit). >>>> Each output line contains the md5sum and then all of the files that have that identical md5sum. All fields are \0 delimited. All records are \n delimited.


    0
    find . -type f -not -empty -printf "%-25s%p\n"|sort -n|uniq -D -w25|cut -b26-|xargs -d"\n" -n1 md5sum|sed "s/ /\x0/"|uniq -D -w32|awk -F"\0" 'BEGIN{l="";}{if(l!=$1||l==""){printf "\n%s\0",$1}printf "\0%s",$2;l=$1}END{printf "\n"}'|sed "/^$/d"
    alafrosty · 2013-10-22 13:34:19 0
  • To allow recursivity : find -type f -exec md5sum '{}' ';' | sort | uniq -c -w 33 | sort -gr | head -n 5 | cut -c1-7,41- Display only filenames : find -maxdepth 1 -type f -exec md5sum '{}' ';' | sort | uniq -c -w 33 | sort -gr | head -n 5 | cut -c43- Show Sample Output


    0
    find -maxdepth 1 -type f -exec md5sum '{}' ';' | sort | uniq -c -w 33 | sort -gr | head -n 5 | cut -c1-7,41-
    MaDCOw · 2017-02-09 11:36:31 0

What's this?

commandlinefu.com is the place to record those command-line gems that you return to again and again. That way others can gain from your CLI wisdom and you from theirs too. All commands can be commented on, discussed and voted up or down.

Share Your Commands


Check These Out

Install a basic FreeBSD system
Install a basic FreeBSD system on a distant server. I use this to install FreeBSD on servers that can only boot a Linux rescue system. This sytem loads on ram when booted, so it is possible to install freely. You can even install on ZFS root !

Verbosely delete files matching specific name pattern, older than 15 days.

Shows picture exif GPS info if any and converts coords to a decimal degree number
This oneliner uses Imagemagic's identify utility to show the exif GPS information of an image an also converts Grad/MIn/Sec representation to a decimal degree number

Syntax Highlight your Perl code
This uses Text::Highlight to output the specified Perl file with syntax highlighting. A better alternative is my App::perlhl - find it on the CPAN: http://p3rl.org/App::perlhl

Show all the available information about your current distribution, package management and base
Just run this command and it will printout all the info available about your current distribution and package management system.

Multi line grep using sed and specifying open/close tags
Working with log files that contains variable length messages wrapped between open and close tags it may be useful to filter the messages upon a keyword. This works fine with GNU sed version 4.2 or higher, so pay attention to some unix distros (solaris, hp-ux, etc.). Linux should be ok.

Find all the files more than 10MB, sort in descending order of size and record the output of filenames and size in a text file.
This command specifies the size in Kilobytes using 'k' in the -size +(N)k option. The plus sign says greater than. -exec [cmd] {} \; invokes ls -l command on each file and awk strips off the values of the 5th (size) and the 9th (filename) column from the ls -l output to display. Sort is done in reversed order (descending) numerically using sort -rn options. A cron job could be run to execute a script like this and alert the users if a dir has files exceeding certain size, and provide file details as well.

Find out when your billion-second anniversary is (was).
This is the same command as this one, but for OS X. http://www.commandlinefu.com/commands/view/3053/find-out-when-your-billion-second-anniversary-is-was.

mtr, better than traceroute and ping combined
mtr combines the functionality of the traceroute and ping programs in a single network diagnostic tool. As mtr starts, it investigates the network connection between the host mtr runs on and HOSTNAME. by sending packets with purposly low TTLs. It continues to send packets with low TTL, noting the response time of the intervening routers. This allows mtr to print the response percentage and response times of the internet route to HOSTNAME. A sudden increase in packetloss or response time is often an indication of a bad (or simply over‐loaded) link.

find files containing text
-l outputs only the file names -i ignores the case -r descends into subdirectories


Stay in the loop…

Follow the Tweets.

Every new command is wrapped in a tweet and posted to Twitter. Following the stream is a great way of staying abreast of the latest commands. For the more discerning, there are Twitter accounts for commands that get a minimum of 3 and 10 votes - that way only the great commands get tweeted.

» http://twitter.com/commandlinefu
» http://twitter.com/commandlinefu3
» http://twitter.com/commandlinefu10

Subscribe to the feeds.

Use your favourite RSS aggregator to stay in touch with the latest commands. There are feeds mirroring the 3 Twitter streams as well as for virtually every other subset (users, tags, functions,…):

Subscribe to the feed for: